Living in an Ivory Basementhttp://ivory.idyll.org/blog/2024-02-20T00:00:00+01:00Stochastic thoughts on science, testing, and programming.Speeding sourmash the heck up2024-02-20T00:00:00+01:002024-02-20T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2024-02-20:/blog/2024-speeding-sourmash-the-heck-up.html<p>Faster things are always nice, right?</p><p>sourmash is our tool for genome and metagenome investigation. Using and developing it has been a major focus of our lab for over 7 years, and maintaining and extending it is my main passion project. sourmash is a k-mer multitool that enables all sorts of really neat bulk metagenome analyses!</p>
<p>I'm proud to say that last week we released <a href="https://github.com/sourmash-bio/sourmash/releases/tag/v4.8.6">a new version of sourmash, v4.8.6</a>, that continues to improve functionality, increase documentation, and decrease computational requirements. But, you know, we release new versions of sourmash pretty regularly, so that's only moderately exciting :).</p>
<p>A bit more exciting - we are hopefully closing in on an updated Journal of Open Source Software publication via our <a href="https://github.com/pyOpenSci/software-submission/issues/129">pyopensci review</a>. I wanted to highlight something very nice one of our reviewers said:</p>
<blockquote>
<p>Outstanding work with sourmash! Your commitment to creating a package that's both easily maintainable and well-documented truly shines through. The code is impressively organized, accompanied by clear comments explaining each section, making it easy to comprehend the purpose of each file and function.</p>
</blockquote>
<p>It's so nice to have your multiple years of effort be appreciated!</p>
<p>The most exciting news is that we've released a significant update to our <a href="https://github.com/sourmash-bio/sourmash_plugin_branchwater">branchwater plugin for sourmash</a>. This plugin supplies fast, low-memory, and multithreaded versions of common sourmash functions. <a href="https://github.com/sourmash-bio/sourmash_plugin_branchwater/releases/tag/v0.9.0">Version 0.9.0 of sourmash_plugin_branchwater</a> dramatically improves the convenience of using the plugin while also speeding up a common use case and, perhaps most importantly to us maintainers, making significant moves towards convergence with the core sourmash code base.</p>
<p>What's that, you say?? <strong>Fast, low-memory, and multithreaded</strong> sourmash functionality?</p>
<p>Yep. Using our test metagenome, the SRR606249 mock community, you can search all 400,000 genomes in the GTDB rs214 release in around 2 minutes and under 2 GB of RAM, using 64 cores. This is 7 fold lower memory than regular ol' sourmash, and approximately 20x faster. Even cooler, if you index GTDB first, you can do it in 600 MB of RAM!</p>
<table>
<thead>
<tr>
<th>software/version</th>
<th>command</th>
<th>details</th>
<th>time</th>
<th>max RAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>sourmash v4.8.6</td>
<td><code>gather</code></td>
<td>the OG</td>
<td>42m 26s</td>
<td>14.5 GB</td>
</tr>
<tr>
<td>branchwater v0.9.0</td>
<td><code>fastgather</code></td>
<td>against zip</td>
<td><span style="color:green"><strong>2m 5s</strong></span></td>
<td><span style="color:red"><strong>14.1 GB^</strong></span></td>
</tr>
<tr>
<td>branchwater v0.9.0</td>
<td><code>fastgather</code></td>
<td>against pathlist</td>
<td><span style="color:green"><strong>2m 26s</strong></span></td>
<td><span style="color:green"><strong>1.8 GB</strong></span></td>
</tr>
<tr>
<td>branchwater v0.9.0</td>
<td><code>fastgather</code></td>
<td>against manifest</td>
<td><span style="color:green"><strong>2m 19s</strong></span></td>
<td><span style="color:green"><strong>1.9 GB</strong></span></td>
</tr>
<tr>
<td>branchwater v0.9.0</td>
<td><code>fastmultigather</code></td>
<td>against rocksdb</td>
<td><span style="color:green"><strong>2m 8s</strong></span></td>
<td><span style="color:green"><strong>600 MB</strong></span></td>
</tr>
<tr>
<td>branchwater v0.8.6</td>
<td><code>fastgather</code></td>
<td>against pathlist</td>
<td>2m 24s</td>
<td>1.6 GB</td>
</tr>
<tr>
<td>branchwater v0.8.6</td>
<td><code>fastgather</code></td>
<td>against zip</td>
<td><span style="color:purple"><strong>28m 34s</strong></span></td>
<td>1.7 GB</td>
</tr>
</tbody>
</table>
<p>^ This benchmark number isn't really real, despite it being reported under Max RSS. The measurement is high because the zip library we're using in Rust uses <code>memmap</code> - actual heap consumption is in the 2 GB range, matching the other approaches. See <a href="https://github.com/sourmash-bio/sourmash/issues/2340">sourmash#2340</a> for more info. </p>
<p>Anyhoo. sourmash v4.8.6 and sourmash_plugin_branchwater v0.9.0 are both available via conda & conda-forge. Enjoy!</p>
<p>--titus</p>The history of the "Tragedy of the Commons"2024-01-13T00:00:00+01:002024-01-13T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2024-01-13:/blog/2024-tragedy-of-commons.html<p>No, just no.</p><p>I've been really interested in applying lessons from
<a href="https://en.wikipedia.org/wiki/Common-pool_resource">common pool resource theory</a>
to my own work and interests in open source and open science
<a href="http://ivory.idyll.org/blog/tag/cpr.html">(see my various posts)</a>.
The
<a href="https://en.wikipedia.org/wiki/Common-pool_resource#Common_property_protocols">framework around this created by Dr. Elinor Ostrom</a>,
for which she received the Nobel Prize in Economics, is awe-inspiring
and incredibly motivational! I've also thoroughly enjoyed the <a href="https://podcasts.apple.com/us/podcast/frontiers-of-commoning-with-david-bollier/id1501085005">Frontiers of Commoning podcast</a> that David Bollier runs, which showcases many ongoing communities and efforts in these areas.</p>
<p>All of this is strongly coupled (negatively) to the well-known concept
of the Tragedy of the Commons, published in 1968 by
<a href="https://en.wikipedia.org/wiki/Garrett_Hardin">Dr. Garret Hardin</a>, a
professor at UCSB. It turns out that Hardin was not only very wrong
(see above links on CPR!) but also
<a href="https://blogs.scientificamerican.com/voices/the-tragedy-of-the-tragedy-of-the-commons/">a terrible person</a>,
and if you care to read the Tragedy of the Commons article, it's,
well, very bad (ibid). (If you prefer a podcast to reading,
<a href="https://srslywrong.com/podcast/235-the-imaginary-tragedy-of-the-hypothetical-commons/">here's one that looks good</a>,
from srsly wrong.</p>
<p>Anyway, I find CPR theory tremendously inspiring, and it provides
wonderful counterexamples to the beliefs that only strong hierarchy,
authoritarian governance, and/or corporate enclosure can work to
manage resources. Highly recommended. Always happy to chat, although I'd
suggest just reading widely instead, since I'm by no means an expert on any
of this!</p>
<p>salud!</p>
<p>--titus</p>Sourmash and branchwater licensing: thoughts on extractive engagement with projects2024-01-07T00:00:00+01:002024-01-07T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2024-01-07:/blog/2024-sourmash-branchwater-licensing.html<p>What licenses should be used, for what purpose?</p><p>I am helping maintain some petabase-scale genomic search
infrastructure as part of the
<a href="https://sourmash.readthedocs.io/">sourmash</a> and
<a href="https://branchwater.sourmash.bio/">branchwater</a> projects. One of the
questions that's frequently in the back of my mind is how to
incentivize
<a href="http://ivory.idyll.org/blog/2018-how-open-is-too-open.html">commons-style engagement rather than extractive engagement</a>,
and a key tool for this purpose is licensing.</p>
<p>Sourmash is BSD-licensed, which, in essence, means that anyone can do
whatever they want with the code - including incorporating it
unchanged into a commercial closed-source product, rebranding it as a
new product, and/or changing it in incompatible ways (and then
rebranding it as a new and better product). This is typically
something that companies will do, although it also happens with open
source forks. (See: <a href="https://www.infoq.com/news/2021/04/amazon-opensearch/">Elasticsearch to OpenSearch</a>; and <a href="https://matrix.org/blog/2023/11/06/future-of-synapse-dendrite/">Matrix</a>).</p>
<p>Branchwater, our internal code-name for the collection of
sourmash-based functionality that enables petabase-scale search, is
<a href="https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/60">licensed under AGPL</a>. This
means that anyone can use it however they want, as long as they
release any modifications they make to the source code. In particular,
this also applies to people providing a service based on the
branchwater code:</p>
<blockquote>
<p>Let’s say you create a software program. Another developer takes and
modifies it, and then provides access to that modification to paying
customers through a software-as-a-service model. Under the GPL v3,
that modification would essentially become proprietary because it
wasn’t technically distributed. Under AGPL, however, that developer
would need to make their modified source code available for
download. <a href="https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/60">(link)</a></p>
</blockquote>
<p>IIRC, there are a couple of reasons that Dr. Luiz Irber (the initial
author of the branchwater code, and the originator of most of the
branchwater code and supporting infrastructure) chose AGPL. One of the
main ones (again, IIRC) is to discourage incompatible forks of the
source code. But it also discourages many kinds of extractive
behavior: a company could not, for example, take this code, modify it
in sekret ways, and provide services based upon that sekrecy, without
providing the modified code openly under the AGPL license.</p>
<p>You could argue that the AGPL license decreases certain kinds of
uptake. Perhaps so, and I chose the BSD license for sourmash (with
Luiz's OK, albeit in a situation where I was his supervisor...)
specifically to encourage uptake, reuse, modification, and
experimentation. I don't know how to evaluate the success of this
choice, really, other than to say that I still don't see a blindingly
obvious downside to it (as of Jan 5, 2024 :).</p>
<p>At the end of the day, my thoughts trend towards seeing the value in
sourmash as less algorithmic innovation and more infrastructure
innovation. We are maintaining and sustaining a very functional and
useful piece of software, with good documentation and an
ever-expanding range of use cases. And it remains very useful to me
and my lab, specifically. Not only do I not care if companies extract
value from it - there are many ways to skin this particular cat - but
I am happy and excited that my labor as an academic is actually useful
to someone else.</p>
<p>On the flip side, branchwater is both more niche and more
difficult. There aren't many ways to do petabase-scale search, and
there is a lot more infrastructure maintenance involved. I would be
sad to see someone take our (collective) investment in this
functionality and build upon it without returning something to the
community of developers.</p>
<p>I'm not sure what and where the dividing line between these two
situations is for me. But I think sketching out the current line is a
good start :).</p>
<p>--titus</p>snakemake for doing bioinformatics - inputs and outputs and more!2023-04-07T00:00:00+02:002023-04-07T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2023-04-07:/blog/2023-snakemake-slithering-input-outputs.html<p>Slithering your way into bioinformatics with snakemake - inputs and outputs and more!</p><h1><code>input:</code> and <code>output:</code> blocks</h1>
<p>As we saw <a href="http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html">before</a>, snakemake will automatically
"chain" rules by connecting inputs to outputs. That is, snakemake
will figure out <em>what to run</em> in order to produce the desired output,
even if it takes many steps.</p>
<p>We also saw that snakemake will fill
in <code>{input}</code> and <code>{output}</code> in the shell command based on the contents
of the <code>input:</code> and <code>output:</code> blocks. This becomes even more useful
when using wildcards to generalize rules, where wildcard values are properly
substituted into the <code>{input}</code> and <code>{output}</code> values.</p>
<p>Input and output blocks are key components of snakemake workflows.
Below, we will discuss the use of input and output blocks
a bit more comprehensively.</p>
<h2>Providing inputs and outputs</h2>
<p>As we saw previously, snakemake will happily take multiple input and
output values via comma-separated lists and substitute them into strings
in shell blocks.</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">example</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"file1.txt"</span><span class="p">,</span>
<span class="s2">"file2.txt"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"output file1.txt"</span><span class="p">,</span>
<span class="s2">"output file2.txt"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> echo {input:q}</span>
<span class="s2"> echo {output:q}</span>
<span class="s2"> touch {output:q}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>When these are substituted into shell commands with <code>{input}</code> and
<code>{output}</code> they will be turned into space-separated ordered lists:
e.g. the above shell command will print out first <code>file1.txt
file2.txt</code> and then <code>output file1.txt output file2.txt</code> before using <code>touch</code> to
create the empty output files.</p>
<p>In this example we are also asking snakemake to quote filenames for
the shell command using <code>:q</code> - this means that if there are spaces,
characters like single or double quotation marks, or other characters
with special meaning they will be properly escaped using
<a href="https://docs.python.org/3/library/shlex.html#shlex.quote">Python's shlex.quote function</a>.
For example, here both output files contain a space, and so <code>touch
{output}</code> would create three files -- <code>output</code>, <code>file1.txt</code>, and
<code>file2.txt</code> -- rather than the correct two files, <code>output file1.txt</code>
and <code>output file2.txt</code>.</p>
<p><strong>Quoting filenames with <code>{...:q}</code> should always be used for anything
executed in a shell block</strong> - it does no harm and it can prevent
serious bugs!</p>
<h3>Digression: Where can we (and should we) put commas?</h3>
<p>In the above code example, you will notice that <code>"file2.txt"</code> and
<code>"output file2.txt"</code> have commas after them:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">example</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"file1.txt"</span><span class="p">,</span>
<span class="s2">"file2.txt"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"output file1.txt"</span><span class="p">,</span>
<span class="s2">"output file2.txt"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> echo {input:q}</span>
<span class="s2"> echo {output:q}</span>
<span class="s2"> touch {output:q}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Are these required? <strong>No.</strong> The above code is equivalent to:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">example</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"file1.txt"</span><span class="p">,</span>
<span class="s2">"file2.txt"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"output file1.txt"</span><span class="p">,</span>
<span class="s2">"output file2.txt"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> echo {input:q}</span>
<span class="s2"> echo {output:q}</span>
<span class="s2"> touch {output:q}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>where there are no commas after the last line in input and output.</p>
<p>The general rule is this: you need internal commas to separate items
in the list, because otherwise strings will be concatenated to each
other - i.e. <code>"file1.txt" "file2.txt"</code> will become <code>"file1.txtfile2.txt"</code>,
even if there's a newline between them! But a comma trailing after the
last filename is optional (and ignored).</p>
<p>Why!? These are <em>Python tuples</em> and you can add a trailing comma if
you like: <code>a, b, c,</code> is equivalent to <code>a, b, c</code>.</p>
<p>So why do we add a trailing comma?! I suggest using trailing commas
because it makes it easy to add a new input or output without
forgetting to add a comma, and this is a mistake I make frequently!
This is a (small and simple but still useful) example of <em>defensive
programming</em>, where we can use optional syntax rules to head off common
mistakes.</p>
<h2>Inputs and outputs are <em>ordered lists</em></h2>
<p>We can also refer to individual input and output entries by using
square brackets to index them as lists, starting with position 0:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">example</span><span class="p">:</span>
<span class="o">...</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> echo first input is {input[0]:q}</span>
<span class="s2"> echo second input is {input[1]:q}</span>
<span class="s2"> echo first output is {output[0]:q}</span>
<span class="s2"> echo second output is {output[1]:q}</span>
<span class="s2"> touch </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>However, <strong>we don't recommend this</strong> because it's fragile. If you
change the order of the inputs and outputs, or add new inputs, you
have to go through and adjust the indices to match. Relying on the
number and position of indices in a list is error prone and will make
your Snakefile harder to change later on!</p>
<h2>Using keywords for input and output files</h2>
<p>You can also name specific inputs and outputs using the <em>keyword</em>
syntax, and then refer to those using <code>input.</code> and <code>output.</code> prefixes.
The following Snakefile rule does this:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">example</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">a</span><span class="o">=</span><span class="s2">"file1.txt"</span><span class="p">,</span>
<span class="n">b</span><span class="o">=</span><span class="s2">"file2.txt"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">a</span><span class="o">=</span><span class="s2">"output file1.txt"</span><span class="p">,</span>
<span class="n">c</span><span class="o">=</span><span class="s2">"output file2.txt"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> echo first input is {input.a:q}</span>
<span class="s2"> echo second input is {input.b:q}</span>
<span class="s2"> echo first output is {output.a:q}</span>
<span class="s2"> echo second output is {output.c:q}</span>
<span class="s2"> touch {output:q}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Here, <code>a</code> and <code>b</code> in the input block, and <code>a</code> and <code>c</code> in the output block,
are keyword names for the input and output files; in the shell command,
they can be referred to with <code>{input.a}</code>, <code>{input.b}</code>, <code>{output.a}</code>, and
<code>{output.c}</code> respectively. Any valid variable name can be used, and the
same name can be used in the input and output blocks without collision,
as with <code>input.a</code> and <code>output.a</code>, above, which are distinct values.</p>
<p><strong>This is our recommended way of referring to specific input and
output files.</strong> It is clearer to read, robust to rearrangements or
additions, and (perhaps most importantly) can help guide the reader
(including "future you") to the <em>purpose</em> of each input and output.</p>
<p>If you use the wrong keyword names in your shell code, you'll get an
error message. For example, this code:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">example</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">a</span><span class="o">=</span><span class="s2">"file1.txt"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">a</span><span class="o">=</span><span class="s2">"output file1.txt"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> echo first input is {input.z:q}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>gives this error message:</p>
<div class="highlight"><pre><span></span><code>AttributeError: 'InputFiles' object has no attribute 'z', when formatting the following:
echo first input is {input.z:q}
</code></pre></div>
<h2>Example: writing a flexible command line</h2>
<p>One example where it's particularly useful to be able to refer to
specific inputs is when running programs on files where the input
filenames need to be specified as optional arguments. One such
program is the <code>megahit</code> assembler when it runs on paired-end input
reads. Consider the following Snakefile:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"assembly_out"</span>
<span class="n">rule</span> <span class="n">assemble</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">R1</span><span class="o">=</span><span class="s2">"sample_R1.fastq.gz"</span><span class="p">,</span>
<span class="n">R2</span><span class="o">=</span><span class="s2">"sample_R2.fastq.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">directory</span><span class="p">(</span><span class="s2">"assembly_out"</span><span class="p">)</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> megahit -1 </span><span class="si">{input.R1}</span><span class="s2"> -2 </span><span class="si">{input.R2}</span><span class="s2"> -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>In the shell command here, we need to supply the input reads as two
separate files, with <code>-1</code> before one and <code>-2</code> before the second. As a
bonus the resulting shell command is very readable!</p>
<h2>Input functions and more advanced features</h2>
<p>There are a number of more advanced uses of input and output that rely
on Python programming - for example, one can define a Python function
that is called to <em>generate</em> a value dynamically, as below -</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">multiply_by_5</span><span class="p">(</span><span class="n">w</span><span class="p">):</span>
<span class="k">return</span> <span class="sa">f</span><span class="s2">"file</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">val</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">5</span><span class="si">}</span><span class="s2">.txt"</span>
<span class="n">rule</span> <span class="n">make_file</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="c1"># look for input file{val*5}.txt if asked to create output{val}.txt</span>
<span class="n">filename</span><span class="o">=</span><span class="n">multiply_by_5</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"output</span><span class="si">{val}</span><span class="s2">.txt"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> cp </span><span class="si">{input}</span><span class="s2"> {output:q}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>When asked to create <code>output5.txt</code>, this rule will look for
<code>file25.txt</code> as an input.</p>
<p>Since this functionality relies on knowledge of
<a href="http://ivory.idyll.org/blog/2023-snakemake-slithering-wildcards.html">wildcards</a> as well as some knowledge of Python, it's too advanced
to talk about here. More on that later!</p>
<h2>References and Links</h2>
<ul>
<li><a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-and-rules">Snakemake manual section on rules</a></li>
</ul>
<h1><code>params:</code> blocks and <code>{params}</code></h1>
<p>As we saw above, input and output blocks are key to the way snakemake works: they let
snakemake automatically connect rules based on the inputs necessary
to create the desired output. However, input and output blocks are
limited in certain ways: most specifically, every entry in both input
and output blocks <em>must</em> be a filename. And, because of the way
snakemake works, the filenames specified in the input and output
blocks must exist in order for the workflow to proceed past that
rule.</p>
<p>Frequently, shell commands need to take parameters other than
filenames, and these parameters may be values that can or should be
calculated by snakemake. Therefore, snakemake also supports a
<code>params:</code> block that can be used to provide parameter strings that are <em>not</em>
filenames in the shell block. As
you'll see below, these can be used for a variety of purposes,
including user-configurable parameters as well as parameters that can
be calculated automatically by Python code.</p>
<h2>A simple example of a params block</h2>
<p>Consider:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">use_params</span><span class="p">:</span>
<span class="n">params</span><span class="p">:</span>
<span class="n">val</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"output.txt"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> echo </span><span class="si">{params.val}</span><span class="s2"> > </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Here, the value <code>5</code> is assigned to the name <code>val</code> in the <code>params:</code> block,
and is then available under the name <code>{params.val}</code> in the <code>shell:</code> block.
This is analogous to using keywords in input and output blocks, but unlike in
input and output blocks, keywords <em>must</em> be used in params blocks.</p>
<p>In this example, there's no gain in functionality, but there is some
gain in readability: the syntax makes it clear that <code>val</code> is a tunable
parameter that can be modified without understanding the details of
the shell block.</p>
<h2>Params blocks have access to wildcards</h2>
<p>Just like the <code>input:</code> and <code>output:</code> blocks, wildcard values are
directly available in <code>params:</code> blocks without using the <code>wildcards</code>
prefix; for example, this means that you can use them in strings with
the standard <a href="https://docs.python.org/3/library/string.html#formatspec">string formatting operations</a>.</p>
<p>This is useful when a shell command needs to use something other than
the filename - for example, the <code>bowtie</code> read alignment software takes
the <em>prefix</em> of the output SAM file via <code>-S</code>, which means you cannot
name the file correctly with <code>bowtie ... -S {output}</code>. Instead, you
could use <code>{params.prefix}</code> like so:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"reads.sam"</span>
<span class="n">rule</span> <span class="n">use_params</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{prefix}</span><span class="s2">.fq"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{prefix}</span><span class="s2">.sam"</span><span class="p">,</span>
<span class="n">params</span><span class="p">:</span>
<span class="n">prefix</span> <span class="o">=</span> <span class="s2">"</span><span class="si">{prefix}</span><span class="s2">"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> bowtie index -U </span><span class="si">{input}</span><span class="s2"> -S </span><span class="si">{params.prefix}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>If you were to use <code>-S {output}</code> here, you would end up producing a file
<code>reads.sam.sam</code>!</p>
<h2>Links and references:</h2>
<ul>
<li>Snakemake docs: <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules">non-file parameters for rules</a></li>
</ul>
<h1>Using <code>expand</code> to generate filenames</h1>
<p><a href="http://ivory.idyll.org/blog/2023-snakemake-slithering-wildcards.html">Snakemake wildcards</a> make it easy to apply rules to
many files, but also create a new challenge: how do you generate all the
filenames you want?</p>
<p>As an example of this challenge, consider the list of genomes needed
for rule <code>compare_genomes</code> from <a href="http://ivory.idyll.org/blog/2023-snakemake-slithering-section-2.html">before</a> -</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_008423265.1.fna.gz.sig"</span><span class="p">,</span>
</code></pre></div>
<p>This list is critical because it specifies the sketches to be created
by the wildcard rule. However, writing this list out is annoying and
error prone, because parts of every filename are identical and
repeated.</p>
<p>Even worse, if you needed to use this list in multiple places, or
produce slightly different filenames with the same accessions, that
will be error prone: you are likely to want to add, remove, or edit
elements of the list, and you will need to change it in multiple
places.</p>
<p><a href="http://ivory.idyll.org/blog/2023-snakemake-slithering-section-2.html">Previously</a>, we showed how to change this to a list of the
accessions at the top of the Snakefile and then used a function called
<code>expand</code> to generate the list:</p>
<div class="highlight"><pre><span></span><code><span class="n">ACCESSIONS</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"GCF_000017325.1"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1"</span><span class="p">,</span>
<span class="s2">"GCF_008423265.1"</span><span class="p">]</span>
<span class="c1">#...</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"</span><span class="si">{acc}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span> <span class="n">acc</span><span class="o">=</span><span class="n">ACCESSIONS</span><span class="p">),</span>
</code></pre></div>
<p>Using <code>expand</code> to generate lists of filenames is a common pattern in
Snakefiles, and we'll explore it more below!</p>
<h2>Using <code>expand</code> with a single pattern and one list of values</h2>
<p>In the example above, we provide a single pattern, <code>{acc}.fna.gz.sig</code>,
and ask <code>expand</code> to resolve it into many filenames by filling in values for
the field name <code>acc</code> from each element in <code>ACCESSIONS</code>. (You may recognize
the keyword syntax for specifying values, <code>acc=ACCESSIONS</code>, from
input and output blocks, above!</p>
<p>The result of <code>expand('{acc}.fna.gz.sig', acc=...)</code> here is
<em>identical</em> to writing out the four filenames in long form:</p>
<div class="highlight"><pre><span></span><code>"GCF_000017325.1.fna.gz.sig",
"GCF_000020225.1.fna.gz.sig",
"GCF_000021665.1.fna.gz.sig",
"GCF_008423265.1.fna.gz.sig"
</code></pre></div>
<p>That is, <code>expand</code> doesn't do any special wildcard matching or pattern
inference - it just fills in the values and returns the resulting list.</p>
<p>Here, <code>ACCESSIONS</code> can be any Python <em>iterable</em> - for example a list, a tuple,
or a dictionary.</p>
<h2>Using <code>expand</code> with multiple lists of values</h2>
<p>You can also use <code>expand</code> with multiple field names. Consider:</p>
<div class="highlight"><pre><span></span><code>expand('{acc}.fna.{extension}`, acc=ACCESSIONS, extension=['.gz.sig', .gz'])
</code></pre></div>
<p>This will produce the following eight filenames:</p>
<div class="highlight"><pre><span></span><code>"GCF_000017325.1.fna.gz.sig",
"GCF_000017325.1.fna.gz",
"GCF_000020225.1.fna.gz.sig",
"GCF_000020225.1.fna.gz",
"GCF_000021665.1.fna.gz.sig",
"GCF_000021665.1.fna.gz",
"GCF_008423265.1.fna.gz.sig",
"GCF_008423265.1.fna.gz"
</code></pre></div>
<p>by substituting <em>all possible</em> combinations of <code>acc</code> and <code>extension</code> into
the provided pattern.</p>
<h2>Generating <em>all</em> combinations vs <em>pairwise</em> combinations</h2>
<p>As we saw above, with multiple patterns, <code>expand</code> will generate all
possible combinations: that is,</p>
<div class="highlight"><pre><span></span><code><span class="n">X</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="n">Y</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'a'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">,</span> <span class="s1">'c'</span><span class="p">]</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s1">'</span><span class="si">{x}</span><span class="s1">.by.</span><span class="si">{y}</span><span class="s1">'</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">Y</span><span class="p">)</span>
</code></pre></div>
<p>will generate 9 filenames: <code>1.by.a</code>, <code>1.by.b</code>, <code>1.by.c</code>, <code>2.by.a</code>, etc.
And if you added a third pattern to the <code>expand</code> string, <code>expand</code> would
also add that into the combinations!</p>
<p>So what's going on here?</p>
<p>By default, expand does an all-by-all expansion containing all
possible combinations. (This is sometimes
called a Cartesian product, a cross-product, or an outer join.)</p>
<p>But you don't always want that. How can we change this behavior?</p>
<p>The <code>expand</code> function takes an optional second argument, the
combinator, which tells <code>expand</code> how to combine the lists of values
the come after. By default <code>expand</code> uses a Python function called
<code>itertools.product</code>, which creates all possible combinations, but you
can give it other functions.</p>
<p>In particular, you can tell <code>expand</code> to create pairwise combinations
by using <code>zip</code> instead - something we did in one of the
<a href="http://ivory.idyll.org/blog/2023-snakemake-slithering-wildcards.html">wildcard examples</a>.</p>
<p>Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="n">X</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="n">Y</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'a'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">,</span> <span class="s1">'c'</span><span class="p">]</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s1">'</span><span class="si">{x}</span><span class="s1">.by.</span><span class="si">{y}</span><span class="s1">'</span><span class="p">,</span> <span class="nb">zip</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">Y</span><span class="p">)</span>
</code></pre></div>
<p>which will now generate only three filenames: <code>1.by.a</code>, <code>2.by.b</code>, and <code>3.by.c</code>.</p>
<p>The big caveat here is that <code>zip</code> will create an output list the length
of the shortest input list - so if you give it one list of three elements,
and one list of two elements, it will only use two elements from the first
list.</p>
<p>For example, in the <code>expand</code> in this <code>Snakefile</code>,</p>
<div class="highlight"><pre><span></span><code><span class="n">X</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="n">Y</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'a'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">]</span>
<span class="n">rule</span> <span class="n">all_zip_short</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s1">'</span><span class="si">{x}</span><span class="s1">.by.</span><span class="si">{y}</span><span class="s1">'</span><span class="p">,</span> <span class="nb">zip</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">Y</span><span class="p">)</span>
</code></pre></div>
<p>only <code>1.by.a</code> and <code>2.by.b</code> will be generated, as there is no partner
for <code>3</code> in the second list.</p>
<p>For more information see the <a href="https://snakemake.readthedocs.io/en/stable/project_info/faq.html#i-don-t-want-expand-to-use-the-product-of-every-wildcard-what-can-i-do">snakemake documentation on using zip instead of product</a>.</p>
<h2>Getting a list of identifiers to use in <code>expand</code></h2>
<p>The <code>expand</code> function provides an effective solution when you have
lists of identifiers that you use multiple times in a workflow - a common
pattern in bioinformatics! Writing these lists out in a Snakefile
(as we do in the above examples) is not always practical, however;
you may have dozens to hundreds of identifiers!</p>
<p>Lists of identifiers can be loaded from <em>other</em> files in a variety of
ways, and they can also be generated from the set of actual files in
a directory using <code>glob_wildcards</code>.</p>
<h2>Examples of loading lists of accessions from files or directories</h2>
<h3>Loading a list of accessions from a text file</h3>
<p>If you have a simple list of accessions in a text file
<code>accessions.txt</code>, like so:</p>
<p>File <code>accessions.txt</code>:</p>
<div class="highlight"><pre><span></span><code>GCF_000017325.1
GCF_000020225.1
GCF_000021665.1
GCF_008423265.1
</code></pre></div>
<p>then the following code will load each line in the text file in as a separate
ID.</p>
<p>Snakefile to load <code>accessions.txt</code>:</p>
<div class="highlight"><pre><span></span><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'accessions.txt'</span><span class="p">,</span> <span class="s1">'rt'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
<span class="n">ACCESSIONS</span> <span class="o">=</span> <span class="n">fp</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span>
<span class="n">ACCESSIONS</span> <span class="o">=</span> <span class="p">[</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">ACCESSIONS</span> <span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'ACCESSIONS is a Python list of length </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">ACCESSIONS</span><span class="p">)</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ACCESSIONS</span><span class="p">)</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"</span><span class="si">{acc}</span><span class="s2">.sig"</span><span class="p">,</span> <span class="n">acc</span><span class="o">=</span><span class="n">ACCESSIONS</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">sketch_genome</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>and build sourmash signatures for each accession.</p>
<h3>Loading a specific column from a CSV file</h3>
<p>If instead of a text file you have a CSV file with multiple columns,
and the IDs to load are all in one column, you can use the Python
<a href="https://pandas.pydata.org/">pandas library</a> to read in the CSV. In
the code below, <code>pandas.read_csv</code> loads the CSV into a pandas
DataFrame object, and then we select the <code>accession</code> column and use
that as an iterable.</p>
<p>File <code>accessions.csv</code>:</p>
<div class="highlight"><pre><span></span><code>accession,information
GCF_000017325.1,genome 1
GCF_000020225.1,genome 2
GCF_000021665.1,genome 3
GCF_008423265.1,genome 4
</code></pre></div>
<p>Snakefile to load <code>accessions.csv</code>:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span>
<span class="n">CSV_DATAFRAME</span> <span class="o">=</span> <span class="n">pandas</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'accessions.csv'</span><span class="p">)</span>
<span class="n">ACCESSIONS</span> <span class="o">=</span> <span class="n">CSV_DATAFRAME</span><span class="p">[</span><span class="s1">'accession'</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'ACCESSIONS is a pandas Series of length </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">ACCESSIONS</span><span class="p">)</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ACCESSIONS</span><span class="p">)</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"</span><span class="si">{acc}</span><span class="s2">.sig"</span><span class="p">,</span> <span class="n">acc</span><span class="o">=</span><span class="n">ACCESSIONS</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">sketch_genome</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<h3>Loading from the config file</h3>
<p>Snakemake also supports the use of configuration files, where the
snakefile supplies the name of the a default config file (which can in
turn be overridden on the command line.)</p>
<p>A config file can also be a good place to put accessions. Consider:</p>
<div class="highlight"><pre><span></span><code><span class="nt">accessions</span><span class="p">:</span>
<span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">GCF_000017325.1</span>
<span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">GCF_000020225.1</span>
<span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">GCF_000021665.1</span>
<span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">GCF_008423265.1</span>
</code></pre></div>
<p>which is used by the following Snakefile.</p>
<p>Snakefile to load accessions from <code>config.yml</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">configfile</span><span class="p">:</span> <span class="s2">"config.yml"</span>
<span class="n">ACCESSIONS</span> <span class="o">=</span> <span class="n">config</span><span class="p">[</span><span class="s1">'accessions'</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'ACCESSIONS is a Python list of length </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">ACCESSIONS</span><span class="p">)</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ACCESSIONS</span><span class="p">)</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"</span><span class="si">{acc}</span><span class="s2">.sig"</span><span class="p">,</span> <span class="n">acc</span><span class="o">=</span><span class="n">ACCESSIONS</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">sketch_genome</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Here, <code>config.yml</code> is a <a href="https://en.wikipedia.org/wiki/YAML">YAML file</a>,
which is a human-readable format that can also be read by computers.
We will talk about config files later!</p>
<h3>Using <code>glob_wildcards</code> to load IDs or accessions from a set of files</h3>
<p>We introduced the <code>glob_wildcards</code> command briefly in the
<a href="https://ivory.idyll.org/blog/2023-snakemake-slithering-wildcards.html">post on wildcards</a>: <code>glob_wildcards</code> does pattern matching on
files <em>actually present in the directory</em>. </p>
<p>Here's a Snakefile that uses <code>glob_wildcards</code> to get the four accessions
from the actual filenames:</p>
<div class="highlight"><pre><span></span><code><span class="n">GLOB_RESULTS</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"genomes/</span><span class="si">{acc}</span><span class="s2">.fna.gz"</span><span class="p">)</span>
<span class="n">ACCESSIONS</span> <span class="o">=</span> <span class="n">GLOB_RESULTS</span><span class="o">.</span><span class="n">acc</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'ACCESSIONS is a Python list of length </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">ACCESSIONS</span><span class="p">)</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ACCESSIONS</span><span class="p">)</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"</span><span class="si">{acc}</span><span class="s2">.sig"</span><span class="p">,</span> <span class="n">acc</span><span class="o">=</span><span class="n">ACCESSIONS</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">sketch_genome</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>This is a particularly convenient way to get a list of accessions,
although it can be dangerous to use this. In particular, it is easy to
accidentally delete a file and not notice that a sample is missing!
For that reason we suggest providing an independent list of files to
load for many situations.</p>
<h2>Wildcards and <code>expand</code> - some closing thoughts</h2>
<p>Combined with wildcards, <code>expand</code> is extremely powerful and useful.
Just like wildcards, however, this power comes with some complexity.
Here is a brief rundown of how these features combine.</p>
<p>The <code>expand</code> function makes a <em>list of files to create</em> from a pattern and
a list of values to fill in.</p>
<p>Wildcards in rules provide <em>recipes</em> to create files whose names match a
pattern.</p>
<p>Typically in Snakefiles we use <code>expand</code> to generate a list of files that
match a certain pattern, and then write a rule that uses wildcards to
generate those actual files.</p>
<p>The list of values to use with <code>expand</code> can come from many places, including
text files, CSV files, and config files. It can <em>also</em> come from
<code>glob_wildcards</code>, which uses a pattern to <em>extract</em> the list of values from
files that are actually present.</p>
<h2>Links and references</h2>
<ul>
<li><a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function">snakemake reference documentation for expand</a></li>
<li>The <a href="https://docs.python.org/3/library/itertools.html">Python <code>itertools</code></a> documentation.</li>
</ul>snakemake for doing bioinformatics - using wildcards to generalize your rules2023-03-03T00:00:00+01:002023-03-03T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2023-03-03:/blog/2023-snakemake-slithering-wildcards.html<p>Slithering your way into bioinformatics with snakemake, wildcard version</p><p>As we showed <a href="http://ivory.idyll.org/blog/2023-snakemake-slithering-section-2.html">in a previous blog post</a>,
when you have repeated
substrings between input and output, you can extract them into
wildcards - going from a rule that makes specific outputs:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes_1</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/GCF_000017325.1.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
</code></pre></div>
<p>to a rule that makes any output that fits a pattern:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes_1</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> </span><span class="se">\</span>
<span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Here, <code>{accession}</code> is a wildcard that "fills in" as needed for any filename
that is under the <code>genomes/</code> directory and ends with <code>.fna.gz</code>.</p>
<p>Snakemake uses simple <em>pattern matching</em> to determine the value of
<code>{accession}</code> - if asked for a filename ending in <code>.fna.gz.sig</code>, snakemake
takes the prefix, and then looks for the matching input file
<code>genomes/{accession}.fna.gz</code>, and fills in <code>{input}</code> accordingly.</p>
<p>This is incredibly useful and means that in many cases you can write
a single rule that can generate hundreds or thousands of files!</p>
<p>However, there are a few subleties to consider. In this
chapter, we're going to cover the most important of those subtleties, and
provide links where you can learn more.</p>
<h2>Rules for wildcards</h2>
<p>First, let's go through some basic rules for wildcards.</p>
<h3>Wildcards are determined by the desired output</h3>
<p>The first and most important rule of wildcards is this: snakemake
fills in wildcard values based on the filename it is asked to produce.</p>
<p>Consider the following rule:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">a</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{prefix}</span><span class="s2">.a.out"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"touch </span><span class="si">{output}</span><span class="s2">"</span>
</code></pre></div>
<p>The wildcard in the output block will match <em>any</em> file that ends with
<code>.a.out</code>, and the associated shell command will create it! This is both
powerful and constraining: you can create any file with the suffix
<code>.a.out</code> - but you also need to <em>ask</em> for the file to be created.</p>
<p>This means that in order to make use of this rule, there needs to be
another rule that has a file that ends in <code>.a.out</code> as a required input.
(You can also explicitly ask for such a file on the command line.)
There's no other way for snakemake to guess at the
value of the wildcard: snakemake follows the dictum that explicit is
better than implicit, and it will not guess at what files you want created.</p>
<p>For example, the above rule could be paired with another rule that asks
for one or more filenames ending in <code>.a.out</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">make_me_a_file</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"result1.a.out"</span><span class="p">,</span>
<span class="s2">"result2.a.out"</span><span class="p">,</span>
</code></pre></div>
<p>This also means that once you put a wildcard in a
rule, you can no longer run that rule by the rule name - you have to
ask for a filename, instead. If you try to run a rule that contains a
wildcard but don't tell it what filename you want to create, you'll get:</p>
<div class="highlight"><pre><span></span><code>Target rules may not contain wildcards.
</code></pre></div>
<p>One common way to work with wildcard rules is to have another rule that
uses <code>expand</code> to construct a list of desired files; this is often paired
with a <code>glob_wildcards</code> to load a list of wildcards. See the recipe for
renaming files by prefix, below.</p>
<h3>All wildcards used in a rule must match to wildcards in the <code>output:</code> block</h3>
<p>snakemake uses the wildcards in the <code>output:</code> block to fill in the wildcards
elsewhere in the rule, so you can only use wildcards mentioned in <code>output:</code>.</p>
<p>So, for example, every wildcard in the <code>input:</code> block needs to be used
in <code>output:</code>. Consider the following example, where the input block
contains a wildcard <code>analysis</code> that is not used in the output block:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># this does not work:</span>
<span class="n">rule</span> <span class="n">analyze_sample</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{sample}</span><span class="s2">.x.</span><span class="si">{analysis}</span><span class="s2">.in"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{sample}</span><span class="s2">.out"</span>
</code></pre></div>
<p>This doesn't work because snakemake doesn't know how to fill in the
<code>analysis</code> wildcard in the <em>input</em> block.</p>
<p>Think about it this way: if this worked, there would be multiple
different input files for the same output, and snakemake would
have no way to choose which input file to use.</p>
<p>There are situations where wildcards in the <code>output:</code> block do <em>not</em>
need to be in the <code>input:</code> block, however - see "Using wildcards to
determine parameters to use in the shell block", below, on using
wildcards to determine parameters for the shell block.</p>
<h3>Wildcards are local to each rule</h3>
<p>Wildcard names must only match <em>within</em> a rule block. You can use the same
wildcard names in multiple rules for consistency and readability, but
snakemake will treat them as independent wildcards, and wildcard values
will not be shared.</p>
<p>So, for example, these two rules use the same wildcard <code>a</code> in both rules -</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">analyze_this</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.first.txt"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.second.txt"</span>
<span class="n">rule</span> <span class="n">analyze_that</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.second.txt"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.third.txt"</span>
</code></pre></div>
<p>but this is equivalent to these next two rules, which use <em>different</em>
wildcards <code>a</code> and <code>b</code> in the separate rules:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">analyze_this</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.first.txt"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.second.txt"</span>
<span class="n">rule</span> <span class="n">analyze_that</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{b}</span><span class="s2">.second.txt"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{b}</span><span class="s2">.third.txt"</span>
</code></pre></div>
<p>There is an exception to the rule that wildcards are independent:
when you use global wildcard constraints to
constrain wildcard matching by wildcard name, the constraints
apply across all uses of that wildcard name in the Snakefile.
However, the <em>values</em> of the wildcards remain independent - it's just
the constraint that is shared.</p>
<!-- CTB: fix link to point directly to global wildcard constraints. -->
<p>While wildcards are independent in values, it is a good convention to
choose wildcards to have the same semantic meaning across the
Snakefile - e.g. always use <code>sample</code> consistently to refer to a
sample. This makes reading the Snakefile easier!</p>
<p>One interesting addendum: because wildcards are local to each rule, you
are free to match different parts of patterns in different rules!
See "Mixing and matching wildcards", below.</p>
<h3>The wildcard namespace is implicitly available in <code>input:</code> and <code>output:</code> blocks, but not in other blocks.</h3>
<p>Within the <code>input:</code> and <code>output:</code> blocks in a rule, you can refer to
wildcards directly by name. If you want to use wildcards in other
parts of a rule you need to use the <code>wildcards.</code> prefix. Here,
<code>wildcards</code> is a <em>namespace</em>, which we will talk about more later. (CTB)</p>
<p>Consider this Snakefile:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># this does not work:</span>
<span class="n">rule</span> <span class="n">analyze_this</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.first.txt"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.second.txt"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"analyze </span><span class="si">{input}</span><span class="s2"> -o </span><span class="si">{output}</span><span class="s2"> --title </span><span class="si">{a}</span><span class="s2">"</span>
</code></pre></div>
<p>Here you will get an error,</p>
<div class="highlight"><pre><span></span><code><span class="n">NameError</span><span class="o">:</span><span class="w"> </span><span class="n">The</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="s1">'a'</span><span class="w"> </span><span class="k">is</span><span class="w"> </span><span class="n">unknown</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="k">this</span><span class="w"> </span><span class="n">context</span><span class="o">.</span><span class="w"> </span><span class="n">Did</span><span class="w"> </span><span class="n">you</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="s1">'wildcards.a'</span><span class="o">?</span>
</code></pre></div>
<p>As the error suggests, you need to use <code>wildcards.a</code> in
the shell block instead:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">analyze_this</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.first.txt"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{a}</span><span class="s2">.second.txt"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"analyze </span><span class="si">{input}</span><span class="s2"> -o </span><span class="si">{output}</span><span class="s2"> --title </span><span class="si">{wildcards.a}</span><span class="s2">"</span>
</code></pre></div>
<h3>Wildcards match greedily, unless constrained</h3>
<p>Wildcard pattern matching chooses the <em>longest possible</em> match to
<em>any</em> characters, which can result in slightly confusing
behavior. Consider:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"x.y.z.gz"</span>
<span class="n">rule</span> <span class="n">something</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{prefix}</span><span class="s2">.</span><span class="si">{suffix}</span><span class="s2">.txt"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{prefix}</span><span class="s2">.</span><span class="si">{suffix}</span><span class="s2">.gz"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"gzip -c </span><span class="si">{input}</span><span class="s2"> > </span><span class="si">{output}</span><span class="s2">"</span>
</code></pre></div>
<p>In the <code>something</code> rule, for the desired output file <code>x.y.z.gz</code>,
<code>{prefix}</code> will currently be <code>x.y</code> and <code>{suffix}</code> will be <code>z</code>.
But it would be equally valid for <code>{prefix}</code> to be <code>x</code> and
suffix to be <code>y.z</code>.</p>
<p>A more extreme example shows the greedy matching even more clearly:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"longer_filename.gz"</span>
<span class="n">rule</span> <span class="n">something</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"</span><span class="si">{prefix}{suffix}</span><span class="s2">.txt"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"</span><span class="si">{prefix}{suffix}</span><span class="s2">.gz"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"gzip -c </span><span class="si">{input}</span><span class="s2"> > </span><span class="si">{output}</span><span class="s2">"</span>
</code></pre></div>
<p>where <code>{suffix}</code> is reduced down to a single character, <code>e</code>, and
<code>{prefix}</code> is <code>longer_filenam</code>!</p>
<p>Two simple rules for wildcard matching are:
* all wildcards must match at least one character.
* after that, wildcards will match greedily: each wildcard will match everything it can before the next wildcard is considered.</p>
<p>Therefore, it's good practice to use
wildcard constraints to limit
wildcard matching. See "Constraining wildcards to avoid
subdirectories and/or periods", below, for some examples.</p>
<h2>Some examples of wildcards</h2>
<h3>Running one rule on many files</h3>
<p>Wildcards can be used to run the same simple rule on many files - this is
one of the simplest and most powerful uses for snakemake!</p>
<p>Consider this Snakefile for compressing many files:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compressed/F3D141_S207_L001_R1_001.fastq.gz"</span><span class="p">,</span>
<span class="s2">"compressed/F3D141_S207_L001_R2_001.fastq.gz"</span><span class="p">,</span>
<span class="s2">"compressed/F3D142_S208_L001_R1_001.fastq.gz"</span><span class="p">,</span>
<span class="s2">"compressed/F3D142_S208_L001_R2_001.fastq.gz"</span>
<span class="n">rule</span> <span class="n">gzip_file</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"original/</span><span class="si">{filename}</span><span class="s2">"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compressed/</span><span class="si">{filename}</span><span class="s2">.gz"</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"gzip -c </span><span class="si">{input}</span><span class="s2"> > </span><span class="si">{output}</span><span class="s2">"</span>
</code></pre></div>
<p>This Snakefile specifies a list of compressed files that it wants produced,
and relies on wildcards to do the pattern matching required to find the
input files and fill in the shell block.</p>
<p>That having been said, this Snakefile is inconvenient to write and is
somewhat error prone:</p>
<ul>
<li>writing out the files individually is annoying if you have many of them!</li>
<li>to generate the list of files, you have to hand-rename them, which is
error prone!</li>
</ul>
<p>Snakemake provides several features that can help with these issues. You
can load the list of files from a text file or spreadsheet, or get the
list directly from the directoriy using <code>glob_wildcards</code>; and you can
use <code>expand</code> to rename them in bulk. Read on for some examples!</p>
<h4>Why use snakemake here?</h4>
<p>It is possible to accomplish the same task by using <code>gzip -k original/*</code>,
although you'd have to move the files into their final location, too.</p>
<p>How is using <code>gzip -k original/*</code> different from using snakemake? And
is it better?</p>
<p>First, while the results aren't different - both approaches will
compress the set of input files, which is what you want! - the <code>gzip
-k</code> command runs in <em>serial</em> and will not run in <em>parallel</em> - that is,
gzip will by default compress one file at a time. The Snakefile will
run the rule <code>gzip_file</code> <em>in parallel</em>, using as many processors as you
specify with <code>-j</code>. That means that if you had many, many such files -
a common problem in bioinformatics! - the snakemake version could
potentially run many times faster.</p>
<p>Second, specifying many files on the command line with <code>gzip -k
original/*</code> works with <code>gzip</code> but not with every shell command. Some
commands only run on one file at a time; <code>gzip</code> just happens to work
whether you give it one or many files. Many other programs do not work
on multiple input files; e.g. the <code>fastp</code> program for preprocessing
FASTQ files runs on one dataset at a time. (It's also worth
mentioning that snakemake gives you a way to flexibly write custom
command lines; more on that later.)</p>
<p>Third, in the Snakefile we are being explicit about which files we
expect to exist after the rules are run, while if we just ran <code>gzip -k
original/*</code> we are asking the shell to compress every file in
<code>original/</code>. If we accidentally deleted a file in the <code>original</code>
subdirectory, then gzip would not know about it and would not
complain - but snakemake would. This is a theme that will come up
repeatedly - it's often safer to be really explicit about what files
you expect, so that you can be alerted to possible mistakes.</p>
<p>And, fourth, the Snakefile approach will let you rename the output
files in interesting ways - with <code>gzip -k original/*</code>, you're stuck
with the original filenames. This is a feature we will explore in the
next subsection!</p>
<h3>Renaming files by prefix using <code>glob_wildcards</code></h3>
<p>Consider a set of files named like so:</p>
<div class="highlight"><pre><span></span><code>F3D141_S207_L001_R1_001.fastq
F3D141_S207_L001_R2_001.fastq
</code></pre></div>
<p>within the <code>original/</code> subdirectory.</p>
<p>Now suppose you want to rename them all to get rid of the <code>_001</code> suffix
before <code>.fastq</code>. This is very easy with wildcards!</p>
<p>The below Snakefile uses <code>glob_wildcards</code> to load in a list of files from
a directory and then make a copy of them with the new name under the
<code>renamed/</code> subdirectory. Here, <code>glob_wildcards</code> extracts the <code>{sample}</code>
pattern <em>from</em> the set of available files in the directory:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># first, find matches to filenames of this form:</span>
<span class="n">files</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"original/</span><span class="si">{sample}</span><span class="s2">_001.fastq"</span><span class="p">)</span>
<span class="c1"># next, specify the form of the name you want:</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"renamed/</span><span class="si">{sample}</span><span class="s2">.fastq"</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">files</span><span class="o">.</span><span class="n">sample</span><span class="p">)</span>
<span class="c1"># finally, give snakemake a recipe for going from inputs to outputs.</span>
<span class="n">rule</span> <span class="n">rename</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"original/</span><span class="si">{sample}</span><span class="s2">_001.fastq"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"renamed/</span><span class="si">{sample}</span><span class="s2">.fastq"</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"cp </span><span class="si">{input}</span><span class="s2"> </span><span class="si">{output}</span><span class="s2">"</span>
</code></pre></div>
<p>This Snakefile also makes use of <code>expand</code> to rewrite the loaded list
into the desired set of filenames. This means that we no
longer have to write out the list of files ourselves - we can let
snakemake do it with <code>expand</code>!</p>
<p>Note that here you could do a <code>mv</code> instead of a <code>cp</code> and then
<code>glob_wildcards</code> would no longer pick up the changed files after
running.</p>
<p>This Snakefile loads the list of files from the directory itself,
which means that if an input file is accidentally deleted, snakemake
won't complain. When renaming files, this is unlikely to cause
problems; however, when running workflows, we recommend loading the
list of samples from a text file or spreadsheet to avoid problems</p>
<!-- (CTB point to a recipe). -->
<p>Also note that this Snakefile will find and rename all files in
<code>original/</code> as well as any subdirectories! This is because
<code>glob_wildcards</code> by default includes all subdirectories. See
the next section below to see how to use wildcard constraints to
prevent loading from subdirectories.</p>
<h3>Constraining wildcards to avoid subdirectories and/or periods</h3>
<p>Wildcards match to any string, including '/', and so <code>glob_wildcards</code>
will automatically find files in subdirectories and will also "stretch
out" to match common delimiters in filenames such as '.' and '-'. This
is commonly referred to as "greedy matching" and it means that
sometimes your wildcards will match to far more of a filename than you
want! You can limit wildcard matches using wildcard constraints.</p>
<p>Two common wildcard constraints are shown below, separately and in
combination. The first constraint avoids files in subdirectories, and
the second constraint avoids periods.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># match all .txt files - no constraints</span>
<span class="n">all_files</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"</span><span class="si">{filename}</span><span class="s2">.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">filename</span>
<span class="c1"># match all .txt files in this directory only - avoid /</span>
<span class="n">this_dir_files</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"{filename,[^/]+}.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">filename</span>
<span class="c1"># match all files with only a single period in their name - avoid .</span>
<span class="n">prefix_only</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"{filename,[^.]+}.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">filename</span>
<span class="c1"># match all files in this directory with only a single period in their name</span>
<span class="c1"># avoid / and .</span>
<span class="n">prefix_and_dir_only</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"{filename,[^./]+}.txt"</span><span class="p">)</span><span class="o">.</span><span class="n">filename</span>
</code></pre></div>
<p>Check out wildcard constraints for more information and details.</p>
<h2>Advanced wildcard examples</h2>
<h3>Renaming files using multiple wildcards</h3>
<p>The first renaming example above works really well when you want to change just
the suffix of a file and can use a single wildcard, but if you want to
do more complicated renaming you may have to use multiple wildcards.</p>
<p>Consider the situation where you want to rename files from the form of
<code>F3D141_S207_L001_R1_001.fastq</code> to <code>F3D141_S207_R1.fastq</code>. You can't
do that with a single wildcard, unfortunately - but you can use two,
like so:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># first, find matches to filenames of this form:</span>
<span class="n">files</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"original/</span><span class="si">{sample}</span><span class="s2">_L001_</span><span class="si">{r}</span><span class="s2">_001.fastq"</span><span class="p">)</span>
<span class="c1"># next, specify the form of the name you want:</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"renamed/</span><span class="si">{sample}</span><span class="s2">_</span><span class="si">{r}</span><span class="s2">.fastq"</span><span class="p">,</span> <span class="nb">zip</span><span class="p">,</span>
<span class="n">sample</span><span class="o">=</span><span class="n">files</span><span class="o">.</span><span class="n">sample</span><span class="p">,</span> <span class="n">r</span><span class="o">=</span><span class="n">files</span><span class="o">.</span><span class="n">r</span><span class="p">)</span>
<span class="c1"># finally, give snakemake a recipe for going from inputs to outputs.</span>
<span class="n">rule</span> <span class="n">rename</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"original/</span><span class="si">{sample}</span><span class="s2">_L001_</span><span class="si">{r}</span><span class="s2">_001.fastq"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"renamed/</span><span class="si">{sample}</span><span class="s2">_</span><span class="si">{r}</span><span class="s2">.fastq"</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"cp </span><span class="si">{input}</span><span class="s2"> </span><span class="si">{output}</span><span class="s2">"</span>
</code></pre></div>
<p>We're making use of three new features in this code:</p>
<p>First, <code>glob_wildcards</code> is matching multiple wildcards, and
puts the resulting values into a single result variable (here, <code>files</code>).</p>
<p>Second, the matching values are placed in two ordered lists,
<code>files.sample</code> and <code>files.r</code>, such that values extracted from file names
match in pairs.</p>
<p>Third, when we use <code>expand</code>, we're asking it to "zip" the two lists of
wildcards together, rather than the default, which is to make all
possible combinations with <code>product</code>.</p>
<p>Also - as with the previous example, this Snakefile will find and
rename all files in <code>original/</code> as well as any subdirectories!</p>
<p>Links:</p>
<ul>
<li><a href="https://snakemake.readthedocs.io/en/stable/project_info/faq.html#i-don-t-want-expand-to-use-the-product-of-every-wildcard-what-can-i-do">snakemake documentation on using zip instead of product</a></li>
</ul>
<h3>Mixing and matching strings</h3>
<p>A somewhat nonintuitive (but also very useful) consequence of wildcards
being local to rules is that you can do clever string matching to mix and
match generic rules with more specific rules.</p>
<p>Consider this Snakefile, in which we are mapping reads from multiple
samples to multiple references (rule <code>map_reads_to_reference</code>) as well
as converting SAM to BAM files:</p>
<!-- CTB: transfer to functional Snakefile? -->
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"sample1.x.ecoli.bam"</span><span class="p">,</span>
<span class="s2">"sample2.x.shewanella.bam"</span><span class="p">,</span>
<span class="s2">"sample1.x.shewanella.bam"</span>
<span class="n">rule</span> <span class="n">map_reads_to_reference</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">reads</span><span class="o">=</span><span class="s2">"</span><span class="si">{sample}</span><span class="s2">.fq"</span><span class="p">,</span>
<span class="n">reference</span><span class="o">=</span><span class="s2">"</span><span class="si">{genome}</span><span class="s2">.fa"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{reads}</span><span class="s2">.x.</span><span class="si">{reference}</span><span class="s2">.sam"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"minimap2 -ax sr </span><span class="si">{input.reference}</span><span class="s2"> </span><span class="si">{input.reads}</span><span class="s2"> > </span><span class="si">{output}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">convert_sam_to_bam</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{filename}</span><span class="s2">.sam"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{filename}</span><span class="s2">.bam"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"samtools view -b </span><span class="si">{input}</span><span class="s2"> -o </span><span class="si">{output}</span>
</code></pre></div>
<p>Here, snakemake is happily using different wildcards in each rule, and
matching them to different parts of the pattern! So,</p>
<ul>
<li>
<p>Rule <code>convert_sam_to_bam</code> will generically convert any SAM file to a BAM
file based solely on the <code>.bam</code> and <code>.sam</code> suffixes.</p>
</li>
<li>
<p>However, <code>map_reads_to_references</code> will only produce mapping files that
match the pattern of <code>{sample}.x.{reference}</code>, which in turn depend on the
existence of <code>{reference}.fa</code> and <code>{sample}.fastq</code>.</p>
</li>
</ul>
<p>This works because, ultimately, snakemake is just matching strings
and does not "know" anything about the structure of the strings that
it's matching. And it also doesn't remember wildcards across rules. So
snakemake will happily match one set of wildcards in one rule, and a
different set of wildcards in another rule!</p>
<h3>Using wildcards to determine parameters to use in the shell block.</h3>
<p>You can also use wildcards to build rules that produce output files
where the parameters used to <em>generate</em> the contents are based on the
filename; for example, consider this example of generating subsets
of FASTQ files:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"big.subset100.fastq"</span>
<span class="n">rule</span> <span class="n">subset</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"big.fastq"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"big.subset</span><span class="si">{num_lines}</span><span class="s2">.fastq"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> head -</span><span class="si">{wildcards.num_lines}</span><span class="s2"> </span><span class="si">{input}</span><span class="s2"> > </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Here, the wildcard is <em>only</em> in the output filename, not in the
input filename. The wildcard value is used by snakemake to determine
how to fill in the number of lines for <code>head</code> to select from the file!</p>
<p>This can be really useful for generating files with many different
parameters to a particular shell command - "parameter sweeps". More
about this later.</p>
<!-- See CTB XXX.
CTB link to:
* params functions, params lambda?
* parameter sweeps with this and expand
-->
<h2>How to think about wildcards</h2>
<p>Wildcards (together with <code>expand</code> and <code>glob_wildcards</code>) are perhaps
the single most powerful feature in snakemake: they permit generic
application of rules to an arbitrary number of files, based entirely
on simple patterns.</p>
<p>However, with that power comes quite a bit of complexity!</p>
<p>Ultimately, wildcards are all about <em>strings</em> and <em>patterns</em>.
Snakemake is using pattern matching to extract patterns from the
desired output files, and then filling those matches in elsewhere in
the rule. Most of the ensuing complexity comes avoiding ambiguity in
matching and filling in patterns, along with the paired challenge of
constructing all the names of the files you actually want to create.</p>
<h2>Additional references</h2>
<p>See also: the
<a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-wildcards">snakemake docs on wildcards</a>.</p>conda & mamba on shared clusters works better now!2023-02-09T00:00:00+01:002023-02-09T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2023-02-09:/blog/2023-conda-mamba-shared.html<p>conda is great!</p><p>Friends! Countrymen! I bring you good tidings! The <a href="https://github.com/mamba-org/mamba/issues/488#issuecomment-1400575225">bug is dead!</a> Long live conda/mamba on shared clusters!</p>
<p>OK, wait. Let's back up. What's this bug, and why does it matter that it's fixed?</p>
<p>It all starts with teaching...</p>
<h2>conda is, like, the best for teaching bioinformatics!!</h2>
<p>I've been teaching bioinformatics using conda for about 5 years now. Not only do I straight up <a href="https://hackmd.io/VTcCz9dmSf6vclaHRwavlw?view">teach conda/mamba</a> but I also use it extensively in my Intro Bioinformatics hands-on lab for graduate students, where I teach <a href="https://github.com/ngs-docs/2023-ggg-201b-lab/blob/main/lab-1.md">variant calling</a>, de novo assembly, and RNAseq.</p>
<p>Mostly I teach on a shared cluster, the 'farm' HPC, because that's where many of the students will be doing their research.</p>
<p>And I teach conda (and mamba) for a few reasons:</p>
<ul>
<li>it works!</li>
<li>you don't need admin privileges to install specific versions of your software!</li>
<li>most bioinformatics command-line software is available via conda!</li>
<li>many (most?) Python packages <em>and</em> many (most?) R packages are available from conda-forge or bioconda!</li>
<li>and, most recently, one of our admins, Camille Scott, got RStudio Server working so that it loads R and R packages from conda environments!</li>
</ul>
<p>So, basically, conda is a full solution for students to take and use <em>after</em> my class is over.</p>
<h2>My teaching setup for conda</h2>
<p>I teach using a bunch of accounts specifically created for the course. These accounts are set up so that I have ssh access into them, which is really important; and they have specific queue access. It all works really well! Well, mostly.</p>
<p>Things that work out of the box: software installed with conda. Yay!</p>
<p>Things that don't work out of the box: 30 students simultaneously downloading the same packages from conda-forge.</p>
<p>This is because 30 students downloading 500 MB of packages from the same remote Web site is slow ;).</p>
<p>The thing is, it's not really necessary for everyone to download the packages - most of the time, students are only downloading packages all at the same time during class, and they're all downloading the <em>same</em> packages. We should be able to cache them!</p>
<p>So I've set up the accounts with a central cache. Read on...</p>
<h2>Using a central package cache for a bunch of accounts</h2>
<p>It's actually pretty straightforward to set up; there are two components: a <a href="https://github.com/ngs-docs/shared-conda-on-farm/blob/main/condarc">condarc file</a>,</p>
<div class="highlight"><pre><span></span><code><span class="nt">pkgs_dirs</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">~/.conda/pkgs</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">/home/ctbrown/remote-computing.cache</span>
</code></pre></div>
<p>that specifies a package cache directory that's shared; and an <a href="https://github.com/ngs-docs/shared-conda-on-farm/blob/main/install-mambaforge.sh">install script</a> that I run in each "child" account that installs and configures conda to use the shared cache:</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">cd</span><span class="w"> </span>~/
$<span class="w"> </span>mkdir<span class="w"> </span>-p<span class="w"> </span>~/.conda/pkgs
$<span class="w"> </span>cp<span class="w"> </span>~ctbrown/shared-conda-on-farm/condarc<span class="w"> </span>~/.condarc
$<span class="w"> </span>bash<span class="w"> </span>~ctbrown/shared-conda-on-farm/Mambaforge-Linux-x86_64.sh<span class="w"> </span>-b<span class="w"> </span>-p<span class="w"> </span><span class="nv">$HOME</span>/miniforge3
</code></pre></div>
<p>This sets things up so that all the accounts look for packages in one place, and download them to their local account if they're not there.</p>
<p>I run this script in each child account, and then I set up a separate parent account that has write privileges to the cache directory. This parent account must then download all of the desired conda packages, at which point they are then available to all the child accounts to use without download.</p>
<p>This works great, except for one thing: until recently, the child account mamba calls would complain bitterly if permissions were wrong. And sometimes things would work out even less well and there would be crashes. So I had to be very mindful of how I installed packages. Which I wasn't always. Which caused problems.</p>
<p>And that's the bug that was fixed! - the specific <a href="https://github.com/mamba-org/mamba/issues/488#issuecomment-1400575225">conda issue I've been paying attention to</a> references <a href="https://github.com/mamba-org/mamba/pull/2141">this fix</a>, which was actually pointed at <a href="https://github.com/mamba-org/mamba/issues/1123">this issue</a>. </p>
<p>All's well that ends well - I upgraded all of the accounts to mamba 1.30 and ran some tests and it all seems to work! We did a stress test on Wednesday with ~30 people running through my snakemake lesson, and other than network glitches, life was good!</p>
<h2>Taking a step back: is conda all that?</h2>
<p>Yes, it's great.</p>
<p>I'm sure it doesn't solve all the packaging problems, and I'm positive it's theoretically inferior to many things, but I've gotta say, it really <strong>just works</strong> for me (and people in my lab) 99% of the time.</p>
<p>Even better, other people are reporting that it's working well for them - including for R software installations.</p>
<h2>Conda and R</h2>
<p>Conda solves a lot of R package installation problems for me.</p>
<p>I'm no R expert, but here is what I've gathered as to why I have a lot of problems:</p>
<p>The challenge with R installation is that many R packages need to be compiled before installation; I gather the R packaging ecosystem typically distributes things as source. This means installing them requires having a particular compiler tool-chain installed. Dependencies also become an issue. Basically, this is a point of fragility.</p>
<p>Conda conveniently does things in a different way: packages are distributed as binaries with no compilation required, and their dependencies include everything required for runtime. When this works, it works really well - you just download and install the compiled package for your system!</p>
<p>Even better, all of the conda magic works - you get to use an isolated environment, with the version of R you wanted to use, with all of the compatible packages installed. And if you need to install something yourself, you can do so <em>in</em> that isolated conda environment without potentially contaminating your other R installs.</p>
<p>So, I now regularly use conda environments that look like this:</p>
<div class="highlight"><pre><span></span><code><span class="nt">channels</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">conda-forge</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">bioconda</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">defaults</span>
<span class="nt">dependencies</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-ggplot2</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-dplyr</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-readr</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-pheatmap</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-knitr</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-rmarkdown</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-rsqlite</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-data.table</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-kableextra</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">bioconductor-tximeta</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">bioconductor-deseq2</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">bioconductor-summarizedexperiment</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-base</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-irkernel=1.1</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">r-devtools</span>
</code></pre></div>
<p>and it works really well for me.</p>
<p>I'll note that the situation has really improved over the last 3 years - I used to have lots of issues, but conda-forge has really stepped up their game and now most of my problems occur elsewhere (problem-specific stuff, basically).</p>
<p>One concern with conda has been the availability of common R packages. Here I'm happy to say that Fredrik Boulund reported that he was able to find that all but one of 600 of their internally used R packages were already available on conda-forge. So that's pretty cool!</p>
<h2>One last thought for you...</h2>
<p>...or maybe two ;).</p>
<p>Packaging for data science software really requires a community. There are so many packages, and so many diverse and disparate needs, that if you want a solution that satisfies > 80% of the needs you need to build off a diverse community. If the community mechanisms include a way to add your own packages of interest (like conda-forge and bioconda do) then that results in magic!</p>
<p>Also, I think software solutions have to incorporate the newbie/learners perspective. If I can't get a class of 30 people to robustly use your solution, then that's a problem.</p>
<p>--titus</p>A brief overview of automation and parallelization options in UNIX/on an HPC2023-01-31T00:00:00+01:002023-01-31T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2023-01-31:/blog/2023-automation-and-parallelization.html<p>Automating things! Parallelizing them!</p><p>What do you do if you have a lot of computing jobs to run, and lots of computing resources to run them?</p>
<p>Let's play with some options! We'll run a simple set of bioinformatics analyses as an example, but all of the approaches below should work for a wide variety of command line needs.</p>
<p>Most of the commands below should work as straight-up copy/paste. Please let me know if they don't!</p>
<h2>Setup and file preparation</h2>
<p>Download some metagenome assemblies from <a href="https://osf.io/vk4fa/?view_only">our metagenome assembly evaluation project</a>. These are all files generated from from <a href="https://pubmed.ncbi.nlm.nih.gov/23387867/">Shakya et al., 2014</a> - specifically, assemblies of SRR606249.</p>
<div class="highlight"><pre><span></span><code><span class="n">mkdir</span><span class="w"> </span><span class="n">queries</span><span class="o">/</span>
<span class="n">cd</span><span class="w"> </span><span class="n">queries</span><span class="o">/</span>
<span class="n">curl</span><span class="w"> </span><span class="o">-</span><span class="n">JLO</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">osf</span><span class="o">.</span><span class="n">io</span><span class="o">/</span><span class="n">download</span><span class="o">/</span><span class="n">q8h97</span><span class="o">/</span>
<span class="n">curl</span><span class="w"> </span><span class="o">-</span><span class="n">JLO</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">osf</span><span class="o">.</span><span class="n">io</span><span class="o">/</span><span class="n">download</span><span class="o">/</span><span class="mi">7</span><span class="n">bzrc</span><span class="o">/</span>
<span class="n">curl</span><span class="w"> </span><span class="o">-</span><span class="n">JLO</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">osf</span><span class="o">.</span><span class="n">io</span><span class="o">/</span><span class="n">download</span><span class="o">/</span><span class="mi">3</span><span class="n">kgvd</span><span class="o">/</span>
<span class="n">cd</span><span class="w"> </span><span class="o">..</span>
<span class="n">mkdir</span><span class="w"> </span><span class="o">-</span><span class="n">p</span><span class="w"> </span><span class="n">database</span><span class="o">/</span>
<span class="n">cd</span><span class="w"> </span><span class="n">database</span><span class="o">/</span>
<span class="n">curl</span><span class="w"> </span><span class="o">-</span><span class="n">JLO</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">osf</span><span class="o">.</span><span class="n">io</span><span class="o">/</span><span class="n">download</span><span class="o">/</span><span class="mi">4</span><span class="n">kfv9</span><span class="o">/</span>
<span class="n">cd</span><span class="w"> </span><span class="o">../</span>
</code></pre></div>
<p>Now you should have three files in queries/</p>
<div class="highlight"><pre><span></span><code>ls -1 queries/
</code></pre></div>
<div class="highlight"><pre><span></span><code>>idba.scaffold.fa.gz
>megahit.final.contigs.fa.gz
>spades.scaffolds.fasta.gz
</code></pre></div>
<p>and one file in database/</p>
<div class="highlight"><pre><span></span><code>ls -1 database/
</code></pre></div>
<div class="highlight"><pre><span></span><code>>podar-complete-genomes-17.2.2018.tar.gz
</code></pre></div>
<p>Let's sketch the queries with sourmash:</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span>i<span class="w"> </span><span class="k">in</span><span class="w"> </span>queries/*.gz
<span class="k">do</span>
<span class="w"> </span>sourmash<span class="w"> </span>sketch<span class="w"> </span>dna<span class="w"> </span>-p<span class="w"> </span><span class="nv">k</span><span class="o">=</span><span class="m">31</span>,scaled<span class="o">=</span><span class="m">10000</span><span class="w"> </span><span class="nv">$i</span><span class="w"> </span>-o<span class="w"> </span><span class="nv">$i</span>.sig
<span class="k">done</span>
</code></pre></div>
<p>Next, unpack the database and create <code>database.zip</code>:</p>
<div class="highlight"><pre><span></span><code><span class="nb">cd</span><span class="w"> </span>database/
tar<span class="w"> </span>xzf<span class="w"> </span>podar*.tar.gz
sourmash<span class="w"> </span>sketch<span class="w"> </span>dna<span class="w"> </span>-p<span class="w"> </span><span class="nv">k</span><span class="o">=</span><span class="m">31</span>,scaled<span class="o">=</span><span class="m">10000</span><span class="w"> </span>*.fa<span class="w"> </span>--name-from-first<span class="w"> </span>-o<span class="w"> </span>../database.zip
<span class="nb">cd</span><span class="w"> </span>../
</code></pre></div>
<p>Finally, make all your inputs read-only:</p>
<div class="highlight"><pre><span></span><code>chmod a-w queries/* database.zip database/*
</code></pre></div>
<p>This prevents against accidental overwriting of the files.</p>
<h2>Running your basic queries</h2>
<p>We're going to run <a href="https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-gather-find-metagenome-members">sourmash gather</a> for all three assembly files in <code>queries/</code> against the 64 genomes in <code>database.zip</code>. These specific commands will run quickly, but note that they are a proxy for a much bigger analysis against larger databases.</p>
<p>You could do these queries in serial:</p>
<div class="highlight"><pre><span></span><code>sourmash<span class="w"> </span>gather<span class="w"> </span>queries/idba.scaffold.fa.gz.sig<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span>idba.scaffold.fa.gz.csv
sourmash<span class="w"> </span>gather<span class="w"> </span>queries/megahit.final.contigs.fa.gz.sig<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span>megahit.final.contigs.fa.gz.csv
sourmash<span class="w"> </span>gather<span class="w"> </span>queries/spades.scaffolds.fasta.gz.sig<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span>spades.scaffolds.fasta.gz.csv
</code></pre></div>
<p>but then your total compute time would be the sum of the individual compute times. And what if each query is super slow and/or big, and you have dozens or hundreds of them? WHAT THEN?</p>
<p>Read on!</p>
<h2>Automation and parallelization</h2>
<h3>1. Write a shell script.</h3>
<p>Let's start by automating the queries so that you can just run one command and have it do all three (or N) queries.</p>
<p>Create the following shell script:</p>
<p><code>run1.sh</code>:</p>
<div class="highlight"><pre><span></span><code>sourmash<span class="w"> </span>gather<span class="w"> </span>queries/idba.scaffold.fa.gz.sig<span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span>idba.scaffold.fa.gz.csv
sourmash<span class="w"> </span>gather<span class="w"> </span>queries/megahit.final.contigs.fa.gz.sig<span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span>megahit.final.contigs.fa.gz.csv
sourmash<span class="w"> </span>gather<span class="w"> </span>queries/spades.scaffolds.fasta.gz.sig<span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span>spades.scaffolds.fasta.gz.csv
</code></pre></div>
<p>and run it:</p>
<div class="highlight"><pre><span></span><code>bash run1.sh
</code></pre></div>
<p>This automates the commands, but nothing else.</p>
<p>Notes:</p>
<ul>
<li>all your commands will run in serial, one after the other;</li>
<li>the memory usage of the script will be the same as the memory usage of the largest command;</li>
</ul>
<h3>2. Add a for loop to your shell script.</h3>
<p>There's a lot of duplication in the script above. Duplication leads to typos, which leads to fear, anger, hatred, and suffering.</p>
<p>Let's make a script <code>run2.sh</code> that contains a for loop instead.</p>
<p><code>run2.sh</code>:</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span>query<span class="w"> </span><span class="k">in</span><span class="w"> </span>queries/*.sig
<span class="k">do</span>
<span class="nv">output</span><span class="o">=</span><span class="k">$(</span>basename<span class="w"> </span><span class="nv">$query</span><span class="w"> </span>.sig<span class="k">)</span>.csv
sourmash<span class="w"> </span>gather<span class="w"> </span><span class="nv">$query</span><span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span><span class="nv">$output</span>
<span class="k">done</span>
</code></pre></div>
<p>While this does exactly the same thing <em>computationally</em> as <code>run1.sh</code>, it is a bit nicer because it is less repetitive and lets you run as many queries as you have.</p>
<p>Notes:</p>
<ul>
<li>yes, we carefully structured the filenames so that the <code>for</code> loop would work :)</li>
<li>the <code>output=</code> line uses <code>basename</code> to remove the <code>queries/</code> prefix and <code>.sig</code> suffix from each query filename.</li>
</ul>
<h3>3. Write a for loop that creates a shell script.</h3>
<p>Sometimes it's nice to <em>generate</em> a script that you can edit to fine tune and customize the commands. Let's try that!</p>
<p>At the shell prompt, run</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span>query<span class="w"> </span><span class="k">in</span><span class="w"> </span>queries/*.sig
<span class="k">do</span>
<span class="nv">output</span><span class="o">=</span><span class="k">$(</span>basename<span class="w"> </span><span class="nv">$query</span><span class="w"> </span>.sig<span class="k">)</span>.csv
<span class="nb">echo</span><span class="w"> </span>sourmash<span class="w"> </span>gather<span class="w"> </span><span class="nv">$query</span><span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span><span class="nv">$output</span>
<span class="k">done</span><span class="w"> </span>><span class="w"> </span>run3.sh
</code></pre></div>
<p>This creates a file <code>run3.sh</code> that contains the commands to run. Neato! You could now edit this file if you wanted to individually change up the commands. Or, you could adjust the for loop if you wanted to change <em>all</em> the commands.</p>
<p>Notes:</p>
<ul>
<li>same runtime parameters as above: everything runs in serial.</li>
<li>be careful about overwriting <code>run3.sh</code> by accident after you've edited it!</li>
</ul>
<h3>4. Use <code>parallel</code> to run the commands instead.</h3>
<p>Once we have this script file ready, we can actually run the commands in parallel, using
<a href="https://www.gnu.org/software/parallel/">GNU <code>parallel</code></a>:</p>
<div class="highlight"><pre><span></span><code>parallel -j 2 < run3.sh
</code></pre></div>
<p>This runs up to two commands from <code>run3.sh</code> at a time (<code>-j 2</code>). Neat, right?!</p>
<p>Notes:</p>
<ul>
<li>depending on the parameter to <code>-j</code>, this can be much faster - here, twice as fast!</li>
<li>it will also use twice as much memory...!</li>
<li><code>parallel</code> runs each line on its own. So if you have multiple things you want to run in each parallel session, you need to do something different - like write a shell script to do each compute action, and <em>then</em> run those in parallel.</li>
</ul>
<h3>5. Write a second shell script that takes a parameter.</h3>
<p>Let's switch things up - let's write a generic shell script that does the computation. Note that it's the same set of commands as in the for loops above!</p>
<p><code>do-gather.sh</code>:</p>
<div class="highlight"><pre><span></span><code><span class="nv">output</span><span class="o">=</span><span class="k">$(</span>basename<span class="w"> </span><span class="nv">$1</span><span class="w"> </span>.sig<span class="k">)</span>.csv
sourmash<span class="w"> </span>gather<span class="w"> </span><span class="nv">$1</span><span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span><span class="nv">$output</span>
</code></pre></div>
<p>Now you can run this in a loop like so:</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span>i<span class="w"> </span><span class="k">in</span><span class="w"> </span>queries/*.sig
<span class="k">do</span>
<span class="w"> </span>bash<span class="w"> </span><span class="k">do</span>-gather.sh<span class="w"> </span><span class="nv">$i</span>
<span class="k">done</span>
</code></pre></div>
<p>Notes:</p>
<ul>
<li>here, <code>$1</code> is the first command-line parameter after the shell script name.</li>
<li>this is back to processing in serial, not parallel.</li>
</ul>
<p>It would be easy to make this into something you can run in parallel, by providing a list of <code>do-gather.sh</code> commands as in (4), above.</p>
<h3>6. Change the second shell script to be an sbatch script.</h3>
<p>Suppose you have access to an HPC that has many different computers, and you want to run a bunch of big jobs <em>across</em> those computers. How do we do that?</p>
<p>All (most?) clusters have a queuing system; ours is called slurm. (You can see a tutorial <a href="https://ngs-docs.github.io/2021-august-remote-computing/executing-large-analyses-on-hpc-clusters-with-slurm.html">here</a>.)</p>
<p>To send jobs to many different computers, you can write a shell script that executes a particular job, and then run lots of those.</p>
<p>Change <code>do-gather.sh</code> to look like the following.</p>
<div class="highlight"><pre><span></span><code><span class="c1">#SBATCH -c 1 # cpus per task</span>
<span class="c1">#SBATCH --mem=5Gb # memory needed</span>
<span class="c1">#SBATCH --time=00-00:05:00 # time needed</span>
<span class="c1">#SBATCH -p med2 </span>
<span class="nv">output</span><span class="o">=</span><span class="k">$(</span>basename<span class="w"> </span><span class="nv">$1</span><span class="w"> </span>.sig<span class="k">)</span>.csv
sourmash<span class="w"> </span>gather<span class="w"> </span><span class="nv">$1</span><span class="w"> </span>database.zip<span class="w"> </span>-o<span class="w"> </span><span class="nv">$output</span>
</code></pre></div>
<p>This is now a script you can send to the HPC to run, using <code>sbatch</code>:</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span>i<span class="w"> </span><span class="k">in</span><span class="w"> </span>queries/*.sig
<span class="k">do</span>
<span class="w"> </span>sbatch<span class="w"> </span><span class="k">do</span>-gather.sh<span class="w"> </span><span class="nv">$i</span>
<span class="k">done</span>
</code></pre></div>
<p>The advantage here is these commands can be scheduled by the HPC to run whenever and wherever there is computational "space" to run them. (Here, the <code>#SBATCH</code> lines in the shell script specify how much compute time/memory is needed.)</p>
<p>Notes:</p>
<ul>
<li>this distributes your job across the HPC;</li>
<li>these jobs only take up as much time/memory as each job individually! but now they're running in parallel on multiple machines!</li>
<li><code>do-gather.sh</code> is actually still a bash script so you can still run it that way, too.</li>
</ul>
<h3>7. Write a snakemake file.</h3>
<p>An alternative to all of the above is to have snakemake run things for you. Here's a simple snakefile to run things in parallel:</p>
<p><code>Snakefile</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">QUERY</span><span class="p">,</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"queries/</span><span class="si">{q}</span><span class="s2">.sig"</span><span class="p">)</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"</span><span class="si">{q}</span><span class="s2">.csv"</span><span class="p">,</span> <span class="n">q</span><span class="o">=</span><span class="n">QUERY</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">run_query</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">sig</span> <span class="o">=</span> <span class="s2">"queries/</span><span class="si">{q}</span><span class="s2">.sig"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">csv</span> <span class="o">=</span> <span class="s2">"</span><span class="si">{q}</span><span class="s2">.csv"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash gather </span><span class="si">{input.sig}</span><span class="s2"> database.zip -o </span><span class="si">{output.csv}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>and run it in parallel:</p>
<div class="highlight"><pre><span></span><code>snakemake -j 2
</code></pre></div>
<p>Notes:</p>
<ul>
<li>this will run things in parallel as in the above example (4).</li>
</ul>
<h2>Strategies for testing and evaluation</h2>
<p>Here are the three strategies I use when trying to scale something up to run in multiple jobs and across multiple computers:</p>
<ol>
<li>Build around an existing example.</li>
<li>Subsample your query data.</li>
<li>Test on a smaller version of your problem.</li>
</ol>
<h2>Appendix: making your shell script(s) nicer</h2>
<p>The above shell scripts are not actually the way I recommend writing shell scripts! Here are a few additional thoughts for you -</p>
<h3>1. Make them runnable without an explicit <code>bash</code></h3>
<p>Put <code>#! /bin/bash</code> at the top of the shell script and run <code>chmod +x <scriptname></code>, and now you will be able to run them directly:</p>
<div class="highlight"><pre><span></span><code>./run1.sh
</code></pre></div>
<h3>2. Set error exit</h3>
<p>Add <code>set -e</code> to the top of your shell script and it will stop running when there's an error.</p>snakemake for doing bioinformatics - a beginner's guide (part 2)2023-01-23T00:00:00+01:002023-01-23T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2023-01-23:/blog/2023-snakemake-slithering-section-2.html<p>Slithering your way into bioinformatics with snakemake, round 2.</p><p>(The below post contains excerpts from <em>Slithering your way into
bioinformatics with snakemake</em>, Hackmd Press, 2023.)</p>
<p>In
<a href="http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html">Section 1</a>,
we introduced snakemake as a system for (efficiently and effectively)
running a series of shell commands.</p>
<p>In Section 2, we'll explore a number of important features of
snakemake. Together with Section 1, this section covers the core set
of snakemake functionality that you need to know in order to effectively
leverage snakemake.</p>
<p>After this section, you'll be well positioned to write a few workflows
of your own, and then you can come back and explore more advanced
features as you need them.</p>
<h2>Chapter 4: running rules in parallel</h2>
<p>Let's take a look at the <code>sketch_genomes</code> rule from the last
<code>Snakefile</code> entry:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/GCF_000017325.1.fna.gz"</span><span class="p">,</span>
<span class="s2">"genomes/GCF_000020225.1.fna.gz"</span><span class="p">,</span>
<span class="s2">"genomes/GCF_000021665.1.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> </span><span class="se">\</span>
<span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
</code></pre></div>
<p>This command works fine as it is, but it is <em>slightly</em> awkward - because,
bioinformatics being bioinformatics, we are likely to want to add more
genomes into the comparison at some point, and right now each additional
genome is going to have to be added to both input and output. It's not
a lot of work, but it's unnecessary.</p>
<p>Moreover, if we add in a <em>lot</em> of genomes, then this step could
quickly become a bottleneck. <code>sourmash sketch</code> may run quickly on 10
or 20 genomes, but it will slow down if you give it 100 or 1000! (In
fact, <code>sourmash sketch</code> scales with the number of genomes - so it will
take 100 times longer on 100 genomes than on 1.) Is there a
way to speed that up?</p>
<p>Yes - we can write a rule that can be run for each genome, and then
let snakemake run it in parallel for us!</p>
<p>Let's start by breaking this one rule into three <em>separate</em> rules:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes_1</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/GCF_000017325.1.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> </span><span class="se">\</span>
<span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">sketch_genomes_2</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/GCF_000020225.1.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> </span><span class="se">\</span>
<span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">sketch_genomes_3</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/GCF_000021665.1.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> </span><span class="se">\</span>
<span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
<span class="c1"># rest of Snakefile here!</span>
</code></pre></div>
<p>It's wordy, but it will work - run:</p>
<div class="highlight"><pre><span></span><code>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">1</span><span class="w"> </span>--delete-all<span class="w"> </span>plot_comparison
snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">1</span><span class="w"> </span>plot_comparison
</code></pre></div>
<p>Before we modify the file further, let's enjoy the fruits of our labor:
we can now tell snakemake to run more than one rule at a time!</p>
<p>Try typing this:</p>
<div class="highlight"><pre><span></span><code>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">1</span><span class="w"> </span>--delete-all<span class="w"> </span>plot_comparison
snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">3</span><span class="w"> </span>plot_comparison
</code></pre></div>
<p>If you look closely, you should see that snakemake is running all three
<code>sourmash sketch dna</code> commands <em>at the same time</em>.</p>
<p>This is pretty cool and is one of the more powerful practical features
of snakemake: once you tell snakemake <em>what you want it to do</em> (by
specifying your desired output(s)) and give snakemake the set of
recipes telling it <em>how to do each step</em>, snakemake will figure out
the fastest way to run all the necessary steps with the resources
you've given it.</p>
<p>In this case, we told snakemake that it could run up to three jobs at
a time, with <code>-j 3</code>. We could also have told it to run more jobs at a
time, but at the moment there are only three rules that can actually
be run at the same time - <code>sketch_genomes_1</code>, <code>sketch_genomes_2</code>, and
<code>sketch_genomes_3</code>. This is because the rule <code>compare_genomes</code> needs the
output of these three rules to run, and likewise <code>plot_genomes</code> needs
the output of <code>compare_genomes</code> to run. So they can't be run at the
same time as any other rules!</p>
<h2>Chapter 5 - visualizing workflows</h2>
<p>Let's visualize what we're doing! Here's the output of <code>snakemake
--dag plot_comparison</code>, visualized with the graphviz package:</p>
<p><img alt="interm2 graph of jobs" src="images/2023-snakemake-slithering-section-2-interm2-dag.png?raw=true"></p>
<p>This diagram shows the relationship between the rules we've put in the
Snakefile: <code>compare_genomes</code> takes the output of the <code>sketch_genome</code>
rules as its own input, and then <code>plot_comparison</code> uses the output of
<code>compare_genomes</code> to build its own plot.</p>
<p>One key aspect of this graph is that it shows you where the various
rules can be run at the same time as each other because they neither
require nor are required for the others - here, the three
<code>sketch_genome</code> rules. That is what let us run all three simultaneously
in the previous chapter!</p>
<p>Note: sometimes you have to have a single rule that deals with all of
the genomes - for example, <code>compare_genomes</code> has to compare <em>all</em> the
genomes, and there's no simple way around that. But with <code>sketch_genomes</code>,
we do have the option of breaking the rule up!</p>
<h2>Chapter 6 - using wildcards to make rules more generic</h2>
<p>Let's take another look at one of those <code>sketch_genomes_</code> rules:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes_1</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/GCF_000017325.1.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> </span><span class="se">\</span>
<span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
</code></pre></div>
<p>There's some redundancy in there - the accession <code>GCF_000017325.1</code> shows up
twice. Can we do anything about that?</p>
<p>Yes, we can! We can use a snakemake feature called "wildcards", which will
let us give snakemake a blank space to fill in automatically.</p>
<p>With wildcards, you signal to snakemake that a particular part of an
input or output filename is fair game for substitutions using <code>{</code> and <code>}</code>
surrounding the wildcard name. Let's create a wildcard named <code>accession</code>
and put it into the input and output blocks for the rule:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes_1</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> </span><span class="se">\</span>
<span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
</code></pre></div>
<p>What this does is tell snakemake that whenever you want an output file
ending with <code>.fna.gz.sig</code>, you should look for a file with that prefix
(the text before <code>.fna.gz.sig</code>) in the <code>genomes/</code> directory, ending in
<code>.fna.gz</code>, and <strong>if it exists</strong>, use that file as the input.</p>
<p>(Yes, there can be multiple wildcards in a rule! We'll show you that later!)</p>
<p>If you go through and use the wildcards in <code>sketch_genomes_2</code> and
<code>sketch_genomes_3</code>, you'll notice that the rules end up looking <em>identical</em>.
And, as it turns out, you only need (and in fact can only have) one rule -
you can now collapse the three rules into one <code>sketch_genome</code> rule again.</p>
<p>Here's the full <code>Snakefile</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genome</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare </span><span class="si">{input}</span><span class="s2"> -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot </span><span class="si">{input}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>It looks a lot like the Snakefile we started with, with the crucial
difference that we are now using wildcards.</p>
<p>Here, unlike the situation we were in at the end of last section where
we had one rule that sketched three genomes, we now have one rule
that sketches one genome at a time, but also can be run in parallel!
So <code>snakemake -j 3</code> will still work! And it will continue to work as
you add more genomes in, and increase the number of jobs you want to
run at the same time.</p>
<p>Before we do that, let's take another look at the workflow now -
you'll notice that it's the same shape, but looks slightly different!
Now, instead of the three rules for sketching genomes having different names,
they all have the same name but have different values for the <code>accession</code> wildcard!</p>
<p><img alt="interm3 graph of jobs" src="images/2023-snakemake-slithering-section-2-interm3-dag.png?raw=true"></p>
<h2>Chapter 7 - giving snakemake filenames instead of rule names</h2>
<p>Let's add a new genome into the mix, and start by generating a sketch
file (ending in <code>.sig</code>) for it.</p>
<p>Download the RefSeq assembly file (the <code>_genomic.fna.gz</code> file) for GCF_008423265.1 from <a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_008423265.1">this NCBI link</a>, and put it in the <code>genomes/</code> subdirectory as <code>GCF_008423265.1.fna.gz</code>. (You can also download a saved copy with the right name from <a href="https://osf.io/7cdxn">this osf.io link</a>.)</p>
<p>Now, we'd like to build a sketch by running <code>sourmash sketch dna</code>
(via snakemake).</p>
<p>Do we need to add anything to the <code>Snakefile</code> to do this? No, no we don't!</p>
<p>To build a sketch for this new genome, you can just ask snakemake to make the
right filename like so:</p>
<div class="highlight"><pre><span></span><code>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">1</span><span class="w"> </span>GCF_008423265.1.fna.gz.sig
</code></pre></div>
<p>Why does this work? It works because we have a generic wildcard rule for
building <code>.sig</code> files from files in <code>genomes/</code>!</p>
<p>When you ask snakemake to build that filename, it looks through all the
output blocks for its rules, and choose the rule with matching output -
importantly, this rule <em>can</em> have wildcards, and if it does, it extracts
the wildcard from the filename!</p>
<h3>Warning: the <code>sketch_genome</code> rule has now changed!</h3>
<p>As a side note, you can no longer ask snakemake to run the rule by its
name, <code>sketch_genome</code> - this is because the rule needs to fill in the
wildcard, and it can't know what <code>{accession}</code> should be without us
giving it the filename.</p>
<p>If you try running <code>snakemake -j 1 sketch_genome</code>, you'll get the following error:</p>
<blockquote>
<p>WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).</p>
</blockquote>
<p>This is telling you that snakemake doesn't know how to fill in the wildcard
(and giving you some suggestions as to how you might do that, which we'll
explore below).</p>
<p>In this chapter we didn't need to modify the Snakefile at all to make use
of new functionality!</p>
<h2>Chapter 8 - adding new genomes</h2>
<p>So we've got a new genome, and we can build a sketch for it. Let's
add it into our comparison, so we're building comparison matrix
for <em>four</em> genomes, and not just three!</p>
<p>To add this new genome in to the comparison, all you need to do is add
the sketch into the <code>compare_genomes</code> input, and snakemake will
automatically locate the associated genome file and run
<code>sketch_genome</code> on it (as in the previous chapter), and then run
<code>compare_genomes</code> on it. snakemake will take care of the rest!</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genome</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_008423265.1.fna.gz.sig"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare </span><span class="si">{input}</span><span class="s2"> -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot </span><span class="si">{input}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Now when you run <code>snakemake -j 3 plot_comparison</code> you will get a
<code>compare.mat.matrix.png</code> file that contains a 4x4 matrix! (See Figure.)</p>
<p><img alt="4x4 matrix comparison of genomes" src="images/2023-snakemake-slithering-section-2-4x4-mat.png"></p>
<p>Note that the workflow diagram has now expanded to include our fourth genome, too!</p>
<p><img alt="interm3 graph of jobs" src="images/2023-snakemake-slithering-section-2-interm4-dag.png?raw=true"></p>
<h2>Chapter 9 - using <code>expand</code> to make filenames</h2>
<p>You might note that the list of files in the <code>compare_genomes</code> rule
all share the same suffix, and they're all built using the same rule.
Can we use that in some way?</p>
<p>Yes! We can use a function called <code>expand(...)</code> and give it a template
filename to build, and a list of values to insert into that filename.</p>
<p>Below, we build a list of accessions named <code>ACCESSIONS</code>, and then use
<code>expand</code> to build the list of input files of the format <code>{acc}.fna.gz.sig</code>
from that list, creating one filename for each value in <code>ACCESSSIONS</code>.</p>
<div class="highlight"><pre><span></span><code><span class="n">ACCESSIONS</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"GCF_000017325.1"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1"</span><span class="p">,</span>
<span class="s2">"GCF_008423265.1"</span><span class="p">]</span>
<span class="n">rule</span> <span class="n">sketch_genome</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"</span><span class="si">{acc}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span> <span class="n">acc</span><span class="o">=</span><span class="n">ACCESSIONS</span><span class="p">),</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare </span><span class="si">{input}</span><span class="s2"> -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot </span><span class="si">{input}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>While wildcards and <code>expand</code> use the same syntax, they do quite different
things.</p>
<p><code>expand</code> generates a list of filenames, based on a template and a list
of values to insert into the template. It is typically used to make a
list of files that you want snakemake to create for you.</p>
<p>Wildcards in rules provide the rules by which one or more files will
be actually created. They are recipes that say, "when you want to make
a file with name that looks like THIS, you can do so from files that
look like THAT, and here's what to run to make that happen.</p>
<p><code>expand</code> tells snakemake WHAT you want to make, wildcard rules tell
snakemake HOW to make those things.</p>
<h2>Chapter 10 - using default rules</h2>
<p>The last change we'll make the Snakefile for this section is
to add what's known as a default rule. What is this and why?</p>
<p>The 'why' is easier. Above, we've been careful to provide specific rule
names or filenames to snakemake, because otherwise it defaults to running
the first rule in the Snakefile. (There's no other way in which the order
of rules in the file matters - but snakemake will try to run the first
rule in the file if you don't give it a rule name or a filename on the
command line.)</p>
<p>This is less than great, because it's one more thing to remember and to
type. In general, it's better to have what's called a "default rule"
that lets you just run <code>snakemake -j 1</code> to generate the file or files you
want.</p>
<p>This is straightforward to do, but it involves a slightly different syntax -
a rule with <em>only</em> an <code>input</code>, and no shell or output blocks. Here's
a default rule for our Snakefile that should be put in the file as
the first rule:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
</code></pre></div>
<p>What this rule says is, "I want the file <code>compare.mat.matrix.png</code>."
It doesn't give any instructions on how to do that - that's what the
rest of the rules in the file are! - and it doesn't <em>run</em> anything,
because it has no shell block, and nor does it <em>create</em> anything,
because it has no output block.</p>
<p>The logic here is simple, if not straightforward: this rule succeeds
when that input exists.</p>
<p>If you place that at the top of the Snakefile, then running
<code>snakemake -j 1</code> will produce <code>compare.mat.matrix.png</code>. You no
longer need to provide either a rule name or a filename on the command
line unless you want to do something <em>other</em> than generate that file,
in which case whatever you put on the command line will ignore
the <code>rule all:</code>.</p>
<h2>Chapter 11 - our final Snakefile - review and discussion</h2>
<p>Here's the final Snakefile:</p>
<div class="highlight"><pre><span></span><code><span class="n">ACCESSIONS</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"GCF_000017325.1"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1"</span><span class="p">,</span>
<span class="s2">"GCF_008423265.1"</span><span class="p">]</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">rule</span> <span class="n">sketch_genome</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"genomes/</span><span class="si">{accession}</span><span class="s2">.fna.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"</span><span class="si">{accession}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 </span><span class="si">{input}</span><span class="s2"> --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"</span><span class="si">{acc}</span><span class="s2">.fna.gz.sig"</span><span class="p">,</span> <span class="n">acc</span><span class="o">=</span><span class="n">ACCESSIONS</span><span class="p">),</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare </span><span class="si">{input}</span><span class="s2"> -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot </span><span class="si">{input}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>This <code>Snakefile</code> provides some nice features.</p>
<p>First, it's easy to add new genomes into the comparison - we download
the genome, name it for its accession, and add it to <code>ACCESSIONS</code> at the
top. Voila!</p>
<p>Second, we don't have to remember the names of any rules to run the whole
workflow, because the <code>rule all:</code> at the top provides a sensible default.</p>
<p>Third, it is easy to change the sketching or comparison parameters and
then rerun the entire workflow from scratch - thus letting us quickly
explore alternate parameters for sketching and comparisons if we so
choose.</p>
<p>In future sections, we'll revisit this basic Snakefile from the top,
and explore some of the details of rules, wildcards, and other features.</p>snakemake for doing bioinformatics - a beginner's guide (part 1)2023-01-14T00:00:00+01:002023-01-14T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2023-01-14:/blog/2023-snakemake-slithering-section-1.html<p>Slithering your way into bioinformatics with snakemake</p><p>(The below post contains excerpts from <em>Slithering your way into
bioinformatics with snakemake</em>, Hackmd Press, 2023.)</p>
<h2>Installation and setup!</h2>
<h3>Setup and installation</h3>
<p>I suggest working in a new directory.</p>
<p>You'll need to <a href="https://snakemake.readthedocs.io/en/stable/getting_started/installation.html">install snakemake</a> and <a href="https://sourmash.readthedocs.io/en/latest/#installing-sourmash">sourmash</a>. We suggest using <a href="https://github.com/conda-forge/miniforge#mambaforge">mamba, via miniforge/mambaforge</a>, for this.</p>
<h4>Getting the data:</h4>
<p>You'll need to download these three files:</p>
<ul>
<li><a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/021/665/GCF_000021665.1_ASM2166v1/GCF_000021665.1_ASM2166v1_genomic.fna.gz">GCF_000021665.1_ASM2166v1_genomic.fna.gz</a></li>
<li><a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/017/325/GCF_000017325.1_ASM1732v1/GCF_000017325.1_ASM1732v1_genomic.fna.gz">GCF_000017325.1_ASM1732v1_genomic.fna.gz</a></li>
<li><a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/225/GCF_000020225.1_ASM2022v1/GCF_000020225.1_ASM2022v1_genomic.fna.gz">GCF_000020225.1_ASM2022v1_genomic.fna.gz</a></li>
</ul>
<p>and rename them so that they are in a subdirectory <code>genomes/</code> with the names:</p>
<div class="highlight"><pre><span></span><code>GCF_000017325.1.fna.gz
GCF_000020225.1.fna.gz
GCF_000021665.1.fna.gz
</code></pre></div>
<p>Note, you can download saved copies of them here, with the right names: <a href="https://osf.io/2g4dm/">osf.io/2g4dm/</a>.</p>
<h2>Chapter 1 - snakemake runs programs for you!</h2>
<p>Bioinformatics often involves running many different programs to characterize and reduce sequencing data, and I use snakemake to help me do that.</p>
<h3>A first, simple snakemake workflow</h3>
<p>Here's a simple, useful snakemake workflow:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first </span>
<span class="s2"> sourmash compare GCF_000021665.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> -o compare.mat</span>
<span class="s2"> sourmash plot compare.mat</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Put it in a file called <code>Snakefile</code>, and run it with <code>snakemake -j 1</code>.</p>
<p>This will produce the output file <code>compare.mat.matrix.png</code> which contains a similarity matrix and a dendrogram of the three genomes (see Figure 1).</p>
<p><img alt="similarity matrix and dendrogram" src="images/2023-snakemake-slithering-section-1-mat.png"></p>
<p>This is functionally equivalent to putting these three commands into a file <code>compare-genomes.sh</code> and running it with <code>bash compare-genomes.sh</code> -</p>
<div class="highlight"><pre><span></span><code>sourmash<span class="w"> </span>sketch<span class="w"> </span>dna<span class="w"> </span>-p<span class="w"> </span><span class="nv">k</span><span class="o">=</span><span class="m">31</span><span class="w"> </span>genomes/*.fna.gz<span class="w"> </span>--name-from-first<span class="w"> </span>
sourmash<span class="w"> </span>compare<span class="w"> </span>GCF_000021665.1.fna.gz.sig<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>GCF_000017325.1.fna.gz.sig<span class="w"> </span>GCF_000020225.1.fna.gz.sig<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-o<span class="w"> </span>compare.mat<span class="w"> </span>
sourmash<span class="w"> </span>plot<span class="w"> </span>compare.mat<span class="w"> </span>
</code></pre></div>
<p>The snakemake version is already a little bit nicer because it will
give you encouragement when the commands run successfully (with nice
green text saying "1 of 1 steps (100%) done"!) and if the commands
fail you'll get red text alerting you to that, too.</p>
<p>But! We can further improve the snakemake version over the shell
script version!</p>
<h3>Avoiding unnecessary rerunning of commands: a second snakemake workflow</h3>
<p>The commands will run every time you invoke snakemake with <code>snakemake -j 1</code>. But most of the time you don't need to rerun them because you've already got the output files you wanted!</p>
<p>How do you get snakemake to avoid rerunning rules?</p>
<p>We can do that by telling snakemake what we expect the output to be by adding an <code>output:</code> block in front of the shell block:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first</span>
<span class="s2"> sourmash compare GCF_000021665.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> -o compare.mat</span>
<span class="s2"> sourmash plot compare.mat</span>
<span class="s2"> """</span>
</code></pre></div>
<p>and now when we run <code>snakemake -j 1</code> once, it will run the commands; but when we run it again, it will say, "Nothing to be done (all requested files are present and up to date)."</p>
<p>This is because the desired output file, <code>compare.mat.matrix.png</code>, already exists. So snakemake knows it doesn't need to do anything!</p>
<p>If you remove <code>compare.mat.matrix.png</code> and run <code>snakemake -j 1</code> again, snakemake will happily make the files again:</p>
<div class="highlight"><pre><span></span><code>rm<span class="w"> </span>compare.mat.matrix.png
snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">1</span>
</code></pre></div>
<p>So snakemake makes it easy to avoid re-running a set of commands if it
has already produced the files you wanted. This is one of the best
reasons to use a workflow system like snakemake for running
bioinformatics workflows; shell scripts don't automatically avoid
re-running commands.</p>
<h3>Running only the commands you need to run</h3>
<p>The last Snakefile above has three commands in it, but if you remove the <code>compare.mat.matrix.png</code> file you only need to run the last command again - the files created by the first two commands already exist and don't need to be recreated. However, snakemake doesn't know that - it treats the entire rule as one rule, and doesn't look into the shell command to work out what it doesn't need to run.</p>
<p>If we want to avoid re-creating the files that already exist, we need to make the Snakefile a little bit more complicated.</p>
<p>First, let's break out the commands into three separate rules.</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes</span><span class="p">:</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare GCF_000021665.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> -o compare.mat</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot compare.mat</span>
<span class="s2"> """</span>
</code></pre></div>
<p>We didn't do anything too complicated here - we made two new rule blocks, with their own names, and split the shell commands up so that each shell command has its own rule block.</p>
<p>You can tell snakemake to run all three:</p>
<div class="highlight"><pre><span></span><code>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">1</span><span class="w"> </span>sketch_genomes<span class="w"> </span>compare_genomes<span class="w"> </span>plot_comparison
</code></pre></div>
<p>and it will successfully run them all!</p>
<p>However, we're back to snakemake running some of the commands every time - it won't run <code>plot_comparison</code> every time, because <code>compare.mat.matrix.png</code> exists, but it will run <code>sketch_genomes</code> and <code>compare_genomes</code> repeatedly.</p>
<p>How do we fix this?</p>
<h3>Adding output blocks to each rule</h3>
<p>If add output blocks to <em>each</em> rule, then snakemake will only run rules
where the output needs to be updated (e.g. because it doesn't exist).</p>
<p>Let's do that:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare GCF_000021665.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> -o compare.mat</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot compare.mat</span>
<span class="s2"> """</span>
</code></pre></div>
<p>and now</p>
<div class="highlight"><pre><span></span><code>snakemake -j 1 sketch_genomes compare_genomes plot_comparison
</code></pre></div>
<p>will run each command only once, as long as the output files are still there. Huzzah!</p>
<p>But we still have to specify the names of all three rules, in the right order, to run this. That's annoying! Let's fix that next.</p>
<h2>Chapter 2: snakemake connects rules for you!</h2>
<h3>Chaining rules with <code>input:</code> blocks</h3>
<p>We can get snakemake to automatically connect rules by providing
information about the <em>input</em> files a rule needs. Then, if you ask
snakemake to run a rule that requires certain inputs, it will
automatically figure out which rules produce those inputs as their
output, and automatically run them.</p>
<p>Let's add input information to the <code>plot_comparison</code> and <code>compare_genomes</code>
rules:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare GCF_000021665.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig </span><span class="se">\</span>
<span class="s2"> -o compare.mat</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot compare.mat</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Now you can just ask snakemake to run the last rule:</p>
<div class="highlight"><pre><span></span><code>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">1</span><span class="w"> </span>plot_comparison
</code></pre></div>
<p>and snakemake will run the other rules only if those input files don't exist and need to be created.</p>
<h3>Taking a step back</h3>
<p>The Snakefile is now a lot longer, but it's not <em>too</em> much more complicated - what we've done is split the shell commands up into separate rules and annotated each rule with information about what file it produces (the output), and what files it requires in order to run (the input).</p>
<p>This has the advantage of making it so you don't need to rerun commands unnecessarily. This is only a small advantage with our current workflow, because sourmash is pretty fast. But if each step takes an hour to run, avoiding unnecessary steps can make your work go much faster!</p>
<p>And, as you'll see later, these rules are reusable building blocks that can be incorporated into workflows that each produce different files. So there are other good reasons to break shell commands out into individual rules!</p>
<h2>Chapter 3: snakemake helps you avoid redundancy!</h2>
<h3>Avoiding repeated filenames by using <code>{input}</code> and <code>{output}</code></h3>
<p>If you look at the previous Snakefile, you'll see a few repeated filenames - in particular, rule <code>compare_genomes</code> has three filenames in the input block and then repeats them in the shell block, and <code>compare.mat</code> is repeated several times in both <code>compare_genomes</code> and <code>plot_genomes</code>.</p>
<p>We can tell snakemake to reuse filenames by using <code>{input}</code> and <code>{output}</code>. The <code>{</code> and <code>}</code> tell snakemake to interpret these not as literal strings but as template variables that should be replaced with the value of <code>input</code> and <code>output</code>.</p>
<p>Let's give it a try!</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare </span><span class="si">{input}</span><span class="s2"> -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot </span><span class="si">{input}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>This approach not only involves less typing in the first place, but also makes it so that you only have to edit filenames in one place. This avoids mistakes caused by adding or changing filenames in one place and not another place - a mistake I've made plenty of times!</p>
<h3>snakemake makes it easy to rerun workflows!</h3>
<p>It is common to want to rerun an entire workflow from scratch, to make sure that you're using the latest data files and software. Snakemake makes this easy!</p>
<p>You can ask snakemake to clean up all the files that it knows how to generate - and <em>only</em> those files:</p>
<div class="highlight"><pre><span></span><code>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">1</span><span class="w"> </span>plot_comparison<span class="w"> </span>--delete-all-output
</code></pre></div>
<p>which can then be followed by asking snakemake to regenerate the results:</p>
<div class="highlight"><pre><span></span><code>snakemake -j 1 plot_comparison
</code></pre></div>
<h3>snakemake will alert you to missing files if it can't make them!</h3>
<p>Suppose you add a new file that does not exist to <code>compare_genomes</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">sketch_genomes</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">compare_genomes</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"GCF_000017325.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000020225.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"GCF_000021665.1.fna.gz.sig"</span><span class="p">,</span>
<span class="s2">"does-not-exist.sig"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash compare </span><span class="si">{input}</span><span class="s2"> GCF_000021665.1.sig -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
<span class="n">rule</span> <span class="n">plot_comparison</span><span class="p">:</span>
<span class="n">message</span><span class="p">:</span> <span class="s2">"compare all input genomes using sourmash"</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s2">"compare.mat"</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"compare.mat.matrix.png"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> sourmash plot </span><span class="si">{input}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Here, <code>does-not-exist.sig</code> doesn't exist, and we haven't given snakemake a rule to make it, either. What will snakemake do??</p>
<p>It will complain, loudly and clearly! And it will do so before running anything.</p>
<p>First, let's force the rule remove the output file that depends on the </p>
<div class="highlight"><pre><span></span><code>rm<span class="w"> </span>compare.mat
</code></pre></div>
<p>and then run <code>snakemake -j 1</code>. You should see:</p>
<div class="highlight"><pre><span></span><code><span class="nv">Missing</span><span class="w"> </span><span class="nv">input</span><span class="w"> </span><span class="nv">files</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nv">rule</span><span class="w"> </span><span class="nv">compare_genomes</span>:
<span class="w"> </span><span class="nv">output</span>:<span class="w"> </span><span class="nv">compare</span>.<span class="nv">mat</span>
<span class="w"> </span><span class="nv">affected</span><span class="w"> </span><span class="nv">files</span>:
<span class="w"> </span><span class="nv">does</span><span class="o">-</span><span class="nv">not</span><span class="o">-</span><span class="nv">exist</span>.<span class="nv">sig</span>
</code></pre></div>
<p>This is exactly what you want - a clear indication of what is missing before your workflow runs.</p>
<h2>Next steps</h2>
<p>We've introduced basic snakemake workflows, which give you a simple way to run shell commands in the right order. snakemake already offers a few nice improvements over running the shell commands by yourself or in a shell script -</p>
<ul>
<li>it doesn't run shell commands if you already have all the files you need</li>
<li>it lets you avoid typing the same filenames over and over again</li>
<li>it gives simple, clear errors when something fails</li>
</ul>
<p>While this functionality is nice, there are many more things we can do to improve the efficiency of our bioinformatics!</p>
<p>In the next section, we'll explore </p>
<ul>
<li>writing more generic rules using <em>wildcards</em>;</li>
<li>typing fewer filenames by using more templates;</li>
<li>providing a list of default output files to produce;</li>
<li>running commands in parallel on a single computer</li>
<li>loading lists of filenames from spreadsheets</li>
<li>configuring workflows with input files</li>
</ul>sourmash has a plugin interface!2023-01-08T00:00:00+01:002023-01-08T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2023-01-08:/blog/2023-sourmash-plugins-first-effort.html<p>Enabling plugins in sourmash, for less directed & more incoherent progress!</p><p>Over the holiday break, I took on a "palette cleansing" project - something technically neat, that wasn't critically important to anyone or anything, but could be useful. I decided to implement plugins for sourmash.</p>
<p><a href="sourmash.readthedocs.io/">Sourmash</a> is open-source scientific software for fast, lightweight exploration of sequencing data set comparison, with a focus on metagenomics. It's largely a command-line program written in Python on top of a Rust library. It is maintained by a small group of developers, most of whom are (or were) affiliated in some way with my academic lab at UC Davis.</p>
<p>Python has (what seems to be) robust support for third-party plugins, where a project can provide hooks for <em>other people</em> to customize functionality.</p>
<p>So the question was, can we add Python plugin support to sourmash?</p>
<h2>First - why focus on plugins?</h2>
<p>Plugins serve a lot of purposes for a project, but I think the most interesting justification for supporting them came from Tim Head, who channeled his observations of Simon Willison's <a href="https://datasette.io/">datasette</a> project into a statement that <strong>plugins are an alternate way to direct open source projects</strong>. (You can read the whole Twitter thread <a href="https://twitter.com/betatim/status/1355902709237473281">here</a>.)</p>
<p>Tim's tl;dr was this: </p>
<blockquote>
<p>"first class plugins" is my current best answer to "we need a project roadmap"</p>
</blockquote>
<p>but what does that mean?</p>
<p>The central idea is that the more extensible you make a project with plugins, the easier it is for everyone to "play" with the project,
pursue their own directions, and figure out what to do next.</p>
<p>Or, to rephrase: if you focus your planning and governance efforts on defining how others can extend the core functionality of your software, then you free others up to do so without permission or close engagement. This can enable a lot of experimentation and creativity!</p>
<p>That was a large part of my sociotechnical motivation in looking into plugins, but there were several more reasons:</p>
<ul>
<li>
<p>Maintaining an open source project is a fair bit of work, and I have a lot of interest in keeping the "feature surface" of sourmash small so that there's less to maintain. That battles with the desire to add more functionality to meet research and user needs. Plugins offer a way to segregate efforts to either side of a well-defined interface: either it's a "core" effort (lots of coordination and work!) or an "external" effort (maybe less work, certainly less coordination), and we can allocate our attention appropriately.</p>
</li>
<li>
<p>With a robust core, plugins can combine to expand the feature surface of sourmash combinatorially. That's a fancy way of saying that if there's a neat new visualization plugin written by Tina, and a neat new remote-collection loading mechanism written by Steve, people can use these plugins in <em>combination</em> to apply the viz to remote collections.</p>
</li>
<li>
<p>Right now it's quite hard to add platform-specific features to sourmash - in particular, there are some software packages that we'd like to use that don't compile on Mac OS ARM laptops. Plugins would be one way to support those features on specific platforms.</p>
</li>
<li>
<p>Refactoring internals to support plugins can clean up the internal code! The loading and saving plugins are implemented in exactly the same way as our internal code, and I think the effort to modularize loading/saving over time has ended up with reasonably simple and decent code internally. Plugins reinforce that by standardizing the API.</p>
</li>
</ul>
<h2>And how's that going, Dr. Brown?</h2>
<p>What I can say after putting in a dozen or so hours of work on the plugin framework is that it's been very liberating - it's just <em>so much easier</em> to try out new ideas, and clearly distinguish them from "serious" core code contributions that need more care and thought.</p>
<p>So, ...it's going well!</p>
<h2>What types of plugins does sourmash support?</h2>
<p>As of this morning, the main branch of sourmash supports <code>load_from</code> and <code>save_to</code> plugins. As the names suggest, these plugins provide alternate ways of loading and saving sourmash sketches.</p>
<p>Using these, I've built out an <a href="https://github.com/sourmash-bio/sourmash_plugin_avro">Avro format saving/loading plugin</a> as well as a <a href="https://github.com/sourmash-bio/sourmash_plugin_load_urls">load-sketches-from-URIs plugin based on fsspec</a>.</p>
<p>I'm <a href="https://github.com/sourmash-bio/sourmash/pull/2438">currently working</a> on adding support for new command-line subcommands. The idea is that you would be able to add new commands under <code>sourmash scripts</code> (a provisional name).</p>
<h2>How did we implement plugin support?</h2>
<p>You can see <a href="https://github.com/sourmash-bio/sourmash/pull/2428">the first plugin PR here, in sourmash#2428</a>, but the tl;dr is: we used <a href="https://docs.python.org/3/library/importlib.metadata.html"><code>importlib.metadata</code></a> to support plugins via <a href="https://setuptools.pypa.io/en/latest/userguide/entry_point.html">entry points</a>.</p>
<p>The code to support plugins is pretty minimal, and currently resides in <a href="https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/plugins.py">sourmash.plugins</a>. It looks like this:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># load entry points.</span>
<span class="n">_plugin_load_from</span> <span class="o">=</span> <span class="n">entry_points</span><span class="p">(</span><span class="n">group</span><span class="o">=</span><span class="s1">'sourmash.load_from'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_load_from_functions</span><span class="p">():</span>
<span class="s2">"Load the 'load_from' plugins and yield tuples (priority, name, fn)."</span>
<span class="c1"># Load each plugin,</span>
<span class="k">for</span> <span class="n">plugin</span> <span class="ow">in</span> <span class="n">_plugin_load_from</span><span class="p">:</span>
<span class="n">loader_fn</span> <span class="o">=</span> <span class="n">plugin</span><span class="o">.</span><span class="n">load</span><span class="p">()</span>
<span class="c1"># get 'priority' if it is available</span>
<span class="n">priority</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">loader_fn</span><span class="p">,</span> <span class="s1">'priority'</span><span class="p">,</span> <span class="n">DEFAULT_LOAD_FROM_PRIORITY</span><span class="p">)</span>
<span class="c1"># retrieve name (which is specified by plugin?)</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">plugin</span><span class="o">.</span><span class="n">name</span>
<span class="k">yield</span> <span class="n">priority</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">loader_fn</span>
</code></pre></div>
<p>Then, in the <code>pyproject.toml</code> of a Python package, anyone can state that there's a sourmash plugin available like so:</p>
<div class="highlight"><pre><span></span><code><span class="k">[project.entry-points."sourmash.load_from"]</span>
<span class="na">a_reader</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"module_name:load_sketches"</span>
<span class="k">[project.entry-points."sourmash.save_to"]</span>
<span class="na">a_writer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"module_name:SaveSignatures_WriteFile"</span>
</code></pre></div>
<p>and this will get automatically loaded and used by sourmash.</p>
<h2>How do plugins fit into the sourmash ecosystem?</h2>
<p>We have an interesting lab-centric / lab-adjacent ecosystem developing around sourmash.</p>
<p>sourmash itself provides a reasonably rich Python and Rust API, for people wanting to do clever things with it. For example, the <a href="https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1">branchwater software</a> is a fairly small script for doing parallel search of many genomes, built on top of the Rust library.</p>
<p>There are workflows that make use of sourmash to do cool things, like characterizing metagenomes (<a href="https://dib-lab.github.io/genome-grist/">genome-grist</a>) and decontaminating databases (<a href="https://github.com/dib-lab/charcoal/">charcoal</a>). These (and other) workflows wrap sourmash in a larger workflow (snakemake, nextflow, CWL, ...?) to do various things.</p>
<p>There's also a nascent <a href="https://github.com/Arcadia-Science/sourmashconsumr">R library, sourmashconsumr</a> being built by Taylor Reiter (and others) at Arcadia Science, for taking the output of sourmash and doing nice things with it.</p>
<p>Taylor is also developing code to load sourmash <code>compare</code> and <code>gather</code> output into MultiQC (see <a href="https://github.com/ewels/MultiQC/issues/1805">issue</a>). This is in effect using sourmash as a plugin for <em>other</em> software.</p>
<p>And now, sourmash plugins add a nice set of opportunities to diversify sourmash internal functionality to this ecosystem. It will be interesting to watch what happens as we build out this functionality!</p>
<h2>How do we support plugin developers?</h2>
<p>An important aspect of plugins is supporting plugin developers - so, <a href="https://sourmash.readthedocs.io/en/latest/dev_plugins.html">we have some nascent documentation</a>, as well as a <a href="https://github.com/sourmash-bio/sourmash_plugin_template">getting-started template repository</a> to eliminate a lot of the boilerplate.</p>
<p>I'm not 100% sure what to add beyond this, but I find dogfooding it to be a good approach - every time I work on a plugin, I will sand down the sharper corners of our documentation a bit more.</p>
<h2>Where next?</h2>
<p>For now, plugins remain experimental and are not subject to semantic versioning considerations. I'm not sure when that will change, but I want to write a few more plugins before committing to the current interfaces!</p>
<p>I think we probably have room for many more <em>types</em> of plugins. We're thinking about how to enable different taxonomy loading functions, for example; and I have specific need for better manifest/picklist manipulation that I think is amenable to being made a plugin. (The plugin design issue is <a href="https://github.com/sourmash-bio/sourmash/issues/1353">sourmash#1353</a> if you're interested.)</p>
<p>I am also starting to think more about the user experience. How do users find, install, use, debug, and remove plugins? This is all relatively easy if you're a Python developer who is familiar with <code>pip</code> and <code>importlib.metadata</code>, but that's not our user base ;). For now, I've started by adding plugin reporting to <code>sourmash info -v</code>, which at least gives us a chance of figuring out what plugins might be around!</p>
<p>I'm also not quite sure how to manage what I expect to be a flood of small plugins from within my lab. Will we want to have a set of recommended plugins that evolves and matures over time? And how do we avoid massively increasing our maintenance surface? (Simon Willison has some <a href="https://simonwillison.net/2022/Nov/26/productivity/">sage advice for the serial project hoarder that applies here</a>.)</p>
<p>I also have this niggling feeling that I should read through datasette's plugin interface to see what I can learn from all of Simon's hard work and experience...</p>
<p>--titus</p>Reading "Orwell's Roses" by Rebecca Solnit2022-12-31T00:00:00+01:002022-12-31T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2022-12-31:/blog/2022-reading-orwells-roses.html<p>This is a good book!</p><p>Happy New Year's Eve!</p>
<p>So, one of my resolutions for 2023 is that I want to do more
non-escapist reading.</p>
<p>Why? And what had I been reading??</p>
<p>For the last three years I've been reading a lot of trashy
books. Unless it was for work (biology/bioinformatics papers) or
random infovore articles that I found online, I've read almost nothing
but mystery novels, romance novels, and LitRPG. (Don't judge, it's
been a weird three years. ;)</p>
<p>Now, LitRPG is all well and good (He Who Fights with Monsters is super
fun!) and I had plenty of reasons to escape, but I was avoiding
anything requiring an attention span, and my stack of Good But Serious
Books was piling up. Every now and then I'd get a chance to read
something more serious and I'd remember how much fun it was to read
something that was well written and meaningful and horizon-expanding,
but soon enough my attention span would lapse and I'd be back to
reading LitRPG at 9pm at night.</p>
<p>SO.</p>
<p>Sometime in the last year, I picked up
<a href="https://www.penguinrandomhouse.com/books/607057/orwells-roses-by-rebecca-solnit/">Orwell's Roses</a>
at our <a href="https://avidreaderbooks.com/">local bookstore, Avid Reader</a>. I
had been introduced to Rebecca Solnit's writing through her amazing
book
<a href="https://www.amazon.com/Paradise-Built-Hell-Extraordinary-Communities/dp/0143118072">A Paradise Built in Hell: The Extraordinary Communities That Arise in Disaster</a>,
which Tracy Teal had recommended to me. While I hadn't (haven't yet!)
finished that book, the parts that I had read were amazing. Sometime
in 2022 I started following
<a href="https://twitter.com/rebeccasolnit">@RebeccaSolnit</a> on Twitter,
perhaps because I saw her retweeted by
<a href="https://twitter.com/m_older">Malka Older (@m_older)</a>, and I found
Solnit's tweets and articles inspirational. And through her tweets I
found her book
<a href="https://www.haymarketbooks.org/books/791-hope-in-the-dark">Hope In the Dark: Untold Histories, Wild Possibilities</a>. That, in turn, became my end-of-quarter
reading for December (and maybe more on that book later!)</p>
<p>This is all to say that for about a year, Orwell's Roses had been
staring at me from my bookshelf. With its bright red cover, it's a
distinctive book. Moreover, I've been a huge fan of Orwell's writing
ever since reading
<a href="https://en.wikipedia.org/wiki/Down_and_Out_in_Paris_and_London">Down and Out in Paris and London</a>
in my teens, and several scenes from that book remain burned into my
memory. And so I decided that <em>Orwell's Roses</em> would be my next book
to read!</p>
<h2>What's "Orwell's Roses" about?</h2>
<p>It's a lovely, meandering tale about Orwell's life and beliefs, and how
his politics intersected with his love of gardening and farming. Along the
way we are treated to extended (and very relevant!) digressions into
other intersecting stories. It's kind of a biography, but also a partial
history of certain kinds of thinking.</p>
<p>Two particular themes caught me the most. One was the discussion of
"Bread and Roses", a
<a href="https://en.wikipedia.org/wiki/Bread_and_Roses">political slogan linked to women's suffrage</a>. The
"bread" here refers to the basic needs of sustenance - food, water,
housing, and the like. The "roses" refers to something a bit more
indefinite - the freedom to pursue an independent life of the mind,
whether it be art, music, literature, or something else. I can't
possibly do justice to the discussion of this in the book, other than
to say that the theme of "Bread for all, and Roses too" resonates in
this age of COVID, r/antiwork, and union organizing.</p>
<p>The other theme that caught me is that of locality vs uniformity, or
community vs systems, or bottom-up vs top-down. This is maybe a bit
more entwined with my personal interests, and richer in my mind and
hence harder to explain -- but it is also a theme that is guiding a
lot of my reading choices, so I expect to get lots of practice
thinking and maybe writing about it!</p>
<p>In brief, Solnit describes how Orwell took great pleasure in the
particulars of gardening and farming, and grounded himself in the
daily routines of his life. Solnit does a lovely job of connecting
this to Orwell's writing, and talked about how taking joy in both
productive and "non-productive" tasks (such as planting roses!) is an
important and small-scale rebellion against the cult of productivity
and grind that capitalism has instilled in our modern life. There
were strong resonances with the book
<a href="https://en.wikipedia.org/wiki/Seeing_Like_a_State">Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed</a>,
as well as the
<a href="https://pluralistic.net/2020/07/14/poesy-the-monster-slayer/#stay-on-target">"chickenization" theme</a>
of gig working introduced to me by Cory Doctorow. (In my imagination,
Solnit and Doctorow have very productive regular chats over coffee.
They live in the same city, I think, so it's a possibility, right? Oh to
be a fly on the wall!)</p>
<p>I don't really have a conclusion here, other than that <em>Orwell's Roses</em>
was a very rewarding read that made me think, and think differently.
And it was a great book to break my fast on!</p>
<h2>What book is next?</h2>
<p>I'm planning to start on a new book today. I'm currently being eyed by
<a href="https://milkweed.org/book/braiding-sweetgrass">Braiding Sweetgrass: Indigenous Wisdom, Scientific Knowledge and the Teachings of Plants</a>,
by Robin Wall Kimmerer, which has been sitting on my shelf for almost
a year... it looks like a thick book, but I'll just take it 30 minutes
at a time and we'll see how it goes!</p>
<p>--titus</p>So! You want to search all the public metagenomes with a genome sequence!2022-08-31T00:00:00+02:002022-08-31T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2022-08-31:/blog/2022-sourmash-mastiff.html<p>Searching all the things - faster!</p><p>Imagine you have a (microbial) genome. Or a contig. And you want to
find similar sequences, either in genomes or in metagenomes.</p>
<p>Looking for it in genomes is possible, if not always easy - you can go
to NCBI and do a BLAST of some sort, but BLAST is intended for more
sensitive and shorter matches. But there are other tools, including
<a href="https://sourmash.readthedocs.io/">sourmash</a>, a tool we've been
developing for a few years, that will happily do it for you.</p>
<p>Looking for something in <em>metagenomes</em> is harder. Metagenomes are
hundreds, thousands, or even millions of times larger than genomes,
and doing <em>anything</em> with them quickly is hard. sourmash supports
doing it one metagenome at a time, but it's slow and memory intensive;
<a href="https://serratus.io/">serratus</a> will do it for you using the power of
the cloud, but it will cost you (at least) a few thousand $$.</p>
<p>If you're interested in how we're doing DNA sequence search, here's an
excerpt from
<a href="http://ivory.idyll.org/blog/2022-storing-ulong-in-sqlite-sourmash.html">a previous blog post about using SQLite to store our data</a> -</p>
<blockquote>
<p>The basic idea is that we take long DNA sequences, extract
sub-sequences of a fixed length (say k=31), hash them, and then sketch
them by retaining only those that fall below a certain threshold
value. Then we search for matches between sketches based on number of
overlapping hashes. This is a proxy for the number of overlapping k=31
subsequences, which is in turn convertible into various sequence
similarity metrics.</p>
</blockquote>
<h2>MAGsearch exists! It works! But it's hard to share.</h2>
<p>For a couple of years now, we've had something called
<a href="http://ivory.idyll.org/blog/2021-MAGsearch.html">MAGsearch</a> working
on our own private infrastructure. MAGsearch is sourmash on steroids:
it uses the same underlying Rust library as sourmash and loads and
searches the metagenomes quickly. And it will do all of this on
commodity hardware that many people have access to - a search of up to
a thousand genomes against the SRA takes under 12 GB of RAM, and under
11 hours, using 32 cores.</p>
<p>MAGsearch does a fairly straightforward thing: it loads all the query
genomes into memory and then iteratively loads each of ~700,000
metagenome sketches, reporting any overlaps. It does so in parallel,
which is why it's so fast - doing this with sourmash would take about
40 times as long, because sourmash isn't parallelized.</p>
<p>One problem with MAGsearch is that it's not real time. 10 hours is
great!!, especially for 1000 genomes, but that's still only about two
genomes a minute. And it's too slow for us to provide MAGsearch as a
service.</p>
<p>Another problem is that the underlying data is about 10 TB at the
moment, and we don't really have a way to share that data.</p>
<p>So we've been using MAGsearch a fair bit over the last two years to do
searches for others, but it's always done in a kind of batch mode
where we run it in between other things we're doing.</p>
<h2>Enter 'mastiff' - using RocksDB to do things faster</h2>
<p>For the
<a href="https://usermeeting.jgi.doe.gov/agenda/">2022 JGI User Meeting</a>
Dr. Luiz Irber was invited to talk about his MAGsearch work, and he
got inspired to try out an alternative solution.</p>
<p>He decided to implement an inverted index using
<a href="http://rocksdb.org/">RocksDB</a>, an embeddable database. I haven't dug
into <a href="https://github.com/sourmash-bio/mastiff">the implementation</a>,
but I believe mastiff uses individual hashes as keys and stores a
vector of dataset IDs as values. So a search for overlaps in the
database is done by using hashes from a query as keys, and then
intersecting the hashes in the values to find which dataset IDs have
sufficient estimated overlap to be reported.</p>
<p>Luiz reported that it took a bit under three weeks to build a RocksDB
index for 500,000 datasets at k=21, scaled=1000. The resulting
database is about 700 GB. He then wrote a Web server to enable queries
against the database.</p>
<h2>mastiff allows real-time search of SRA-scale data sets!</h2>
<p>So... it's fast. Like, really fast.</p>
<p>It's so fast, you can just go try it out yourself - I've provided up a
simple notebook
<a href="https://github.com/sourmash-bio/2022-search-sra-with-mastiff/blob/main/interpret-sra-live.ipynb">here</a>
in
<a href="https://github.com/sourmash-bio/2022-search-sra-with-mastiff">this github repo</a>,
and you can run it directly by clicking on the button below:
<a href="https://mybinder.org/v2/gh/sourmash-bio/2022-search-sra-with-mastiff/stable?labpath=interpret-sra-live.ipynb"><img alt="Binder" src="https://mybinder.org/badge_logo.svg"></a></p>
<p>This notebook does the following:</p>
<ul>
<li>downloads some SRA metadata (once)</li>
<li>loads and sketches a Shewanella genome query into a sourmash signature (~45 KB, for a ~5.3 Mbp genome)</li>
<li>serializes the signatures and sends it to the mastiff server to run it against the SRA</li>
<li>receives the resulting CSV of dataset + containment estimates</li>
<li>interprets the CSV in light of the SRA metadata</li>
</ul>
<p>What you'll see at the bottom of the notebook is that this particular
genome tends to show up in freshwater and wastewater.</p>
<p>The cool thing is that you can run your own queries if you like - just
replace the <code>shewanella.fa.gz</code> file references with your own queries
of interest!</p>
<p>(There's also
<a href="https://snakemake.readthedocs.io/">a snakemake workflow</a> to query
mastiff if you want to run many queries, and a mastiff command-line
program that will sketch and query all in one go.)</p>
<h2>What can mastiff be used for?</h2>
<p>MAGsearch is already being used by people for
<a href="https://twitter.com/phiweger/status/1402506165452513283">outbreak analysis</a>
and biogeography studies, among other things. We have a few different
active research projects in the lab that are exploring its utility for
various questions. So we will soon be able to do those things a lot
faster. Yay!</p>
<p>I personally am looking forward to digging into strain dynamics and
content-based alerts of new metagenomes, among other things.</p>
<p>We can also enable other cool projects, including (perhaps most
importantly) things that we didn't think of.</p>
<p>A rule of thumb that I like is that a technology will be most useful
for researchers when a summer undergrad can casually use it to explore
wild-haired ideas and initiate summer projects based on rapidly
generated exploratory results - and I'm really curious to see what we
can enable others to do with this ;). I can imagine that once people
can casually search the SRA with queries, they'll come up with lots of
ideas and make lots of discoveries. (Of course, lots of follow-up work
would be needed, too - chasing down what detection of a genome in a
metagenome means <em>biologically</em> is tough!)</p>
<p>It has not escaped our notice that this can be used for much smaller
databases, too. So we're looking forward to enabling real-time search
of all the NCBI microbial genomes, as well as ..well, whatever we can
get our hands on :).</p>
<p>mastiff will eventually (see below, "Whither mastiff?") be integrated
into sourmash and/or robustified, and then it will support private
databases, too.</p>
<h2>Well, but wait, you said "real-time"</h2>
<p>Right, I did - it takes between 2 and 10 seconds to do a search, and
IIRC the server can handle up to 200 simultaneous queries at a time.</p>
<p>And I've gotta be honest... at first I missed the point that this was
real-time. And web-enabled.</p>
<p>I was describing it to some collaborators, and while I was describing
it I realized, oh, cool, we can actually do this all in JavaScript via
WebAssembly too, of course.</p>
<p>So, also coming eventually (if not, like, tomorrow), I expect we will
provide a Web site where you can sketch a genome client-side (e.g. in
the browser - see
<a href="https://github.com/sourmash-bio/sourmash/issues/1973">sourmash#1973</a>),
and then receive near-instantaneous reporting on similarities to any
known genome as well as presence within public metagenomes.</p>
<p>And, once various things are worked out <waves hands about
infrastructure and sustainability and cost>, I hope we can provide
this as a generic service for others to use.</p>
<p>So that seems neat, right?</p>
<h2>Cautions, reservations, and limitations</h2>
<p>There are a few things you should know before you get too excited. I
mean, you should totally be excited, but... read on.</p>
<p>First, this is a proof of concept. It shows it can be done, but it is
not (yet) something that anyone other than Luiz can run! Engineering
and testing and releasing needs to happen, and that will take time.</p>
<p>Second, there are reasonably significant limitations to this on the
scientific side. The search will only work out to about
<a href="https://github.com/sourmash-bio/sourmash/issues/1859">90% average nucleotide identity (ANI) - a containment of .01-.05</a>,
which means you can robustly find matches out to the genus level, but
not beyond. That's a limitation of nucleotide k-mers and it's
something we're working on.</p>
<p>Small-ish queries also don't work well - we can robustly find exact
matches to 10kb chunks of sequence, but not shorter.</p>
<p>Third, mastiff is mostly designed around searching for <em>small</em>
queries. Query times should scale approximately linearly with the
query size. Luiz has limited the server to a 5MB query for this
reason.</p>
<p>And last but by no means least, this is <em>not</em> the entire SRA, it's
only about 480,000 records (of about 700,000). We'll update it
eventually, but for now it's a sufficient proof of concept ;).</p>
<h2>Whither mastiff?</h2>
<p>We (mostly Luiz ;) are working to integrate mastiff functionality into
sourmash. There's a pretty wide gap between a proof-of-concept
implementation and mature, robust, end-user-usable software, of
course, but we know how to do it.</p>
<p>There's probably other super cool back-end approaches we could use,
and we'd love to talk to you about them if you're interested in trying
out alternative implementations. At this point we have a fairly good
understanding of the conceptual operations and can even convey them to
you in functioning code snippets :).</p>
<p>I also gotta tell you that we don't know how to support this kind of
work exactly. This developed out of Luiz's thesis work but is now done
on a volunteer basis by him. JGI is supporting the server development
for a year (thanks!!) but we are a bit bottlenecked on UX support and
backend/frontend development. So
<a href="mailto:ctbrown@ucdavis.edu,lcirberjr@ucdavis.edu">drop us a line</a> if
you've got some spare change - we'd be looking for 3-5 years of
support.</p>
<p>(I'd be interested in exploring governance and sustainability issues
around this kind of thing, too.)</p>
<h2>Acknowledgements</h2>
<p>The interpretation and understanding of MAGsearch results has been
tremendously helped by work from Dr. Tessa Pierce-Ward (ANI),
Dr. Adrian Viehweger (pathogen outbreaks), Dr. Jessica Lumian
(biogeography), Dr. Christy Grettenberger (biogeography and more), and
others. Thank you!!</p>Announcing ribbity - a hacky project to build Web sites from GitHub issue trackers2022-05-23T00:00:00+02:002022-05-23T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2022-05-23:/blog/2022-announcing-ribbity-github-issue-munging.html<p>Munging GitHub issue trackers for fun!</p><p>For the last few weeks, I've been hacking on a new passion microproject on the side, code-named <code>ribbity</code>.</p>
<p>ribbity is the software that builds the <a href="https://sourmash-bio.github.io/sourmash-examples/">sourmash-examples Web site</a>, by producing a <a href="https://www.mkdocs.org">mkdocs</a> site from the <a href="https://github.com/sourmash-bio/sourmash-examples/issues/">sourmash-examples issue tracker</a>.</p>
<p>In brief, ribbity takes issues descriptions from GitHub and puts them in Markdown files so you can run mkdocs :).</p>
<p>You can see the install and config documentation for ribbity <a href="https://ribbity-org.github.io/ribbity-docs/">here</a>.</p>
<h2>Why oh why would you do this?</h2>
<p>You might well ask... why not "just build a Markdown site", maybe with pull requests? A few reasons -</p>
<h3>The GitHub issue tracker is awesome</h3>
<p>First, I really like using GitHub issue trackers to organize resources and notes. For example, the <a href="https://github.com/dib-lab/sourmash/issues">sourmash issue tracker</a> is my "external brain" for all things related to sourmash and genome comparison. I also have several private repos that I use to organize link collections.</p>
<p>Most specifically, I really love the "backlinks" feature of github (where when you refer to issue A from issue B, issue A receives a pointer back to issue B) - this was in the original <a href="https://en.wikipedia.org/wiki/Project_Xanadu">Project Xanadu</a> plan for interlinked hypertext documents, but it never really made it into the Web. It's awfully handy.</p>
<p>Here, the ability to see backlinks from private repos into public repos is particularly lovely!</p>
<h3>Flexible organization and commenting</h3>
<p>I also really like the labeling (categorization) and commenting functionality of github.</p>
<p>Moreover, github has very nice Markdown support, along with a usable editor. And, while writing Markdown in a Web browser is not my most favorite of activities, it sure is nice to be able to do it in a pinch. But more importantly I can write Markdown in a <a href="hackmd.io/">hackmd page</a> and then copy/paste it into a github issue - this is an <a href="https://github.com/sourmash-bio/sourmash/issues/1968">increasingly common workflow</a> for me!</p>
<h3>Flexible authentication and notifications</h3>
<p>I really like (and use heavily) github's auth and notification systems. You can enable and disable access to repositories, watch specific issues and silence others, lock issues, block people from posting, etc. etc.</p>
<p>I need auth and notifications, but I'm not interested in doing any of
that myself. Building on top of all of that is a nice simplification.</p>
<h3>GitHub as a platform</h3>
<p>More generally, I really like how GitHub is becoming a platform for stuff; you can see an earlier project of mine here, <a href="http://ivory.idyll.org/blog/2019-github-project-reporting.html">Using GitHub for janky project reporting - some code</a>.</p>
<p>Other inspirational projects in this space include <a href="https://utteranc.es/">utteranc.es</a>, which builds a blog commenting platform on top of github; and <a href="https://angeliqueweger.com/blog/2021/love-letter-to-lftm/">Coraline's "low-friction project management"</a> site. And, while I don't specifically use <a href="https://datasette.io/">datasette</a> (yet) in any way, it has been a major conceptual contributor to the idea that hosting things statically is a great idea :).</p>
<p>(If you know of other github-based hacks like this, please drop them in the comments or <a href="https://twitter.com/ctitusbrown/">ping me on Twitter!</a>)</p>
<h3>mkdocs static site hosting is simple, esp via github pages</h3>
<p>mkdocs produces static sites, and static sites are awesome! (inspiration from <a href="https://datasette.io/">datasette</a> here, again.) No complicated databases, or authentication, or nasty JavaScripts creepings across my pages. (Side note: I don't know JavaScript.)</p>
<p>Also, github pages sure is easy (and mkdocs natively supports deploying to github pages natively).</p>
<p>And of course you can host mkdocs sites in many places. So it's pretty flexible and enabling to build on top of mkdocs.</p>
<h2>But does it, like, enable anything cool?</h2>
<p>One of the prime proximal motivations for building ribbity was the <a href="https://github.com/sourmash-bio/sourmash/issues/2054">increasing complexity of the sourmash documentation</a>, which is in danger of becoming sprawling and labyrinthine.</p>
<p>I really like the idea of a set of documentation that is explicitly intended to be explored and searched in a non-linear way.</p>
<p>That's how I use github issues in practice.</p>
<p>So it seemed natural to try out something new that strips away some of the complexity of the github interface and makes it customizable.</p>
<p>And I'm pretty happy with the resulting <a href="https://sourmash-bio.github.io/sourmash-examples/">sourmash examples</a> Web site!</p>
<p>In particular, it has really lowered the barrier to contribution for me, personally. I don't have to worry about pull requests or integrating new examples into a big, complicated doc site in a good way - I just throw a new example together, slap a few labels on it, and get on with my day.</p>
<p>In some regards, this is a version of <a href="https://felixge.de/2013/03/11/the-pull-request-hack/">the pull request hack</a>, a contribution model that has always intrigued me. Except instead of giving contributors PR access, they just need to be able to add issues - which, by default, anyone can do on any visible GitHub project!</p>
<h2>How is ribbity implemented??</h2>
<p>It's pretty simple underneath -</p>
<ol>
<li>"pull" GitHub issues into a Python pickle dump.</li>
<li>process the pickle dump into Python objects, salted with a few regexps.</li>
<li>run object model through jinja2 templating to build a <code>docs/</code> directory.</li>
<li>feed <code>docs/</code> directory into mkdocs, which builds a <code>site/</code> directory.</li>
</ol>
<p>I've layered on some tests and some Python package stuff and some CLI, but the core code is pleasingly simple - under 400 lines of code, including spaces and comments.</p>
<h2>Whither ribbity?</h2>
<p>A few people have looked at ribbity and gone ...whoa. I want that! So that's nice and validating!</p>
<p>In particular, there's been some enthusiasm amongst colleagues about having a different interface to github issue trackers. One specific motivation is that the responsive search offered by the default mkdocs interface is nice! And I could see an argument for aggregating together multiple issue trackers in a single site, which is a use case some colleagues are interested in.</p>
<p>Basically I see a lot of enthusiasm around specific, customizable hackage of github things.</p>
<p>But... I dunno. There's quite some space between a minimal "this is useful! and limited enough that we can keep it working!" approach, and a janky, badly reimplemented version of everything the github Web site already offers. I'm leaning more towards the former, because I think that's achievable and offers specific utility. But I also have a lot of ideas for how to do ribbity-like things in other directions (Watch This Space!)</p>
<p>If I had to guess, I think my personal interest in ribbity will evolve in the following ways:</p>
<ul>
<li>I'll work to push more of ribbity's text munging functionality into jinja2, and make the github download a more complete (and more standardized!) version of the issue repo.</li>
<li>this will in turn push the core ribbity into being a simple merge of (a) jinja2 templates overlaid on (b) a github object model.</li>
<li>if I can get the primitives right, this would then make it easy to build custom overlays on github issue trackers entirely in jinja2.</li>
</ul>
<p>And that actually seems pretty maintainable to me.</p>
<p>Then the current ribbity functionality would just be a specific set of templates we use to build a particular kind of Web site. And new functionality or different issue tracker overlays could be built entirely in jinja2.</p>
<p>But, who knows? I'm definitely not committing to anything; just playing around for now.</p>
<p>That having been said, I'm thinking about applying ribbity to building a directory of training resources, and throwing it at the newsletter problem, and a colleague is using it for their own examples site. So we'll see!</p>
<h2>What other fun experiences did you want to relate?</h2>
<p>This was my first experience with <a href="https://docs.python.org/3/library/dataclasses.html">Python dataclasses</a>! Super cool! Code <a href="https://github.com/ribbity-org/ribbity/blob/main/ribbity/objects.py">here</a>.</p>
<p>(A colleague in the lab, Tessa Pierce, started using them <a href="https://sourmash.readthedocs.io/">over in sourmash</a>, and that finally motivated me to move on from namedtuples or straight up bare Python objects.)</p>
<p>This was also my first parsing experience with <a href="https://toml.io/en/">TOML</a>, which is pretty nice! And I found the <a href="https://github.com/hukkin/tomli">tomli</a> parser to be easy to use, and thought the <a href="https://peps.python.org/pep-0680/">tomllib PEP</a> was really great.</p>
<h2>Concluding thoughts</h2>
<p><a href="https://github.com/ribbity-org/ribbity">ribbity</a> is open source - BSD 3-clause!</p>
<p><a href="https://github.com/ribbity-org/ribbity/issues">Please file issues</a> if you have ideas for how you might want to use ribbity!</p>
<p>Pull requests are welcome, but this is a side project, so unless they're fairly minimal or accompanied by good, clear, obvious tests, I might defer them as "too much brain needed". I encourage forking and experimentation!</p>
<p>Your thoughts welcome!</p>
<p>--titus</p>The second Common Fund Data Ecosystem hackathon - May 9-13, 2022!2022-05-01T00:00:00+02:002022-05-01T00:00:00+02:00Rayna Harris and Jessica Lumiantag:ivory.idyll.org,2022-05-01:/blog/2022-second-cfde-hackathon.html<p>We're running another hackathon!</p><p>We are pleased to announce that the <a href="http://nih-cfde.org/">NIH Common Fund Data Ecosystem</a> will be hosting a hackathon on <a href="https://commonfund.nih.gov/">NIH Common Fund</a> data sets from May 9 - 13! This follows on our first hackathon (<a href="http://ivory.idyll.org/blog/2022-feb-hackathon.html">see recap blog post</a>).</p>
<p>This hackathon has both synchronous and asynchronous work, with concentrated hackathon sessions on specific data sets and co-working sessions on Thursday. Participants can attend whichever hackathon sessions they are interested in. There is no minimum work requirement, all are welcome to participate as much or as little as schedules and interest allow!</p>
<p>See our schedule and find more information about this event here: <a href="https://nih-cfde.github.io/2022-may-hackathon/">https://nih-cfde.github.io/2022-may-hackathon/</a></p>
<p>Register for the hackathon here: <a href="https://www.nih-cfde.org/events/may-2022-hackathon/">https://www.nih-cfde.org/events/may-2022-hackathon/</a></p>
<p>Hackathon Benefits:</p>
<ul>
<li>Gain experience with Common Fund data sets and have access to data set curators!</li>
<li>See an immediate product from a short burst of concentrated effort!</li>
<li>Meet researchers with common interests and potentially spur collaborations or funding efforts!</li>
</ul>
<h2>Common Fund Session Details</h2>
<h3><a href="https://kidsfirstdrc.org/">Gabriella Miller Kids First Pediatric Research Program</a></h3>
<p>The goal of the Gabriella Miller Kids First Pediatric Research Program is to help researchers uncover new insights into the biology of childhood cancer and structural birth defects, including the discovery of shared genetic pathways between these disorders.</p>
<p>Kids First will host a session on accessing and using federated Common Fund Data Ecosystem graph data through the Kids First-Human BioMolecular Atlas Program graph database with an API.</p>
<h3><a href="https://sparc.science/">Stimulating Peripheral Activity to Relieve Conditions</a></h3>
<p>The Common Fund’s Stimulating Peripheral Activity to Relieve Conditions (SPARC) program accelerates development of therapeutic devices that modulate electrical activity in nerves to improve organ function.</p>
<p>SPARC will host a session on providing information on access to SPARC resources via the SPARC portal and associated APIs.</p>
<h3><a href="https://portal.hmpdacc.org/">Human Microbiome Project</a></h3>
<p>The Human Microbiome project has DNA sequencing data to characterize the microbiome in healthy adults and people with specific microbiome-associated diseases. It also contains integrated datasets with multiple biological projects from the microbiome and host over time for specific microbiome associated diseases.</p>
<p>A session on Human Microbiome Project data will involve obtaining this data from the Common Fund Data Ecosystem search portal and working with it using Amazon Web Services.</p>
<h3><a href="https://app.nih-cfde.org/">Common Fund Data Ecosystem Search Portal</a></h3>
<p>The Common Fund Data Ecosystem Coordinating Center supports efforts to make Common Fund data sets more findable, accessible, interoperable, and reusable for the scientific community through collaboration, end-user training, and data set sustainability. </p>
<p>The Common Fund Data Ecosystem Portal Demonstration will be a demonstration session on how to access data in the Portal.</p>
<h3>Introduction to R for RNA-Seq Analysis Workshop</h3>
<p>RNA-Sequencing (RNA-Seq) is a popular method for determining the presence and quantity of RNA in biological samples. In this 3 hour workshop, we will use R to explore publicly-available RNA-Seq data from the Gene Expression Tissue Project (GTEx). Attendees will be introduced to the R syntax, variables, functions, packages, and data structures common to RNA-Seq projects. We will use RStudio to import, tidy, transform, and visualize RNA-Seq count data. Attendees will learn tips and tricks for making the processes of data wrangling and data harmonization more manageable. This workshop will not cover cloud-based workflows for processing RNA-seq reads or statistics and modeling because these topics are covered in our RNA-Seq Concepts and RNA-Seq in the Cloud workshops. Rather, this workshop will focus on general R concepts applied to RNA-Seq data. Familiarity with R is not required but would be useful.</p>
<h2>Participant Skill Level:</h2>
<p>The hackathon is open to the public, and anyone can attend. Despite the name “hackathon”, participants don’t need to be experts in computer science! The most important criteria is interest in the data sets, and some familiarity with the command line and GitHub is helpful but not required.</p>
<p>See our schedule and find more information about this event here: <a href="https://nih-cfde.github.io/2022-may-hackathon/">https://nih-cfde.github.io/2022-may-hackathon/</a></p>
<p>Register for the hackathon here: <a href="https://www.nih-cfde.org/events/may-2022-hackathon/">https://www.nih-cfde.org/events/may-2022-hackathon/</a></p>
<p>Please don’t hesitate to contact <a href="mailto:training@cfde.atlassian.net">training@cfde.atlassian.net</a> with any questions!</p>Storing 64-bit unsigned integers in SQLite databases, for fun and profit2022-04-22T00:00:00+02:002022-04-22T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2022-04-22:/blog/2022-storing-ulong-in-sqlite-sourmash.html<p>Storing unsigned longs in SQLite is possible, and can be fast.</p><h2>The problem: storing <em>and querying</em> lots of 64-bit unsigned integers</h2>
<p>For the past ~6 years, we've been going down quite a rabbit hole with hashing based sequence search, using a <a href="https://en.wikipedia.org/wiki/MinHash">MinHash</a>-derived approach called FracMinHash. (You can read more about FracMinHash <a href="https://www.biorxiv.org/content/10.1101/2022.01.11.475838">here</a>, but it's essentially a bottom-sketch version of ModHash.) This is all implemented in <a href="https://sourmash.readthedocs.io/">the sourmash software</a>, a Python and Rust-based command-line bioinformatics toolkit.</p>
<p>The basic idea is that we take long DNA sequences, extract sub-sequences of a fixed length (say k=31), hash them, and then sketch them by retaining only those that fall below a certain threshold value. Then we search for matches between sketches based on number of overlapping hashes. This is a proxy for the number of overlapping k=31 subsequences, which is in turn convertible into various sequence similarity metrics.</p>
<p>The scale of the problems we're tackling is pretty big. As one example, we have a database (Genbank bacterial) with 1.15 million buckets of hashes, containing a total of 4.6 billion hashes across these buckets (representing approximately 4.6 trillion original k-mers). So we need to do moderately clever things to store them and search them quickly.</p>
<p>We already have a variety of formats for storing and querying sketch collections, including straight up zip files that contain JSON-serialized sketches, a custom disk-based <a href="https://www.nature.com/articles/nbt.3442">Sequence Bloom Trees</a>, and an inverted index that lives in memory. The inverted index turns out to be fast once loaded, but serialization is ...not that great, and memory consumption is very high. This is something I wanted to fix!</p>
<p>I've had <a href="http://ivory.idyll.org/blog/storing-and-retrieving-sequences.html">a long-time love of SQLite</a>, the tiny little embedded database engine that is just ridiculously fast, and I decided to figure out how to store <em>and query</em> our sketches in SQLite.</p>
<h2>Using SQLite to store 64-bit unsigned integers: a first attempt</h2>
<p>The challenge I faced here was that our sketches are composed of 64-bit unsigned integers, and SQLite <em>does not store</em> 64-bit unsigned ints natively. But this is exactly what I needed!</p>
<p>Enter type converters! I found two really nice resources on automatically converting 64-bit uints into data types that SQLite could handle: <a href="https://stackoverflow.com/questions/57464671/peewee-python-int-too-large-to-convert-to-sqlite-integer">this stackoverflow post, "Python int too large to convert to SQLite INTEGER"</a>, and this great <a href="https://wellsr.com/python/adapting-and-converting-sqlite-data-types-for-python/">tutorial from wellsr.com, Adapting and Converting SQLite Data Types for Python</a>.</p>
<p>In brief, I swiped code from the stackoverflow answer to do the following:</p>
<ul>
<li>write a function that, for any hash value larger than 2**63-1, converts numbers into a hex string;</li>
<li>write the opposite function that converts hex strings back to numbers;</li>
<li>register these functions as adapters on a SQLite data type to automatically run for every column of that type.</li>
</ul>
<p>This works because SQLite has a really flexible internal typing system where it can store basically anything as a string, no matter the official column type.</p>
<p>The python code looks like this:</p>
<div class="highlight"><pre><span></span><code><span class="n">MAX_SQLITE_INT</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">63</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">sqlite3</span><span class="o">.</span><span class="n">register_adapter</span><span class="p">(</span>
<span class="nb">int</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">hex</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">if</span> <span class="n">x</span> <span class="o">></span> <span class="n">MAX_SQLITE_INT</span> <span class="k">else</span> <span class="n">x</span><span class="p">)</span>
<span class="n">sqlite3</span><span class="o">.</span><span class="n">register_converter</span><span class="p">(</span>
<span class="s1">'integer'</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">b</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="mi">16</span> <span class="k">if</span> <span class="n">b</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span> <span class="o">==</span> <span class="sa">b</span><span class="s1">'0x'</span> <span class="k">else</span> <span class="mi">10</span><span class="p">))</span>
</code></pre></div>
<p>and when you connect to the database, you can tell SQLite to pay attention to those adapters like so:</p>
<div class="highlight"><pre><span></span><code><span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">dbfile</span><span class="p">,</span>
<span class="n">detect_types</span><span class="o">=</span><span class="n">sqlite3</span><span class="o">.</span><span class="n">PARSE_DECLTYPES</span><span class="p">)</span>
</code></pre></div>
<p>Then you define your tables in SQLite,</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">hashes</span>
<span class="w"> </span><span class="p">(</span><span class="n">hashval</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span>
<span class="w"> </span><span class="n">sketch_id</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span>
<span class="w"> </span><span class="k">FOREIGN</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">sketch_id</span><span class="p">)</span><span class="w"> </span><span class="k">REFERENCES</span><span class="w"> </span><span class="n">sketches</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">))</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">sketches</span>
<span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span>
<span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="p">...)</span>
</code></pre></div>
<p>and you can do all the querying you want, and large integers will be converted into hex strings, and life is good. Right?</p>
<p>This code actually worked fine! Except for one problem.</p>
<p><strong>It was very slow.</strong> One key to making relational databases in general (and SQLite in specific) fast is to use indices, and these INTEGER columns could no longer be indexed as INTEGER columns because they contained hex strings! Which means that once databases got big, well, basically searching and retrieval was too slow to be useful.</p>
<p>This code was perfectly functional and lives on in <a href="https://github.com/sourmash-bio/sourmash/blob/3259fbddf6c33b6093bea2717a4e24642145a32d/src/sourmash/sqlite_index.py">some commits</a>, but it wasn't fast enough to be used for production code.</p>
<p>Unfortunately (or fortunately?), I was now <em>in it</em>. I'd sunk enough time into this problem already, and had enough functioning code and tests, that I decided to keep on going. See: <a href="https://en.wikipedia.org/wiki/Sunk_cost">sunk cost fallacy</a>.</p>
<h2>Storing 64-bit unsigned integers <em>efficiently</em> in SQLite</h2>
<p>I wasn't actually convinced that SQLite could do it efficiently, so <a href="https://twitter.com/ctitusbrown/status/1490695385781661697">I asked on Twitter</a> about alternative approaches. Among a variety of responses, @jgoldschrafe <a href="https://twitter.com/jgoldschrafe/status/1490700497988329485">said something very important that resonated</a>:</p>
<blockquote>
<p>SQLite isn't a performance monster for complex use cases, but should be absolutely fine for this.</p>
</blockquote>
<p>and that gave me the courage to stay the course and work on a SQLite-based resolution.</p>
<p>The next key was an idea that I had toyed with, based on hints <a href="https://sqlite-users.sqlite.narkive.com/hlY9vnL7/sqlite-storing-unsigned-64-bit-values">here</a> and then <a href="https://twitter.com/jgoldschrafe/status/1490701880099590146">confirmed</a> by the still-awesome @jgoldschrafe - I didn't need <em>more</em> than 64 bits, and I just needed to do searching based on equality. So I could convert unsigned 64-bit ints into signed 64-bit numbers, shove them into the database, and do equality testing between a query and the hashvals. As long as I was doing the conversion systematically, it would all work! </p>
<p>I ended up writing two adapter functions that I call in Python code for the relevant values (<em>not</em> using the SQLite type converter registry) -</p>
<div class="highlight"><pre><span></span><code><span class="n">MAX_SQLITE_INT</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">63</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">convert_hash_to</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">BitArray</span><span class="p">(</span><span class="n">uint</span><span class="o">=</span><span class="n">x</span><span class="p">,</span> <span class="n">length</span><span class="o">=</span><span class="mi">64</span><span class="p">)</span><span class="o">.</span><span class="n">int</span> <span class="k">if</span> <span class="n">x</span> <span class="o">></span> <span class="n">MAX_SQLITE_INT</span> <span class="k">else</span> <span class="n">x</span>
<span class="n">convert_hash_from</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">BitArray</span><span class="p">(</span><span class="nb">int</span><span class="o">=</span><span class="n">x</span><span class="p">,</span> <span class="n">length</span><span class="o">=</span><span class="mi">64</span><span class="p">)</span><span class="o">.</span><span class="n">uint</span> <span class="k">if</span> <span class="n">x</span> <span class="o"><</span> <span class="mi">0</span> <span class="k">else</span> <span class="n">x</span>
</code></pre></div>
<p>Note here I am using the lovely <a href="https://pypi.org/project/bitstring/">bitstring package</a> so that I don't have to think hard about bit twiddling (although that's a possible optimization now that I have everything locked down with tests).</p>
<p>The SQL schema I am using looks like this:</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">sketches</span>
<span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span>
<span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="p">...)</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">sourmash_hashes</span><span class="w"> </span><span class="p">(</span>
<span class="w"> </span><span class="n">hashval</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span>
<span class="w"> </span><span class="n">sketch_id</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span>
<span class="w"> </span><span class="k">FOREIGN</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">sketch_id</span><span class="p">)</span><span class="w"> </span><span class="k">REFERENCES</span><span class="w"> </span><span class="n">sourmash_sketches</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div>
<p>and I also build three indices, that correspond to the various kinds of queries I want to do -</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">sourmash_hashval_idx</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">sourmash_hashes</span><span class="w"> </span><span class="p">(</span>
<span class="w"> </span><span class="n">hashval</span><span class="p">,</span>
<span class="w"> </span><span class="n">sketch_id</span>
<span class="p">)</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">sourmash_hashval_idx2</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">sourmash_hashes</span><span class="w"> </span><span class="p">(</span>
<span class="w"> </span><span class="n">hashval</span>
<span class="p">)</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">sourmash_sketch_idx</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">sourmash_hashes</span><span class="w"> </span><span class="p">(</span>
<span class="w"> </span><span class="n">sketch_id</span>
<span class="p">)</span>
</code></pre></div>
<p>One of the design decisions I made midway through this PR was to allow duplicate hashvals in <code>sourmash_hashes</code> - since different sketches can share hashvals with other sketches, we have to either do things this way, or have another intermediate table that links unique hashvals to potentially multiple sketch_ids. It just seemed simpler to have hashvals be non-unique, and instead build an index for the possible queries. (I might revisit this later, now that I can refactor fearlessly ;).</p>
<p>At this point, insertion is now easy:</p>
<div class="highlight"><pre><span></span><code><span class="n">sketch_id</span> <span class="o">=</span> <span class="o">...</span>
<span class="c1"># insert all the hashes</span>
<span class="n">hashes_to_sketch</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">h</span> <span class="ow">in</span> <span class="n">ss</span><span class="o">.</span><span class="n">minhash</span><span class="o">.</span><span class="n">hashes</span><span class="p">:</span>
<span class="n">hh</span> <span class="o">=</span> <span class="n">convert_hash_to</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
<span class="n">hashes_to_sketch</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">hh</span><span class="p">,</span> <span class="n">sketch_id</span><span class="p">))</span>
<span class="n">c</span><span class="o">.</span><span class="n">executemany</span><span class="p">(</span><span class="s2">"INSERT INTO sourmash_hashes (hashval, sketch_id) VALUES (?, ?)"</span><span class="p">,</span>
<span class="n">hashes_to_sketch</span><span class="p">)</span>
</code></pre></div>
<p>and retrieval is similarly simple:</p>
<div class="highlight"><pre><span></span><code><span class="n">sketch_id</span> <span class="o">=</span> <span class="o">...</span>
<span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="sa">f</span><span class="s2">"SELECT hashval FROM sourmash_hashes WHERE sourmash_hashes.sketch_id=?"</span><span class="p">,</span> <span class="n">sketch_id</span><span class="p">)</span>
<span class="k">for</span> <span class="n">hashval</span><span class="p">,</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
<span class="n">hh</span> <span class="o">=</span> <span class="n">convert_hash_from</span><span class="p">(</span><span class="n">hashval</span><span class="p">)</span>
<span class="n">minhash</span><span class="o">.</span><span class="n">add_hash</span><span class="p">(</span><span class="n">hh</span><span class="p">)</span>
</code></pre></div>
<p>So this was quite effective for storing the sketches in SQLite! I could perfectly reconstruct sketches after a round-trip through SQLite, which was a great first step.</p>
<p>Next question: could I quickly <em>search</em> the hashes as an inverted index? That is, could I find sketches based on querying with hashes, rather than (as above) using <code>sketch_id</code> to retrieve hashes for an already identified sketch?</p>
<h2>Matching on 64-bit unsigned ints in SQLite</h2>
<p>This ended up being pretty simple!</p>
<p>To query with a collection of hashes, I set up a temporary table containing the query hashes, and then do a join on exact value matching. Conveniently, this doesn't care whether the values in the database are signed or not - it just cares if the bit patterns are equal!</p>
<p>The code, for a cursor c:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">_get_matching_sketches</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">hashes</span><span class="p">,</span> <span class="n">max_hash</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> For hashvals in 'hashes', retrieve all matching sketches,</span>
<span class="sd"> together with the number of overlapping hashes for each sketch.</span>
<span class="sd"> """</span>
<span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"DROP TABLE IF EXISTS sourmash_hash_query"</span><span class="p">)</span>
<span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"CREATE TEMPORARY TABLE sourmash_hash_query (hashval INTEGER PRIMARY KEY)"</span><span class="p">)</span>
<span class="n">hashvals</span> <span class="o">=</span> <span class="p">[</span> <span class="p">(</span><span class="n">convert_hash_to</span><span class="p">(</span><span class="n">h</span><span class="p">),)</span> <span class="k">for</span> <span class="n">h</span> <span class="ow">in</span> <span class="n">hashes</span> <span class="p">]</span>
<span class="n">c</span><span class="o">.</span><span class="n">executemany</span><span class="p">(</span><span class="s2">"INSERT OR IGNORE INTO sourmash_hash_query (hashval) VALUES (?)"</span><span class="p">,</span>
<span class="n">hashvals</span><span class="p">)</span>
<span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="sa">f</span><span class="s2">"""</span>
<span class="s2"> SELECT DISTINCT sourmash_hashes.sketch_id,COUNT(sourmash_hashes.hashval) as CNT</span>
<span class="s2"> FROM sourmash_hashes, sourmash_hash_query</span>
<span class="s2"> WHERE sourmash_hashes.hashval=sourmash_hash_query.hashval</span>
<span class="s2"> GROUP BY sourmash_hashes.sketch_id ORDER BY CNT DESC</span>
<span class="s2"> """</span><span class="p">,</span> <span class="n">template_values</span><span class="p">)</span>
<span class="k">return</span> <span class="n">c</span>
</code></pre></div>
<p>As a side benefit, this query orders the results by the size of overlap between sketches, which leads to some pretty nice and efficient thresholding code.</p>
<h2>Benchmarking!!</h2>
<p>I'll just say that performance is definitely acceptable - the below benchmarks compare sqldb against our other database formats. The database we're searching is a collection of 48,000 sketches with 161 million total hashes - GTDB RS202, if you're curious :).</p>
<p>For 53.9k query hashes, with 19.0k found in the database, the SQLite implementation is nice and fast, albeit with a large disk footprint:</p>
<table>
<thead>
<tr>
<th>db format</th>
<th>db size</th>
<th>time</th>
<th>memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>sqldb</td>
<td>15 GB</td>
<td>28.2s</td>
<td>2.6 GB</td>
</tr>
<tr>
<td>sbt</td>
<td>3.5 GB</td>
<td>2m 43s</td>
<td>2.9 GB</td>
</tr>
<tr>
<td>zip</td>
<td>1.7 GB</td>
<td>5m 16s</td>
<td>1.9 GB</td>
</tr>
</tbody>
</table>
<p>For larger queries, with 374.6k query hashes, where we find 189.1k in the database, performance evens out a bit:</p>
<table>
<thead>
<tr>
<th>db format</th>
<th>db size</th>
<th>time</th>
<th>memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>sqldb</td>
<td>15 GB</td>
<td>3m 58s</td>
<td>9.9 GB</td>
</tr>
<tr>
<td>sbt</td>
<td>3.5 GB</td>
<td>7m 33s</td>
<td>2.6 GB</td>
</tr>
<tr>
<td>zip</td>
<td>1.7 GB</td>
<td>5m 53s</td>
<td>2.0 GB</td>
</tr>
</tbody>
</table>
<p>Note that zip file searches don't use any indexing at all, so the search is linear and it's expected that the time will more or less be the same for regardless of the query. And SBTs are not really meant for this use case, but they are the other "fast search" database we have, so I benchmarked them anyway.</p>
<p>(There are lots of nuances to what we're doing here and I think I mostly understand these performance numbers; see <a href="https://github.com/sourmash-bio/sourmash/issues/1958">the benchmarking issue</a> for my thoughts.)</p>
<p>The really nice thing is that for our motivating use case, looking hashes up in a reverse index to correlate with other labels, the performance with SQLite is <em>much</em> better than our current JSON-on-disk/in-memory search format.</p>
<p>For 53.9k query hashes, we get:</p>
<table>
<thead>
<tr>
<th>lca db format</th>
<th>db size</th>
<th>time</th>
<th>memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>SQL</td>
<td>1.6 GB</td>
<td>20s</td>
<td>380 MB</td>
</tr>
<tr>
<td>JSON</td>
<td>175 MB</td>
<td>1m 21s</td>
<td>6.2 GB</td>
</tr>
</tbody>
</table>
<p>which is frankly excellent - for 8x increase in disk size, we get 4x faster query and 16x lower memory usage! (The in-memory performance includes loading from disk, which is the main reason it's so terrible.)</p>
<h2>Further performance improvements?</h2>
<p>I'm still pretty exhausted from this coding odyssey (> 250 commits, ending with nearly 3000 lines of code added or changed), so I'm leaving some work for the future. Most specifically, we'd like to benchmark having multiple <em>readers</em> read from the database at once, for e.g. Web server backends. I expect it to work pretty well for that but we'll need to check.</p>
<p>I do use the following PRAGMAs for configuration, and I'm wondering if I should spend time trying out different parameters; this is mostly a database built around writing once, and reading many times. Advice welcome :).</p>
<div class="highlight"><pre><span></span><code><span class="n">PRAGMA</span><span class="w"> </span><span class="n">cache_size</span><span class="o">=</span><span class="mi">10000000</span>
<span class="n">PRAGMA</span><span class="w"> </span><span class="n">synchronous</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">OFF</span>
<span class="n">PRAGMA</span><span class="w"> </span><span class="n">journal_mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MEMORY</span>
<span class="n">PRAGMA</span><span class="w"> </span><span class="n">temp_store</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">MEMORY</span>
</code></pre></div>
<h2>Concluding thoughts</h2>
<p>The second solution above is the code that is in <a href="https://github.com/sourmash-bio/sourmash/pull/1808">my current pull request</a>, and I expect it will eventually be merged into sourmash and released as part of sourmash v4.4.0. It's fully integrated into sourmash (with a much broader range of use cases than I explained above ;), and I'm pretty happy with it. There's actually a whole 'nother story about manifests that motivated some part of the above; you can read about that <a href="https://github.com/sourmash-bio/sourmash/issues/1930">here</a>.</p>
<p>I'm not planning on revisiting reverse indices in sourmash anytime soon, but we are starting to think more seriously about better (...non-JSON ways) of serializing sketches. Avro looks interesting, and there are some fast columnar formats like Arrow and Parquet; see <a href="https://github.com/sourmash-bio/sourmash/issues/1262">this issue</a> for our notes.</p>
<p>Anyway, so that's my SQLite odyssey. Thoughts welcome!</p>
<p>--titus</p>The First Common Fund Data Ecosystem Hackathon2022-03-05T00:00:00+01:002022-03-05T00:00:00+01:00Rayna Harris and Jessica Lumiantag:ivory.idyll.org,2022-03-05:/blog/2022-feb-hackathon.html<p>We ran a successful pilot hackathon, and we will run a second one soon!</p><p>The week of February 21-25, 2022, we hosted
<a href="https://nih-cfde.github.io/2022-feb-hackathon">the first Common Fund Data Ecosystem (CFDE) Hackathon</a>. The
goals of the hackathon were to increase familiarity with data sets
from <a href="https://commonfund.nih.gov/programs">Common Fund programs</a> and
work towards cross-cutting, integrative analyses.</p>
<p><img alt="" src="images/2022-hackathon-img1.png"></p>
<p>We invited members of the CFDE to propose hackathon sessions to
introduce their Common Fund data sets and provide technical support
while attendees explored the data. Sessions featured data from the
<a href="https://app.nih-cfde.org/">CFDE Portal</a>,
<a href="https://hmpdacc.org/">Human Microbiome Project (HMP)</a>,
<a href="https://commonfund.nih.gov/exrna">Extracellular RNA Communication (exRNA)</a>,
<a href="https://commonfund.nih.gov/metabolomics">Metabolomics Workbench (MW)</a>,
and
<a href="https://commonfund.nih.gov/LINCS">Signature Commons Library of Integrated Network-Based Cellular Signatures (SigCom LINCS)</a>.</p>
<h2>The Hackathon Sessions</h2>
<p><strong>All sessions were recorded and can be viewed on the <a href="https://nih-cfde.github.io/2022-feb-hackathon/about/">Session Details and Recordings</a> page of the hackathon website!</strong></p>
<p>This virtual event began Monday morning with a welcome address by Dr. Titus Brown (UC Davis) followed by presentations from each Common Fund Program to give a brief overview of their data and session goals.</p>
<p>On Monday afternoon, Dr. Amanda Charbonneau (UC Davis) taught
attendees how to use the <a href="https://app.nih-cfde.org/"><strong>CFDE Portal</strong></a>
to find datasets from participating Common Fund
programs. Dr. Charbonneau used <strong>HMP</strong> data as a motivating example,
then helped attendees discover data sets from other programs. These
datasets are quite large, so on Tuesday afternoon, Dr. Charbonneau
taught a second session on how to download and process data from the
CFDE Portal using Amazon Web Services (AWS). Attendees were provided
with AWS accounts that they could use to analyze data discovered
through the portal.</p>
<p>On Tuesday morning, Emily LaPlante and Keyang Yu (Baylor College of
Medicine) provided an overview of the
<strong><a href="https://exrna-atlas.org/">exRNA Atlas</a></strong>, which contains over 7,500
small RNA sequences and qPCR profiles from human and mouse, and
introduced attendees to a variety of software tools for exploring RNA
binding proteins. This session explored two use cases:</p>
<p>1) Finding RNA binding proteins and their associated RNA cargo in a variety of human biofluids and exploring their utility as biomarkers</p>
<p>2) Exploring other sites across the genome by intersecting exRNA Atlas
data with regions of interest using BedGraph files, as well as
applying this approach to other datasets.</p>
<p>On Wednesday morning, Eoin Fahy and Mano Maurya (UCSD) introduced the
<strong><a href="https://www.metabolomicsworkbench.org/">Metabolomics Workbench</a></strong>
database which contains over 164,000 molecular structures covering
100+ species! Attendees learned how to interact with the Metabolomics
Workbench Portal and then viewed a demonstration of
<strong><a href="https://www.biorxiv.org/content/10.1101/2020.11.20.391912v1">MetENP</a></strong>,
an R package that enables detection of significant metabolites from
metabolite information.</p>
<p>The final data-driven hackathon session took place on Thursday afternoon. John
Erol Evangelista (Mt Sinai) introduced the <strong><a href="https://maayanlab.cloud/sigcom-lincs/#/SignatureSearch/UpDown">SigCom LINCS</a></strong>
API which contains over 1.5 million gene expression signatures from LINCS,
the Gene Expression Tissue Project (GTEx), and Gene Expression Omnibus
databases (GEO). Then, Daniel Clarke (Mt Sinai) gave an introduction to
building <strong><a href="https://appyters.maayanlab.cloud/#/">Appyters</a></strong> and how to
use the SigCom LINCS APIs within Appyters. </p>
<p>On Friday we ran a Wrap Up and Future Directions session for
presenters to recap what happened at their sessions and talk about
future goals for their tools. This allowed everyone to learn about
sessions they might not have attended, and possibly spark interest in
watching the video recording of the session.</p>
<h2>Reflection</h2>
<p>Overall the sessions were well attended and well received! In a
pre-hackathon survey, we asked participants which hackathon sessions
they were interested in attending. More people attended each session
than we anticipated, which indicated that the introduction session on
Monday was critical for spurring interest.</p>
<p><img alt="" src="images/2022-hackathon-img2.png"></p>
<p>According to our survey, participants walked away with a greater understanding of Common Fund databases and tools, so we achieved our main goal of increasing familiarity with the diverse datasets supported by the Common Fund Data Ecosystem. Additionally, our team of trainers identified new Common Fund datasets that we plan on integrating into our <a href="https://training.nih-cfde.org/">training program</a> in the future. </p>
<p>A common observation was that some sessions felt more like webinars or
workshops than what the name "hackathon" implies. For our next
hackathon, we will work with presenters to define sessions as webinars
(a demonstration of a data tool), workshops (a training event with
live coding), or hackathons (a defined problem that participants work
on). We also received requests for more advanced notice and
information about the content of sessions, which we will incorporate
into our next round of event planning.</p>
<p>This event, along with many other online events, lacked the sense of
community that can be present with in person multi-day events. We
tried using GitHub Issues or Discussions to foster conversations
between participants, but these tools were rarely used. We are
thinking about how to address this for our next event, and we're open
to feedback!</p>
<p>Finally, the hackathon coordination team would like to reiterate our
thanks to all Common Fund groups that ran sessions for this event! We
could not have achieved a diversity of datasets and tools at this
event without your time and efforts.</p>
<h2>Next steps</h2>
<p>We are excited to announce that <strong>the second CFDE Hackathon will take
place April 25 - 29th!</strong> We're going to fine tune the event
with the feedback from our February event, and we hope you will join
us!</p>
<p>If you are interested in learning more about attending the April 2022
Hackathon as a participant, please
<a href="https://www.nih-cfde.org/events/april-2022-hackathon/">register here</a>!
We hope to see you there :)</p>
<hr>
<p><em>The Common Fund Data Ecosystem Training Program is funded by the
National Institutes of Health (1OT3OD025459-01).</em></p>On minimum metagenome covers, and calculating them for your own data.2022-01-18T00:00:00+01:002022-01-18T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2022-01-18:/blog/2022-calculating-minimum-metagenome-covers-with-genome-grist.html<p>You, too, can run our software!</p><p>We just posted a preprint, <a href="https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2">Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers</a>, Irber et al., 2022! Some day soon I'd like to write a long blog post about how this is six years in the making, part of a major intellectual endeavor in the lab that I'm incredibly excited about, yada yada yada, but for now let me just say that I think it's got some interesting ideas in it and if you're at all interested in analyzing shotgun metagenome data you should open it in a tab somewhere; a <a href="https://dib-lab.github.io/2020-paper-sourmash-gather/">very readable HTML version is available for just that purpose</a>.</p>
<h2>There is a super cool figure. You should check it out!</h2>
<p>But what I'm really here to say is this: you might see a super cool figure in the paper that looks like this:</p>
<p><img alt="figure comparing mapping to k-mer hash matching" src="images/2022-calculating-gather-vs-mapping.png"></p>
<p>That figure is super cool in part because it tells you what microbial genomes from Genbank are present in your shotgun metagenome.</p>
<p>And it's even <em>super cooler</em> because our software figures out which genomes are present <strong>automatically</strong>, and can use all of Genbank microbial to do so!<sup>1</sup></p>
<p>We're not talking taxonomic information here, BTW, where you then have to go pick a representative genome after doing an analysis that only gives you vague species-level designations. Nope, we're talking cold, hard DNA-sequence-on-the-table, genome-files-in-a-directory, automatically retrieved and analyzed for you. With mapping and everything.<sup>2</sup></p>
<p>(Taxonomy <em>is</em> available, if you're interested in such. You can use GTDB or NCBI taxonomy as you wish. But you can just have the genomes, too!)</p>
<h2>Where can I get this magickal software?</h2>
<p>What, you say? How is this magic possible!?</p>
<p>We wrote some software! And workflows! It's called genome-grist and it's <a href="https://github.com/dib-lab/genome-grist">available</a> NOW NOW NOW for the LOW LOW COST of FREE!</p>
<p>And (I can't stress enough how excited I am about this) <a href="https://dib-lab.github.io/genome-grist/">it's got documentation, too!</a></p>
<p>And, for an <em>unlimited time only</em>, you can even integrate <a href="https://dib-lab.github.io/genome-grist/configuring/#preparing-information-on-local-genomes"><strong>your own private, unpublished, cherished and hoarded genome sequences!</strong></a><sup>3</sup></p>
<p><ahem></p>
<p>Anyhoo. Feedback is welcome.</p>
<p>And yes, this is actually the software that was used to calculate the figures in the paper about the sourmash software referenced at the top. Yes, we are writing software so that we can generate figures for papers about other software. No, this will never end.</p>
<p>--titus</p>
<p>1: terms and conditions may apply: right now we can only give you Genbank as of July 2020. Sorry. We're working on it.</p>
<p>2: terms and conditions may apply: right now this really only works well with paired-end Illumina metagenomes.</p>
<p>3: And, like, your own private taxonomic classifications for your genomes, if you're into that kind of thing.</p>A bioinformatics training career panel in the DIB Lab2021-11-08T00:00:00+01:002021-11-08T00:00:00+01:00Saranya Canchitag:ivory.idyll.org,2021-11-08:/blog/2021-training-career-panel.html<p>Careers in training!</p><p><strong>Note:</strong> The below blog post was written by <a href="https://s-canchi.github.io/">Dr. Saranya Canchi</a>.</p>
<hr>
<p>(Thanks to Marisa Lim, Abhijna Parigi, and Titus Brown for reading drafts!)</p>
<p>On August 6th, we held a career panel for the
<a href="http://ivory.idyll.org/lab/">Lab for Data Intensive Biology (DIB Lab)</a>
The panel consisted of Drs. Tracy Teal, Karen Word and Kate Hertweck, all of whom are friends or alumni of the DIB lab and have built successful careers in biology and bioinformatics training. The discussion was attended by graduate students, post docs and alumni of the lab.</p>
<p>Given the non-traditional nature of the careers, we started the session learning about the career journeys of each panelist leading to their current roles. Interestingly, all our panelists had some shared experiences over time.</p>
<p><strong>Kate</strong> learned computational methods and other model systems during postdoctoral training, which led to an assistant professor position at University of East Texas. Kate especially enjoyed teaching while in this position and that inspired her to transition to Fred Hutchinson Cancer Research Center as a bioinformatics training manager, where she developed and taught many Carpentries style "intro to" type lessons. She was also heavily involved in the Carpentries, having served on the executive council. In her current position as an open science specialist at Chan Zuckerberg Initiative, she combines her experience in teaching with her passion for open science methods. </p>
<p><strong>Karen</strong> got her PhD in physiology, but spent a considerable portion of her graduate time as an educator working at science museums and teaching high school curriculum. She contributed to designing curriculums as well as teaching as part of her pre-doctoral experience, which carried over to her postdoctoral work in the DIB lab. Here she started working with the Carpentries training model, which helped pave the path to her current position as the Director of Education for The Carpentries. </p>
<p><strong>Tracy</strong>, like the other two panelists, also had a non-traditional career path, starting with a PhD in computational neuroscience followed by postdoctoral research in microbial ecology and genomics. She decided against a non-tenure track assistant professorship position as that was non-optimal for planning, in addition to the burden of raising one's own capital. She had worked on Data Carpentry as part of an NSF grant during her postdoctoral period, which led to her obtaining a large scale grant from Gordon and Betty Moore Foundation for Data Carpentry. The merger of Data Carpentry with Software Carpentry into the Carpentries allowed her to come on board as the executive director for the overall program. She further expanded her skills as a executive director at Dryad, working with data curation and open science tools. In her current role as the Director for Open Source at RStudio, she combines her rich experience in computational genomics, teaching and open science knowledge to help drive the mission of RStudio.</p>
<p><strong><em>What are the common job titles that are at the intersection of science, training and community management ?</em></strong></p>
<p>All panelists agreed that job titles can be vague with fuzzy descriptions. While that poses challenges in understanding the required skill set and introduces a level of uncertainty, it also allows for flexibility in defining the boundaries and responsibilities of the role. Looking for keywords like training/support/community in the description can be helpful. Some possible titles include community manager, open source manager, science technician, training specialist etc. It can also be helpful to talk to your friends and peers when looking for jobs as they may be able to provide insights into your strengths, as well as point towards positions (including unlisted jobs) that could be a good fit irrespective of the job title. </p>
<p><strong><em>How do you find these open ended positions ? Are most of them full time positions ?</em></strong></p>
<p>The jobs channel on <a href="https://www.cscce.org">Center for Scientific Collaboration and Community Engagement</a> (CSCCE) slack space is a good resource for these types of job postings. Informational interviews are a great networking strategy. It is best to seek out people in interesting positions or at interesting companies and ask about their experience and career path. Such sessions have the potential to become a networking opportunity with the possibility of future job postings being sent your way. Another great networking source is be your undergrad or graduate alumni network. UC Berkeley has a <a href="https://career.berkeley.edu/Info/InfoQuestions">neat template for informational interviews</a> which can be helpful in preparing potential questions. </p>
<p>Positions that rely on grant support typically have a finite timeline. Some open-ended positions are yearly and contract-based. If you are unsure, it is best to ask the hiring manager/HR/recruiter in the initial stages of hiring process. </p>
<p><strong><em>What are some important aspects to consider when evaluating a position with a company ?</em></strong></p>
<p>It is important to have lots of latitude to develop the responsibilities of the position as well as for personal growth. The position should allow you to challenge yourself and your team (if applicable) and try new initiatives. There should also be enough scope to read, reflect and engage in professional development to continually improve yourself. While it maybe difficult to gauge prior to starting a position, it is also important to think about the colleagues who you may work with everyday. Working at a company is not the whole experience but rather depends on which team and people you interact with the most. Staff restructuring can significantly change the overall company experience. It is also important to consider the values, vision and cultural fit of the company.</p>
<p>Tracy suggested asking these questions during the interview process:
- What kind of values are there at X?
- How do you run your meetings?
- What are the expectations around communications?
- How do you manage employee time away?</p>
<p>She pointed out that consistent answers across employees within a given team/company, which illustrates shared understanding, is helpful in evaluating potential fit.</p>
<p><strong><em>What types of other jobs can one do as a graduate student/postdoc to gain experience beyond research ?</em></strong></p>
<p>Academic experience can focus specifically on research skills, rather than outreach/training/community building skills. But to understand aspects of a job broadly it is important to gain experience outside of it. Volunteering and engaging in hobbies can be critical to developing skill sets that are helpful in non academic jobs. Volunteering is a safe route to try new roles while figuring out the career direction you would like to pursue. </p>
<p>It can be useful to look at job descriptions and work backwards. If you are interested in teaching and teaching leadership roles, consider volunteering for <a href="https://carpentries.org">The Carpentries</a>, a global organization that focuses on teaching essential data and computational skills! You can also join groups like <a href="https://rladies.org">R-Ladies</a>, <a href="https://pyladies.com">PyLadies</a>, coding meetings or start a group of your own. Offer to host an event or help with documentation within your lab or beyond. Kate also shared a <a href="https://github.com/k8hertweck/professional_assets_data_science">resource that she developed for a workshop on professional assets in data science careers</a>. It is also useful to network and talk to people from varied fields to gain a unique perspective. </p>
<p>It is difficult to land a manager/director level position straight out of graduate school or postdoc, since it requires management experience. While not identical to managing employees, mentoring students can provide substantial people management experience in academic settings. </p>
<p><strong><em>Do you have any regrets about not continuing the scientific research career route ?</em></strong></p>
<p>Science comes in many flavors. Working in science adjacent fields such as training/ teaching can still offer the opportunity to do scientific data driven decisions while providing rewards such as learner satisfaction, developing open source materials and engaging with a broader community. There are some aspects of the scientific process that you may not get to do in this line of work. Allowing yourself space to process the disconnect and feeling sadness are important to move forward. However, you are constantly learning, adapting and thinking about the impact your current position has on a larger scale of teaching and that can be very empowering. </p>
<p>With the clock moving forward we had to end the lively and inspiring discussion! </p>Using snakemake to do simple wildcard operations on many, many, many files2021-08-30T00:00:00+02:002021-08-30T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2021-08-30:/blog/2021-snakemake-simple-operations.html<p>snakemake is awesome</p><p>I recently co-taught <a href="https://ngs-docs.github.io/2021-august-remote-computing/automating-your-analyses-with-the-snakemake-workflow-system.html">another snakemake lesson</a> (with Dr. Abhijna Parigi), and was reminded of one of my favorite off-label uses of snakemake: replacing complicated bash <code>for</code> loops with simple and robust snakemake workflows.</p>
<h2>An example</h2>
<p>As a bioinformatics researcher, I frequently need to do simple operations to many files. As part of this, I usually want to change the filename to represent the change in file content.</p>
<p>For example, let's suppose I have a bunch of FASTQ files (say, the ones <a href="https://github.com/ngs-docs/2021-remote-computing-binder/tree/latest/data/MiSeq">here</a>), and I want to subset them to the first 400 lines. The filenames all have the form <code>NAME.fastq</code>, and I want to add <code>.subset.fastq</code> to the end of the subset filenames to distinguish them.
(See <a href="https://ngs-docs.github.io/2021-august-remote-computing/automating-your-analyses-and-executing-long-running-analyses-on-remote-computers.html#subsetting">this shell scripting lesson</a> for more background and motivation for this particular operation.)</p>
<h3>Using <code>bash</code>, round 1</h3>
<p>For many years I did this with bash <code>for</code> loops. The following code works, assuming the original fastq files are in a <code>data/</code> subdirectory:</p>
<div class="highlight"><pre><span></span><code>:::<span class="nv">bash</span>
<span class="nv">mkdir</span><span class="w"> </span><span class="nv">subset</span>
<span class="k">for</span><span class="w"> </span><span class="nv">i</span><span class="w"> </span><span class="nv">in</span><span class="w"> </span><span class="nv">data</span><span class="cm">/*.fastq</span>
<span class="cm">do</span>
<span class="cm"> head -400 $i > subset/$(basename $i).subset.fastq</span>
<span class="cm">done</span>
</code></pre></div>
<p>Starting from a bunch of files,</p>
<div class="highlight"><pre><span></span><code>>data/F3D0_S188_L001_R1_001.fastq
>data/F3D0_S188_L001_R2_001.fastq
>...
</code></pre></div>
<p>this loop will produce</p>
<div class="highlight"><pre><span></span><code>>subset/F3D0_S188_L001_R1_001.fastq.subset.fastq
>subset/F3D0_S188_L001_R2_001.fastq.subset.fastq
</code></pre></div>
<h3>Improving the bash solution</h3>
<p>The output filenames are kind of ugly, because <code>fastq</code> is repeated. That's just because bash makes it so easy to append to filenames - we can fix this by adding <code>.fastq</code> into the <code>$(basename ...)</code> call:</p>
<div class="highlight"><pre><span></span><code>:::<span class="nv">bash</span>
<span class="nv">mkdir</span><span class="w"> </span><span class="nv">subset2</span>
<span class="k">for</span><span class="w"> </span><span class="nv">i</span><span class="w"> </span><span class="nv">in</span><span class="w"> </span><span class="nv">data</span><span class="cm">/*.fastq</span>
<span class="cm">do</span>
<span class="cm"> head -400 $i > subset2/$(basename $i .fastq).subset.fastq</span>
<span class="cm">done</span>
</code></pre></div>
<p>So... not difficult to read, and fairly straightforward. Why would I use anything else?</p>
<h2>Using snakemake instead</h2>
<p>tl;dr The bash code above is brittle when I modify it; it's not robust enough for important work.</p>
<p>In my (extensive <sigh>) experience, the above approach fails some reasonable percent of the time. Usually I get it right the first time I write it, and then I modify and tweak it, and chaos ensues because I omit something in the for loop.</p>
<p>So a year or two ago, I decided to try out snakemake for one of these operations.</p>
<p>Here's the contents of a file named <code>Snakefile.subset</code>, which does the same thing as the for loop above -</p>
<div class="highlight"><pre><span></span><code>:::<span class="nv">python</span>
#<span class="w"> </span><span class="nv">pull</span><span class="w"> </span><span class="nv">in</span><span class="w"> </span><span class="nv">all</span><span class="w"> </span><span class="nv">files</span><span class="w"> </span><span class="nv">with</span><span class="w"> </span>.<span class="nv">fastq</span><span class="w"> </span><span class="nv">on</span><span class="w"> </span><span class="nv">the</span><span class="w"> </span><span class="k">end</span><span class="w"> </span><span class="nv">in</span><span class="w"> </span><span class="nv">the</span><span class="w"> </span>#<span class="nv">data</span>
<span class="nv">FILES</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">glob_wildcards</span><span class="ss">(</span><span class="s1">'data/{name}.fastq'</span><span class="ss">)</span>
#<span class="w"> </span><span class="nv">extract</span><span class="w"> </span><span class="nv">the</span><span class="w"> </span>{<span class="nv">name</span>}<span class="w"> </span><span class="nv">values</span><span class="w"> </span><span class="nv">into</span><span class="w"> </span><span class="nv">a</span><span class="w"> </span><span class="nv">list</span>
<span class="nv">NAMES</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">FILES</span>.<span class="nv">name</span>
<span class="nv">rule</span><span class="w"> </span><span class="nv">all</span>:
<span class="w"> </span><span class="nv">input</span>:
<span class="w"> </span>#<span class="w"> </span><span class="nv">use</span><span class="w"> </span><span class="nv">the</span><span class="w"> </span><span class="nv">extracted</span><span class="w"> </span><span class="nv">name</span><span class="w"> </span><span class="nv">values</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">build</span><span class="w"> </span><span class="nv">new</span><span class="w"> </span><span class="nv">filenames</span>
<span class="w"> </span><span class="nv">expand</span><span class="ss">(</span><span class="s2">"subset3/{name}.subset.fastq"</span>,<span class="w"> </span><span class="nv">name</span><span class="o">=</span><span class="nv">NAMES</span><span class="ss">)</span>
<span class="nv">rule</span><span class="w"> </span><span class="nv">subset</span>:
<span class="w"> </span><span class="nv">input</span>:
<span class="w"> </span><span class="s2">"data/{n}.fastq"</span>
<span class="w"> </span><span class="nv">output</span>:
<span class="w"> </span><span class="s2">"subset3/{n}.subset.fastq"</span>
<span class="w"> </span><span class="nv">shell</span>:<span class="w"> </span><span class="s2">""</span><span class="err">"</span>
<span class="err"> head -400 {input} > {output}</span>
<span class="w"> </span><span class="s2">""</span><span class="err">"</span>
</code></pre></div>
<p>and you can run it with <code>snakemake -s Snakefile.subset -j 1</code>.</p>
<p>With this Snakefile, snakemake pulls in all files that match the glob pattern and extracts their names, and then constructs a set of "targets" in rule <code>all</code> that it must create. The <code>subset</code> rule specifies how to build targets of that name.</p>
<h2>Why I like snakemake more than bash for this</h2>
<p>So why do I like snakemake more? A few reasons that I think are intrinsic to snakemake vs bash -</p>
<ul>
<li>snakemake fails if something is wonky about the filenames, <em>before</em> doing anything!</li>
<li>if any of the operations fail, snakemake stops and alerts me by default!</li>
<li>I can do the operations in parallel by specifying e.g. <code>snakemake -j 4</code> to use 4 cores.</li>
<li>the templating language for using <code>{...}</code> is nice, simple, and Python-standard (see <a href="https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python">this blog post on f-strings</a> and also the <a href="https://docs.python.org/3/library/string.html#formatspec">the templating minilanguage ref</a>).</li>
<li>as the operations get more complicated, snakemake doesn't need to get more complicated, while the bash solution tends to complexify into illegibility...</li>
<li>I think the snakemake solution is easier to understand and modify!</li>
</ul>
<p>Above all, the overall structure of snakemake is <em>declarative</em> rather than <em>procedural</em>. We declare what we want the result to look like, and snakemake uses the available rules to create the overall set of steps that must be executed and Makes It Happen. This is what makes the error checking and parallelization possible.</p>
<p>Another "feature" of this solution is that there are more comments because I comment Snakefiles more than bash scripts. This is probably a me-problem that is caused by snakemake <em>forcing</em> me to edit a file :).</p>
<p>I haven't reused Snakefiles that much, but I think you can reuse Snakefiles fairly easily - see next section.</p>
<p>Are there any downsides? The main one is that the snakemake solution feels more heavyweight to me - it involves creating a file, getting the spacing/indentation right, etc. etc. So I still don't use it as much as I probably should.</p>
<p>Thoughts welcome!</p>
<p>--titus</p>
<h2>Appendix: A more reusable Snakefile</h2>
<p>Below is a Snakefile that's a bit more reusable for situations where your input and output directories don't match the names I used above - you can override PREFIX and OUTPUT by running <code>snakemake -C prefix=PREFIX output=OUTPUT</code>.</p>
<p>(I don't really like the syntax of using f-strings here, but it's cleaner than anything else I've found. Suggestions welcome.)</p>
<div class="highlight"><pre><span></span><code><span class="x">:::python</span>
<span class="x"># pull in all files with .fastq on the end in the 'data' directory. </span>
<span class="x">PREFIX = config.get('prefix', 'data')</span>
<span class="x">print(f"looking for FASTQ files under '{PREFIX}'/")</span>
<span class="x">OUTPUT = config.get('output', 'subset5')</span>
<span class="x">print(f"subset results will go under '{OUTPUT}'/")</span>
<span class="x">FILES = glob_wildcards(f'{PREFIX}/</span><span class="cp">{{</span><span class="nv">name</span><span class="cp">}}</span><span class="x">.fastq')</span>
<span class="x"># extract the {name} values into a list </span>
<span class="x">NAMES = FILES.name</span>
<span class="x"># request the output files </span>
<span class="x">rule all:</span>
<span class="x"> input:</span>
<span class="x"> # use the extracted 'name' values to build new filenames </span>
<span class="x"> expand("{output}/{name}.subset.fastq", output=OUTPUT, name=NAMES)</span>
<span class="x"># actually do the subsetting </span>
<span class="x">rule subset_wc:</span>
<span class="x"> input:</span>
<span class="x"> f"{PREFIX}/</span><span class="cp">{{</span><span class="nv">n</span><span class="cp">}}</span><span class="x">.fastq"</span>
<span class="x"> output:</span>
<span class="x"> "{output}/{n}.subset.fastq"</span>
<span class="x"> shell: """ </span>
<span class="x"> head -400 {input} > {output} </span>
<span class="x"> """</span>
</code></pre></div>A biotech career panel in the DIB Lab2021-07-20T00:00:00+02:002021-07-20T00:00:00+02:00Marisa Limtag:ivory.idyll.org,2021-07-20:/blog/2021-biotech-career-panel.html<p>Careers outside of universities!</p><p><strong>Note:</strong> The below blog post was written by Dr. Marisa Lim.</p>
<hr>
<p>(Thanks to Titus Brown, Abhijna Parigi, Tessa Pierce, and Saranya Canchi for reading drafts!)</p>
<p>On June 25th, we held a career panel discussion for the DIB lab on bioinformatics and biomedical data science careers. We invited four DIB-lab alumni and affiliates to be our panelists - Shaun Jackman, Lisa Johnson, Phil Brooks, and Olga Botvinnik - and graduate students and post-docs in the lab attended the event.</p>
<p>Each panelist shared their career journey leading to their current roles and then we discussed topics of interest from the audience, which roughly fell into two categories: 1) advice for finding jobs and interviewing and 2) a comparison of academic research vs. biotech industry careers.</p>
<h2>Finding jobs & interviewing</h2>
<p>Everyone agreed that you're more likely to land interviews and jobs when you've got a contact that can refer and recommend you to the hiring manager. This is why it's important to create a professional network, as cold applying (applying to jobs without any prior contact) is a more difficult approach to finding jobs. Here was the advice for online and in-person networking:</p>
<ul>
<li>Have some form of public online presence, whether that be a LinkedIN profile, twitter account (functions as personal stackoverflow for asking questions and a place to advertise your own work), and/or personal website. These forums serve as public portfolios for your research and work experience. </li>
<li>Industry resumes are typically short, so one tip our panelists recommended was embedding hyperlinks in the text, so recruiters/hiring managers can find the resources listed above for additional information. </li>
<li>In-person networking might occur at conferences or at smaller group events. For example, you can request an informational interview with someone to learn more about their job (most people will be willing to chat with you!). This is a great option for a smaller group discussion, which may be less overwhelming than trying to talk to people at large conferences. Be sure to take notes from informational interviews! One panelist suggested making a new document (i.e., google doc) for each interview and time stamping the conversation to keep track of the information. An added benefit to doing informational interviews is that they might generate positions or lead to formal interviews. It was mentioned that a large proportion of biotech jobs are actually <em>not</em> publicly announced.</li>
</ul>
<p>Besides networking, our panelists suggested keeping up to date on biotech company news - if a company has recently gotten an infusion of funding, they're likely to be hiring soon!</p>
<ul>
<li>Read biotech news and blogs - i.e., https://www.genomeweb.com/</li>
<li>Get in touch with venture capital (vc) recruiter firms.</li>
</ul>
<p>At the job search/application stage, a big concern is whether to apply for jobs if you don't think you meet all the exact requirements. Our panelists very enthusiastically made the following suggestions:</p>
<ul>
<li>As long as you're interested and show that you're motivated to learn, go for it! Let hiring managers decide!</li>
<li>You <em>can</em> learn new skills on the job (this is something to look out for when assessing whether a job allows you to grow)</li>
</ul>
<p>At the interview stage, our panel had this advice to share:</p>
<ul>
<li>Have your list of references (usually 3 people) ready to go! Don't wait until you need them for a job interview and be sure they're people you trust to support you.</li>
<li>Nobody has all of the skills listed in job descriptions, but make sure you know the <em>purpose</em> of every tool, even if you don't know the exact details. During the interview, you can say you know what the tools are for and if true, that you'd like to learn more about how to use them for your job.</li>
<li>Don't oversell your skills however, because interviewers can tell and it's perfectly ok to say you're keen to learn.</li>
<li>Know what job you're applying for and why you're a good fit for the team. For example, if you're interviewing for a customer support role, it's less pertinent to go into the fine details about your research, unless you can link the story back to something support-related.</li>
<li>Interviewing is a skill too and takes practice. Even if it's not your dream job, if there's a chance you'd take the job, consider going through with the interview process to gain experience.</li>
</ul>
<p>Recognize that interviews are a two-way conversation. In addition to answering interviewer questions, be sure to ask questions to help determine whether the company, role, and team will be a good fit for you as well. Olga shared 3 questions she asks at every interview:</p>
<ul>
<li>What has kept you at company X?</li>
<li>What would you change at company X?</li>
<li>Is there anything else I should have asked?</li>
</ul>
<p>What to do if you notice warning signs during interviews? </p>
<ul>
<li>It's a good idea to ask about company culture during one-on-one interviews. It can really help to have a contact at the company to talk to about potential issues.</li>
<li>If something feels really wrong, it's ok to say you want to stop early on. This will save your energy and time, as well as that of the interviewers.</li>
</ul>
<h2>Academia vs. Industry</h2>
<p>As students and postdocs with training and research experience primarily at universities, one of the most popular topics is comparing academic research and biotech industry careers. While we only scratched the surface of this topic during the panel, here were our main discussion points:</p>
<ul>
<li>The 'balance' part of work-life balance is highly dependent on the biotech company culture, job role, and timing. In general, the workload in industry is not spread evenly over the year. For example, there may be more intense working conditions leading up to product release deadlines. However, our panelists said they generally get to schedule their own time as long as they are meeting their commitments. For companies with a global customer base, the schedule may require some employees to work at night to accommodate time zone differences - however this may actually offer some work-time flexibility depending on your circumstances. It's important to communicate early on with your team to determine work and working condition expectations.</li>
<li>Perhaps one of the more visible differences between academia and industry is that industry jobs tend to be team-oriented and focused on specific aims; communication and interdisciplinary teamwork are consequently very important to meet the responsibilities of your group within the company. Deadlines are determined by business decisions and are often less flexible. In contrast, academic researchers and faculty tend to work more independently within their lab group or department, and wear multiple hats - i.e., apply for funding, mentor students, teach, publish, contribute to department and other service duties, and manage their lab. Project milestones are often less defined and deadlines may be more flexible.</li>
</ul>
<p>Before we knew it, 1 hour had passed and it was time to wrap up the panel! </p>Scaling sourmash to millions of samples2021-07-13T00:00:00+02:002021-07-13T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2021-07-13:/blog/2021-sourmash-scaling-to-millions.html<p>Bigger and better!</p><p>(Many thanks to Dr. Luiz Irber, who sunk the pillars and laid the foundations for a lot of the work below. Dr. Tessa Pierce and Dr. Taylor Reiter drove much of our engineering work by constantly coming up with new! and bigger! use cases that were also quite exciting and motivating ;)</p>
<p><a href="https://sourmash.readthedocs.io/">sourmash</a> is our software for quickly searching large volumes of genomic and metagenomic sequence data using k-mer sketching. We're up to version v4.2.0 now, and looking forward to releasing v4.2.1 sometime in the next month.</p>
<p>One emerging theme for sourmash for the v4 series has been <em>scaling</em>. There are a variety of large-scale data sets that continue to grow in size, and it sure would be nice to be able to work with them easily.</p>
<p>The challenges are big and growing. In no particular order,</p>
<ul>
<li>NCBI has about a million microbial genomes in their GenBank database;</li>
<li>the Sequence Read Archive contains well over a million microbial shotgun sequencing data sets, with about 600,000 of them being large metagenomes;</li>
<li>the <a href="https://gtdb.ecogenomic.org/">GTDB taxonomy group</a> has produced revised taxonomic annotations for 250,000 of the GenBank genomes;</li>
<li>individual research projects can now quickly and easily produce hundreds of genomes and dozens to hundreds of metagenomics samples, so the numbers above are growing rapidly.</li>
</ul>
<p>Thanks to Luiz Irber's work, discussed in his <a href="https://github.com/luizirber/phd/releases">thesis</a>, we have a nice distributed system ('wort') that computes new sourmash sketches as new data enters the system. (Some of that is also described in <a href="http://ivory.idyll.org/blog/2021-MAGsearch.html">my blog post about searching all public metagenomes</a>.) Also thanks to Luiz, over the last two years sourmash was <a href="https://blog.luizirber.org/2018/08/23/sourmash-rust/">refactored to use Rust underneath</a>, and we've been enjoying a number of <a href="https://twitter.com/ctitusbrown/status/1356344041978228736">raw performance gains</a>.</p>
<p>The challenges we're struggling with now are because of all of this. We <em>have</em> a lot of data that we <em>can</em> work with (package, search, etc.), but our processes and infrastructure for working with it haven't scaled to meet the new capabilities of sourmash. Briefly,</p>
<ul>
<li>we have several millions of files sitting in various directories, representing sketches for a lot of public data.</li>
<li>it's prohibitively slow to scan through all of that information repeatedly, and difficult or impossible to fit it all in memory (depending on the collection in question).</li>
<li>most sketches are not really interesting for any given operation, so a lot of our scanning would be redundant anyway.</li>
</ul>
<p>Because of this, a lot of our work since the 4.0 release has been on technical changes to support <em>better processes</em> that will better handle searching, collating, and updating collections of bajillions of files.</p>
<h2>Motivation: building new database releases</h2>
<p>One of the several major uses of sourmash is searching genome collections, with the goal of finding matches to and/or classifying genome or metagenome samples. We variously use the GTDB genomic representatives database (48k genomes), the GTDB complete database (250k genomes), or the NCBI microbial database (~800k genomes). And we want to <a href="https://sourmash.readthedocs.io/en/latest/databases.html">provide these databases</a> for download so that sourmash users don't have to do all the prep work themselves.</p>
<p>To provide these databases,</p>
<ul>
<li>first, we need to sketch all the genomes. This involves downloading each genome, running <code>sourmash sketch</code>, and saving the results somewhere. This is what wort does - it monitors NCBI for new genome entries, calculates the sketches, and makes them available for download.</li>
<li>then we need to select a <em>specific</em> set of sketches based on a catalog (GenBank microbial, or GTDB genomic reps, or whatnot) and a set of parameters (k-mer size, mostly).</li>
<li>next we need to figure out which sketches do not exist in our overall collection, for whatever reason, and find/build those. (e.g. NCBI GenBank is somewhat fluid, and GTDB isn't always synced with its releases; or something just slipped through the wort cracks; or GenBank never actually <em>had</em> the right sequence, so it needs to be calculated)</li>
</ul>
<p>If you have 100, 1000, or even 10,000 sketches, this is all pretty easy. It only starts to get annoying when you have 100,000 and more. We have a million :).</p>
<h2>Investing in scaling - some principles</h2>
<p>There are a number of techniques for working with large volumes of data.</p>
<p>First, <strong>lazy loading</strong>. Better known as <a href="https://en.wikipedia.org/wiki/Lazy_evaluation">Lazy Evaluation</a>, this is a CS concept where you pass around references to objects, and only resolve those references when you decide to actually use the object. Since references are usually (much) cheaper than the full object, you can save on memory. In the case of sourmash, one of our on-disk search structures, the Sequence Bloom Tree (SBT), has relied on lazy loading for years, and we've been expanding this to sketch collections more generally.</p>
<p>Second, <strong>streaming input and output</strong>. Another CS concept, <a href="https://en.wikipedia.org/wiki/Stream_(computing)">streaming</a> means that you perform as many operations as possible on individual items, and don't hold all the items in memory (ever). We've always intended to support large-scale streaming I/O in sourmash, but it hasn't been a priority before this. Luckily Python gives us lots of tools - generators, in particular - for doing streaming!</p>
<p>A related concept to streaming is to <strong>avoid accumulating anything big in memory</strong>. This is easier said than done - for one not-so-random example, if you're searching a big database for matches, it is very easy to just keep matches in memory and then deal with them as a single collection. But what if a large portion of that database matches? You need a place to store the matches!</p>
<p>Fourth, <strong>use metadata to filter as much as possible</strong>. This seems separate from but maybe overlaps with lazy loading... basically, you want to work with catalogs of your data (which are less bulky), rather than your data itself (which is usually much larger).</p>
<p>Fifth, <strong>support flexible filtering</strong>. It's very easy to write custom solutions that get you what you need today, but a more general solution may not be much more work and will save you time later, as your use cases evolve.</p>
<p>Sixth, <strong>use databases</strong>. This may be obvious, but it's always worth remembering that there are literally decades of work on storing and searching structured catalogs of data! We should make use of that software! And if we use sqlite3, we have a superbly engineered and high-performance SQL database that is embedded in many programming languages!</p>
<p>Seventh, <strong>think declaratively</strong> instead of procedurally. Try to describe <em>what</em> you want to do with the data, now <em>how</em> you want it done (and in particular, avoid for loops as much as possible :). Abstracting the operations you want to do into a declarative form permits refactoring and optimization of the underlying implementation. </p>
<p>So how is this all shaking out in sourmash?</p>
<h2>Iterating towards nerdvana: a progress report</h2>
<p>We've invested considerable amounts of effort into engineering over the last year, iterating towards implementations of the above practices.</p>
<h3>Round 0 (sourmash 3.x through sourmash 4.0)</h3>
<p>By the release of <a href="https://github.com/sourmash-bio/sourmash/releases/tag/v4.0.0">sourmash 4.0</a> in March 2021, we had included a lot of good optimizations and refactoring already.</p>
<p>Way back in 3.x sometime, Luiz had moved sketch loading into Rust. This led to a ridiculous speedup in pretty much everything - 100-1000x.</p>
<p>We had slowly but surely made our way to a standard <code>Index</code> class API that let us collect, select, and search on large piles of sketches.</p>
<p>In particular, we'd started to invest in <em>selectors</em>, that let us specify features (like k-mer size, or molecule type) that we wanted our collection limited to.</p>
<h3>Round 1 (sourmash 4.1)</h3>
<p>For <a href="https://github.com/sourmash-bio/sourmash/releases/tag/v4.1.0">sourmash v4.1.0</a>, two months later, we evolved things more.</p>
<p>We had a lazy selection <code>Index</code> class that deferred running the selectors until the actual sketches themselves were requested. Getting the class to work properly and supporting it fully throughout the code base (and testing the bejeezus out of it) forced us to regularize the class API some more, which opened up many more opportunities.</p>
<p>We also added <em>generic</em> support for retrieval of sketches by random access into a collection, through our use of .zip collections and <code>ZipFileLinearIndex</code>. This was an expansion of the lazy loading and on-disk storage that SBTs had enjoyed since v3.2, but without the same overhead cost of the data structures. So, now it was possible to package really large collections of sketch in a compressed format <em>and retrieve individual sketches directly</em>, with minimal overhead. Not so incidentally, this was also our first random-access/on-disk mechanism that could store <em>incompatible</em> sketches - so it was much more flexible than what we'd been doing before.</p>
<p>The internal (and command-line) support for the streaming <code>prefetch</code> functionality was also a watershed moment. Prior to <code>prefetch</code>, all of our database search methods did a search and then sorted all of the results to present a nice summary to the user. While useful, this meant that if you had lots of matches, you had to store them all in memory so you could sort them later. This could be ...prohibitive in terms of memory, and we already had specific examples where we knew it wouldn't work. <code>prefetch</code> was a new feature that was <em>explicitly</em> streaming and was meant to search Databases of Unusual Size: so, it simply output matches as it found them, with no sorting.</p>
<p>Last but by no means least, once we had streaming <em>input</em> we needed streaming <em>output</em>, so we implemented a general sketch saving method that supported several standard output methods (to directories and zipfiles, in particular) to offload sketches directly to disk.</p>
<p>Together, what this all meant was that we could finally:</p>
<ul>
<li>take an arbitrarily large collection of on-disk sketches,</li>
<li>select just the ones we wanted without necessarily loading them all,</li>
<li>walk across those sketches one by one, storing no more than a small number of them in memory,</li>
<li>find matches and offload those matches to disk as we went.</li>
</ul>
<p>There were still some suboptimal constraints that had to be obeyed, but they were in the implementation, not in the API, so we "just" needed to iterate on the implementation :).</p>
<p>This was the first crack in the dam of database building (one of our motivating use cases): once we had zipfile collections implemented, we could first build zipfile collections and then use those zipfile collections as our source build for all the other database types that supported fast search. (And, indeed, we now have <a href="https://github.com/sourmash-bio/sourmash/issues/1511#issuecomment-867759491">a snakemake workflow</a> that does exactly that!.)</p>
<h3>Round 2 (sourmash 4.2)</h3>
<p>Between v4.1 and v4.2, we had several minor releases that cleaned things up and improved edge case efficiency. </p>
<p>For <a href="https://github.com/sourmash-bio/sourmash/releases/tag/v4.2.0">sourmash v4.2.0</a> in early July 2021, however, we doubled down on the "working with large collections" theme.</p>
<p>First, we introduced <a href="https://sourmash.readthedocs.io/en/latest/command-line.html?highlight=picklist#using-picklists-to-subset-large-collections-of-signatures">"picklists"</a>, which give command-line and API-level support for selecting sketches based on their metadata features (not their content). The initial implementation was slooooooow on large data sets, but this was an important declarative mechanism (that immediately saw extension in unexpected directions, too!)</p>
<p>This was followed (in the same release) by database <em>manifests</em>, a feature that is not user-facing at all (and doesn't show up in the docs, either - oops!). Manifests are simply a spreadsheet-style catalog of the metadata for all the sketches in a particular database, and they can be calculated <em>once</em> and then included in zipfiles. They support direct retrieval of sketches by id, as well as rapid intersection with picklists.</p>
<p>These two features were relatively minor in terms of new user-facing functionality - although they do support some cool stuff! - but were massive in terms of internal improvements.</p>
<p>For example, it was now virtually instant to take a zipfile collection of 260,000 sketches and pick out the three sketches you were interested in, based on whatever criteria you wanted.</p>
<p>So, as a not so random example, you could run <code>prefetch</code> on a big database (low memory, streaming...) and save only the CSV with match names, and then <em>just use that CSV</em> as a picklist to run further operations on the database - search, gather, etc. There's no need for intermediate collections of sketches in workflows! (This has saved us literally 100s of GBs of disk space already!)</p>
<p>As another not-so-random example, you could load a manifest from a Zipfile collection, run your sketch selection (ksize, molecule type, identifier, etc.) on the manifest in memory, and then go back to load <em>only</em> the relevant sketches from disk as you needed them.</p>
<p>Again, the internal implementation is leading the user-facing features here, and there are still some performance issues, but the API support is there and seems flexible enough to support a wide range of optimizations.</p>
<h3>Round 3 (sourmash 4.2.1, maybe?)</h3>
<p>The next release will add some internal support for more/better manifest stuff. In particular, I've been experimenting with a <em>generic</em> lazy loading index class, which lets us do clever things like load a manifest, do selection and filtering on it, and only actually go to the disk to load the index object when we're ready - previous approaches always worked on the loaded index, which is suboptimal when you have thousands of them.</p>
<p>With this new class, we can apply the manifest directly as a picklist and subset down to just the sketches we care about. (As with all of these things, I've been <a href="https://github.com/sourmash-bio/sourmash/pull/1619">playing around</a> with different implementations and throwing different use cases at them, and it's been "interesting" to watch various solutions fall apart under the burden of really large collections!)</p>
<p>One thing this has let me do is (finally!) re-engineer database releases around manifests, using (tada) <a href="https://github.com/ctb/2021-sourmash-mom">manifests of manifests</a>. With this we do the following -</p>
<ul>
<li>load many manifests from many collections into a single SQLite database;</li>
<li>run our metadata selection (k-mer size, molecule type, picklists, etc.) on this database, using SQL primitives;</li>
<li>and then go grab precisely those sketches we care about, for further downstream processing.</li>
</ul>
<p>In <em>practice</em> this lets us do things like cut a new database release for GTDB quickly and easily - it takes <a href="https://github.com/sourmash-bio/sourmash/issues/1652#issuecomment-877647611">only a minute</a> to verify that we have all of the necessary sketches and return their locations. (And like everything else, it can probably be optimized dramatically.)</p>
<h3>What's next? (sourmash 4.3 or later)</h3>
<p>We've been <a href="https://github.com/sourmash-bio/sourmash/issues/1350">somewhat fixated</a> on trying to provide good user experience (fast, performant, communicative) around searching Extremely Large Collections.</p>
<p>We have a prototype solution for near-realtime search of 50k+ sketches (see <a href="https://github.com/sourmash-bio/sourmash/issues/1226">sourmash#1226</a> and <a href="https://github.com/sourmash-bio/sourmash/issues/1641">sourmash#1641</a>) and at some point that will make it into the codebase. At that point we will be closer to fully exploiting CPU and disk capabilities; right now our speed is mostly bound by our lack of parallelism.</p>
<p>Somewhere down the road we're going to expand our persistent storage options. We support file storage, zipfiles, Redis, and IPFS for SBTs already (thanks again Luiz!) but want to support these for more collection types. Not hard now, just ...work.</p>
<p>And, as a nice cherry on top of the sundae, after all of the <code>Index</code> API refactoring we did back in 4.0 and 4.1, we can now easily support client/server mechanisms via remote procedure calls using only the standard interface - see <a href="https://github.com/sourmash-bio/sourmash/issues/1484">sourmash#1484</a>. This opens the door to using larger in-memory database types, which have been hampered thus far by the loading time.</p>
<h2>Some concluding thoughts</h2>
<p>Why are we putting so much effort into all of this? There are a couple of reasons:</p>
<ul>
<li>sourmash is underpinning a lot of different work in our lab, and these kinds of efficiency enhancements really make a difference when amortized over 5 projects!</li>
<li>routine lightweight search of all public data will unlock a lot of use cases that we can only see dimly right now. But our experience has been that <em>actually building</em> stepping stones towards this dimly-seen future set of use cases is the best way to make them happen (or to figure out why they can't or shouldn't happen).</li>
<li>it's fun! This has been a labor of love during some rough pandemic times, and it's been nice to actually make visible progress on something over the last 18 months...</li>
</ul>
<p>That all having been said, it's been a lot of work to solve the engineering challenges, with only fuzzy use cases to motivate us. Moreover, a lot of the work we describe above is not directly publishable. It's not entirely clear how we'll roll this out in a way that supports people's careers. So it's a bit of a gamble, but hey, that's what tenure's for, right?</p>
<p>(There's definitely some more to discuss here about the tension between grant writing and a focus on slow, careful, and iterative engineering of new capabilities - my tagline in lab for this is "boring in theory, transformative in practice" - but that's another blog post.)</p>
<p>Some other thoughts -</p>
<p><strong>Abstractions sure are convenient!</strong> Figuring out the right APIs internally has led to a renaissance in our internal code, although I think we need to step up our code docs game, too, so that someone other than the core developers can make use of these features :(.</p>
<p><strong>Declarative approaches are awesome.</strong> It's been really nice to redefine our APIs in terms of what we want to have happen, and then implement differently performing classes (storage, search, selection) that we can mix and match depending on our requirements.</p>
<p><strong>Automated testing has been key!</strong> We embarked upon the v4 journey with a codebase at about 85% code coverage, and having those solid building blocks has been critical. We continue to discover API edge cases that need to be resolved, and then we <em>immediately</em> lock them down with more tests. Without these tests, the massive-scale refactoring we've been doing would never have worked. (A <code>diff --stat v3.5.1</code> shows virtually every source file changed, with 17738 lines added, and 3546 removed, in a codebase with only 50,000 lines of code and tests!)</p>
<p><strong>Python is awesome</strong>. The language supports really nice abstraction layers, provides good language primitives for streaming and lazy evaluation (generators in particular!), and has both a massive stdlib and straightforward installation system that means that adding new capabilities to software built in Python is quite easy.</p>
<p><strong>Rust is awesome</strong>. We're really liking Rust as a high-performance layer under Python. The boundaries are still being negotiated - who owns what objects is a persistent theme, for example, and we're still working on fleshing out Rust support for new storage types - but Python is challenging for compute-focused multithreading while Rust supports it very straightforwardly (and way, way better than C++).</p>
<p>Last but not least, <strong>sqlite3 continues to be amazing</strong>. I'm not even that good at tuning it, and it's already incredibly efficient; if we put time and effort into better schema, we'll probably get an order of magnitude improvement out of it with only a few hours of work. We just don't need that yet :).</p>
<p>--t</p>New sourmash databases are available!2021-06-29T00:00:00+02:002021-06-29T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2021-06-29:/blog/2021-sourmash-new-databases.html<p>Databases are now available for GTDB!</p><p>(Many thanks go to Dr. Tessa Pierce for refitting our database construction
process, to Dr. Luiz Irber for underlying infrastructure work, and to Dr.
Taylor Reiter for updating the docs :)</p>
<p>While we are working on releasing sourmash 4.2, I wanted to drop a
short note - we have some new databases (and database types!)
available for <a href="https://sourmash.readthedocs.io/en/latest/">sourmash</a>,
our genome and metagenome analysis tool.</p>
<p>If you go to the
<a href="https://sourmash.readthedocs.io/en/latest/databases.html">prepared databases</a>
page for sourmash, you'll see that we now make three types of
databases available, for two different collections of GenBank genomes.</p>
<h2>Collections</h2>
<p>We've created two collections of sourmash signatures for
GTDB 06-RS202, the latest release of the <a href="https://gtdb.ecogenomic.org/">Genome Taxonomy Database</a>. (Since every genome in GTDB is in GenBank, these are really just subsets of GenBank.)</p>
<p>The smaller collection contains the 48,000 genomic representatives, a collection of genomes that is non-redundant at the species level.</p>
<p>The larger collection contains all 258k GenBank genomes for which GTDB has calculated taxonomies.</p>
<p>Why do we have this focus on GTDB? It's a nice collection of high quality genomes; it covers most of the bacterial and archaeal species diversity present in GenBank; it's not massively redundant; and it's less monstrously huge than all of GenBank microbial (also see below.) And, since sourmash is (mostly) taxonomy agnostic, it doesn't matter whether you are fan of GTDB or a fan of NCBI taxonomies.</p>
<p>For all these reasons, GTDB has become our default for searches. But again, see below.</p>
<h2>Database types</h2>
<p>We provide three database types: SBT, LCA, and Zipfile collections.</p>
<p>The SBT and LCA databases are the same database types that we've provided for several years, and you probably want one of these. They are compatible with sourmash 3.5 and sourmash 4.x both.</p>
<p>To quote from <a href="https://sourmash.readthedocs.io/en/docs_4.0/command-line.html#storing-and-searching-signatures">the documentation</a>,</p>
<blockquote>
<p>SBT databases are low memory and disk-intensive databases that allow for fast searches using a tree structure, while LCA databases are higher memory and (after a potentially significant load time) are quite fast.</p>
</blockquote>
<p>But!</p>
<p>With <a href="https://github.com/sourmash-bio/sourmash/releases/tag/v4.1.0">sourmash 4.1</a>, we <em>also</em> support a new type of sourmash database - zipfile collections. These are unindexed collections of signatures, and now serve as the basis for our database release process. They are not (yet) that useful for users, is all.</p>
<h2>Do we provide anything else?</h2>
<p>Why, yes, thanks for asking! Actually, yes - if you look at our full <a href="https://drive.google.com/drive/folders/1ohyggli2FsOoA2PO9h74FMp8A4mznzjt">google drive folder</a>, you'll see that we also provide full manifests for the content of these databases, along with a report from <code>sourmash lca index</code>.</p>
<h2>What other collections are we planning to provide?</h2>
<p>We've spent much of the last year trying to figure out how to make all GenBank microbial genomes (=~ all non-animal/non-plant) searchable in a useful way - see e.g. <a href="https://github.com/sourmash-bio/sourmash/releases/tag/v4.1.0">sourmash 4.1</a>, which massively sped up search and gather. We'll probably provide that as a zip collection soon (not sure about an SBT, though! and almost certainly not as an LCA database).</p>
<h2>What about protein databases? And support for multiple taxonomies?</h2>
<p>At the present time, I can neither confirm nor deny that we will soon be providing prepared database for protein search. Likewise, I can neither confirm nor deny that we will soon release support for doing taxonomic analysis with either NCBI or GTDB taxonomies, or indeed <em>both at the same time</em>. So please do not engage in unwarranted speculation.</p>
<h2>Questions? Comments? Thoughts? Requests?</h2>
<p>File an <a href="https://github.com/sourmash-bio/sourmash/issues">issue</a> or come chat with us over on our <a href="https://gitter.im/sourmash-bio/community#">new gitter channel, sourmash-bio/community</a>!</p>
<p>And stay tuned!</p>
<p>--titus</p>Moving sourmash towards more community engagement - a funding application2021-06-09T00:00:00+02:002021-06-09T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2021-06-09:/blog/2021-sourmash-czi-application.html<p>CZI EOSS4 application for sourmash support</p><p>We applied for funding from CZI for sourmash a few weeks back, via the <a href="https://chanzuckerberg.com/eoss/">Essential Open Source Software for Science</a> program. Here's the core of the application (lightly edited).</p>
<p>(We'll hear about funding by end of September, I believe.)</p>
<p>Feedback welcome, unless you're alerting me to the presence of typos :)</p>
<h2>Proposal details</h2>
<p>We seek funding for maintenance and user support for the sourmash software, while embarking on an ambitious plan to improve sustainability through improved governance, enhanced inclusivity, and robust community engagement.</p>
<h2>Short description of software project:</h2>
<p>Sourmash is mature software that enables lightweight content search, comparison and classification of microbial genomes and metagenomes. Sourmash works in low memory with compact databases, supports both NCBI and GTDB taxonomies, and can operate on private collections of genomes and metagenomes. The release of v4.1 brings massive-scale search of all Genbank microbial genomes and all public metagenomes to commodity hardware. These features are underpinned by novel data structures and algorithms, including an extension of MinHash that supports containment and the use of min-set-cov to do highly accurate metagenome analysis. Sourmash serves as a robust, reliable, and performant backbone for microbial sequence analysis.</p>
<p>We use development practices based on 30 years of scientific software engineering expertise: we develop in the open, do code review, have tests with 90%+ line coverage, and have a robust release process with semantic versioning. We provide thorough documentation, engage with users via our issue tracker, and use social media to broadcast new features and use cases. The utility of sourmash has been recognized by both users and funding agencies: we are increasingly well cited, the NSF is supporting the development of flexible taxonomies and distant evolutionary classification via protein k-mers, and the NIH is supporting iHMP reanalysis.</p>
<h2>Proposal Summary</h2>
<p>Sourmash is mature software that serves as a stable component of sequence analysis workflows, a fast and lightweight tool for massive-scale search of public and private sequence databases, and a platform for novel data structure and algorithm exploration. Sourmash is explicitly designed to meet the computational needs created by the massive expansion of sequencing capacity in microbiome biology.</p>
<p>We have arrived at an important crossroads with sourmash. We are just now releasing mature support for petabase-scale content search (v4.1.x and v4.2), and are currently writing up our novel data structures and algorithms for publication. We have ongoing projects using sourmash to analyze Human Microbiome Project datasets, including discovering strain-specific markers of Inflammatory Bowel Disease. Simultaneously, grant support for the core development of sourmash is ending, and Dr. Luiz Irber, the core developer behind most of the scaling work, is moving to another job where sourmash will become his part-time project. While sourmash research development will continue, we have no way to robustly support our current user base and grow the developer community with traditional funding, and do not have the governance infrastructure to productively engage with other support mechanisms.</p>
<p>We request support from CZI to support our newly released features with continued sourmash core development, while working toward sustainability by growing the project out of the lab and into the community. We propose to use the period to expand the sourmash community, define and grow a governance framework, connect to the Python and Rust bioinformatics ecosystem, and train both biologists and bioinformaticians to better engage with open source bioinformatics software. In particular, we see an opportunity to use sourmash to provide one example of how to grow a small project based in a single lab into a more sustainable community-based project. Importantly, this kind of maintenance and community growth does not fall within the scope of traditional funding opportunities.</p>
<p>At the end of this two year period, we will have continued to release and support high performance, high impact software. We will also have expanded our developer and user community, chosen a governance framework, identified a fiscal sponsorship plan, and published our strategies for project growth and sustainability.</p>
<h2>Work Plan</h2>
<h3>Software development activities:</h3>
<p>We propose to follow a “python-dev” model in which maintenance and feature releases proceed on their own timeline, while the roadmap process coordinates the planning and development of related feature sets (e.g. taxonomy extensions and database formats are connected). This separates maintenance updates from the “slow science” process of developing, testing, and evaluating new functionality against scientific use cases, while also ensuring that fully baked new functionality does regularly get released. Software development will proceed under our current “async” model, in which all decisions are discussed and documented openly in GitHub. </p>
<p>Fully 50% of the funded effort on this proposal goes to the “maintenance mode” activities, which are intended to further regularize the development process and support iterative, gradual performance improvement while preventing feature and performance regressions. This will include regular releases, continued maintenance of and improvements to software development and release process, database updates and releases as new genomes and metagenomes are made public, regular JOSS publications on major new versions (v4, v5, etc.), structural improvements to sourmash core, including a plugin architecture for storage formats, new command-line subcommands, and visualizations, and sketch serialization documentation and format upgrades to store more metadata, and support higher-performance binary formats.</p>
<h3>Community engagement activities:</h3>
<p>The community engagement activities below seek to build, grow, and support an active and robust user and developer community that includes biologists, bioinformaticians, computer scientists, and software engineers.</p>
<p>As sourmash matured, we focused our efforts toward building sustainable software and developing advanced use cases within the lab first, with documentation for new users added via github issues, blog posts, and feature papers. However, this has resulted in somewhat uneven support resources: e.g. we lack intermediate-level tutorials helping users transition from our introductory tutorials to advanced use cases or python API usage. We will upgrade our documentation systematically, create a “recipes” site, and construct an FAQ section that is well integrated with the documentation by reorganizing and amending existing content.</p>
<p>We plan to provide a warm, welcoming community forum that encourages new user questions and contributions. This will require engaged moderators, a strong Code of Conduct process, and a large user base, which we have not had the bandwidth to support previously. A key outcome of this funding will be the clear definition of a single support forum for sourmash, as one of the first outputs of our governance process.</p>
<p>Contributors may come from both the user community and the broader bioinformatics/CS community. We routinely source use cases, ideas for new functionality, and requests for performance improvement from the current biology-focused user community, and will encourage deeper and broader contributions through our governance and contributor framework, discussed below.</p>
<p>Similarly, there are many implementation aspects of sourmash that are interesting to, and may provide fodder for, CS and software engineers who are interested in contributing to bioinformatics software. While this is supported within the lab, these challenges are not immediately obvious or accessible to others without some biological background and appropriate documentation. We will build tutorials and documentation that highlight the algorithmic and implementation aspects of sourmash (sketching approximations, scaling issues, indexing formats, performance benchmarking, and quality-of-result benchmarking) and provide guidance for CS researchers who wish to evaluate new algorithms. Our governance and contributor framework will welcome extensions and evaluations and require neither permission nor involvement from sourmash core.</p>
<p>We see great value in further broadening our contributor base, and will continue to improve our current support for first-time OSS contributors by expanding our new contributor issue labels beyond “good first issue”, “good next issue”, and “repeatable quest”. While we do not expect many of these contributors to become long-term sourmash contributors, some may; more importantly, a steady influx of new first-time contributors will ensure that our development documentation remains accurate and useful. In support of this effort, we have budgeted for two 10 hrs/wk undergraduates to continue to contribute. We will also offer first-time contributor collaboratives, run documentation and visualization improvement hackyfests, and contribute to hackathons at BOSC and PyCon.</p>
<h3>Governance activities:</h3>
<p>We will build a Steering Council that guides governance, defines contributor guidelines, authorship considerations, and oversees the roadmap process. As part of this, we will nucleate “sourmash.bio” and move development activities out of the dib-lab organization. The Steering Council will also define the scope of the project and outline contribution mechanisms, most likely via a fiscal sponsor (perhaps the Software Freedom Conservancy).</p>
<h2>Milestones and Deliverables:</h2>
<p>We will deliver regular releases of sourmash under semantic versioning, per http://ivory.idyll.org/blog/2021-sourmash-v4-released.html. We anticipate approximately quarterly releases of major.minor versions, with more frequent patch releases.</p>
<p>We will quarterly update our roadmaps for v4.2.x, v5, and beyond. All planned features for these versions are discussed in the issue tracker. Each minor release will feature a link to updated roadmaps for the coming features. The issue tracker will continue to be constantly updated and refined in conjunction with releases and roadmaps.</p>
<p>These releases will also see regular refinement and updates of both the Python layer and the Rust layer; a major goal of our project is to expand our Rust contributor pool via CS undergrads and also (potentially) engagement with rust-bio.</p>
<p>We will simultaneously engage in iterative refactoring of our documentation to include not just getting-started docs and tutorials, but also detailed guidelines on how to get started contributing, video guides to sourmash, a “recipe” site that outlines solutions to common use cases, developer-oriented documentation for new plugins and visualizations, and a CS-focused introduction to the problems that sourmash is tackling. Recipes will be in place by mid-2022 and major updates will be delivered on a semi-annual basis.</p>
<p>Each summer (2022 and 2023) we will participate in undergraduate research projects (e.g. the National Summer Undergraduate Research Program) and introduce biology and CS undergraduates to problems in microbial genomics and metagenomics, including but not limited to sourmash. We will also participate in summer training courses (STAMPS at MBL, and DIBSI at UC Davis) as was our usual pre-pandemic practice (2010-2019).</p>
<p>We will offer at least two webinars and four hackfests annually, with our focus varied between attracting new users, attracting new developers, refining our documentation, exploring new functionality and improving our UX, and highlighting new analysis opportunities.</p>
<p>In December of 2021, 2022, and 2023 we will provide a detailed update of our governance progress and future plans. By December 2021, we will have issued invitations to a Steering Council, and begun the process of holding quarterly meetings. By December 2022, we will have engaged with potential fiscal sponsors and identified a path forward.</p>
<p>By mid-2022, we will have designated and seeded a support forum for sourmash.</p>
<p>While this will not be supported by this proposal specifically, we will also have submitted two papers on sourmash by December 2021.</p>
<p>In terms of metrics,
* We will have engaged with over 1000 new users via hackyfests, webinars, etc. as a direct result of CZI funding.
* Our stretch goal is over 500 citations combined for sourmash core papers by Dec 2023.
* We hope to be the “stable, boring” option for petabase-scale content search and expect to have seen a substantial growth in user support and functionality requests for these use cases.
* We also expect to see a dozen or more 3rd party extension modules adding new format import/export and visualizations to sourmash.
* We will have submitted at least two major releases (v4 and v5) to JOSS, one in 2021 and one by end of 2023.</p>
<h2>Value to Biomedical Users:</h2>
<p>As the biomedical field increasingly moves towards large-scale sequencing, both of single genomes (e.g. individuals) and metagenomes (e.g. gut microbiome), lightweight analysis tools are becoming an essential part of core biomedical treatment and research. Sourmash provides a lightweight and robust interface for these analyses. In particular, we note four of our well-developed applications have considerable biomedical relevance for sequencing data analysis generally, and microbiome work specifically:</p>
<p>(1) finding the minimal list of relevant genomes for a microbiome, from all available (800k+) microbial and viral genomes;</p>
<p>(2) searching all microbiome data sets for a specific genome;</p>
<p>(3) detecting and removing contamination in metagenome, genome and transcriptome data sets;</p>
<p>(4) extraction of annotation independent features to support machine learning.</p>
<p>These applications are already under active use for large-scale biomedical data: the NIH has provided short-term funding to Dr. Brown in support of applying sourmash systematically to the Human Microbiome Project data sets, and we have an ongoing project using sourmash to discover strain-specific markers of Inflammatory Bowel Disease using a random forest approach.</p>
<p>Beyond the technical aspects of sourmash, we will work towards being a good example of a scientific open source project in biology/bioinformatics, by intentionally moving towards community governance, rewarding a wide variety of contributions, providing use-case focused tutorials, and guiding sourmash users towards how to support and evolve sourmash.</p>
<h2>Diversity, Equity, and Inclusion Statement:</h2>
<p>We believe that social barriers to contribution are a major cause of the low diversity in scientific OSS, and we are committed to systematically lowering these barriers while also lifting contributors over these barriers.</p>
<p>We also believe that lightweight and robust methods that support large-scale data discovery and reuse can expand bioinformatics into the “lightly resourced” space, e.g. Primarily Undergraduate Institutions; this is an equity issue because so many current methods require substantial resources simply to get started.</p>
<p>Training modules at the DIBSI and STAMPS workshops will introduce sourmash to a diverse range of research-focused participants. NSURP is focused on undergraduates from underrepresented backgrounds, and in 2020 we hosted two Latinx undergraduates. UC Davis is also an HSI and our undergraduate researchers will be recruited with attention to diversity.</p>
<p>We need a stronger CoC response framework, both for forum moderation and for project contributors; currently, the CoC process is based on the BDFL model, which is inadequate. This is important for DEI and antiracism, and improving our CoC process is one of our main goals in finding a fiscal sponsor who can provide a larger framework within which we can operate.</p>
<p>Last but not least, we believe that providing authorship for all contributors, including those who contribute use cases, recipes, and documentation, provides a way to formally recognize contributions that are traditionally undervalued in both open source projects and academia. Recognizing this kind of “invisible labor” is fundamentally an equity issue.</p>Searching all public metagenomes with sourmash2021-06-08T00:00:00+02:002021-06-08T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2021-06-08:/blog/2021-MAGsearch.html<p>Searching all the things!</p><p>In preparation for an NIH/DOE workshop I'm attending today on
"Emerging Solutions in Petabyte Scale Sequence Search", I thought I'd
write down what we're currently doing with sourmash for public
metagenome search. I'm writing this blog post in a hurry, and I may
revise it later as I receive comments and feedback; I'll point to a
diff if I do.</p>
<p>This is based largely on work that was <a href="https://blog.luizirber.org/2020/07/24/mag-results/">done by Dr. Luiz Irber last year</a>, as part of his PhD work with me.</p>
<p>sourmash itself is available (see
<a href="https://sourmash.readthedocs.io">sourmash.readthedocs.io/</a>), and we
just released v4.1.2 yesterday! It's under the BSD 3-clause license
and is fully available via conda and pip.</p>
<h2>In brief - lightweight metagenome search with MAGsearch</h2>
<p>Today, we can use MAGsearch to robustly find matches to 10kb+ sequences (or collections of 10,000 or more k-mers) across all publicly available metagenomes, out to about 93% ANI.</p>
<p>It's particularly useful for -</p>
<ul>
<li>gathering candidates from public metagenomes for e.g. outbreak detection.</li>
<li>finding matches to a particular species or genus so as to study its ecological distribution.</li>
<li>gathering data sets to expand our knowledge of a species pangenome</li>
</ul>
<p>A search with ~100 query genomes takes about 17 hours, today, and will search 580,000 metagenomes representing 530 TB of original sequence data.</p>
<h2>How it works underneath</h2>
<p>We use <a href="https://sourmash.readthedocs.io/en/latest/">sourmash</a> to support metagenome containment search with scaled signatures.</p>
<p>sourmash scaled signatures are derived from MinHash techniques. They are compressed representations of k-mer collections, and can reliably be used to find exact matches of ~10kb segments of DNA between any two collections; larger matches can be found out to about 93% ANI.</p>
<p>One key aspect here is that search can be done without access to the original data.</p>
<p>We maintain a collection of signatures for ~580,000 public metagenomes with the SRA for k=21, 31, and 51. A search with about 100 genome-sized queries currently takes about 17 hours using 32 threads with 48 GB of RAM (on our HPC).</p>
<p>Our complete collection of signatures is approximately 10 TB total, although this contains far more than the metagenome data - it contains 3.7m signatures, representing 1.3 PB of total data (SRA metagenomes + SRA non-plant/animals + Genbank/Refseq microbial genomes).</p>
<p>This collection of signatures is automatically updated by <a href="https://github.com/dib-lab/wort">wort</a>, which coordinates a distributed collection of workers to compute signatures as new data arrives at NCBI.</p>
<h2>Simple opportunities for improvement</h2>
<p>MAGsearch is a robust prototype, with many straightforward opportunities for improvement. I would guess that with a few weeks of focused investment, we could get down to about ~1 hour per search.</p>
<p>First, the MAGsearch code doesn't do anything special in terms of loading; it's using the default sourmash signature format, which is JSON. For example, binary encodings would decrease the collection size a lot, while also speeding up search (by decreasing the load time).</p>
<p>Second, searching the signatures is done linearly, and uses Rust to do so in parallel. It uses the same Rust code that underlies sourmash (but is several versions behind the latest version). Making use of recent improvements in sourmash Rust code would probably speed this up several fold.</p>
<p>Third, we can now add protein signatures to our collection of DNA signatures, which would enable much more sensitive search. (We'd have to sketch a lot of data, though. :)</p>
<h2>Broader limitations</h2>
<p>The internal data structures we use in sourmash are optimized for relatively small collections of k-mers, because sourmash is built around downsampling k-mer collections. We're slowly improving our internal structures, but supporting <em>all</em> k-mers is not straightforward and is not something on our current roadmap.</p>
<p>Our sketching techniques only support individual k-mer sizes/molecule types. So while we can compute, store and search multiple k-mer sizes for DNA, protein, Dayhoff encodings, etc., they are stored separately and don't "compress" together. This means that signature collections grow quickly in size as we provide more k-mer sizes and molecule types!</p>
<p>We're not quite sure how to provide our current databases to people. Personally I'm not really ready to support MAGsearch as a service, either, but that's partly because of a lack of funding.</p>
<h2>What else does sourmash offer?</h2>
<p>sourmash itself is stable and well tested, and can be used with confidence to do many bioinformatics tasks. It is easy to install (pip/conda), and is reasonably well documented.</p>
<p>Our data structures and algorithms are simple and well-understood and straightforward to (re)implement. While they aren't yet all published, we are happy to explain them and tell you where they will and won't work.</p>
<p>sourmash is fast, and low memory, and requires little disk space for even pretty large collections of signatures.</p>
<p>sourmash has an increasingly useful command-line interface that supports many common k-mer and search operations. In this sense, it can be used as a partial guide for a good "default" set of operations that k-mer-based tools could support. We have paid a fair amount of attention to user experience, too.</p>
<p>Underneath, sourmash has a flexible Python API that is slowly being replaced with Rust underneath. This means that we can quickly prototype new functionality while refactoring critical functionality underneath, so sourmash performance is continually improving while we are also tackling new use cases.</p>
<p>We have an open, robust approach to software development, with an increasingly diverse array of contributors. I'm not sure we're ready to take on a lot of new contributors quite yet, because our roadmapping processes are not very mature, but we're working on that.</p>
<p>We use semantic versioning for the sourmash package itself, and we communicate clearly about breaking changes. As a result, sourmash can be cleanly integrated into workflows with simple versioning pinning requirements.</p>
<p>We support public and private collections of signatures, and all of our primary search and analysis approaches work with multiple databases or signature collections without needing to re-index them or combine them from scratch. </p>
<p>We also support flexible "free-form" taxonomy, and in particular support both NCBI and GTDB taxonomies.</p>
<h2>Where would I like to see petabase-scale search go?</h2>
<p>I wouldn't advocate for sourmash itself (either the software or the underlying techniques) as the one true method for searching all (meta)genomic data. Among other things, sourmash has a lot of other use cases that matter to us!</p>
<p>But I think we have a few experiences to offer to any such effort -</p>
<ul>
<li>we have functioning implementations that support a number of really useful use cases for metagenome search and analysis. It would be nice not to lose those use cases!</li>
<li>high-sensitivity prefiltering approaches are good and enable flexible triage afterwards. We mostly use sourmash as a lightweight way to find all the things that we <em>might</em> care about, before doing more in-depth analysis.</li>
<li>having both command-line and Python APIs has been incredibly useful, and I think it would be a mistake to bypass good APIs in favor of a Web API. Of course, this also increases the developer effort by a lot, but the return is that you enable a lot more flexibility.</li>
<li>riffing more on that, I think it would be a mistake to write a custom Web-hosted indexing and search tool that only works with NCBI formats and taxonomies.</li>
<li>riffing even more on that, it's been great to be able to quickly add databases/collections to search, and supporting both completely private databases as well as rapid updating of public database collections is something that has been really useful in comparison to many other metagenome analysis tools.</li>
<li>simplicity of data structures and algorithms has helped us a lot with sourmash. Software support is fundamentally a game of maintenance and it has been great to be able to reimplement our core data structures and algorithms in multiple languages. In particular, I worry a lot about premature optimization when I look at other packages.</li>
</ul>
<p>Luiz has also done a lot of thinking about distributed computing and decentralization via Dat and IPFS that I think could be valuable, but I'm not expert enough to summarize it myself. Hopefully Luiz will write something up :). (You can already <a href="https://github.com/luizirber/phd/tree/master/thesis">check out his PhD thesis, chapters 4 and 5, for some juicy details and discussion, though!</a>)</p>
<h2>What other tools should we be looking at for large scale search?</h2>
<p>I think <a href="https://www.biorxiv.org/content/10.1101/2020.08.07.241729v2">Serratus</a> did an excellent job of showing some of the possibilities of massive-scale metagenome search!</p>
<p>There's lots of tools out there in various stages of development, but I am particularly interested in <a href="https://www.biorxiv.org/content/10.1101/2020.10.01.322164v1">metagraph</a>.</p>
<p>I'd love to hear about more tools and approaches - please drop them in the comments or on twitter!</p>
<p>--titus</p>sourmash 4.1.0 released!!2021-05-17T00:00:00+02:002021-05-17T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2021-05-17:/blog/2021-sourmash-v4.1.0-released.html<p>sourmash v4.1.0 is here!</p><p>We are pleased to announce that sourmash v4.1 is now out! As usual <a href="https://sourmash.readthedocs.io/en/latest/#installing-sourmash">it can be installed via conda or pip</a>. You can <a href="https://github.com/dib-lab/sourmash/releases/tag/v4.1.0">read the release notes here</a> for details, or just read on here for the highlights!</p>
<h2>One big new command-line feature - zipfile collections.</h2>
<p>One command-line feature that opens up a lot of new opportunities down the line is support for zipfile collections.</p>
<p>Zipfile collections provide a way for sourmash to take in potentially <em>very</em> large collections of signatures. Briefly, you can take a directory hierarchy of signatures and zip them all up, and sourmash can now load the signatures directly from the zip file - so you can distribute collections of signatures, search and gather and compare on them, and so on.</p>
<p>Now, back in v3.3.0, Luiz <a href="http://ivory.idyll.org/blog/2020-sourmash-databases-as-zip-files.html">added .zip as a storage format for Sequence Bloom Trees</a>, indexed databases of signatures. These are fantastic, but because of the nature of SBT indices, they came with some restrictions - they had to be compatible signatures, and big SBTs consumed a lot of disk space and memory. While we're working on fixing that separately, zipfile collections offer an alternative that is not faster but does offer some more conveniences.</p>
<p>In particular, unlike SBTs, zipfile collections can store incompatible signatures, and they don't consume any extra memory, and they don't require any ancillary files. This lets us (not so hypothetically...) store k=21, k=31, and k=51 signatures for <a href="https://osf.io/wxf9z/files/">all 300k+ GTDB genomes</a> in a fairly small (~8.5 GB) zipfile. You can also get the GTDB representatives in even smaller files (1.5 GB) and we built SBTs for them, too (2.8 GB each).</p>
<p>The remaining problem is that zipfile collections aren't indexed, and so searching 300k+ signatures is not really that fast because you're doing it linearly. While <code>search</code> can handle it, iterative approaches like <code>gather</code> cannot. To that end, we added interim support via <code>prefetch</code>. Read on!</p>
<h2>Another nifty command line feature: <code>prefetch</code>.</h2>
<p><code>sourmash prefetch</code> is a command that basically does a <code>sourmash search --containment</code>. It only works on scaled signatures (more about that soon, promise), and it's meant as a prefilter for <code>sourmash gather</code>. The idea is, you run <code>prefetch</code> with a metagenome query, and <code>prefetch</code> finds all of the potentially relevant signatures, and then saves them for you. Then, you run <code>sourmash gather</code> on the saved signatures, which winnows them down to the smallest possible list of genomes relevant to your metagenome.</p>
<p>Why implement <code>prefetch</code>? A couple of reasons -</p>
<ul>
<li>we already had code in some other projects that did this and was quite useful.</li>
<li>it was an easy feature to implement that led to massive speedups when doing certain kinds of parameter exploration.</li>
<li>it's explicitly streaming compatible, because it doesn't need to hold anything in memory long-term - it's meant to walk across whatever (potentially very, very large...) databases and collections you give it, and output any relevant matches. As we're approaching a million genomes in Genbank, this feature seemed ...relevant.</li>
<li>last but not least, <em>internal</em> support for prefetch goes with some excellent internal primitives that can now be further optimized. More about THAT in some future releases :).</li>
<li>prefetch also lets us support some other features, such as reporting ties in <code>sourmash gather</code>. We don't do that yet, but we can do so <em>much</em> more easily now.</li>
</ul>
<p>So how would you use prefetch? You don't need to, really - it now underpins gather, so <code>sourmash gather</code> on a zipfile containing all 300k+ GTDB genomes will actually run much, much faster than it ever would have before, despite using a linear search underneath.</p>
<p>For some speed comparisons of the new features, see <a href="https://github.com/dib-lab/sourmash/issues/1530">sourmash issue #1530</a> - here's the summary, for searching approximately 45,000 signatures from GTDB with a fake metagenome built from 4 genomes -</p>
<table>
<thead>
<tr>
<th></th>
<th>Time (s)</th>
<th>Memory (mb)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. index, prefetch</td>
<td>10s</td>
<td>215mb</td>
</tr>
<tr>
<td>2. index, no prefetch</td>
<td>22s</td>
<td>214mb</td>
</tr>
<tr>
<td>3. no index, prefetch</td>
<td>207s</td>
<td>81mb</td>
</tr>
<tr>
<td>4. no index, no prefetch</td>
<td>811s</td>
<td>87mb</td>
</tr>
</tbody>
</table>
<p>So obviously you want to use an index if you have the memory, but if you don't, you definitely want to use prefetch! Happy to discuss the scaling behavior in the comments or over at the github issue, too - the short version is that the time for rows (2) and (4) should scale with the diversity of the metagenome, while (1) and (3) should be mostly independent of diversity (which is what you want!)</p>
<p>Important note: before sourmash 4.1, row (2) above was the only behavior supported. :) All of the behaviors above can be toggled at the command line.</p>
<h2>Last but by no means least: flexible and online output formats for saving signatures</h2>
<p>As we were implementing all this, it turned out to be easy to refactor in some more flexible output formats. You can now specify that <code>sketch</code>, <code>search</code>, <code>gather</code>, and <code>prefetch</code>, as well as many of the <code>sourmash sig</code> manipulation commands, should put their output signatures in a directory (<code>output/</code>), a Zip file (<code>output.zip</code>), or a compressed sig file (<code>output.sig.gz</code>). These are also streaming compatible - the zipfile and directory output saves matches "as you go", without holding them in memory. And, since all of these can be passed into sourmash as collections to search and manipulate, we have a pleasingly complete set of storage formats!</p>
<h2>Other features</h2>
<p>The <a href="https://github.com/dib-lab/sourmash/releases/tag/v4.1.0">release notes</a> should be pretty comprehensive, and they do contain links into the pull requests (and from there, into the issues) that we addressed for this release. In particular, note that as our user base expands we're getting a wider range of issues submitted. Many of these are straightforward to fix - so this release addresses a fair number of those user requests, too.</p>
<p>In particular, this release should address a number of Windows issues around encodings and newlines; we don't yet provide wheels for Windows, but we're getting a lot closer!</p>
<h2>Internal improvements and enhanced flexibility</h2>
<p>For me the really exciting thing is the internal refactoring that under-pin the features above. We've significantly reworked the internals to consolidate code, make new features easier to add, better support streaming of large signature collections, and permit many more optimizations. Coincidentally (we swear!) the refactoring also sped up some of our core operations - <code>sourmash gather</code>, in particular, is twice as fast and consumes 80-90% less memory on SBTs! Maybe that's the sign of a good refactoring? Or maybe we just got lucky...</p>
<p>sketchily yours,
--titus</p>sourmash 4.0 is now available! Low low cost if you buy now!2021-03-04T00:00:00+01:002021-03-04T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2021-03-04:/blog/2021-sourmash-v4-released.html<p>sourmash v4.0.0 is here!</p><p>So, we just <a href="https://github.com/dib-lab/sourmash/releases/tag/v4.0.0">released sourmash 4.0</a>, our Python- and Rust-based open source tool for k-mer sketch-based analysis of metagenomes and genomes.</p>
<p>The high notes of this release are -</p>
<ul>
<li>much better user experience design around creating and storing sketches;</li>
<li>removal of several obsolete features that were holding us back;</li>
<li>improved default Python API;</li>
</ul>
<p>but they really don't particularly matter, to be honest :).</p>
<p>What's most cool about this release is...</p>
<h2>Semantic versioning, feature compatibility, and deprecations</h2>
<p>...it's a release where we purposely broke compatibility with previous versions, and went through a whole deprecation effort, and documented it all.</p>
<p>We use semantic versioning for sourmash. What this means is that major versions of sourmash (v3, v4, etc.) can break backwards compatibility, but minor versions (v3.1, v3.2) cannot. In practical terms, it means that when you use sourmash in a workflow or application, you can pin your software install to the major version without worrying about breakage - e.g. specify <code>sourmash >=3,<4</code>.</p>
<p>In the case of v3.x and v4.0, we systematically upgraded and improved sourmash performance and features during 3.x, and reserved breaking features for 4.0. Further, we added warnings and deprecations to v3.5 about features that were <em>going</em> to break in v4.0. Then we wrote a <a href="https://sourmash.readthedocs.io/en/latest/support.html#migrating-from-sourmash-v3-x-to-sourmash-v4-x">migration guide</a>.</p>
<p>It was a lot of work! I probably put 40-80 hours into just this aspect of things over the last three months.</p>
<p>So... why did we do it?</p>
<h2>Why did we do all this work?</h2>
<p>We're not really sure how many people use sourmash outside the lab, but <em>in</em> the lab, we use it quite a bit. It's a pretty effective Swiss army knife tool for hacking and slashing at sequencing data, and a lot of basic questions about taxonomy and k-mer content can be answered with a little creative sourmashing.</p>
<p>And we have 5-6 workflows and pipelines that rely on sourmash.</p>
<p>So the first answer is that we did it for ourselves, so that we could robustly rely on sourmash in our workflows.</p>
<p>But the more complete answer is that we wanted to go through the semantic versioning & deprecation workflow and user communication/documentation stuff, so that we could just bake it into project expectations (for this project and for others). And we wanted to do this because we think this is the right way to do scientific software, and we wanted to communicate our expectations about changing sourmash behavior clearly and unambiguously.</p>
<h2>What do we get out of it?</h2>
<p>We're signaling to our current and prospective user base that we are open to their concerns. This results in improved user communication: for example, after our first release candidate we got some feedback that caused us to explicitly note that (1) numerical results shouldn't change and (2) old sourmash databases are still compatible with the new version.</p>
<p>We're providing a path to ourselves, as well as future developers (and future us), on how to think about pacing our changes to the software. On the one hand it's frustrating to delay cool and important changes to the software because we're not yet ready to release a big version; on the other hand, we took the time to more completely bake some of the new features and did several rounds of documentation improvement.</p>
<p>And, frankly, I think we ended up with better code reviews and development processes internally, because we had to think explicitly about how each particular change would impact users. (FWIW, our best guess is that we have about 1,000 users.)</p>
<h2>What are the downsides?</h2>
<p>Well, it was a lot of work :). And investment in the future of an academic software project is always a gamble!</p>
<p>Also, we don't have the person power to maintain multiple releases of sourmash, so it does mean we're more or less abandoning people who want to continue using sourmash v3.x. We didn't break any particularly big features, but it does require effort on our users' side to upgrade, so maybe some people will hold off because of that. And while I'll backport fixes to really important bugs if we have any in the next few months, we don't intend to backport performance improvements or new features. So maybe users will suffer a bit from that.</p>
<h2>What's next?</h2>
<p>We have a lot of new features that will probably come out in v4.1 and beyond, now that we can switch our efforts to that! Lots of exciting stuff is coming in the areas of protein k-mers and massive-scale database search!</p>
<p>I'm already looking forward to v5, where we can remove some of the features that we deprecated for v4.0.</p>
<p>And I think it's more than time for a new JOSS paper...</p>
<p>--titus</p>sourmash v4.0.0 release candidate 1 is now available for comment!2021-02-19T00:00:00+01:002021-02-19T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2021-02-19:/blog/2021-sourmash-v4-rc1-now-available.html<p>sourmash v4.0.0 is coming!</p><p>Hello everyone,</p>
<p>we are happy to announce the (imminent ;) release of sourmash 4.0, and present sourmash v3.5.1 and sourmash v4.0.0rc1 (release candidate 1) for your comments and questions!</p>
<p>sourmash is a command-line tool + Python & Rust library for quickly searching, comparing, and analyzing genomic and metagenomic data sets.</p>
<p>If you use sourmash regularly and are interested in upgrading, we are providing you with this release candidate so you can try out the migration guide and the new/revised functionality.</p>
<p>Draft release notes for 4.0.0 are <a href="https://github.com/dib-lab/sourmash/releases/tag/v4.0.0rc1">here</a>, and we have a <a href="https://sourmash.readthedocs.io/en/latest/support.html#migrating-from-sourmash-v3-x-to-sourmash-v4-x">migration guide as well</a>.</p>
<p>Please note that sourmash uses <a href="https://sourmash.readthedocs.io/en/latest/support.html#versioning-and-stability-of-features-and-apis">semantic versioning</a>, so v3.5.1 should not break any features or functionality. You should <a href="https://sourmash.readthedocs.io/en/latest/support.html#version-pinning">version pin</a> your sourmash dependencies to <code>>=3,<4</code> if you want to continue using sourmash as before.</p>
<h2>sourmash v3.5.1 is the last release of v3.x</h2>
<p>sourmash v3.5.1 should be the last release of sourmash v3. It adds warnings for features changed in 4.0.</p>
<p>More info on 3.5.1: https://github.com/dib-lab/sourmash/releases/tag/v3.5.1</p>
<p>You can install it like so from <a href="https://pypi.org/project/sourmash/3.5.1/">PyPI</a>:</p>
<div class="highlight"><pre><span></span><code>pip install sourmash==3.5.1
</code></pre></div>
<p>or with conda from bioconda and conda-forge:</p>
<div class="highlight"><pre><span></span><code>conda install -c conda-forge -c bioconda sourmash=3.5.1
</code></pre></div>
<h2>sourmash v4.0.0 is coming soon!</h2>
<p>sourmash v4.0.0rc1 is a feature-complete release of 4.0, with full migration docs. It contains many improvements and some breaking changes from 3.x.</p>
<p>Please see https://github.com/dib-lab/sourmash/releases/tag/v4.0.0rc1 for details!</p>
<p>To install <a href="https://pypi.org/project/sourmash/4.0.0rc1/">sourmash v4rc1 from PyPI</a>, please use:</p>
<div class="highlight"><pre><span></span><code>pip install --pre sourmash==4.0.0rc1
</code></pre></div>
<p>(You can also install sourmash v3.5.1 from conda to get the
dependencies, and then upgrade to the latest version using pip.)</p>
<h2>Feedback requested!</h2>
<p>We would very much appreciate feedback on the new features in sourmash, as well as comments and questions about upgrading. Please put comments on <a href="https://github.com/dib-lab/sourmash/issues/1338">the migration issue</a>.</p>
<p>C. Titus Brown and Luiz Irber</p>
<p>(for the sourmash development team :)</p>Transition your Python project to use pyproject.toml and setup.cfg! (An example.)2021-02-02T00:00:00+01:002021-02-02T00:00:00+01:00Titus Browntag:ivory.idyll.org,2021-02-02:/blog/2021-transition-to-pyproject.toml-example.html<p>Updating old Python packages, in this year of the PSF 2021!</p><p><em>Thanks to Luiz Irber for all the pre-work on sourmash, as well as the code reviews on screed; and Brett Cannon for a review of an earlier version of this blog post!</em></p>
<p>The future of Python packaging is pyproject.toml, and (for now) setup.cfg, based on <a href="https://snarky.ca/clarifying-pep-518/">PEP 518</a> and (soon) <a href="https://www.python.org/dev/peps/pep-0621/">PEP 621</a>.</p>
<p>For some background, please read <a href="https://snarky.ca/what-the-heck-is-pyproject-toml/">"What the heck is pyproject.toml"</a>, and also <a href="https://discuss.python.org/t/where-to-get-started-with-pyproject-toml/4906">Where to get started with pyproject.toml?</a></p>
<p>The <a href="https://setuptools.readthedocs.io/en/latest/build_meta.html">relevant setuptools docs</a> have been updated to reflect the new toolchain, too!</p>
<p>My takeaway from all of this is:</p>
<ul>
<li>configuration files are better than scripts</li>
<li>a few standard configuration files are better than many</li>
<li>declarative/static is better than procedural/dynamic</li>
</ul>
<p>OK! <em>rubs hands with glee</em> let's do this!</p>
<h2>"But what do I actually <em>do</em>?"</h2>
<p>Brett's post is pretty excellent and was really informative for me, but I have a high tolerance for reading lots of text! It's probably a bit long for people who just want to update their project, though ;).</p>
<p>So I decided to give it a try myself and then post an example!</p>
<p>Recently, I wanted to release a new version of <a href="https://github.com/dib-lab/screed/">screed</a>, in order to get rid of some DeprecationWarnings for the release of <a href="https://github.com/dib-lab/sourmash/">sourmash 4.0</a>. Now, screed is a remarkably ...stable project, by which I mean it does the thing we need it to do and no more, and we're not changing it at all.</p>
<p>BUT. Screed was based on an old school setup.py. So, inspired by Luiz Irber's updating of sourmash to use pyproject.toml, I updated screed similarly. (It was REALLY helpful to have an example!)</p>
<h2>tl;dr</h2>
<p><a href="https://github.com/dib-lab/screed/pull/83/files">Here is the diff.</a></p>
<p>In brief,</p>
<ul>
<li>your <code>pyproject.toml</code> can be very close to boilerplate.<ul>
<li>It's basically the three lines that Brett posted...</li>
<li>the additional stuff for screed has to do with <a href="https://pypi.org/project/setuptools-scm/">setuptools_scm</a>, which we're using to automatically convert git tags like <code>v1.0.4</code> into actual version numbers.</li>
</ul>
</li>
<li>your <code>setup.cfg</code> basically contains almost everything your <code>setup.py</code> contained, just a bit reformatted to fit into the <a href="https://docs.python.org/3/distutils/configfile.html">setup.cfg format</a>.</li>
<li>your new <code>setup.py</code> can now be a really short stub to permit <code>python setup.py ...</code> to continue to work.</li>
</ul>
<p>I hope this helps! Comment and ask questions as you have them!</p>
<p>(Also: I just released <a href="https://github.com/dib-lab/screed/releases/tag/v1.0.5">screed v1.0.5!</a> :tada:)</p>
<p>--titus</p>A snakemake hack for checkpoints2021-01-25T00:00:00+01:002021-01-25T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2021-01-25:/blog/2021-snakemake-checkpoints.html<p>snakemake checkpoints r awesome</p><p>As I get deeper and deeper into using the excellent snakemake workflow system for ...everything, I have had to learn how to use checkpoints. I ended up hacking together an approach that made checkpoints easy for me, and now I'm caught between being proud of it and wondering if it's Actually Bad. So I thought I'd share and see what y'all thought.</p>
<p>(Thanks to Taylor Reiter, Tessa Pierce, and Luiz Irber for their comments on early drafts of this blog post!)</p>
<h2>What are checkpoints for?</h2>
<p>By default, snakemake figures out what to run based on the rules in the Snakefile and whatever files are present in the working space. It implements this using a simple but incredibly powerful pattern matching technique that is executed at the very beginning of the run.</p>
<p>The one big problem with doing everything at the beginning of the run is that if you don't know exactly which files are going to be produced by a particular step, you can't write a regular rule to depend on them.</p>
<p>For example, suppose you want to run a BLAST search against a query sequence, and then for each BLAST match you want to download the matching sequence and do more analysis. snakemake could handle doing the BLAST easily enough, but the rule that downloads matching sequences would have somewhere between 0 and N outputs. How many wouldn't be known until the BLAST was done!</p>
<p>There are a few different approaches you can use --</p>
<ul>
<li>simply having multiple workflows (I did this for a while :)</li>
<li><a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files">snakemake dynamic</a></li>
<li><a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution">snakemake checkpoints</a></li>
</ul>
<p>This blog post is about the last option, which is (apparently) the stable approach, going forward!</p>
<h2>What do checkpoints do?</h2>
<p>Briefly put, checkpoints trigger a re-evaluation of the Snakefile rules in light of new information.</p>
<p>For each checkpoint, snakemake looks at the rules that depend on the checkpoint's output, and holds off on evaluating downstream inputs until the checkpoint is done. Once the checkpoint is done, the rules are evaluated and new jobs are entered into snakemake's TODO list.</p>
<h2>How do checkpoints work, under the hood?</h2>
<p>It took me a surprisingly large amount of time to figure out these details, so I'm going to share in case others are in a similar boat :).</p>
<p>There is a <code>checkpoints</code> namespace.</p>
<p>When you create a checkpoint, it is entered into this namespace.</p>
<p>When another rule's input <em>refers</em> to a checkpoint to get its outputs, by calling <code>checkpoint.<name>.get(...)</code>, snakemake raises an exception. This exception tells it to defer evaluation of the checkpoint outputs until the checkpoint; snakemake tracks the calling rule and waits.</p>
<p>Once the checkpoint is executed, the output becomes available and the rules that depend on it are re-evaluated.</p>
<h2>An example of the syntax</h2>
<p>The syntax is straightforward - you define checkpoints
the same way you do rules, and then you refer to the
checkpoint in <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions">an input function</a>.</p>
<div class="highlight"><pre><span></span><code><span class="n">checkpoint</span> <span class="n">a</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span> <span class="n">touch</span><span class="p">(</span><span class="s2">"a.out"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">input_for_b</span><span class="p">(</span><span class="o">*</span><span class="n">wildcards</span><span class="p">):</span>
<span class="k">return</span> <span class="n">checkpoints</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">get</span><span class="p">()</span><span class="o">.</span><span class="n">output</span>
<span class="n">rule</span> <span class="n">b</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">input_for_b</span>
<span class="n">run</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'input is: </span><span class="si">{</span><span class="nb">input</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
</code></pre></div>
<p>and if you run <a href="https://github.com/ctb/2021-snakemake-checkpoints-example/blob/latest/Snakefile.example">rule b in this example</a> like so,</p>
<div class="highlight"><pre><span></span><code><span class="c">% snakemake -j 1 -s Snakefile.example b</span>
</code></pre></div>
<p>you will see <code>input is: a.out</code>.</p>
<p>Note this example is a bit useless, though, because in this case you could make <code>checkpoint a</code> a rule; it doesn't do anything here that requires it to be a checkpoint. Specifically, the output of rule <code>a</code> and input of rule <code>b</code> are both known.</p>
<p>Nonetheless, I think it serves as a useful example of the syntax:</p>
<ul>
<li>the output of the checkpoint must be something that fits into the snakemake rules - a filename or a wildcard pattern or something specific.</li>
<li>the rules that depend on this checkpoint need to use a <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions">function as an input</a>, so that snakemake can <em>try</em> to run it and generate the exception that lets it know this depends on a checkpoint.</li>
<li>the input function must take a list of potential wildcards, even if there are no wildcards and/or the wildcards aren't used.</li>
</ul>
<h2>A real example: making a spreadsheet dynamically, and then using that spreadsheet</h2>
<p><a href="https://github.com/ctb/2021-snakemake-checkpoints-example/blob/latest/Snakefile.random">Here is an example Snakefile</a> that is closer to how I use checkpoints in real Snakefiles.</p>
<p>Briefly, a rule <code>make_spreadsheet</code> builds a spreadsheet with some filenames in it (here, the entries are random, but it could be doing something useful, like running BLAST).</p>
<p>Then, I define a checkpoint that waits for that file to be created, and ...does nothing.</p>
<p>Last, I define a rule that depends on that checkpoint. This rule reads in all the names from the spreadsheet and then builds a list of output filenames, <code>output-{name}.txt</code> where <code>{name}</code> is taken from the spreadsheet.</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">make_all_files</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">Checkpoint_MakePattern</span><span class="p">(</span><span class="s2">"output-</span><span class="si">{name}</span><span class="s2">.txt"</span><span class="p">)</span>
</code></pre></div>
<p>The individual files are created by another rule; <code>make_all_files</code> just has the responsibility of laying out the list of files to be created, which is created by the <code>Checkpoint_MakePattern</code> class, discussed a few paragraphs below.</p>
<p>The interesting thing here is that the checkpoint doesn't really do anything; it just requires that the <code>names.csv</code> file exist (triggering the correct upstream rule), and it touches a file (because, as it turns out, checkpoints <em>must</em> have an output.)</p>
<div class="highlight"><pre><span></span><code><span class="c1"># second rule, a checkpoint for rules that depend on contents of "count.csv"</span>
<span class="n">checkpoint</span> <span class="n">check_csv</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="s2">"names.csv"</span>
<span class="n">output</span><span class="p">:</span> <span class="c1"># checkpoints _must_ have output.</span>
<span class="n">touch</span><span class="p">(</span><span class="s2">".make_spreadsheet.touch"</span><span class="p">)</span>
</code></pre></div>
<p>The "magic" here is in the <code>Checkpoint_MakePattern</code> class, which I defined. This class takes in and saves a pattern:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">Checkpoint_MakePattern</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pattern</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">pattern</span> <span class="o">=</span> <span class="n">pattern</span>
</code></pre></div>
<p>and then, when called as part of the input function in <code>make_all_files</code>, it (a) waits for the checkpoint, (b) gets the names from the CSV file (<code>get_names()</code> call), and (c) expands the pattern with the names from the CSV file:</p>
<div class="highlight"><pre><span></span><code> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">w</span><span class="p">):</span>
<span class="k">global</span> <span class="n">checkpoints</span>
<span class="c1"># wait for the results of 'check_csv'; this will trigger an</span>
<span class="c1"># exception until that rule has been run.</span>
<span class="n">checkpoints</span><span class="o">.</span><span class="n">check_csv</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="o">**</span><span class="n">w</span><span class="p">)</span>
<span class="c1"># the magic, such as it is, happens here: we create the</span>
<span class="c1"># information used to expand the pattern, using arbitrary</span>
<span class="c1"># Python code.</span>
<span class="n">names</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">get_names</span><span class="p">()</span>
<span class="n">pattern</span> <span class="o">=</span> <span class="n">expand</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">pattern</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="n">names</span><span class="p">,</span> <span class="o">**</span><span class="n">w</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pattern</span>
</code></pre></div>
<p>The only application-specific bit of code is in <code>get_names()</code>, which reads in the CSV:</p>
<div class="highlight"><pre><span></span><code> <span class="k">def</span> <span class="nf">get_names</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'names.csv'</span><span class="p">,</span> <span class="s1">'rt'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
<span class="n">names</span> <span class="o">=</span> <span class="p">[</span> <span class="n">x</span><span class="o">.</span><span class="n">rstrip</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">fp</span> <span class="p">]</span>
<span class="k">return</span> <span class="n">names</span>
</code></pre></div>
<p>This function can do pretty much anything it needs to do, and could (in cases where a bunch of output files are created) be replaced with snakemake's <a href="https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-do-i-run-my-rule-on-all-files-of-a-certain-directory"><code>glob_wildcards</code> function</a>.</p>
<h2>Another example: taking a count from a file.</h2>
<p><a href="https://github.com/ctb/2021-snakemake-checkpoints-example/blob/latest/Snakefile.count">Here is another Snakefile</a> that outputs <code>h+2</code> (where h is the current hour of the day) to a file <code>count.txt</code>.
The number in <code>count.txt</code> is then used to create files named "output-1.txt" to "output-{n}.txt", </p>
<p>Clearly snakemake's runtime analyzer can't know how many files are going to be output up front, so the Snakefile uses a checkpoint to read in the hour from <code>count.txt</code>, and then uses <code>expand</code> to generate the output file patterns:</p>
<div class="highlight"><pre><span></span><code><span class="n">rule</span> <span class="n">make_file</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="s2">"output-</span><span class="si">{n}</span><span class="s2">.txt"</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"echo hello, world > </span><span class="si">{output}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">make_all_files</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">Checkpoint_MakePattern</span><span class="p">(</span><span class="s2">"output-</span><span class="si">{n}</span><span class="s2">.txt"</span><span class="p">)</span>
</code></pre></div>
<h2>A third example - reimplementing <code>dynamic</code></h2>
<p>Luiz made an interesting comment when he read a draft of this blog post: he pointed out that this gets pretty close to the <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files"><code>dynamic</code> behavior</a>. So I thought I'd (try) to reimplement that!</p>
<p>The result is <a href="https://github.com/ctb/2021-snakemake-checkpoints-example/blob/latest/Snakefile.dynamic">here, in Snakefile.dynamic</a>.</p>
<p>The <code>make_files</code> rule makes a bunch of files (mimicking clustering output, for example). Then the <code>Checkpoint_MakePattern</code> class uses <code>glob_wildcard</code> to figure out what files there are and extract wildcards, which it uses to fill in the pattern:</p>
<div class="highlight"><pre><span></span><code> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">w</span><span class="p">):</span>
<span class="k">global</span> <span class="n">checkpoints</span>
<span class="c1"># wait for the results of 'check_csv'; this will trigger an</span>
<span class="c1"># exception until that rule has been run.</span>
<span class="n">checkpoints</span><span class="o">.</span><span class="n">make_files</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="o">**</span><span class="n">w</span><span class="p">)</span>
<span class="c1"># use glob_wildcards to find the (as-yet-unknown) new files.</span>
<span class="n">names</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s1">'output-</span><span class="si">{rs}</span><span class="s1">.txt'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">pattern</span> <span class="o">=</span> <span class="n">expand</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">pattern</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="n">names</span><span class="p">,</span> <span class="o">**</span><span class="n">w</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pattern</span>
</code></pre></div>
<p>For example, this rule transforms all of the <code>output-{random}.txt</code> files into <code>output-{random}.round2</code> names:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># final rule that depends on that checkpoint and transforms</span>
<span class="c1"># dynamically created files into something else.</span>
<span class="n">rule</span> <span class="n">make_patterns</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s1">'.make_rs_files.touch'</span><span class="p">,</span>
<span class="n">Checkpoint_MakePattern</span><span class="p">(</span><span class="s1">'output-</span><span class="si">{name}</span><span class="s1">.round2'</span><span class="p">)</span>
</code></pre></div>
<p>A bonus feature is that you can easily compute a summary across all the files like so:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># bonus rule that does something with all the files</span>
<span class="n">rule</span> <span class="n">make_summary</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s1">'.make_rs_files.touch'</span><span class="p">,</span>
<span class="n">files</span><span class="o">=</span><span class="n">Checkpoint_MakePattern</span><span class="p">(</span><span class="s1">'output-</span><span class="si">{name}</span><span class="s1">.txt'</span><span class="p">)</span>
<span class="n">output</span><span class="p">:</span>
<span class="s1">'output-random.summary'</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> cat </span><span class="si">{input.files}</span><span class="s2"> > </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<h2>This works, but is it a good way to do things?</h2>
<p>The <code>Checkpoint_MakePattern</code> code that I used above gave me a simple way to make use of checkpoints. I largely ignored the internal snakemake mechanism for passing around information that is laid out in the docs and in (e.g.) <a href="https://evodify.com/snakemake-checkpoint-tutorial/">this very useful blog post</a>.</p>
<p>I just write Python code that (a) triggered the checkpoint exception and then (b) Did Something in pure Python to spit out a list of files to be created.</p>
<p>I've used essentially this same code a few times now, and I like it a lot! But I would love feedback as to whether I'm doing something unnatural here :), or if I'm missing something that's really much simpler. Feedback welcome!</p>
<p>--titus</p>Improved workflows-as-applications: tips and tricks for building applications on top of snakemake2020-08-06T00:00:00+02:002020-08-06T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2020-08-06:/blog/2020-improved-workflows-as-applications.html<p>Writing applications around workflow systems, take 2.</p><p>(Thanks to Camille Scott, Phillip Brooks, Charles Reid, Luiz Irber, Tessa Pierce, and Taylor Reiter for all their efforts over the years! Thanks also to Silas Kieser for his work on <a href="https://github.com/metagenome-atlas/atlas">ATLAS</a>, which gave us inspiration and some working code :).)</p>
<p>A while back, <a href="http://ivory.idyll.org/blog/2018-workflows-applications.html">I wrote about workflow as applications</a>, in which I talked about how Camille Scott had written dammit (link below) in the pydoit workflow system, and released it as an application. In doing so, Camille made a fundamental observation: many bioinformatics tools are wrappers that run other bioinformatics tools, and that is literally what workflow tools are designed to do!</p>
<p>Since that post, we've doubled down on workflow systems, improved and adapted our <a href="http://ivory.idyll.org/blog/2020-software-and-workflow-dev-practices.html">good enough in-lab practices for software and workflow development</a>, and written <a href="https://dib-lab.github.io/2020-workflows-paper/">a paper on workflow systems</a> - <a href="https://www.biorxiv.org/content/10.1101/2020.06.30.178673v1">(also on bioRxiv)</a>.</p>
<p>Projects that we write this way end up consisting of large collections of interrelated Python scripts (more on how we manage that later - see <a href="https://github.com/spacegraphcats/spacegraphcats/tree/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/search">e.g. the spacegraphcats.search package for an example</a>). This strategy also allows integration of multiple different languages under a single umbrella, including (potentially) R scripts and bash scripts and... whatever else you want :). </p>
<p>As part of this effort, we've developed much improved practices around better (more functional) user experiences with our software. In this blog post, I'm going to talk about some of these - read on for details!</p>
<p>This post extracts experience from the following in-lab projects: </p>
<ul>
<li><a href="https://github.com/dib-lab/dammit">the dammit transcriptome annotator</a>,</li>
<li><a href="https://github.com/dahak-metagenomics/dahak">the dahak metagenomics pipeline</a>,</li>
<li><a href="https://github.com/spacegraphcats/spacegraphcats">the spacegraphcats metagenomics graph query software</a>,</li>
<li><a href="https://github.com/dib-lab/elvers">the elvers de novo transcriptome pipeline</a>,</li>
<li>and <a href="https://github.com/dib-lab/charcoal">the charcoal genome decontamination pipeline</a>.</li>
</ul>
<h2>Some background: how do we build applications on top of snakemake?</h2>
<p>We've done quite a few times now, and there are 3 parts to the pattern:</p>
<p>first, we build a <a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/conf/Snakefile">Snakefile</a> that does the things we want to do, and stuff it into a Python package.</p>
<p>second, we create a Python entry point (<a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/__main__.py">see <code>__main__</code> in spacegraphcats</a>) that calls snakemake - in <a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/__main__.py#L128">this case</a> it does it by calling the Python API (but see below for better options).</p>
<p>third, in that entry point we <a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/__main__.py">load config files, salt in our own overrides, and otherwise customize the snakemake session</a>.</p>
<p>and voila, now when you call that entry point, you run a custom-configured snakemake that runs whatever workflows are needed to create the specified targets! See for example <a href="https://github.com/spacegraphcats/spacegraphcats/blob/master/doc/running-spacegraphcats.md#running-spacegraphcats-search--output-files">the docs on running spacegraphcats</a>.</p>
<h2>Problems that we've run into, and their solutions.</h2>
<p>The strategy above works great in general, but there are a few annoying problems that have popped up over time.</p>
<ul>
<li>we want more flexible config than is provided by a single config file.</li>
<li>we want to distribute jobs from our application across clusters.</li>
<li>we don't want to have to manually implement all of snakemake's (many) command line options and functionality.</li>
<li>we want to support better testing!</li>
<li>we want to run our applications from within Snakemake workflows.</li>
</ul>
<p>So, over time, we've come up with the following solutions. Read on!</p>
<h3>Stacking config files</h3>
<p>One thing we've been doing for a while is providing configuration options via a YAML file (see e.g. <a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/conf/twofoo.yaml">spacegraphcats config files</a>). But once you've got more than a few config files, you end up with a whole host of options in common and only a few config parameters that you change for each run.</p>
<p>With our newer project, charcoal, I decided to try out stacking config files, so that there's an <strong>installation-wide</strong> set of defaults and config parameters, as well as a <strong>project-specific</strong> config.</p>
<p>This makes it possible to have sensible defaults that can be overridden easily on a per-project basis.</p>
<p>The way this works with snakemake is that you supply one or more JSON or YAML files <a href="https://github.com/dib-lab/charcoal/blob/dfc18387a7f88abb77941a5c0528b924bc43b237/charcoal/conf/system.conf">like this</a> to snakemake. Snakemake then loads them all in order and supplies the parameters <a href="https://github.com/dib-lab/charcoal/blob/dfc18387a7f88abb77941a5c0528b924bc43b237/charcoal/Snakefile">in the Snakefile namespace via the <code>config</code> variable</a>.</p>
<p>The Python code to do this via the wrapper command-line is <a href="https://github.com/dib-lab/charcoal/blob/dfc18387a7f88abb77941a5c0528b924bc43b237/charcoal/__main__.py#L40">pretty straight forward - you make a list of all the config files and supply that to <code>subprocess</code>!</a></p>
<h3>Supporting snakemake job management on clusters</h3>
<p>Snakemake conveniently supports <a href="https://snakemake.readthedocs.io/en/v5.1.4/executable.html#cluster-execution">cluster execution</a>, where you can distribute jobs across HPC clusters.</p>
<p>With both spacegraphcats and elvers, we couldn't get this to work at first. This is because we were <a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/__main__.py#L128">calling snakemake via its Python API</a>, while the cluster execution engine wanted to call snakemake at the command line and couldn't figure out how to do that properly in our application setup.</p>
<p>The <a href="https://github.com/metagenome-atlas/atlas">ATLAS</a> folk had figured this out, though: ATLAS uses subprocess to run the snakemake executable, and when I was writing charcoal, I <a href="https://github.com/dib-lab/charcoal/blob/dfc18387a7f88abb77941a5c0528b924bc43b237/charcoal/__main__.py#L51">tried doing that instead</a>. It works great, and is surprisingly much easier than using the Python API!</p>
<p>So, now our applications can take full advantage of snakemake's underlying cluster distribution functionality!</p>
<h2>Supporting snakemake's (many) parameters</h2>
<p>With spacegraphcats, the first application we built on snakemake, we implemented a kind of janky parameter passing thing where we <a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/__main__.py#L45">just mapped our own parameters over to snakemake parameters explicitly</a>.</p>
<p>However, snakemake has <em>tons</em> of command line arguments that do useful things, and it's really annoying to reimplement them all. So in charcoal, <a href="https://github.com/dib-lab/charcoal/blob/dfc18387a7f88abb77941a5c0528b924bc43b237/charcoal/__main__.py#L67">we switched from argparse to click for argument parsing</a>, and simply pass all "extra" arguments on to snakemake.</p>
<p>This occasionally leads to weird logic like <a href="https://github.com/dib-lab/charcoal/blob/dfc18387a7f88abb77941a5c0528b924bc43b237/charcoal/__main__.py#L29">the code needed to support<code>--no-use-conda</code></a>, where we by default pass <code>--use-conda</code> to snakemake, and then have to override that to turn it off. But by and large it's worked out quite smoothly.</p>
<h2>A drop-in module for a command-line API</h2>
<p>As we build more applications this way, we're starting to recognize commonalities in the use cases. Recently I wanted to upgrade the spacegraphcats CLI to take advantage of lessons learned, and so I <a href="https://github.com/dib-lab/charcoal/blob/dfc18387a7f88abb77941a5c0528b924bc43b237/charcoal/__main__.py">copied the charcoal __main__.py</a> over to <a href="https://github.com/spacegraphcats/spacegraphcats/blob/d049876a2f4c452fe9ea42a0db70b7c2f3b6112d/spacegraphcats/click.py">spacegraphcats.click</a> and started editing it. Somewhat to my surprise, it was really easy to adapt to spacegraphcats - like, 15 minutes easy!</p>
<p>So, we're pretty close to having a "standard" entry point module that we can copy between projects and quickly customize.</p>
<h2>Testing, testing, testing!</h2>
<p>We get a lot of value from writing automated functional and integration tests for our command-line apps; they help pin down functionality and make sure it's still working over time.</p>
<p>However, with spacegraphcats, I really struggled to write good tests. It's hard to test the whole workflow when you have piles of interacting Python scripts in a workflow - e.g. the <a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/search/test_workflow.py">workflow tests</a> are terrible: clunky to write and hard to modify.</p>
<p>In contrast, once I had the new command-line API working, I had the tools to make really nice and simple workflow tests that relied on snakemake underneath - see <a href="https://github.com/spacegraphcats/spacegraphcats/blob/d049876a2f4c452fe9ea42a0db70b7c2f3b6112d/spacegraphcats/test_snakemake.py">test_snakemake.py</a>. Now our tests look like this:</p>
<div class="highlight"><pre><span></span><code><span class="n">def</span><span class="w"> </span><span class="n">test_dory_build_cdbg</span><span class="p">()</span><span class="err">:</span>
<span class="w"> </span><span class="k">global</span><span class="w"> </span><span class="n">_tempdir</span>
<span class="w"> </span><span class="n">dory_conf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">utils</span><span class="p">.</span><span class="n">relative_file</span><span class="p">(</span><span class="s1">'spacegraphcats/conf/dory-test.yaml'</span><span class="p">)</span>
<span class="w"> </span><span class="n">target</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'dory/bcalm.dory.k21.unitigs.fa'</span>
<span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">run_snakemake</span><span class="p">(</span><span class="n">dory_conf</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="o">=</span><span class="k">True</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="o">=</span><span class="n">_tempdir</span><span class="p">,</span>
<span class="w"> </span><span class="n">extra_args</span><span class="o">=[</span><span class="n">target</span><span class="o">]</span><span class="p">)</span>
<span class="w"> </span><span class="n">assert</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">assert</span><span class="w"> </span><span class="n">os</span><span class="p">.</span><span class="k">path</span><span class="p">.</span><span class="ow">exists</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="k">path</span><span class="p">.</span><span class="k">join</span><span class="p">(</span><span class="n">_tempdir</span><span class="p">,</span><span class="w"> </span><span class="n">target</span><span class="p">))</span>
</code></pre></div>
<p>which is about as simple as you can get - specify config file and a target, run snakemake, check that the file exists.</p>
<p>The one tricky bit in <a href="https://github.com/spacegraphcats/spacegraphcats/blob/d049876a2f4c452fe9ea42a0db70b7c2f3b6112d/spacegraphcats/test_snakemake.py">test_snakemake.py</a> is that the tests should be run in a particular order, because they build on each other. (You can actually run them in any order you want, because snakemake will create the files as needed, but it makes the test steps take longer.)</p>
<p>I ended up using <a href="https://github.com/spacegraphcats/spacegraphcats/blob/d049876a2f4c452fe9ea42a0db70b7c2f3b6112d/spacegraphcats/test_snakemake.py">pytest-dependency</a> to recapitulate which steps in the workflow depended on each other, and now I have a fairly nice granular breakdown of tests, and they seem to work well.</p>
<p>(I'm still stuck on how to ensure that the outputs of the tests have the correct content, but that's a problem for another day :).</p>
<h2>Using workflows inside of workflows</h2>
<p>Last but not least, we tend to want to run our applications <em>within</em> workflows. This is true even when our applications <em>are</em> workflows :).</p>
<p>However, we ran into a little bit of a problem with paths. Because snakemake relies heavily on file system paths, the applications we built on top of snakemake had fairly hardcoded outputs. For example, spacegraphcats produces lots of directories like <code>genome_name</code>, <code>genome_name_k31_r1</code>, <code>genome_name_k31_r1_search</code>, etc. that have to be in the working directory. This turns into an ugly mess for any reasonably complicated workflow.</p>
<p>So, we took advantage of <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#configure-working-directory">snakemake's <code>workdir:</code> parameter</a> to provide a command-line feature in our applications that would stuff all of the outputs in a particular directory.</p>
<p>This, however, meant some <em>input</em> locations needed to be adjusted to absolute rather than relative paths. Snakemake handled this automatically for filenames specified in the Snakefile, but for paths loaded from config files, <a href="https://github.com/spacegraphcats/spacegraphcats/blob/2e82cc46cd25e71a1158d641760a11fdb940583d/spacegraphcats/conf/Snakefile#L19">we had to do it manually</a>. This turned out to be quite easy and works robustly!</p>
<p>You can see an example of this usage <a href="https://github.com/dib-lab/2020-ibd/blob/50404a7cfcd41ce1ed809ba8aaddb1939e279e8e/Snakefile#L941">here</a>. The <code>--outdir</code> parameter tells spacegraphcats to just put everything under a particular location.</p>
<h2>Concluding thoughts</h2>
<p>I've been pleasantly surprised at how easy it has been to build applications on top of snakemake. We've accumulated some good experience with this, and have some fairly robust and re-usable code that solves many of our problems. I hope you find it useful!</p>
<p>--titus</p>sourmash databases as zip files, in sourmash v3.3.02020-05-07T00:00:00+02:002020-05-07T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2020-05-07:/blog/2020-sourmash-databases-as-zip-files.html<p>Use compressed databases directly!</p><p>The feature that I'm most excited about in <a href="https://github.com/dib-lab/sourmash/releases/tag/v3.3.0">sourmash 3.3.0</a> is the ability to directly use <em>compressed</em> SBT search databases.</p>
<p>Previously, if you wanted to search (say) 100,000 genomes from GenBank, you'd have to download a several GB .tar.gz file, and then uncompress it out to ~20 GB before searching it. The time and disk space requirements for this were major barriers for teaching and use.</p>
<p>In v3.3.0, <a href="https://twitter.com/luizirber/">Luiz Irber</a> fixed this by, first, releasing the <a href="https://lib.rs/crates/niffler">niffler</a> Rust library with <a href="https://twitter.com/pierre_marijon">Pierre Marijon</a>, to read and write compressed files; second, replacing our old khmer Bloom filter nodegraph with a Rust implementation (<a href="https://github.com/dib-lab/sourmash/pull/799">sourmash PR #799</a>); and, third, adding direct zip file storage (<a href="https://github.com/dib-lab/sourmash/pull/648">sourmash #648</a>).</p>
<p>So, as of the latest release, you can do the following:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># install sourmash v3.3.0</span>
<span class="n">conda</span><span class="w"> </span><span class="n">create</span><span class="w"> </span><span class="o">-</span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="n">n</span><span class="w"> </span><span class="n">sourmash</span><span class="o">-</span><span class="n">demo</span><span class="w"> </span>\
<span class="w"> </span><span class="o">-</span><span class="n">c</span><span class="w"> </span><span class="n">conda</span><span class="o">-</span><span class="n">forge</span><span class="w"> </span><span class="o">-</span><span class="n">c</span><span class="w"> </span><span class="n">bioconda</span><span class="w"> </span><span class="n">sourmash</span><span class="o">=</span><span class="mf">3.3</span><span class="o">.</span><span class="mi">0</span>
<span class="c1"># activate environment</span>
<span class="n">conda</span><span class="w"> </span><span class="n">activate</span><span class="w"> </span><span class="n">sourmash</span><span class="o">-</span><span class="n">demo</span>
<span class="c1"># download the 25k GTDB release89 guide database (~1.4 GB)</span>
<span class="n">curl</span><span class="w"> </span><span class="o">-</span><span class="n">L</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">osf</span><span class="o">.</span><span class="n">io</span><span class="o">/</span><span class="mi">5</span><span class="n">mb9k</span><span class="o">/</span><span class="n">download</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">gtdb</span><span class="o">-</span><span class="n">release89</span><span class="o">-</span><span class="n">k31</span><span class="o">.</span><span class="n">sbt</span><span class="o">.</span><span class="n">zip</span>
<span class="c1"># grab a genome signature - here, download a demo one from OSF</span>
<span class="n">curl</span><span class="w"> </span><span class="o">-</span><span class="n">L</span><span class="w"> </span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">osf</span><span class="o">.</span><span class="n">io</span><span class="o">/</span><span class="n">vhnk4</span><span class="o">/</span><span class="n">download</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">genome</span><span class="o">.</span><span class="n">sig</span>
<span class="c1"># search!</span>
<span class="n">sourmash</span><span class="w"> </span><span class="n">search</span><span class="w"> </span><span class="n">genome</span><span class="o">.</span><span class="n">sig</span><span class="w"> </span><span class="n">gtdb</span><span class="o">-</span><span class="n">release89</span><span class="o">-</span><span class="n">k31</span><span class="o">.</span><span class="n">sbt</span><span class="o">.</span><span class="n">zip</span>
</code></pre></div>
<p>This takes less than 2 GB of disk space total (including conda env), and the search runs in about 3 seconds and 120 MB of RAM.</p>
<p>Using the zip file stuff alone is a slight speed drag (~10-20%?), but the shift to Rust <a href="https://twitter.com/ctitusbrown/status/1257419632572538882">leads to an overall speed increase of about 4x</a>. And you can always unpack the zip file and use the unpacked files directly.</p>
<p>Yay!</p>
<h2>New database releases are coming!</h2>
<p>Over the next few months, we plan to release all our SBT databases as zip files!</p>
<p>As usual, per our semantic versioning guidelines, you'll need sourmash v3.3 or later to use the zip files. However, old databases will continue to work for all sourmash v3.x, and probably v4.x as well (and maybe beyond :).</p>
<p>--titus</p>Software and workflow development practices (April 2020 update)2020-04-20T00:00:00+02:002020-04-20T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2020-04-20:/blog/2020-software-and-workflow-dev-practices.html<p>How we develop software and workflows in the DIB Lab, in 2020.</p><p>Over the last 10-15 years, I've blogged periodically about how my lab develops research software and build scientific workflows. The <a href="http://ivory.idyll.org/blog/2018-repeatability-in-practice.html">last update</a> talked a bit about how we've transitioned to snakemake and conda for automation, but I was spurred by an e-mail conversation into another update - because, y'all, it's going pretty well and I'm pretty happy!</p>
<p>Below, I talk through our current practice of building workflows and software.
These procedures work pretty well for our (fairly small) lab of people who mostly work part-time on workflow and software development. <strong>By far</strong> the majority of our effort is usually spent trying to understand the <strong>results</strong> of our workflows; except in rare cases, I try to guide people to spend at most 20% of their time writing new analysis code - preferably less.</p>
<p>Nothing about these processes ensures that the scientific output is correct or useful, of course. While scientific correctness of computational workflows necessarily depends (often critically) on the correctness of the code underlying those workflows, the code could ultimately be doing the wrong thing scientifically. That having been said, I've found that the processes below let us focus much more cleanly on the scientific value of the code because we don't worry as much about whether the code is correct, and moreover our processes support rapid iteration of software and workflows as we iteratively develop our use cases.</p>
<p>As one side note, I should say that the complexity of the scientific process is one thing that distinguishes research computing from other software engineering projects. <strong>Often we don't actually have a good idea of what we're trying to achieve</strong>, at least not any level of specificity. This is a recipe for disaster in a software engineering project, but it's our day-to-day life in science! What ...fun? (I mean, it kind of is. But it's also hellishly complicated.)</p>
<h2>Workflows and scripts</h2>
<p>Pretty much every scientific computing project I've worked on in the last (counts on fingers and toes... runs out of toes... 27 years!? eek) has grown into a gigantic mess of scripts and data files. Over the (many) years I've progressively worked on taming these messes using a variety of techniques.</p>
<p>Phillip Brooks, Charles Reid, Tessa Pierce, and Taylor Reiter have been the source of a lot of the workflow approaches I discussed below, although everyone in the lab has been involved in the discussions!</p>
<h3>Store code and configuration in version control</h3>
<p>Since I "grew up" simultaneously in science and open source, I started using version control early on - first RCS, then CVS, then darcs, then Subversion, and finally git. Version control is second nature, and it applies to science too!</p>
<p>The first basic rule of scientific projects is, <strong>put it in git.</strong></p>
<p>This means that I can (almost) always figure out what I was doing a month ago when I got that neat result that I haven't been able to replicate again. More importantly I can see <em>exactly</em> what I changed in the last hour, and either fix it or revert to what last worked.</p>
<p>Over almost 30 years of sciencing, project naming becomes a problem! Especially since I tend to start projects small and grow them (or let them die on the vine if my focus shifts). So my repo names usually start with the year, followed by a few keywords -- e.g. <a href="https://github.com/ctb/2020-long-read-assembly-decontam">2020-long-read-assembly-decontam</a>. While I can't predict which code I'll go back to, I always end up going back to some of it!</p>
<h3>Write scripts using a language that encourages modularity and code sharing</h3>
<p>I've developed scientific workflows in C, bash, Perl, Tcl, Java, and Python. By far my favorite language of these is Python. The main reason I switched wholeheartedly to Python is that, more than any of the others, Python had a nice blend of modularity and reusability. I could quickly pick up a blob of useful code from one script and put it in a shared module for other scripts to use. And it even had its own simple namespace scheme, which encouraged modularity by default!</p>
<p>At the time (late '90s, early '00s) this kind of namespacing was something that wasn't as well supported by other interpreted languages like Perl (v4?) and Tcl. While I was already a knowledgeable programmer, the ease of code reuse combined with such simple modularity encouraged systematic code reuse in my scripts in a new way. When combined with the straightforward C extension module API, Python was a huge win.</p>
<p>Nowadays there are many good options, of course, but Python is still one of them, so I haven't had to change! My lab now uses an increasing amount of R, of course, because if its dominance in stats and viz. And we're starting to use Rust instead of C/C++ for extension modules.</p>
<h3>Automate scientific workflows</h3>
<p>Every project ends up with a mess of scripts.</p>
<p>When you have a pile of scripts, it's usually not clear how to run them in order. When you're actively developing the scripts, it becomes confusing to remember whether your output files have been updated by the latest code. Enter workflows!</p>
<p>I've been using <code>make</code> to run workflows for ages, but about 2 years ago the entire lab switched over to snakemake. This is in part because it's well integrated with Python, and in part because it supports conda environments. It's been lovely! And we now have a body of shared snakemake expertise in the lab that is hard to beat.</p>
<p>snakemake also works really well for combining my own scripts with other programs, which is of course something that we do a <em>lot</em> in bioinformatics.</p>
<p>There are a few problems with snakemake, of course. It doesn't readily scale to 100s of thousands of jobs, and we're still working out the best way to orchestrate complex workflows on a cluster. But it's proven <a href="https://github.com/ngs-docs/2020-GGG201b-lab">relatively straightforward to teach</a>, and it's nicely designed, with an awful lot of useful features. I've heard good things about nextflow, and if I were going to operate at larger scales, I'd be looking at CWL or WDL.</p>
<h3>New: Work in isolated execution environments</h3>
<p>One problem that we increasingly encounter is the need to run different incompatible versions of software within the same workflow. Usually this manifests in underlying dependencies -- <strong>this</strong> package needs Python 2 while <strong>this other</strong> package requires Python 3.</p>
<p>Previously, tackling this required ugly heavyweight hacks such as VMs or docker containers. I personally spent a few years negotiating with Python virtualenvs, but they only solved some of the problems, and only then in Python-land.</p>
<p>Now, we are 100% conda, all the time. In snakemake, we can provide <a href="https://github.com/ctb/2020-long-read-assembly-decontam/blob/master/environment.yml">environment config files</a> for running the basic pipeline, with rule/step-specific <a href="https://github.com/ctb/2020-long-read-assembly-decontam/blob/master/conf/env-sourmash.yml">environment files</a> that rely on pinned (specific) versions of software.</p>
<p>Briefly, with <code>--use-conda</code> on the command line and <code>conda:</code> directives <a href="https://github.com/ctb/2020-long-read-assembly-decontam/blob/master/Snakefile#L26">in the Snakefile</a>, snakemake manages creating and updating these environments for you, and activate/deactivates them on a per-rule basis. It's beautiful and Just Works.</p>
<h3>New: Provide quickstart demonstration data sets.</h3>
<p>(This is a brand new approach to my daily practice, supported by the easy configurability of snakemake!)</p>
<p>The problem is this: often I want to develop and rapidly execute workflows on small test data sets, while also periodically running them on bigger "real" data sets to see what the results look like. It turns out this is hard to stage-manage! Enter ...snakemake config files! These are YAML or JSON files that are automatically loaded into your Snakefile name space.</p>
<p><strong>Digression:</strong> A year or three ago, <a href="http://ivory.idyll.org/blog/2018-workflows-applications.html">I got excited</a> about using workflows as applications. This was a trend that Camille Scott, a PhD student in the lab, had started with <a href="https://dib-lab.github.io/dammit/">dammit</a>, and we've been using it for <a href="https://github.com/spacegraphcats/spacegraphcats/">spacegraphcats</a> and <a href="https://github.com/dib-lab/elvers">elvers</a>.</p>
<p>The basic idea is this: Increasingly, bioinformatics "applications" are workflows that involve running other software packages. Writing your own scripts that stage-manage other software execution is problematic, since you have to reinvent a lot of error handling that workflow engines already have. This is also true of issues like parallelization and versioning.</p>
<p>So why not write your applications as wrappers around a workflow engine? It turns out with both pydoit and snakemake, you can do this pretty easily! So that's an avenue we've been exploring a few projects.</p>
<p><strong>Back to the problem to be solved:</strong> What I want for workflows is the following:</p>
<ol>
<li>A workflow that is approximately the same, independent of the input data.</li>
<li>Different sets of input data, ready to go.</li>
<li>In particular, a demo data set (a real data set cut down in size, or synthetic data) that exercises most or all of the features of the workflow.</li>
<li>The ability to switch between input data sets quickly and easily <strong>without</strong> changing any source code.</li>
<li>In a perfect world, I would have the ability to develop and run the same workflow code on both my laptop and in an HPC queuing system.</li>
</ol>
<p>This set of functionality is something that snakemake easily supports with its <code>--configfile</code> option - you specify a <em>default</em> config file <a href="https://github.com/ctb/2020-long-read-assembly-decontam/blob/master/Snakefile#L6">in your Snakefile</a>, and then override that with other config files when you want to run for realz. Moreover, with the rule-specific conda environment files (see previous section!), I don't even need to worry about installing the software; snakemake manages it all for me!</p>
<p>With this approach, my workflow development process becomes very fluid. I prototype scripts on my laptop, where I have a full dev environment, and I develop synthetic data sets to exercise various features of the scripts. I bake this demo data set into <a href="https://github.com/ctb/2020-long-read-assembly-decontam/blob/master/test-data/conf.yml">my default snakemake config</a> so that it's what's run by default. For real analyses, I then override this by specifying <a href="https://github.com/ctb/2020-long-read-assembly-decontam/blob/master/conf/conf-necator.yml">a different config file</a> on the command line with <code>--configfile</code>. And this all interacts perfectly well with snakemake's cluster execution approach.</p>
<p>As a bonus, the demo data set provides a simple quickstart and example config file for people who want to use your software. This makes <a href="https://github.com/ctb/2020-long-read-assembly-decontam/blob/master/README.md#installing">the installation and quickstart docs</a> really simple and nearly identical across multiple projects!</p>
<p>(Note that I develop on Mac OS X and execute at scale on Linux HPCs. I'd probably be less happy with this approach if I developed on Windows, for which bioconda doesn't provide packages.)</p>
<h2>Libraries and applications</h2>
<p>On the opposite end of the spectrum from "piles of scripts" is research software engineering, where we are trying explicitly to build maintainable and reusable libraries and command-line applications. Here we take a very different approach to the workflow approach detailed above, although in recent years I've noticed that we're working across this full spectrum on several projects. (This is perhaps because workflows, done like we are doing them above, start to resemble serious software engineering :).</p>
<p>Whenever we find a core set of functionality that is being used across multiple projects in the lab, we start to abstract that functionality into a library and/or command line application. We do this in part because <a href="http://ivory.idyll.org/blog/automated-testing-and-research-software.html">most scripts have bugs</a> that should be fixed, and we remain ignorant of them until we start reusing the scripts; but it also aids in efficiency and code reuse. It's a nice use-case driven way to develop software!</p>
<p>We've developed several software packages this way. For example, the <a href="https://github.com/dib-lab/khmer/">khmer</a> and <a href="https://github.com/dib-lab/screed/">screed</a> libraries emerged from piles of code that slowly got unified into a shared library.</p>
<p>More recently, the <a href="https://github.com/dib-lab/sourmash/">sourmash</a> project has become the in-lab exemplar of intentional software development practices. We now have 3-5 people working regularly on sourmash, and it's being used by an increasingly wide range of people. Below are some of the key techniques we've been using, which will (in most cases) be readily recognized as matching basic open source development practices!</p>
<p>I want to give an especially big shoutout here to Michael Crusoe, Camille Scott, and Luiz Irber, who have been the three key people leading our adoption of these techniques.</p>
<h3>Automate tests</h3>
<p>Keeping software working is hard. Automated tests are one of the solutions.</p>
<p>We have an increasingly expansive <a href="https://github.com/dib-lab/sourmash/tree/master/tests">set of automated tests</a> for sourmash - over 600 at the moment. It takes about a minute to run the whole test suite on my laptop. If it looks intimidating, that's because we've grown it over the years. We started with one test, and went from there.</p>
<p>We don't really use test-driven development extensively, or at least I don't. I know Camille has used it <a href="http://www.camillescott.org/2017/11/15/pytest-magic/">in her De Bruijn graph work</a>. I tend to reserve it for situations where the code is becoming complicated enough at a class or function level that I can't understand it -- and that's rarely necessary in my work. (Usually it means that I need to take a step back and rethink what I'm doing! I'm a big believer in <a href="https://www.linusakesson.net/programming/kernighans-lever/index.php">Kernighan's Lever</a> - if you're writing code at the limit of your ability to understand it, you'll never be able to debug it!)</p>
<h3>Use code review</h3>
<p>Maintainability, sustainability, and correctness of code are all enhanced by having multiple people's eyes on it.</p>
<p>We basically use <a href="https://guides.github.com/introduction/flow/">GitHub Flow</a>, as I understand it. Every PR runs all the tests on each commit, and we have a checklist to help guide contributors.</p>
<p>We have a two-person sign-off rule on every PR. This can slow down code development when some of us are busy, but on the flip side no one person is solely responsible when bad code makes it into a release :).</p>
<p>Most importantly, it means that our code quality is consistently better than what I would produce working on my own.</p>
<h3>Use semantic versioning</h3>
<p><a href="https://semver.org/">Semantic versioning</a> means that when we release a new version, outside observers can quickly know if they can upgrade without a problem. For example, within the sourmash 3.x series, the only reason for the same command line options to produce different output is if <a href="https://github.com/dib-lab/sourmash/pull/942">there was a bug</a>.</p>
<p><a href="https://github.com/dib-lab/sourmash/issues/655">We are still figuring out some of the details, of course.</a> For example, we have only recently started tracking performance regressions. And it's unclear exactly what parts of our API should be considered public. Since sourmash isn't <em>that</em> widely used, I'm not pushing hard on resolving these kinds of high level issues, but they are a regular background refrain in my mind.</p>
<p>In any case, what semantic versioning does is provide a simple way for people to know if it's safe to upgrade. It also lets us pin down versions in our own workflows, with some assurance that the behavior shouldn't be changing (but performance might improve!) if we pin to a major version.</p>
<h3>Nail down behavior with tests, then refactor underneath</h3>
<p>I write a lot of hacky code when I'm exploring research functionality. Often this code gets baked into our packages with a limited understanding of its edge cases. As I explore and expand the use cases more and more, I find more of these edge cases. And, if the code is in a library, I nail down the edge cases with <a href="http://ivory.idyll.org/blog/stupidity-driven-testing.html">stupidity-driven testing</a>. This then lets me (or others) refactor the code to be less hacky and more robust, without changing its functionality.</p>
<p>For example, I'm currently going through a <a href="https://github.com/dib-lab/sourmash/pull/946">long, slow refactor</a> of some formerly ugly sourmash code that creates a certain kind of indexed database. This code worked reasonably well for years, but as we developed more uses for it, it became clear that there were, ahem, opportunities for refactoring it to be more usable in other contexts.</p>
<p>We don't start with good code. We don't pretend that our code is good (or at least I wouldn't, and can't :). But we iteratively improve upon our code as we work with it.</p>
<h3>Explore useful behavior, then nail it down with tests, and only <strong>then</strong> optimize the heck out of it</h3>
<p>The previous section is how we clean up code, but it turns out it also works really well for <strong>speeding up code</strong>.</p>
<p>There this really frustrating bias amongst software developers towards <a href="https://wiki.c2.com/?PrematureOptimization">premature optimization</a>, which leads to ugly and unmaintainable code. In my experience, flexibility trumps optimization 80% or more of the time, so I take this to the other extreme and rarely worry about optimizing code. Luckily some people in my lab counterbalance me in this preference, so we occasionally produce performant code as well :).</p>
<p>What we do is get to the point where we have pretty well-specified functionality, and then benchmark, and then refactor and optimized based on the benchmarking.</p>
<p>A really clear example of this applied to sourmash was <a href="https://github.com/dib-lab/sourmash/issues/573">here</a>, when Luiz and Taylor noticed that I'd written really bad code that was recreating large sets again and again in Python. Luiz added a simple "remove_many" method that did the same operation in place and we got a really substantial (order of magnitude?) speed increase.</p>
<p>Critically, this optimization was to a new research algorithm that we developed over the period of years. <strong>First</strong> we got the research algorithm to work. <strong>Then</strong> we spent a lot of time understanding how and why and where it was useful. <strong>During this period</strong> we wrote a whole bunch of tests that nailed down the behavior. And then when Luiz optimized the code, we just dropped in a faster replacement that passed all the tests.</p>
<p>This has become a bit of a trend in recent years. As sourmash has moved from C to C++ to Rust, Luiz has systematically improved the runtimes for various operations. But this has always occurred in the context of well-understood features with lots of tests. Otherwise we just end up breaking our software when we optimize it.</p>
<p>As a side note, whenever I hear someone emphasize the speed of their just-released scientific software, my strong Bayesian prior is that they are really telling me their code is not only full of bugs (all software is!) but that it'll be really hard to find and fix them...</p>
<h3>Collaborate by insisting the tests pass</h3>
<p>Working on multiple independent feature sets at the same time is hard, whether it's only one person or five. Tests can help here, too!</p>
<p>One of the cooler things to happen in sourmash land in the last two years is that <a href="https://twitter.com/olgabot">Olga Botvinnik</a> and some of her colleagues at CZBioHub started contributing substantially to sourmash. This started with Olga's interest in using sourmash for single-cell RNAseq analysis, which presents new and challenging scalability challenges.</p>
<p>Recently, the CZBioHub folk <a href="https://github.com/dib-lab/sourmash/pull/925">submitted a pull request to significantly change one of our core data structures</a> so as to scale it better. (It's going to be merged soon!) Almost all of our review comments have focused on reviewing the code for understandability, rather than questioning the correctness - this is because the interface for this data structure is pretty well tested at a functional level. <strong>Since the tests pass, I'm not worried that the code is wrong.</strong></p>
<p>What this overall approach lets us do is simultaneously work on multiple parts of the sourmash code base with some basic assurances that it will still work after all the merges are done.</p>
<h3>Distribute via (bio)conda, install via environments</h3>
<p>Installation for end users is hard.
I've spent many, many years writing installation tutorials. Conda just solves this, and is our go-to approach now for supporting user installs.</p>
<p>Conda software installation is awesome and awesomely simple. Even when software isn't yet packaged for conda install (like <a href="https://github.com/spacegraphcats/spacegraphcats/">spacegraphcats</a>, which is research-y enough that I haven't bothered) you <a href="https://github.com/spacegraphcats/spacegraphcats/blob/master/environment.yml">can still install it that way -- see the pip commands, here</a>.</p>
<h3>Put everything in issues</h3>
<p>You can find most design decisions, feature requests, and long-term musings for sourmash in our <a href="https://github.com/dib-lab/sourmash/issues">issue tracker</a>. This is where we discuss almost everything, and it's our primary help forum as well. Having a one-stop shop that ties together design, bugs, code reviews, and documentation updates is really nice. We even try to <a href="https://twitter.com/ctitusbrown/status/1247524069596991488">archive slack conversations</a> there!</p>
<h2>Concluding thoughts</h2>
<p>Academic workflow and software development is a tricky business. We operate in slow moving and severely resource-constrained environments, with a constant influx of people who have a variety of experience, to solve problems that are often poorly understood in the beginning (and maybe at the end). The practices above have been developed for a small lab and are battle-tested over a decade and more.</p>
<p>While your mileage may vary in terms of tools and approaches, I've seen convergence across the social-media enabled biological data science community to similar practices. This suggests these practices solve real problems that are being experienced by multiple labs. Moreover, we're developing a solid community of practice in not only using these approaches but also teaching them to new trainees. Huzzah!</p>
<p>--titus</p>
<p>(Special thanks go to the USDA, the NIH, and the Moore Foundation for funding so much of our software development!)</p>How to give a bad online talk2020-04-13T00:00:00+02:002020-04-13T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2020-04-13:/blog/2020-bad-online-talk.html<p>A bad example...</p><p>Today at lab meeting, I wanted to brainstorm about how to give good online talks, because I'm giving a few remote talks in the next month. Tracy suggested that perhaps I should demonstrate a <em>bad</em> talk first, just to get everyone on the same page.</p>
<p>So I did!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/F4czvzciTlE" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p><a href="https://t.co/fX1YE2yjDp">Direct (YouTube link)</a></p>
<p>...enjoy? It's short, and not TOO painful if you show up with low expectations!</p>
<hr>
<p>First, let me say that we were tremendously ...inspired by Greg Wilson's <a href="https://t.co/p7LwjAmUOB">How to Teach Badly</a> and <a href="https://t.co/AUCbyN5fDV">How to Teach Badly (part 2)</a>!</p>
<p>So here's what I did --</p>
<p>I put together a few slides on some stuff that I'd been working on recently, so it would look reasonable.</p>
<p>My initial screen opened with a private Twitter message up, to mimic inadvertent content sharing :).</p>
<p>I started out with "I didn't have a lot of time to prepare for this meeting so apologies for some of the slides."</p>
<p>My slide theme was very hard to read - bad fonts and colors.</p>
<p>A few slides in I went with "I know we're all busy on time so I'm going to be brief. I'll just skip some of the background and through these first slides quickly."</p>
<p>On the first slide with an image, I had Taylor Reiter break in to ask a question, and I shut her down with "Just hold questions, I'll get to them at the end of we have time."</p>
<p>All of my slide content was just ...terrible. I am especially "proud" of the screenshots of code (I carefully cropped off the code comments).</p>
<p>And of course I spoke quickly, imparted little to no useful information in any way, and took no questions at the end, either...</p>
<hr>
<p>I only informed one or two people in advance that I was doing this, and so I got some good reactions ;). I also got some amazing recommendations for how to make it far, far worse...</p>
<p>Anyway, enjoy! I will write another blog post on what the various suggestions for giving good online talks were -- I'm giving two remote talks in the next month or so, and I'll come back with some specific recommendations, too!</p>
<p>--titus</p>
<p>p.s. Yes, these are real projects and you CAN find them on github :).</p>Some snakemake hacks for dealing with large collections of files2020-03-09T00:00:00+01:002020-03-09T00:00:00+01:00C. Titus Brown and N. Tessa Pierce and Taylor Reitertag:ivory.idyll.org,2020-03-09:/blog/2020-snakemake-hacks-collections-files.html<p>snakemake4life</p><p>This winter quarter I taught my usual graduate-level introductory
bioinformatics lab at UC Davis, GGG 201(b), for the fourth time. The
course lectures are given by Megan Dennis and Fereydoun Hormozdiari,
and I do a largely separate lab that aims to teach the basics of
practical variant calling, de novo assembly, and RNAseq differential
expression.</p>
<p>I also co-developed and co-taught a new course, GGG 298 / Tools for
Data Intensive Research, with Shannon Joslin, a graduate student here
in Genetics & Genomics who (among other things) took GGG 201(b) the
first time I offered it. GGG 298 is a series of ten half-day workshops
where we teach shell, conda, snakemake, git, RMarkdown, etc - you can
see
<a href="https://github.com/ngs-docs/2020-GGG298/">the syllabus for GGG 298 here</a>.</p>
<p>This time around, I did a complete redesign of the
<a href="https://github.com/ngs-docs/2020-GGG201b-lab">GGG 201(b) lab (see syllabus)</a>
to focus on using
<a href="http://snakemake.readthedocs.io/en/stable/">snakemake workflows</a>.</p>
<p>I'm 80% happy with how it went - there's some overall fine tuning to
be done, and snakemake has some corners that need more explaining than
other corners, but I think the basic concepts got through to a lot of
the students. I also think I'm finally teaching people something they
<em>really</em> need to know, which is how to build, automate, place controls
on, and execute complex bioinformatics workflows.</p>
<p>I was traveling the week before last, so I asked Taylor Reiter and
Tessa Pierce to do the first RNAseq lecture for the class (week 8!) As
part of their
<a href="https://github.com/ngs-docs/2020-ggg-201b-rnaseq">brilliant RNAseq materials</a>
for the class (snakemake! salmon! tximeta! DESeq2! RMarkdown!), Tessa
used a cute trick in the Snakefile that I hadn't seen before. It's
"obvious" if you're a Python+snakemake expert, but many people aren't,
and in any case it's always nice to share, right??</p>
<p>Below, I take the opportunity to share several solutions for loading
sample names into the Snakefile.</p>
<p>(These are fairly boilerplate examples that you can use in your own
code with little modification, too!)</p>
<h2>Cute snakemake trick #1: dictionaries for downloads</h2>
<p>The following code snippet is a nice, simple Pythonic way to download
a bunch of files from Web URLs.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># list sample names & download URLs.</span>
<span class="n">sample_links</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"ERR458493"</span><span class="p">:</span> <span class="s2">"https://osf.io/5daup/download"</span><span class="p">,</span>
<span class="s2">"ERR458494"</span><span class="p">:</span><span class="s2">"https://osf.io/8rvh5/download"</span><span class="p">,</span>
<span class="s2">"ERR458495"</span><span class="p">:</span><span class="s2">"https://osf.io/2wvn3/download"</span><span class="p">,</span>
<span class="s2">"ERR458500"</span><span class="p">:</span><span class="s2">"https://osf.io/xju4a/download"</span><span class="p">,</span>
<span class="s2">"ERR458501"</span><span class="p">:</span> <span class="s2">"https://osf.io/nmqe6/download"</span><span class="p">,</span>
<span class="s2">"ERR458502"</span><span class="p">:</span> <span class="s2">"https://osf.io/qfsze/download"</span><span class="p">}</span>
<span class="c1"># the sample names are dictionary keys in sample_links. extract them to a list we can use below</span>
<span class="n">SAMPLES</span><span class="o">=</span><span class="n">sample_links</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span>
<span class="c1"># download yeast rna-seq data from Schurch et al, 2016 study</span>
<span class="n">rule</span> <span class="n">download_all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"rnaseq/raw_data/</span><span class="si">{sample}</span><span class="s2">.fq.gz"</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">)</span>
<span class="c1"># rule to download each individual file specified in sample_links</span>
<span class="n">rule</span> <span class="n">download_reads</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"rnaseq/raw_data/</span><span class="si">{sample}</span><span class="s2">.fq.gz"</span>
<span class="n">params</span><span class="p">:</span>
<span class="c1"># dynamically generate the download link directly from the dictionary</span>
<span class="n">download_link</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">:</span> <span class="n">sample_links</span><span class="p">[</span><span class="n">wildcards</span><span class="o">.</span><span class="n">sample</span><span class="p">]</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> curl -L </span><span class="si">{params.download_link}</span><span class="s2"> -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<h2>Cute snakemake trick #2: loading filenames from the current directory.</h2>
<p>(<strong>I don't recommend this approach.</strong> Read on.)</p>
<p>One of the most common questions I've been asked in the last few weeks
is how to avoid typing all of the sample names into the
Snakefile. (This can matter a lot when you have hundreds of samples!)</p>
<p>After you download the files above, you can get a list of the
downloaded files like so:</p>
<div class="highlight"><pre><span></span><code><span class="n">sample_ids</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s2">"rnaseq/raw_data/</span><span class="si">{sample}</span><span class="s2">.fq.gz"</span><span class="p">)</span>
</code></pre></div>
<p>Now, <code>sample_ids</code> is a Python list that behaves just like <code>SAMPLES</code>,
and it can be used with <code>expand</code>.</p>
<p>Note, for this example, <code>SAMPLES</code> and <code>sample_ids</code> are going to
contain the same list of files. The difference is that <code>sample_ids</code>
are loaded from the directory listing, while <code>SAMPLES</code> has to be
written out in the Snakefile somehow (here, in <code>sample_links</code>).</p>
<p><strong>Why don't I recommend this approach?</strong> You can only use this
approach if the files already exist in the directory. That's fine -
often you don't want to copy them in or download them dynamically! -
but it sets up a particular kind of potential error. If you load the
list of samples from your working directory, and you accidentally
delete one of the sample files, you'll omit it from your workflow
without knowing.</p>
<p>It's much better to <em>independently</em> specify the list of files, so that
if you accidentally delete one, snakemake will complain. That's where
the next trick comes in.</p>
<p>As a bonus, the next approach lets you specify metadata in the
spreadsheet, which is important!</p>
<h2>Cute snakemake trick #3: loading a list of sample names from a spreadsheet.</h2>
<p>This is taken from a really nice, clean
<a href="https://github.com/snakemake-workflows/rna-seq-star-deseq2">example RNAseq workflow that uses STAR and DESeq2</a>,
written by Johannes Köster, Sebastian Schmeier, and Jose Maturana.</p>
<p>Here,
<a href="https://github.com/snakemake-workflows/rna-seq-star-deseq2/blob/master/Snakefile">the Snakefile</a>
loads sample names from a tab-separated values spreadsheet using
pandas; a simplified version of the code follows:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">samples_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'samples.tsv'</span><span class="p">)</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s2">"sample"</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">sample_names</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">samples_df</span><span class="p">[</span><span class="s1">'sample'</span><span class="p">])</span>
</code></pre></div>
<p>Here, <code>sample_names</code> is the same as <code>SAMPLES</code> and <code>sample_ids</code>, above - a list that you can use in <code>expand</code> and so on. The difference here is that <code>samples_df</code> is a <a href="https://www.geeksforgeeks.org/python-pandas-dataframe/">Pandas dataframe</a> that contains other information, such as sample metadata; and it's loaded from a TSV file that can be created, visualized, and edited using spreadsheet software.</p>
<h2>Cute snakemake trick #4: loading a list of download links from a spreadsheet.</h2>
<p>The TSV approach is particularly useful for downloading files or
moving files, as the download links or file paths can be included in
the spreadsheet, rather than at the top of the Snakefile (as they were
in cute trick #1).</p>
<p>Considering the same yeast RNAseq data as the first example and a TSV
file containing the sample names and download links, samples can be
downloaded like so:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">samples_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'samples.tsv'</span><span class="p">)</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s2">"sample"</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">SAMPLES</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">samples_df</span><span class="p">[</span><span class="s1">'sample'</span><span class="p">])</span>
<span class="c1"># download yeast rna-seq data from Schurch et al, 2016 study</span>
<span class="n">rule</span> <span class="n">download_all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"rnaseq/raw_data/</span><span class="si">{sample}</span><span class="s2">.fq.gz"</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">)</span>
<span class="c1"># rule to download each individual file specified in samples_df</span>
<span class="n">rule</span> <span class="n">download_reads</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"rnaseq/raw_data/</span><span class="si">{sample}</span><span class="s2">.fq.gz"</span>
<span class="n">params</span><span class="p">:</span>
<span class="c1"># dynamically grab the download link from the "dl_link" column in the samples data frame</span>
<span class="n">download_link</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">:</span> <span class="n">samples_df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">wildcards</span><span class="o">.</span><span class="n">sample</span><span class="p">,</span> <span class="s2">"dl_link"</span><span class="p">]</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"""</span>
<span class="s2"> curl -L </span><span class="si">{params.download_link}</span><span class="s2"> -o </span><span class="si">{output}</span>
<span class="s2"> """</span>
</code></pre></div>
<p>Enjoy! And comments are welcome!</p>
<p>--titus (and Tessa and Taylor!)</p>Two talks at JGI in May: sourmash, spacegraphcats, and disease associations in the human microbiome.2020-02-17T00:00:00+01:002020-02-17T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2020-02-17:/blog/2020-talks-at-jgi.html<p>Using k-mers and taxonomy to find contamination in metagenomes</p><p>Hello all! I'm giving two metagenomics talks - a tech talk and a bio
talk - at the Joint Genome Institute on May 7, 2020. The abstracts are
below.</p>
<p>The JGI just moved to a new building at LBNL, so these talks are much
more accessible to the UC Berkeley and LBNL communities than they
would have been a year ago. I hope interested people can make it!</p>
<p>The talks will be in the afternoon on May 7th at the
<a href="https://www.lbl.gov/community/integrative-genomics-building/">Integrative Genomics Building</a>,
LBNL Bldg 91-310. I've put the tentative times down. I'll update this
post with final times and contact information for security + parking
passes closer to the day.</p>
<h2>Bio talk: Novel approaches to metagenome analysis reveal microbial signatures of IBD</h2>
<p>(This will be the Science and Technology seminar, 3-4pm on May 7.)</p>
<p>Inflammatory bowel disease (IBD) is a spectrum of diseases
characterized by chronic inflammation of the intestines; it is likely
caused by host-mediated inflammatory responses at least in part
elicited by microorganisms. As of 2015, 1.3% of US adults have been
diagnosed with IBD. To date, although significant microbial
associations have been uncovered, no causative or consistent microbial
signature has been associated with IBD.</p>
<p>In a metaanalysis of six IBD cohorts comprising 2290 gut microbiome
shotgun metagenomes, we sought to uncover microbial signatures of
IBD. We developed a k-mer-based analysis approach based on sourmash
scaled signatures that comprehensively characterizes each metagenome
sample. We demonstrate that this approach explains substantial PCoA
variation across samples, and that patient, study, and diagnosis
account for the majority of variation. We then built an accurate
random forest classifier to predict IBD subtype. This classifier is
built on approximately 14,000 predictive k-mers and outperforms all
previously published work for characterization of IBD subtype. We next
sought to uncover the biological signal of the predictive k-mers. To
determine the origin of the predictive k-mers, we used sourmash gather
to search 400,000 microbial genomes from GenBank as well as recent
human metagenome reanalysis efforts.</p>
<p>We found that 69% of predictive k-mers were contained in 129 genomes,
many of which match known IBD correlates. We reasoned that many
additional predictive k-mers were likely in the pangenomes of these
129 predictive genomes, so we next used spacegraphcats to query
neighborhoods in compact de Bruijn graphs and extract sequences that
were near our predictive genomes in graph space. This increased the
annotated fraction of predictive k-mers to 85%.</p>
<p>This suggests that ~16% of predictive k-mers originate from
strain-variable or accessory components of pangenomes, and that this
variation is hidden from referenced-based approaches but is important
for determining IBD subtypes. Interestingly, the fraction of
predictive k-mers associated with the 129 genomes changed
substantially after spacegraphcats queries. For example, a genome from
the genus Bacteroides increased from owning 2.1% to 10.7% of
predictive k-mers, surpassing the genome that was most predictive
prior to spacegraphcats queries (Clostridiales bacterium, 2.9% to
7.4%). We are now working to bioinformatically characterize the genes
associated with the pangenomes.</p>
<p>Our pipeline is lightweight and open source, extensible to similar
comparative metagenomic studies, and has the potential to improve
diagnostic criteria for IBD subtype.</p>
<h2>Tech talk: No k-mer left behind.</h2>
<p>(This is part of the Compute Next Generation talk series at JGI, 2-3pm
on May 7.)</p>
<p>Here at the DIB Lab @ UC Davis, we've developed and implemented a few
techniques that might be of interest to microbiology and metagenomics
computational researchers. In this tech talk, will dig into the theory
and implementation of our approaches, and discuss some of our current
and future use cases. While there may be some extreme speculation
involved, I will be sure to highlight it as such :).</p>
<p>The first technique is DensityHash, an extension and simplification of
the modulo hash technique proposed as an alternative to MinHash by
Broder (1997). Briefly, we massively downsample k-mers by intersecting
with a subset of hash space. This permits efficient and accurate
estimation of Jaccard similarity and containment on large sequencing
data sets. We have implemented this technique in sourmash
(github.com/dib-lab/sourmash), which offers a pleasant user experience
for comparing samples, searching large databases (e.g. all of
GenBank), estimating the composition of metagenomes, and discovering
contaminated MAGs, among others. We also have a taxonomic module that
slices and dices arbitrary taxonomies, and associates them with hashes
for fun and profit.</p>
<p>The second technique is neighborhood query into large compact De
Bruijn graphs, using dominating sets. Briefly, we implement a
practically efficient linear-time neighborhood clustering on
metagenome compact De Bruijn graphs, and then use this to query and
characterize neighborhoods. This is implemented in spacegraphcats
(github.com/spacegraphcats/spacegraphcats/). Spacegraphcats permits
recovery of accessory elements and strain variation from metagenomes,
for additional fun and profit.</p>
<p>All of our software is open source under the BSD license, developed
openly on GitHub, and implemented in a combination of Python and
Rust. We use automated tests, continuous integration, code coverage
analysis, and pull request review in our development processes.</p>
<p>References:</p>
<p>sourmash: <a href="https://f1000research.com/articles/8-1006">Pierce at al., 2019</a></p>
<p>spacegraphcats: <a href="https://www.biorxiv.org/content/10.1101/462788v3">Brown et al., 2020</a></p>
<hr>
<p>Hope to see you there!</p>
<p>--titus</p>sourmash-oddify: a workflow for exploring contamination in metagenome-assembled genomes2020-01-02T00:00:00+01:002020-01-02T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2020-01-02:/blog/2020-sourmash-oddify.html<p>Using k-mers and taxonomy to find contamination in metagenomes</p><p>(Thanks to Erich Schwarz, Taylor Reiter, and Donovan Parks for brainstorming and feedback on this stuff. Thanks also to Luiz Irber and Phillip Brooks for their work on sourmash!)</p>
<p>Yesterday, <a href="http://ivory.idyll.org/blog/2020-sourmash-gtdb-oddities.html">I posted</a> about using k-mers and taxonomy to investigate Genbank genomes for potential contamination.</p>
<p>The underlying idea is pretty simple: look for subsets of k-mers that don't match the inferred taxonomy of the genome bins they're from, then analyze.</p>
<p>What started me down this path <a href="http://ivory.idyll.org/blog/2017-comparing-genomes-from-metagenomes.html">over two years ago (!!)</a> was the use of the same underlying Tara Oceans metagenomic data for two separate papers, <a href="https://www.nature.com/articles/sdata2017203">Tully et al., 2018</a> and <a href="https://www.nature.com/articles/s41564-018-0176-9">Delmont et al., 2018</a>. Both groups released their data early along with bioRxiv preprints, and it proved to be a treasure trove for my bioinformatics methods development - <a href="https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-lca-subcommands-for-taxonomic-classification">all of the sourmash lca functionality</a> as well a lot of other functionality came from a series of about 14 blog posts examining these genomes.</p>
<p>I last left off with the Tara oceans taxonomic analysis <a href="http://ivory.idyll.org/blog/2017-taxonomic-disagreements-in-tara-mags.html">back around Thanksgiving 2017</a>, with the realization that I needed to dig some more in order to really understand what was going on.</p>
<p>Then, over the 2019 winter break, while updating our Genbank databases, I started playing with <a href="http://ivory.idyll.org/blog/2019-sourmash-lca-db-gtdb.html">making sourmash databases for the GTDB taxonomy</a>, and <a href="http://ivory.idyll.org/blog/2019-sourmash-lca-vs-gtdb-classify.html">while trying to understand why sourmash classifications were different from GTDB classifications</a>, I <a href="http://ivory.idyll.org/blog/2020-sourmash-gtdb-oddities.html">developed a pile of scripts to dig into taxonomically divergent genomes that share sequence</a>.</p>
<p>While corresponding with Donovan Parks about some of the Genbank oddities I found, he pointed out that this approach might be a useful technique for exploring contamination in metagenome-assembled genomes more generally.</p>
<p>Yep!</p>
<h2>The challenge: metagenome-assembled genome analysis</h2>
<p>When people compute metagenome-assembled genomes by assembling metagenomes and then binning the resulting contigs into inferred genomes, they usually assign taxonomy to the genomes using single-copy marker genes. These same genes can also be used in the binning pipeline, and/or in an evaluation step (see e.g. <a href="https://genome.cshlp.org/content/early/2015/05/14/gr.186072.114">CheckM</a>).</p>
<p>What has always worried me (and others!) is that this taxonomic assignment step drags with it many contigs whose only association with the single-copy marker genes is often that they were binned together. And, absent detailed inspection or knowledge of the genes in those contigs, it's been unclear how to evaluate the inclusion of those accessory contigs.</p>
<p>Here I should note that we + collaborators have looked into similar questions using <a href="https://www.biorxiv.org/content/10.1101/462788v2">assembly graph proximity</a>, which may work as well. Regardless, the question of how to QC MAGs is definitely an obsession of mine!</p>
<p>An angle suggested by the <a href="http://ivory.idyll.org/blog/2020-sourmash-gtdb-oddities.html">above Genbank analysis</a> was to look at the accessory contigs by doing k-mer-based taxonomic analysis on them, and then see if the k-mer taxonomy agreed or disagreed with the marker-gene-based taxonomy.</p>
<p>There are many reasons why this might fail - the main one being that you would generally expect the DNA sequence in MAGs to be novel, followed closely by issues of genuine horizontal gene transfer, plasmids, etc. But nothing ventured, nothing gained - and I already had functioning scripts! So I gave it a try.</p>
<h2>Connecting everything into a workflow</h2>
<p>In my lab, we have been using <a href="https://snakemake.readthedocs.io/">the snakemake workflow system</a> a lot. It's an excellent way to tie together a bunch of disparate scripts!</p>
<p>So I put together a workflow, <a href="https://github.com/dib-lab/sourmash-oddify">sourmash-oddify</a>, to automate the analysis of genome bins. The steps are:</p>
<ol>
<li>Given a collection of genome bins,</li>
<li>Assign taxonomy using <a href="https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz848/5626182">GTDB-Tk</a></li>
<li>Build a <a href="https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-lca-index">sourmash taxonomy/LCA database</a> using the resulting taxonomy</li>
<li>Run the <a href="https://github.com/dib-lab/sourmash-oddify/blob/master/scripts/find-oddities.py">find-oddities</a> and <a href="https://github.com/dib-lab/sourmash-oddify/blob/master/scripts/find-oddities-examine.py">find-oddities-examine</a> scripts.</li>
</ol>
<p>and voila! Sprinkle some <a href="https://github.com/dib-lab/sourmash-oddify/blob/master/conf/default.yml">YAML config file magic pixie dust on top</a> and you have a configurable workflow!</p>
<p>Now I needed to run it on some interesting data... hmm, what collections of MAGs do I have lying around... <a href="http://ivory.idyll.org/blog/2019-comparing-binnings.html">hey look, the Tara MAGs!</a></p>
<h2>Running sourmash-oddify on the Tara genomes</h2>
<p>So, I ran this on the 2,631 genomes from Tully et al., 2018, and the 957 genomes from Delmont et al., 2018. The GTDB-Tk step took about 12 hours, and the rest (computing signatures, building the LCA database, extracting oddities, aligning genomes) took about an hour. (The config file is <a href="https://github.com/dib-lab/sourmash-oddify/blob/master/conf/config-tara.yml">here</a>.)</p>
<p>The results on the Delmont data are <a href="https://osf.io/xj87f/">here</a> and <a href="https://osf.io/rt6qm/">here</a>, and the results on the Tully data are <a href="https://osf.io/xqt3n/">here</a> and <a href="https://osf.io/jhq62/">here</a>.</p>
<p>I decided to dig into two results, one from each data set. In both cases, the two genomes were classified in different superkingdoms:</p>
<div class="highlight"><pre><span></span><code> - TOBG_MED-875 (d__Archaea;p__Thermoplasmatota;c__Poseidoniia;o__Poseidoniales;f__Thalassoarchaeaceae;g__MGIIb-O5;;)
- TOBG_SAT-1614 (d__Bacteria;p__Actinobacteriota;c__Acidimicrobiia;o__Microtrichales;f__TK06;g__UBA7388;s__UBA7388 sp002470695;)
- TARA_MED_MAG_00140 (d__Archaea;p__Asgardarchaeota;c__Heimdallarchaeia;;;;;)
- TARA_PON_MAG_00079 (d__Bacteria;p__Patescibacteria;c__CG2-30-54-11;;;;;)
</code></pre></div>
<p>and both pairs shared a lot of sequence between them:</p>
<div class="highlight"><pre><span></span><code><span class="n">TOBG</span><span class="o">:</span>
<span class="n">cluster2</span><span class="o">.</span><span class="mi">0</span><span class="o">:</span><span class="w"> </span><span class="mi">208</span><span class="n">kb</span><span class="w"> </span><span class="n">aln</span><span class="w"> </span><span class="o">(</span><span class="mi">130</span><span class="n">k</span><span class="w"> </span><span class="mi">51</span><span class="o">-</span><span class="n">mers</span><span class="o">)</span><span class="w"> </span><span class="n">across</span><span class="w"> </span><span class="o">(</span><span class="n">root</span><span class="o">);</span><span class="w"> </span><span class="n">longest</span><span class="w"> </span><span class="n">contig</span><span class="o">:</span><span class="w"> </span><span class="mi">115</span><span class="w"> </span><span class="n">kb</span>
<span class="n">weighted</span><span class="w"> </span><span class="n">percent</span><span class="w"> </span><span class="n">identity</span><span class="w"> </span><span class="n">across</span><span class="w"> </span><span class="n">alignments</span><span class="o">:</span><span class="w"> </span><span class="mf">98.1</span><span class="o">%</span>
<span class="n">skipped</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">alignments</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="n">alignments</span><span class="w"> </span><span class="o">(<</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="n">bp</span><span class="w"> </span><span class="n">or</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mi">95</span><span class="o">%</span><span class="w"> </span><span class="n">identity</span><span class="o">)</span>
<span class="n">TOBG_SAT</span><span class="o">-</span><span class="mi">1614</span><span class="o">:</span><span class="w"> </span><span class="n">removed</span><span class="w"> </span><span class="mi">330</span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mi">2514</span><span class="n">kb</span><span class="w"> </span><span class="o">(</span><span class="mi">13</span><span class="o">%),</span><span class="w"> </span><span class="mi">3</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mi">28</span><span class="w"> </span><span class="n">contigs</span>
<span class="n">TOBG_MED</span><span class="o">-</span><span class="mi">875</span><span class="o">:</span><span class="w"> </span><span class="n">removed</span><span class="w"> </span><span class="mi">238</span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mi">1305</span><span class="n">kb</span><span class="w"> </span><span class="o">(</span><span class="mi">18</span><span class="o">%),</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mi">55</span><span class="w"> </span><span class="n">contigs</span>
<span class="n">TARA</span><span class="o">:</span>
<span class="n">cluster14</span><span class="o">.</span><span class="mi">0</span><span class="o">:</span><span class="w"> </span><span class="mi">1497</span><span class="n">kb</span><span class="w"> </span><span class="n">aln</span><span class="w"> </span><span class="o">(</span><span class="mi">970</span><span class="n">k</span><span class="w"> </span><span class="mi">51</span><span class="o">-</span><span class="n">mers</span><span class="o">)</span><span class="w"> </span><span class="n">across</span><span class="w"> </span><span class="o">(</span><span class="n">root</span><span class="o">);</span><span class="w"> </span><span class="n">longest</span><span class="w"> </span><span class="n">contig</span><span class="o">:</span><span class="w"> </span><span class="mi">11</span><span class="w"> </span><span class="n">kb</span>
<span class="n">weighted</span><span class="w"> </span><span class="n">percent</span><span class="w"> </span><span class="n">identity</span><span class="w"> </span><span class="n">across</span><span class="w"> </span><span class="n">alignments</span><span class="o">:</span><span class="w"> </span><span class="mf">98.9</span><span class="o">%</span>
<span class="n">skipped</span><span class="w"> </span><span class="mi">15</span><span class="w"> </span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">alignments</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">37</span><span class="w"> </span><span class="n">alignments</span><span class="w"> </span><span class="o">(<</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="n">bp</span><span class="w"> </span><span class="n">or</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mi">95</span><span class="o">%</span><span class="w"> </span><span class="n">identity</span><span class="o">)</span>
<span class="n">TARA_PON_MAG_00079</span><span class="o">:</span><span class="w"> </span><span class="n">removed</span><span class="w"> </span><span class="mi">1788</span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mi">6127</span><span class="n">kb</span><span class="w"> </span><span class="o">(</span><span class="mi">29</span><span class="o">%),</span><span class="w"> </span><span class="mi">507</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mi">1767</span><span class="w"> </span><span class="n">contigs</span>
<span class="n">TARA_MED_MAG_00140</span><span class="o">:</span><span class="w"> </span><span class="n">removed</span><span class="w"> </span><span class="mi">2791</span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mi">6746</span><span class="n">kb</span><span class="w"> </span><span class="o">(</span><span class="mi">41</span><span class="o">%),</span><span class="w"> </span><span class="mi">472</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mi">1411</span><span class="w"> </span><span class="n">contigs</span>
</code></pre></div>
<p>As a control, I then took the "cleaned" genomes and re-ran the classification with GTDB-Tk. Three of the four classified as their original classification, indicating that the removed sequence didn’t contain essential marker genes (as I would have guessed). TARA_PON_MAG_00079 wasn't classified as anything by GTDB, because fewer than 10% of the markers were present. (AFAICT, GTDB-Tk doesn’t give any more details than that in its logs, so I’ll have to dig to figure out what happened.)</p>
<p>TOBG_MED-875 is classified as d__Archaea, while TOBG_SAT-1614 is classified as d__Bacteria. So what is the sequence that is shared? Conveniently find-oddities-examine.py outputs the contigs it removes, but what next?</p>
<p>I decided to run a quick analysis using Torsten Seeman's <a href="https://github.com/tseemann/prokka">prokka</a>, which did gene calling and gave me protein sequences in FASTA format with a minimum amount of fuss (thanks Torsten!). I took the resulting aa sequences, extracted those over 100 aa in length, shuffled them, and took the first 10. I then BLASTed <a href="https://osf.io/uds2n/">these ten sequences</a> over at <a href="https://blast.ncbi.nlm.nih.gov/Blast.cgi">NCBI BLAST</a>.</p>
<p>The top hit to these genes in all 10 cases is to the <a href="https://www.ncbi.nlm.nih.gov/biosample/SAMN07618765">TOBG_MED-875 genome in Genbank</a>, which is labeled as a <em>Euryarchaeota archaeon</em>.</p>
<p>However, the second and third hits are generally to a variety of Chloroflexi and/or Acidimicrobiales proteins, in the Bacterial superkingdom. This suggests that the majority of the predicted genes in the DNA shared between TOBG_MED-875 and TOBG_SAT-1614 are bacterial.</p>
<p>Moreover, it suggests that the inclusion of TOBG_MED-875 in Genbank may be messing up some gene taxonomies.</p>
<h2>Summary thoughts</h2>
<p>I think it is safe to argue that two different binned genomes from the same metagenomic samples should not share much genomic DNA, unless they are from closely related species. (In general, I would not trust conclusions about lateral gene transfer based solely on computationally inferred genomes.)</p>
<p><a href="https://github.com/dib-lab/sourmash-oddify">sourmash-oddify</a> is an alpha-stage automated workflow to identify k-mers and DNA segments that don't follow the taxonomy of their containing genomes. I think using it to flag contamination in metagenome-assembled genomes is (or will be :) straightforward.</p>
<p>It uses the GTDB taxonomy assignment pipeline, GTDB-Tk, to generate the taxonomies, uses a Kraken-inspired approach to identify "incoherent" k-mers shared between genomes, and then runs nucmer to align the genomes.</p>
<p>Indications are that at least on some genomes, it correctly identifies contamination.</p>
<p>This is a pretty lightweight workflow, too, especially if you're already using GTDB-Tk!</p>
<h2>What's next?</h2>
<p>I'm not really sure. I have a few ideas for some larger scale analyses, but I'm at the stage where I have 80% of the coding done for a full project, but it's only 20% of the work needed to bring the project to some sort of real fruition.</p>
<p>I think what I'd be looking to do next is to automate the Prokka steps above, and find some way of semi-automatedly reaching conclusions about who the contamination belongs to.</p>
<p>I'd like to work with a group or two who have a large collection of pre-publication MAGs to investigate with this approach. I think the best way to mature an approach like this is in tandem with a biology team that really cares about the specifics of the genomes and genes. <a href="mailto:ctbrown@ucdavis.edu">Drop me a line if you're interested!</a></p>
<p>I have other ideas and questions, too - can we use this pipeline on single-cell genomes? Should we work to ingest all genomes everywhere, and would that make this more sensitive in a useful way?</p>
<p>--titus</p>Finding problematic bacterial/archaeal genomes using k-mers and taxonomy2020-01-01T00:00:00+01:002020-01-01T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2020-01-01:/blog/2020-sourmash-gtdb-oddities.html<p>Some things in Genbank look ...odd.</p><p>(Happy New Year, everyone! Thanks on this blog post go out to Erich
Schwarz and Taylor Reiter, for offering helpful suggestions and asking
tough questions as I meandered through this work!)</p>
<p>Yesterday,
<a href="http://ivory.idyll.org/blog/2019-sourmash-lca-vs-gtdb-classify.html">I posted</a>
about using <code>sourmash lca classify</code> to taxonomically classify
bacterial and archaeal genomes quickly, and compared the results to
the full GTDB taxonomy. The tl;dr was that sourmash works pretty well
and returns results consistent with GTDB and GTDB-Tk, but that it
often doesn't classify as precisely as GTDB-Tk.</p>
<p>I was kind of expecting that at the species level, because there is a
limit to the kind of precision that downsampled k-mers can achieve:
the last 1-0.1% of nucleotide similarity can be a bit wobbly with
sourmash (<- technical term).</p>
<p>But I was surprised to see the phylum and superkingdom level
limits. <code>sourmash lca classify</code> couldn't classify 235 genomes beyond
phylum level! What could be causing this?</p>
<h2>Digging into a single case of imprecise classification by sourmash</h2>
<p>I took a closer look at
<a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_001477405.1/">GCF_001477405</a>,
a genome tagged as <em>Staphylococcus sciuri</em> in Genbank. Using <code>sourmash
lca summarize</code>, I output a summary of the taxonomic labels of the
31-mers in this genome, downsampled at 10,000. At the phylum level, I
saw:</p>
<div class="highlight"><pre><span></span><code><span class="mf">67.9</span><span class="err">%</span><span class="w"> </span><span class="mf">199</span><span class="w"> </span><span class="n">d__Bacteria</span><span class="p">;</span><span class="n">p__Firmicutes</span>
<span class="mf">2.0</span><span class="err">%</span><span class="w"> </span><span class="mf">6</span><span class="w"> </span><span class="n">d__Bacteria</span><span class="p">;</span><span class="n">p__Proteobacteria</span>
</code></pre></div>
<p>which suggest that there are about 60k 31-mers in the genome that
belong to genomes in the phylum Proteobacteria (they're from the
<em>Bradyrhizobium sp003020075</em> genome, if you're interested :).</p>
<p>And, because of the mechanism and thresholds by which <code>sourmash lca
classify</code> works, those 60k k-mers were enough to trigger confusion
about whether the genome belonged to the Firmicutes or the
Proteobacteria.</p>
<h2>The limitations of a naive lowest common ancestor algorithm</h2>
<p>Please indulge me in a brief digression about lowest common ancestor
approaches to classification. Per
<img alt="this figure" src="http://ivory.idyll.org/blog/images/2017-kmers-kraken.jpg">
from Wood and Salzberg 2014), the algorithm for taxonomic
classification of collections of k-mers looks something like this:</p>
<ol>
<li>Classify all k-mers individually</li>
<li>Collect all the classifications into a single tree</li>
<li>Compute the lowest common ancestor of all the classifications</li>
<li>Assign that classification to the collection</li>
</ol>
<p>This is the algorithm that sourmash uses, with the addition of a
filtering step to remove classifications that are few in number before
computing the lowest common ancestor.</p>
<p>(Kraken takes a more sophisticated approach than this, where it
computes the highest-weighted root-to-leaf path through the taxonomy
(as pictured in the above figure).)</p>
<p>So that's what going on with this specific genome: sourmash is doing
the right thing (by our implementation of the lca algorithm), and
refusing to classify the genome beyond the phylum level, because the
genome has bits and pieces of firmicutes <em>and</em>
proteobacteria. Meanwhile, GTDB is appropriately deciding that this is
almost certainly a <em>Staphylococcus sciuri</em>, based on its marker genes.</p>
<h2>Looking systematically at genome composition across 25k genomes</h2>
<p>On the other hand, why the heck is 2% of this genome shared with
<em>Bradyrhizobium sp003020075</em>?? That's a good question...</p>
<p>Rather than dig into this in a case by case basis, I decided to look
across 25k of the GTDB genomes - these are the 25k dereplicated
genomes that are part of the GTDB toolkit, and (not coincidentally)
the ones in the databases that
<a href="http://ivory.idyll.org/blog/2019-sourmash-lca-db-gtdb.html">we posted on Monday</a>. These
"LCA" databases contain a dictionary of all of the k-mers in all 25k
genomes, together with their taxonomic lineages - perfect!</p>
<p>So I devised the following algorithm:</p>
<ol>
<li>Gather all 51-mers in the 25k genomes</li>
<li>Identify those that are "taxonomically incoherent" at the superkingdom or phylum level, by which I mean they belong to genomes in both Archaea and Bacteria, or multiple phyla within Archaea or Bacteria.</li>
<li>Find pairs of genomes that belong to different phyla or superkingdoms and contain approximately 100,000 or more 51-mers in common.</li>
</ol>
<p>This algorithm is implemented in the
<a href="https://github.com/dib-lab/sourmash-oddify/blob/master/scripts/find-oddities.py">find-oddities.py script</a>,
if you're interested; it'll run on any sourmash LCA database file, and
takes about a minute to run on the 25k genomes one.</p>
<p>What does the output look like? This!</p>
<div class="highlight"><pre><span></span><code><span class="n">cluster</span><span class="w"> </span><span class="mh">0</span><span class="w"> </span><span class="n">has</span><span class="w"> </span><span class="mh">2</span><span class="w"> </span><span class="n">assignments</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="mh">47</span><span class="w"> </span><span class="n">hashvals</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="mh">470000</span><span class="w"> </span><span class="n">bp</span>
<span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="nl">lca:</span><span class="w"> </span><span class="n">superkingdom</span><span class="w"> </span><span class="n">d__Bacteria</span>
<span class="w"> </span><span class="n">Candidate</span><span class="w"> </span><span class="n">genome</span><span class="w"> </span><span class="n">pairs</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">these</span><span class="w"> </span><span class="nl">lineages:</span>
<span class="w"> </span><span class="n">cluster</span><span class="p">.</span><span class="n">pair</span><span class="w"> </span><span class="mf">0.0</span><span class="w"> </span><span class="n">share</span><span class="w"> </span><span class="mh">470000</span><span class="w"> </span><span class="n">bases</span>
<span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">GCA_003220225</span><span class="w"> </span><span class="p">(</span><span class="n">d__Bacteria</span><span class="p">;</span><span class="n">p__Methylomirabilota</span><span class="p">;</span><span class="n">c__Methylomirabilia</span><span class="p">;</span><span class="n">o__Ro</span><span class="se">\</span>
<span class="n">kubacteriales</span><span class="p">;</span><span class="n">f__GWA2</span><span class="o">-</span><span class="mh">73</span><span class="o">-</span><span class="mh">35</span><span class="p">;</span><span class="n">g__AR12</span><span class="p">;</span><span class="n">s__AR12</span><span class="w"> </span><span class="n">sp003220225</span><span class="p">;)</span>
<span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">GCA_003222275</span><span class="w"> </span><span class="p">(</span><span class="n">d__Bacteria</span><span class="p">;</span><span class="n">p__Acidobacteriota</span><span class="p">;</span><span class="n">c__Vicinamibacteria</span><span class="p">;</span><span class="n">o__Fen</span><span class="o">-</span><span class="se">\</span>
<span class="mh">336</span><span class="p">;</span><span class="n">f__Fen</span><span class="o">-</span><span class="mh">336</span><span class="p">;</span><span class="n">g__AA32</span><span class="p">;</span><span class="n">s__AA32</span><span class="w"> </span><span class="n">sp003222275</span><span class="p">;)</span>
</code></pre></div>
<p>This is flagging two genomes, GCA_003220225 and GCA_003222275, as
sharing approximately 470,000 51-mers. The output is sorted by number
of shared k-mers, descending, and for the particular thresholds and
parameters that I'm using, there are 21 sets of lineages in GTDB that
share 100,000 or more 51-mers across the superkingdom or phylum level.</p>
<h2>Looking at actual alignments</h2>
<p>The big problem with the above approach is that it relies on k-mers,
and downsampled k-mers at that. To dig into the actual genomes, I
decided to actually do some alignments. Briefly, I gathered each pair
of genomes, aligned them with nucmer, and then filtered the resulting
alignments using pymummer; this is implemented in the script
<a href="https://github.com/dib-lab/sourmash-oddify/blob/master/scripts/find-oddities-examine.py">find-oddities-examine.py</a>.</p>
<p>The resulting output is this:</p>
<div class="highlight"><pre><span></span><code><span class="nt">cluster0</span><span class="p">.</span><span class="nc">0</span><span class="o">:</span><span class="w"> </span><span class="nt">557kb</span><span class="w"> </span><span class="nt">aln</span><span class="w"> </span><span class="o">(</span><span class="nt">470k</span><span class="w"> </span><span class="nt">51-mers</span><span class="o">)</span><span class="w"> </span><span class="nt">across</span><span class="w"> </span><span class="nt">d__Bacteria</span><span class="o">;</span><span class="w"> </span><span class="nt">longest</span><span class="w"> </span><span class="nt">contig</span><span class="o">:</span><span class="w"> </span><span class="nt">26</span><span class="w"> </span><span class="nt">kb</span>
<span class="nt">weighted</span><span class="w"> </span><span class="nt">percent</span><span class="w"> </span><span class="nt">identity</span><span class="w"> </span><span class="nt">across</span><span class="w"> </span><span class="nt">alignments</span><span class="o">:</span><span class="w"> </span><span class="nt">97</span><span class="p">.</span><span class="nc">6</span><span class="o">%</span>
<span class="nt">skipped</span><span class="w"> </span><span class="nt">79</span><span class="w"> </span><span class="nt">kb</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">alignments</span><span class="w"> </span><span class="nt">in</span><span class="w"> </span><span class="nt">97</span><span class="w"> </span><span class="nt">alignments</span><span class="w"> </span><span class="o">(<</span><span class="w"> </span><span class="nt">0</span><span class="w"> </span><span class="nt">bp</span><span class="w"> </span><span class="nt">or</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="nt">95</span><span class="o">%</span><span class="w"> </span><span class="nt">identity</span><span class="o">)</span>
<span class="nt">GCA_003222275</span><span class="o">:</span><span class="w"> </span><span class="nt">removed</span><span class="w"> </span><span class="nt">760kb</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">6756kb</span><span class="w"> </span><span class="o">(</span><span class="nt">11</span><span class="o">%),</span><span class="w"> </span><span class="nt">154</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">628</span><span class="w"> </span><span class="nt">contigs</span>
<span class="nt">GCA_003220225</span><span class="o">:</span><span class="w"> </span><span class="nt">removed</span><span class="w"> </span><span class="nt">4739kb</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">6031kb</span><span class="w"> </span><span class="o">(</span><span class="nt">79</span><span class="o">%),</span><span class="w"> </span><span class="nt">95</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">167</span><span class="w"> </span><span class="nt">contigs</span>
</code></pre></div>
<p>and hopefully it's pretty self-explanatory.</p>
<p>The script makes some minimal attempt to "cleanse" the genomes of
things that align between them, and that's what the last two lines
are. But, rather than doing anything clever, I just discard any contig
that has an alignment in it. This is obviously wrong in a general
sense but...</p>
<p>...interestingly, it provides an opportunity to see that in this pair
of genomes, one genome has alignments to a bunch of fragments (that's
the first one), while the other has alignments throughout (the second
one). The signature of this is that you can cleanly remove all of the
alignments from the first genome and only get rid of 11% of the
sequence, whereas 79% of the second genome goes away when you
eliminate contigs with alignments.</p>
<p>So, in this case, it's pretty clear that the first genome is probably
contaminated by sequence from the second genome.</p>
<p>There are other situations where it's less clear what's going on:</p>
<div class="highlight"><pre><span></span><code><span class="nt">cluster21</span><span class="p">.</span><span class="nc">0</span><span class="o">:</span><span class="w"> </span><span class="nt">115kb</span><span class="w"> </span><span class="nt">aln</span><span class="w"> </span><span class="o">(</span><span class="nt">100k</span><span class="w"> </span><span class="nt">51-mers</span><span class="o">)</span><span class="w"> </span><span class="nt">across</span><span class="w"> </span><span class="nt">d__Bacteria</span><span class="o">;</span><span class="w"> </span><span class="nt">longest</span><span class="w"> </span><span class="nt">contig</span><span class="o">:</span><span class="w"> </span><span class="nt">1</span><span class="w"> </span><span class="nt">kb</span>
<span class="nt">weighted</span><span class="w"> </span><span class="nt">percent</span><span class="w"> </span><span class="nt">identity</span><span class="w"> </span><span class="nt">across</span><span class="w"> </span><span class="nt">alignments</span><span class="o">:</span><span class="w"> </span><span class="nt">99</span><span class="p">.</span><span class="nc">2</span><span class="o">%</span>
<span class="nt">skipped</span><span class="w"> </span><span class="nt">4</span><span class="w"> </span><span class="nt">kb</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">alignments</span><span class="w"> </span><span class="nt">in</span><span class="w"> </span><span class="nt">6</span><span class="w"> </span><span class="nt">alignments</span><span class="w"> </span><span class="o">(<</span><span class="w"> </span><span class="nt">0</span><span class="w"> </span><span class="nt">bp</span><span class="w"> </span><span class="nt">or</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="nt">95</span><span class="o">%</span><span class="w"> </span><span class="nt">identity</span><span class="o">)</span>
<span class="nt">GCF_000477555</span><span class="o">:</span><span class="w"> </span><span class="nt">removed</span><span class="w"> </span><span class="nt">1439kb</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">2775kb</span><span class="w"> </span><span class="o">(</span><span class="nt">52</span><span class="o">%),</span><span class="w"> </span><span class="nt">163</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">207</span><span class="w"> </span><span class="nt">contigs</span>
<span class="nt">GCF_000427295</span><span class="o">:</span><span class="w"> </span><span class="nt">removed</span><span class="w"> </span><span class="nt">5423kb</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">6292kb</span><span class="w"> </span><span class="o">(</span><span class="nt">86</span><span class="o">%),</span><span class="w"> </span><span class="nt">88</span><span class="w"> </span><span class="nt">of</span><span class="w"> </span><span class="nt">202</span><span class="w"> </span><span class="nt">contigs</span>
</code></pre></div>
<p>and here we would need to dig further.</p>
<h2>Examining potential contamination across 25k Genbank genomes</h2>
<p>Of the 21 pairs of genomes found with the above approach, it looks
like there are 11 that have cleanly isolatable contigs with
taxonomically incoherent k-mers.</p>
<p>The most interesting one is this:</p>
<div class="highlight"><pre><span></span><code><span class="n">cluster</span><span class="w"> </span><span class="mh">2</span><span class="w"> </span><span class="n">has</span><span class="w"> </span><span class="mh">2</span><span class="w"> </span><span class="n">assignments</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="mh">25</span><span class="w"> </span><span class="n">hashvals</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="mh">250000</span><span class="w"> </span><span class="n">bp</span>
<span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="nl">lca:</span><span class="w"> </span><span class="n">superkingdom</span><span class="w"> </span><span class="n">d__Bacteria</span>
<span class="w"> </span><span class="n">Candidate</span><span class="w"> </span><span class="n">genome</span><span class="w"> </span><span class="n">pairs</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">these</span><span class="w"> </span><span class="nl">lineages:</span>
<span class="w"> </span><span class="n">cluster</span><span class="p">.</span><span class="n">pair</span><span class="w"> </span><span class="mf">2.0</span><span class="w"> </span><span class="n">share</span><span class="w"> </span><span class="mh">260000</span><span class="w"> </span><span class="n">bases</span>
<span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">GCF_002705755</span><span class="w"> </span><span class="p">(</span><span class="n">d__Bacteria</span><span class="p">;</span><span class="n">p__Actinobacteriota</span><span class="p">;</span><span class="n">c__Actinobacteria</span><span class="p">;</span><span class="n">o__Actin</span><span class="se">\</span>
<span class="n">omycetales</span><span class="p">;</span><span class="n">f__Microbacteriaceae</span><span class="p">;</span><span class="n">g__Microbacterium</span><span class="p">;</span><span class="n">s__Microbacterium</span><span class="w"> </span><span class="n">esteraromat</span><span class="se">\</span>
<span class="n">icum_A</span><span class="p">;)</span>
<span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">GCA_003265155</span><span class="w"> </span><span class="p">(</span><span class="n">d__Bacteria</span><span class="p">;</span><span class="n">p__Firmicutes</span><span class="p">;</span><span class="n">c__Bacilli</span><span class="p">;</span><span class="n">o__Mycoplasmatales</span><span class="p">;</span><span class="n">f_</span><span class="se">\</span>
<span class="n">_Mycoplasmoidaceae</span><span class="p">;</span><span class="n">g__Eperythrozoon_A</span><span class="p">;</span><span class="n">s__Eperythrozoon_A</span><span class="w"> </span><span class="n">wenyonii_A</span><span class="p">;)</span>
<span class="n">weighted</span><span class="w"> </span><span class="n">percent</span><span class="w"> </span><span class="n">identity</span><span class="w"> </span><span class="n">across</span><span class="w"> </span><span class="nl">alignments:</span><span class="w"> </span><span class="mf">97.9</span><span class="o">%</span>
<span class="n">skipped</span><span class="w"> </span><span class="mh">45</span><span class="w"> </span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">alignments</span><span class="w"> </span><span class="n">in</span><span class="w"> </span><span class="mh">42</span><span class="w"> </span><span class="n">alignments</span><span class="w"> </span><span class="p">(</span><span class="o"><</span><span class="w"> </span><span class="mh">0</span><span class="w"> </span><span class="n">bp</span><span class="w"> </span><span class="k">or</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mh">95</span><span class="o">%</span><span class="w"> </span><span class="n">identity</span><span class="p">)</span>
<span class="nl">GCA_003265155:</span><span class="w"> </span><span class="n">removed</span><span class="w"> </span><span class="mh">593</span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mh">597</span><span class="n">kb</span><span class="w"> </span><span class="p">(</span><span class="mh">99</span><span class="o">%</span><span class="p">),</span><span class="w"> </span><span class="mh">34</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mh">37</span><span class="w"> </span><span class="n">contigs</span>
<span class="nl">GCF_002705755:</span><span class="w"> </span><span class="n">removed</span><span class="w"> </span><span class="mh">534</span><span class="n">kb</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mh">3626</span><span class="n">kb</span><span class="w"> </span><span class="p">(</span><span class="mh">15</span><span class="o">%</span><span class="p">),</span><span class="w"> </span><span class="mh">139</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mh">225</span><span class="w"> </span><span class="n">contigs</span>
</code></pre></div>
<p>here it looks like 100% of the genome of <em>Eperythrozoon_A wenyonii_A</em>
is contained in the genome of <em>Microbacterium esteraromaticum_A</em>!</p>
<p>I'm still fine-tuning the approach but I think this is a promising way
to flag Genbank genomes that are candidates for further examination.</p>
<h2>Some concluding thoughts for today</h2>
<p>To summarize, what we're seeing is that whole-genome approaches to
taxonomic classification (either based on phylogeny of marker genes,
or on whole-genome nucleotide comparisons, or both) sometimes disagree
with the details of small bits of the content.</p>
<p>Let me hasten to add: this is a well known approach to looking at
horizontal gene transfer, and the only small bits of interesting
novelty here are (a) the scaling power of sourmash and (b) the
large-scale application.</p>
<p>Concerning my own initial question: at least some of the imprecise
classifications by sourmash are probably due to cross-genome shared
nucleotides, some of which may be contamination (and others of which
might be legitimate lateral gene transfer, plasmids, etc.) It's hard
to tell without digging in further, of course!</p>
<p>I think it's interesting to contrast compositional approaches like the
above with approaches like average nucleotide identity (ANI). ANI is a
good way to do a comparison of two (or more) genomes, but it's a bulk
measure that (like all bulk measures) elides details. A k-mer based
approach can detect compositional commonalities between genomes, but
of course has its own limitations. Using both seems like a good
opportunity!</p>
<p>I've tried analyses like this with the Genbank taxonomy, but because
that taxonomy isn't constructed using whole-genome comparisons, the
results are too messy for me to look into; I'm too likely to discover
that the problem is an incoherent taxonomic assignment of the whole
genome, rather than a smaller portion of the genome being confused. So
I'm really appreciating GTDB, which resolves a lot of these issues!</p>
<p>Donovan Parks made the excellent point to me in e-mail that many of
the exciting new taxa in the tree of life are based on species known
only from metagenome-assembled genomes, and so some contamination is
not unexpected. (See also
<a href="https://mbio.asm.org/content/10/3/e00725-19.abstract">"Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories", Shaiber and Eren, 2019</a>
and
<a href="https://www.biorxiv.org/content/10.1101/808410v2">"Accurate and Complete Genomes from Metagenomes", Chen et al., bioRxiv, 2019</a>
for some relevant discussions.) My interest, at least for the moment,
is in building tools to dig into this quickly and easily; we'll see
where that goes!</p>
<p>--titus</p>
<p>p.s. The full <code>oddities-k51.txt</code> is <a href="https://osf.io/n8vcg/">here</a>, and the full <code>oddities-k51.examine.txt</code> is <a href="https://osf.io/azyst/">here</a>.</p>
<p>p.p.s The command lines to generate the above files are in <a href="https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/run-oddities.sh">this script</a>.</p>How does sourmash's lca classification routine compare with GTDB classifications?2019-12-31T00:00:00+01:002019-12-31T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2019-12-31:/blog/2019-sourmash-lca-vs-gtdb-classify.html<p>GTDB databases again!</p><p>Yesterday
<a href="http://ivory.idyll.org/blog/2019-sourmash-lca-db-gtdb.html">I posted</a>
about the
<a href="https://www.biorxiv.org/content/10.1101/256800v2">GTDB taxonomy</a>; we
are now providing prepared databases that can be used with sourmash's
taxonomy classification routines to classify genomes with GTDB.</p>
<p>The databases we posted are built from the dereplicated 25k GTDB
genomes distributed as part of the
<a href="https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz848/5626182">GTDB-Tk classification toolkit</a>,
and not the full 145k classifications in GTDB. So they are smaller
than they could be, and also potentially lower resolution. Moreover,
sourmash uses k-mers instead of amino acids, which may lead to
different classifications.</p>
<p>A good first question is, how well do classifications with <code>sourmash
lca classify</code> & 25k genomes compare to the full 145k classifications
in GTDB? This is basically a measure of generalizability - how
reliably can we infer the classifications of the 145k genomes from the
25k?</p>
<h2>Comparing <code>sourmash lca classify</code> on Genbank to GTDB</h2>
<p>I classified all 420k Genbank genomes using <code>sourmash lca classify</code>
with k=31, and I then wrote a script to compare the output to the GTDB
taxonomy. This involved some rather nasty identifier conversion which
sometimes failed, but we ended up with a good number of comparable
items:</p>
<div class="highlight"><pre><span></span><code>identifiers in gtdb only: 6901
identifiers in sourmash lca classify only: 247987
identifiers in both: 137185
</code></pre></div>
<p>So we are using the harmonized 95% of gtdb identifiers and 35% of
sourmash identifiers for the below comparisons. (The missing items are
due to failed identifier munging, different versions of genbank, and
me using genbank-entire instead of refseq (which is the source of the
bulk of the sourmash-specific identifiers)).</p>
<p>Of the 137,185 genomes in common, a straight-up comparison of
classifications gave the following:</p>
<div class="highlight"><pre><span></span><code><span class="n">same</span><span class="o">:</span><span class="w"> </span><span class="mi">79666</span><span class="w"> </span><span class="o">(</span><span class="mf">58.1</span><span class="o">%)</span>
<span class="n">different</span><span class="o">:</span><span class="w"> </span><span class="mi">57519</span><span class="w"> </span><span class="o">(</span><span class="mf">41.9</span><span class="o">%)</span>
</code></pre></div>
<p>The 58.1% identical number is reassuring, but 41.9% disagreement is
not great - what's going on here?</p>
<p>It turns out that, in almost all situations, sourmash <strong>agrees with</strong>
but is <strong>lower resolution than</strong> GTDB.</p>
<div class="highlight"><pre><span></span><code>different but consistent: 57498
rank: superkingdom / count: 201
rank: phylum / count: 36
rank: class / count: 176
rank: order / count: 94
rank: family / count: 2260
rank: genus / count: 54731
rank: species / count: 0
</code></pre></div>
<p>That is, 201 of the sourmash classifications stop at the superkingdom
level, 36 continue to the phylum level, and so on. Fully 95.1% match
at the genus level! And for all of these, the sourmash classifications
agree with the GTDB taxonomy - a full 99.96% of the time.</p>
<p>What about the disagreements?</p>
<div class="highlight"><pre><span></span><code><span class="n">inconsistent</span><span class="o">:</span><span class="w"> </span><span class="mi">21</span>
<span class="w"> </span><span class="n">rank</span><span class="o">:</span><span class="w"> </span><span class="n">superkingdom</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">count</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">rank</span><span class="o">:</span><span class="w"> </span><span class="n">phylum</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">count</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">rank</span><span class="o">:</span><span class="w"> </span><span class="kd">class</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">count</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">rank</span><span class="o">:</span><span class="w"> </span><span class="n">order</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">count</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">rank</span><span class="o">:</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">count</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">rank</span><span class="o">:</span><span class="w"> </span><span class="n">genus</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">count</span><span class="o">:</span><span class="w"> </span><span class="mi">21</span>
<span class="w"> </span><span class="n">rank</span><span class="o">:</span><span class="w"> </span><span class="n">species</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">count</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
</code></pre></div>
<p>So all 21 of the disagreements are at the genus level... whew.</p>
<p>The upshot is that <code>sourmash lca classify</code> seems to work pretty well
as a first-round classification system, and will only lead you astray
at the genus level (and even then only rarely). The species-level
accuracy could potentially be improved by using k=51 instead of k=31,
but that would probably decrease the number being identified, too.</p>
<h2>Comparing <code>sourmash lca classify</code> to GTDB-Tk.</h2>
<p>The next question I had was, how does sourmash's computational
performance compare with the
<a href="https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz848/5626182">GTDB-Tk toolkit</a>?
GTDB-Tk is the standard way to classify new genomes using the GTDB
taxonomy. How does sourmash lca classify compare computationally?</p>
<p>Using the sourmash k=21 LCA database (https://osf.io/9d5rx/), I
analyzed 336 randomly chosen genbank genomes with both sourmash and
GTDB-Tk. As with the full comparison above, the results are pretty
comparable:</p>
<ul>
<li>if sourmash lca classify yields a species-level designation, it is
identical to what GTDB-Tk produces.</li>
<li>at k=21, sourmash lca classify will never disagree with GTDB-Tk. At
worse it will fail to classify out to species, genus, etc. level.</li>
</ul>
<p>But how did the compute compare?</p>
<p>sourmash lca classify takes about 2 minutes to compute the signatures
and 35 seconds to classify 336 signatures. GTDB-Tk takes about 2 hours
with GTDB-Tk, using 8 threads.</p>
<p>sourmash lca classify used about 5 GB of RAM, compared to about 120 GB
of RAM for GTDB-Tk.</p>
<h2>Conclusions</h2>
<p>So, it seems like sourmash lca classify is a decent prefilter for
GTDB-Tk, and that if you need to classify a lot of genomes quickly,
you could start with sourmash and then use GTDB-Tk to focus in on the
ones that aren’t classified at the species level.</p>
<p>In summary,</p>
<ol>
<li>sourmash rarely disagrees with GTDB-Tk, and when it does, it's only
at the genus level.</li>
<li>sourmash often fails to classify genomes that GTDB-Tk does.</li>
<li>sourmash is faster and requires less memory than GTDB-Tk. Compute
efficiency is admittedly a focus of our project, so ...that's good?
:)</li>
</ol>
<p>Special thanks go to Taylor Reiter for suggesting that we look into
the GTDB taxonomy for sourmash, and Donovan Parks for corresponding with
me on various GTDB issues!</p>
<p>--titus</p>
<p>p.s. Here's the sourmash command I used to classify genomes:</p>
<p><code>sourmash compute -k 21,31,51 —scaled=1000 *.fna.gz
sourmash lca classify \
--query *.sig \
--db gtdb-release89-k31.lca.json.gz > lca-classify-all-k31.txt</code></p>
<p>p.p.s. To do the comparison, I ran our <a href="https://github.com/dib-lab/2019-sourmash-gtdb/blob/master/bulk-classify-sbt-with-lca.py">sourmash bulk classify</a> script and then <a href="https://github.com/dib-lab/2019-sourmash-gtdb/blob/master/bulk-csv-to-lineages-csv.py">converted the results into a lineage CSV</a>. I separately <a href="https://github.com/dib-lab/sourmash_databases/blob/master/translate_gtdb_gb_foo.py">converted the GTDB taxonomy file</a> to a lineage CSV, and then <a href="https://github.com/dib-lab/2019-sourmash-gtdb/blob/master/compare-bulk-lca-to-gtdb-entire.py">compared the two</a>. Do not try this at home, the scripts are ugly and require a lot of data that's only on our HPC at the moment :)</p>Sourmash LCA databases now available for the GTDB taxonomy2019-12-30T00:00:00+01:002019-12-30T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2019-12-30:/blog/2019-sourmash-lca-db-gtdb.html<p>GTDB databases!</p><p>I am happy to announce that we have made available prepared sourmash
taxonomy ("LCA") databases for release 89 of the
<a href="https://www.biorxiv.org/content/10.1101/256800v2">GTDB taxonomy</a>.</p>
<p>The databases are available for download from the Open Science
Framework in <a href="https://osf.io/wxf9z/">this project</a>. There are prepared
databases avaialble for k=21, k=31, and k=51.</p>
<h2>What is the GTDB taxonomy?</h2>
<p>GTDB is a revised bacterial and archaeal taxonomy based on
phylogenetic relations between proteins from approximately 25k
genomes. You can read more about it
<a href="https://www.biorxiv.org/content/10.1101/256800v2">here</a>.</p>
<p>GTDB is an alternative to the NCBI taxonomy. It is used by (among
others) <a href="https://www.ebi.ac.uk/metagenomics/">MGnify</a>, the EBI
metagenomics resource.</p>
<h2>What is sourmash?</h2>
<p>Sourmash is a research platform and bioinformatics tool for searching
and analyzing genomes, based on a
<a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x">MinHash</a>-inspired
approach that allows genome similarity searches, genome containment
searches, and compositional analysis of k-mers in large sequence data
sets. You can read more about it
<a href="https://f1000research.com/articles/8-1006">here</a>.</p>
<h2>What do these databases let you do?</h2>
<p>There are three immediate uses for these databases:</p>
<ul>
<li>
<p>you can use the
<a href="https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-lca-classify"><code>sourmash lca classify</code></a>
routine (and other LCA commands) to do taxonomic classification of
genomes using the GTDB taxonomy. (See <a href="https://sourmash.readthedocs.io/en/latest/tutorials-lca.html">our tutorial on sourmash lca!</a>)</p>
</li>
<li>
<p>you can do compositional analysis of metagenomes using
<a href="https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-lca-summarize"><code>sourmash lca summarize</code></a>.</p>
</li>
<li>
<p>you can search for genomes in GTDB that are similar to genomes (or
metagenomes) of interest, using
<a href="https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-search"><code>sourmash search</code></a>
and
<a href="https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-gather"><code>sourmash gather</code></a>.</p>
</li>
</ul>
<h2>How much memory does sourmash need to use these databases?</h2>
<p>LCA databases take up less disk space than SBT databases, but are more
memory intensive. Using these databases requires about 5 GB of RAM.</p>
<p>--titus</p>
<h2>Appendix: How are these databases built?</h2>
<p>We use a fully automated snakemake workflow to build them,
<a href="https://github.com/dib-lab/sourmash_databases/tree/master/gtdb">here</a>. It
takes about 12 hours and under 100 GB of RAM to build the databases from the
genomes under <code>release89/fastani/database/</code>.</p>An initial report on the Common Fund Data Ecosystem2019-08-15T00:00:00+02:002019-08-15T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-08-15:/blog/2019-cfde-july-report.html<p>Helping DCCs get FAIRer</p><p>For the past 6 months or so, I've been working with a team of people on a project called the Common Fund Data Ecosystem. This is a targeted effort within the <a href="https://commonfund.nih.gov">NIH Common Fund</a> (CF) to improve the Findability, Accessibility, Interoperability, and Reusability - a.k.a. <a href="https://www.nature.com/articles/sdata201618">"FAIRness"</a> - of the data sets hosted by their Data Coordinating Centers.</p>
<p>(You can see <a href="https://dpcpsi.nih.gov/sites/default/files/CoC_May_2019_2.00PM_Data_Ecosystem_508.pdf">Dr. Vivien Bonazzi's presentation</a> if you're interested in more details on the background motivation of this project.)</p>
<p>I'm thrilled to announce that our first report is <a href="https://figshare.com/articles/2019-July_CFDE_AssessmentReport_pdf/9588374">now available!</a> This is the product of a tremendous data gathering effort (by many people), four interviews, and an ensuing distillation and writing effort with Owen White and Amanda Charbonneau. To quote,</p>
<blockquote>
<p>This assessment was generated from a combination of systematic review of online materials, in-person site visits to the Genotype Tissue Expression (GTEx) DCC and Kids First, and online interviews with Library of Integrated Network-Based Cellular Signatures (LINCS) and Human Microbiome Project (HMP) DCCs. Comprehensive reports of the site visits and online interviews are available in the appendices. We summarize the results within the body of the report.</p>
</blockquote>
<p>The executive summary is just under four pages, and the full report is about 30 - the bulk of the report document (another 100 pages or so) consists of appendices to the main report.</p>
<p>I wanted to highlight a few things about the report in particular.</p>
<h2>1. Putting your data in the cloud ...is just the start.</h2>
<p>This may be obvious to those of us in the weeds, but supporting long-term availability of data through the use of cloud hosting is only one of many steps. Indexing of (meta)data, auth and access, and a host of other issues are all important to spur actual data reuse.</p>
<h2>2. Just, like, talking with people is, y'know, really useful!</h2>
<p>We did a lot of interviewing and found out some surprising things! In partial reaction to <a href="http://ivory.idyll.org/blog/2019-nih-data-commons-update.html">our experience with the Data Commons</a>, we are taking a much lower key and more ethnographic approach to understanding the opportunities and challenges that <em>actually</em> exist on the ground. A lot of the good stuff in the report emerged from these interviews.</p>
<h2>3. Interoperability is contingent on the data sets (and processing pipelines) you're talking about.</h2>
<p>The I in FAIR stands for "Interoperability", and (at least in the context of the CFDE) this is probably the trickiest to measure and evaluate. Why?</p>
<p>Suppose, not-so-hypothetically, that you want to take some data from the GTEx human tissue RNAseq collection, and compare the expression of genes in that data with some data from the Kids First datasets.</p>
<p>At some basic level, you might think "RNAseq is RNAseq, surely you just grab both data sets and go for it", right?</p>
<p>Not so fast!</p>
<p>First, you need to make sure that the raw data is comparable - not all RNAseq can be compared, at least not without removing technical biases. (And I'm honestly not sure what the state of the art is around comparing different protocols, e.g. strand-specific RNAseq to generic RNAseq.)</p>
<p>Second, the processing pipeline used to analyze the
RNAseq data needs to be the same. Practically speaking
this means that you may need to reanalyze all of the raw data.</p>
<p>Third, you need to deal with batch effects. I'm again not actually sure how you do this on data from a variety of different studies.</p>
<p>Fourth, and more fundamental, you need to connect your sample metadata across the various studies so that you are comparing apples to apples. (Spoiler alert: this turns out to be really hard, and seems to be the main conceptual barrier to actual widespread reuse of data across multiple studies.)</p>
<p>There are some techniques and perspectives being developed by various Common Fund DCCs that may help with this, and I hope to talk about them in a future blog post. But it's just hard.</p>
<h2>4. Computational training is second on everybody's list.</h2>
<p>This is something that I first saw when a group of us were talking with a bunch of NSF Science and Technology Centers (STCs): when asked what their challenges were, everyone said "in addition to our primary mission, computational training is really critical." (This broad realization by the STCs led to two funded NSF supplements that are part of Data Carpentry's back story!)</p>
<p>We saw the same thing here - a surprising result of our interviews was the extent to which the Common Fund Data Coordinating Centers felt that computational training could help foster data use and reuse. I say "surprising" not in the sense that it surprised me that training could be important - I've been banging that drum for well over a decade! - but that it was so high on everybody's list. We only had to mention it - "so, what role do you see for training?" - to have people at the DCCs jump on it enthusiastically!</p>
<p>There are many challenges with building training programs with the CF DCCs, but it seems likely that training will be a focus of the CFDE moving forward.</p>
<h2>What's next?</h2>
<p>This is only an interim report, and we've only interviewed four DCCs - we have another five to go. Expect to hear more!</p>
<p>--titus</p>
<p>Brown, C. T., Charbonneau, A., & White, O.. (2019, August 13). 2019-July_CFDE_AssessmentReport.pdf (Version 1). figshare. <a href="https://doi.org/10.6084/m9.figshare.9588374.v1">doi: 10.6084/m9.figshare.9588374.v1</a></p>Comparing two genome binnings quickly with sourmash2019-07-23T00:00:00+02:002019-07-23T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-07-23:/blog/2019-comparing-binnings.html<p>Comparing two sets of MAGs, for fun and profit!</p><p>tl;dr? Compare and cluster two collections of 1000+ metagenome-assembled genomes in a few minutes with sourmash!</p>
<hr>
<p>A week ago, someone e-mailed me with an interesting question: how can we compare two collections of genome bins with <a href="http://sourmash.rtfd.io">sourmash</a>?</p>
<p>Why would you want to do this? Well, there's lots of reasons! The main
one that caught my attention is <em>comparing</em> genomes extracted from
metagenomes via two different binning procedures - that's where I
started almost two years ago,
<a href="http://ivory.idyll.org/blog/2017-comparing-genomes-from-metagenomes.html">with two sets of bins extracted from the Tara ocean data</a>. You
might also want to merge bins that were similar to produce a
(hopefully) more complete bin, or you could intersect bins that were
similar to produce a consensus bin that might be higher quality, or
you could identify bins that were in one collection and not in the
other, to round out your collection.</p>
<p>I'm assuming this is done by lots of workflows - I note, for example,
that the <a href="https://github.com/bxlab/metaWRAP">metaWRAP</a> workflow
includes a 'bin refinement' step that must do something like this.</p>
<p>I (ahem) haven't really read up on what others do, because I was mostly
interested in hacking something together myself. So here goes :).</p>
<h2>How do you compare two collections of bins??</h2>
<p>There are a few different strategies. My previous attempts were --</p>
<ul>
<li>
<p><a href="http://ivory.idyll.org/blog/2017-comparing-genomes-from-metagenomes.html">comparing two directories in bulk</a>, focusing on summary statistics;</p>
</li>
<li>
<p><a href="http://ivory.idyll.org/blog/2017-taxonomic-disagreements-in-tara-mags.html">reclassifying each bin set with the taxonomy from the other bin set</a></p>
</li>
</ul>
<p>In both cases, my conclusions ended with "wow, there are some real differences
here" but I never dug deeply into what was going on in detail.</p>
<p>This time, though, I had a bit more experience under my belt and I
realized that a fairly simple thing to do would be to cluster <em>all</em> of
the bins together while tracking the origin of each bin, and then
deconvolving the clusters so that you could dig into each cluster at
arbitrary detail.</p>
<h2>The basic strategy</h2>
<ol>
<li>
<p>Load in two lists of sourmash signatures.</p>
</li>
<li>
<p>Compare them all.</p>
</li>
<li>
<p>Perform some kind of clustering on the all-by-all comparison.</p>
</li>
<li>
<p>Output clusters.</p>
</li>
</ol>
<p>Conveniently, I had already implemented the key bits in a Jupyter
notebook about a year ago
(<a href="https://github.com/ctb/2017-sourmash-cluster/blob/6ea9e2161fe72e6b7e4865070b66ac02a3dec373/species-clustering.ipynb">here</a>),
so it was ready to go! I turned it into a command-line script called
<a href="https://github.com/ctb/2017-sourmash-cluster/blob/afdde27619c36432ee5426c8032e2a785bc57755/cocluster.py">cocluster.py</a>
and tested it out; on data where I knew the answer, it performed fine, grouping
identical bins together and grouping or splitting strain variants depending
on the cut point for the dendrogram.</p>
<p>You do have to run it on collections of already-computed signatures;
an example command line for cocluster.py is:</p>
<div class="highlight"><pre><span></span><code>cocluster.py --first podar-ref/?.fa.sig --second podar-ref/*.fa.sig -k 31
</code></pre></div>
<p>This version outputs a dendrogram showing the clustering, as well as a
spreadsheet containing the cluster assignments.</p>
<h2>Speeding it up</h2>
<p>The problem is, it's kind of slow for big data sets where you have to do millions of comparisons!</p>
<p>Since comparing N signatures against N signatures is inherently an N**2
problem, any work we can put into filtering out signatures at the front
end of the analysis will be paid back in serious coin.</p>
<p>So, I <a href="https://github.com/ctb/2017-sourmash-cluster/commit/114fabc0275e4f9deb55bfb0e4add3bd5860035b#diff-250fb0b09f82971f54bb26c454679ed7">added two optimizations</a>.</p>
<p>First, you can now pass in a <code>--threshold</code> argument that specifies, in
basepairs, roughly how many bp need to be shared by a signature from
the first list with <em>any</em> of the signatures in the second list. If this
threshold isn't met, the signature from the first list is dropped. Then
do the same for each signature in the second list with respect to the first
list.</p>
<p>Second, you can now downsample the signatures by specifying a
<code>--scaled</code> parameter. (Read more about this <a href="https://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html#what-resolution-should-my-signatures-be-how-should-i-compute-them">here</a>.) The logic here is that if you're comparing
genomes, you probably don't really need to look at a high resolution to
get a rough estimate of what's going on. This optimization speeds up every
comparison done.</p>
<p>Together, this made it straightforward to apply this stuff to scads of
genomes!</p>
<h2>More/better output</h2>
<p>Last but not least, I <a href="https://github.com/ctb/2017-sourmash-cluster/blob/1b8095722c890f3a43cd50ad40ab1da5717fb2c3/cocluster.py">updated the script</a> to output clusters, and provide summary output too!</p>
<h2>An example!</h2>
<p>Here is an annotated example of the complete workflow - this is done on the reference genome data set from <a href="https://www.ncbi.nlm.nih.gov/pubmed/23387867">Shakya et al., 2013</a>, which we updated in <a href="https://www.biorxiv.org/content/10.1101/155358v3">Awad et al., 2017</a>. This genome collection contains 64 genomes, some of which are strain variants of each other.</p>
<p>Briefly, after computing signatures, cocluster.py
calculates an all-by-all comparison for the two input collections, that results in a matrix like this (not currently output by cocluster.py) --</p>
<p><img alt="comparison matrix" src="https://raw.githubusercontent.com/ctb/2017-sourmash-cluster/master/podar-coclust/podar.cmp.matrix.png"></p>
<p>The dendrogram is then cut at some given phenetic distance - in this case I chose 1.8, based on
visual inspection of this next dendrogram:</p>
<p><img alt="dendrogram annotated with distances" src="https://raw.githubusercontent.com/ctb/2017-sourmash-cluster/master/podar-coclust/podar.coclust.dendro.png"></p>
<p>The cocluster.py script then outputs <a href="https://github.com/ctb/2017-sourmash-cluster/blob/master/podar-coclust/podar.coclust.csv">a cluster details CSV file</a> that lists all of the clusters and their members. (The clustered signatures themselves are also provided, along with singletons.)</p>
<p>And, finally, all of this activity is <a href="https://github.com/ctb/2017-sourmash-cluster/blob/master/podar-coclust/podar.coclust.log">logged</a> and <a href="https://github.com/ctb/2017-sourmash-cluster/blob/master/podar-coclust/podar.coclust.txt">summarized in the results output</a>:</p>
<div class="highlight"><pre><span></span><code>...
total clusters: 60
num 1:1 pairs: 56
num singletons in first: 0
num singletons in second: 0
num multi-sig clusters w/only first: 0
num multi-sig clusters w/only second: 0
num multi-sig clusters mixed: 4
</code></pre></div>
<p>The full set of commands is <a href="https://github.com/ctb/2017-sourmash-cluster/blob/master/podar-coclust/Snakefile">listed in this Snakefile</a>, and commands to repeat it are in the appendix below.</p>
<h2>Playing with real data</h2>
<p>Since both the Tully et al. and the Delmont et al. papers have been
published now, I first re-downloaded the published data and calculated
all the signatures for the 3500 or so genomes -- see the instructions
and <a href="https://github.com/ctb/2019-tara-binning2/blob/master/Snakefile">Snakefile</a> in <a href="https://github.com/ctb/2019-tara-binning2/">github.com/ctb/2019-tara-binning2/</a>.</p>
<p>Once downloaded, computing the signatures takes about 15 minutes, using
<code>snakemake -j 16</code>.</p>
<p>Then, I ran the cocluster script from https://github.com/ctb/2017-sourmash-cluster like so:</p>
<div class="highlight"><pre><span></span><code>./2017-sourmash-cluster/cocluster.py --threshold=50000 -k 31 \
--first ../data/tara/tara-tully/*.sig \
--second ../data/tara/tara-delmont/NON_REDUNDANT_MAGs/*.sig \
--prefix=tara.coclust --cut-point=1.0
</code></pre></div>
<p>This took about 2 minutes to run on my HPC cluster, and produced the
following output with a cut point of 1.0 (which is pretty liberal).</p>
<div class="highlight"><pre><span></span><code>...
total clusters: 2838
num 1:1 pairs: 331
num singletons in first: 1886
num singletons in second: 443
num multi-sig clusters w/only first: 42
num multi-sig clusters w/only second: 4
num multi-sig clusters mixed: 132
</code></pre></div>
<p>When I re-run it with a more stringent cut-point of 0.1, I get:</p>
<div class="highlight"><pre><span></span><code><span class="c">% ./2017-sourmash-cluster/cocluster.py --threshold=50000 -k 31 \</span>
<span class="w"> </span><span class="o">--</span><span class="n">first</span><span class="w"> </span><span class="p">.</span><span class="o">./</span><span class="n">data</span><span class="o">/</span><span class="n">tara</span><span class="o">/</span><span class="n">tara</span><span class="o">-</span><span class="n">tully</span><span class="o">/*</span><span class="p">.</span><span class="n">sig</span><span class="w"> </span><span class="o">\</span>
<span class="w"> </span><span class="o">--</span><span class="nb">second</span><span class="w"> </span><span class="p">.</span><span class="o">./</span><span class="n">data</span><span class="o">/</span><span class="n">tara</span><span class="o">/</span><span class="n">tara</span><span class="o">-</span><span class="n">delmont</span><span class="o">/</span><span class="n">NON_REDUNDANT_MAGs</span><span class="o">/*</span><span class="p">.</span><span class="n">sig</span><span class="w"> </span><span class="o">\</span>
<span class="w"> </span><span class="o">--</span><span class="n">prefix</span><span class="p">=</span><span class="n">tara</span><span class="p">.</span><span class="n">coclust</span><span class="w"> </span><span class="o">--</span><span class="n">cut</span><span class="o">-</span><span class="n">point</span><span class="p">=</span><span class="mf">0.1</span>
<span class="k">...</span>
<span class="n">total</span><span class="w"> </span><span class="s">clusters:</span><span class="w"> </span><span class="s">3520</span>
<span class="n">num</span><span class="w"> </span><span class="s">1:1</span><span class="w"> </span><span class="s">pairs:</span><span class="w"> </span><span class="s">43</span>
<span class="n">num</span><span class="w"> </span><span class="s">singletons</span><span class="w"> </span><span class="s">in</span><span class="w"> </span><span class="s">first:</span><span class="w"> </span><span class="s">2557</span>
<span class="n">num</span><span class="w"> </span><span class="s">singletons</span><span class="w"> </span><span class="s">in</span><span class="w"> </span><span class="s">second:</span><span class="w"> </span><span class="s">906</span>
<span class="n">num</span><span class="w"> </span><span class="s">multi-sig</span><span class="w"> </span><span class="s">clusters</span><span class="w"> </span><span class="s">w/only</span><span class="w"> </span><span class="s">first:</span><span class="w"> </span><span class="s">6</span>
<span class="n">num</span><span class="w"> </span><span class="s">multi-sig</span><span class="w"> </span><span class="s">clusters</span><span class="w"> </span><span class="s">w/only</span><span class="w"> </span><span class="s">second:</span><span class="w"> </span><span class="s">0</span>
<span class="n">num</span><span class="w"> </span><span class="s">multi-sig</span><span class="w"> </span><span class="s">clusters</span><span class="w"> </span><span class="s">mixed:</span><span class="w"> </span><span class="s">8</span>
</code></pre></div>
<p>Basically this means that:</p>
<ul>
<li>when doing stringent clustering, there are 3520 different clusters;</li>
<li>43 of the clusters provide a 1-1 match between bins from the Delmont and Tully studies;</li>
<li>2557 of the Tully signatures don't cluster with anything else;</li>
<li>906 of the Delmont signatures don't cluster with anything else;</li>
<li>there are 6 clusters that contain more than one Tully signature, and no Delmont signatures</li>
<li>there are 0 clusters that contain more than one Delmont signatures, and no Tully signatures;</li>
<li>8 of the clusters have more than two signatures and contain at least
one Tully and at least one Delmont signature.</li>
</ul>
<p>I'll dig into some of these results in a separate blog post!</p>
<p>--titus</p>
<h2>Appendix: repeating the podar analysis</h2>
<p>This workflow will take about 1 minute to run, once the software is installed.</p>
<p>To repeat the analysis of 64 genomes above (see <a href="https://github.com/ctb/2017-sourmash-cluster/tree/master/podar-coclust">output</a>), do the following.</p>
<div class="highlight"><pre><span></span><code># create a new conda environment w/python 3.7
conda create -y -c bioconda -p /tmp/podar-coclust \
python=3.7.3 sourmash snakemake
# activate conda environment
conda activate /tmp/podar-coclust
# grab the cocluster script and podar workflow
git clone https://github.com/ctb/2017-sourmash-cluster/
cd 2017-sourmash-cluster/podar-coclust
# clean out the existing files & run!
snakemake clean
snakemake -j 4 -p all
</code></pre></div>
<p>This last step will download the necessary files, compute the signatures, and run cocluster.py.</p>How to encourage participation in teleconferences2019-06-24T00:00:00+02:002019-06-24T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-06-24:/blog/2019-encourage-participation-teleconferences.html<p>Participation is good!</p><p>(and/or how to run effective teleconferences!)</p>
<p>I participate in a lot of teleconferences, and some of them aren't very participatory, for various reasons. Recently a good friend asked for suggestions on how to open up the phone calls, and I came up with the below ideas. What am I missing? What did I get wrong?</p>
<hr>
<p>First, post a meeting agenda with a medium amount of detail, well in advance ( > 24 hours).</p>
<ul>
<li>Posting an agenda in advance gives people time to think about things, if they are interested.</li>
<li>The medium amount of detail (up to a paragraph) lets people understand what it’s about, see what the major issues/questions are, and think of questions or comments they may have.</li>
<li>If the agenda is posted > 24 hours in advance, you can reasonably expect people to have read it, and if people want to add things to the agenda <em>on the call</em> you punt them to the next call instead.</li>
</ul>
<p>Basically, if you spring a skeleton agenda on a group with < 3 hours to spare, no one will read it and even when they do they won’t room to dig into it.</p>
<hr>
<p>Second, assign duties to multiple people and rotate.</p>
<ul>
<li>Typical meetings need a timekeeper (keeping an eye on the agenda), a facilitator (keeping conversation moving), and a note taker (recording notes and action items). </li>
<li>Assigning these roles is less about authority and more about making sure someone has been given the responsibility.</li>
<li>It also means that at least three different “voices” are heard - two on the call, one in the notes - each time.</li>
<li>Rotating means that you’re not giving someone permanent authority, and also ensures that if someone isn't good at or dislikes one role, they’re not stuck on it. Nor do they necessarily escape practicing :)</li>
<li>Rotating also means that the convener or nominal authority is not always the person driving the conversation.</li>
<li>Having these roles means that at least three people will be engaged in the conversation, even if nobody else is :)</li>
</ul>
<hr>
<p>Third, pause after questions until the silence becomes slightly uncomfortable before proceeding.</p>
<ul>
<li>People who are hesitant to speak will need the time to come forward.</li>
</ul>
<p>(This is an approach that was taught to me during interview training at UC Davis!)</p>
<hr>
<p>Fourth, provide a respectful way for people to indicate they are ready to speak.</p>
<ul>
<li>e.g. type “hand” in chat, or Raise Hand in zoom.</li>
<li>this means it’s not just “first to interrupt” that gets to speak, which biases towards certain types of personalities</li>
<li>institute a rule that only the facilitator and timekeeper get to interrupt without a ‘hand’ (or maybe not even them).</li>
<li>encourage people to post low-key/non-urgent questions in the chat.</li>
</ul>
<hr>
<p>You can also circulate a set of rules and suggestions for how to participate effectively. Belinda Weaver wrote up <a href="https://software-carpentry.org/blog/2017/11/online-meetings.html">this really great list</a> from the Carpentries - it's a great starting point!</p>
<p>What am I missing? What did I get wrong?</p>
<p>--titus</p>Using GitHub for janky project reporting - some code2019-05-15T00:00:00+02:002019-05-15T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-05-15:/blog/2019-github-project-reporting.html<p>We scripted GitHub for lightweight project reporting</p><p>For the <a href="http://ivory.idyll.org/blog/2019-nih-data-commons-update.html">NIH Data Commons</a>, we needed a way for 10 distinct teams to do reporting at the level of about 50-100 milestones per team, on a monthly basis.</p>
<p>Each team was already using different project management software internally, and we didn't want to require them to switch to something new. We also didn't need a lot of innate functionality in the project reporting system - basically, for each milestone we needed two statuses, "started" and "finished".</p>
<p>So we decided to go with something lightweight and simple that would support programmatic update and automated reporting: GitHub!</p>
<p>We chose to use GitHub for project reporting for several reasons. We were already using GitHub for content stuff, and everyone had accounts. We were also using GitHub for authentication control on static Web sites via a Heroku app.</p>
<p>So what we did was use the <a href="https://pygithub.readthedocs.io/en/latest/index.html">PyGithub package</a> to write a script to take the project milestones (which were all in a spreadsheet) and load them into GitHub issues. There was a label for "this task has been started", and when complete, the issue was just closed.</p>
<p>Each issue had some metadata associated with it (this was basically regexp-friendly fields like "id: XYZ") that linked it back to information in the spreadsheet. Other metadata such as the team that "owned" the milestone was layered on with GitHub labels.</p>
<p>We then wrote another script that extracted the issue statuses and output a reporting spreadsheet that we could send to the NIH on a monthly basis.</p>
<p>(Luiz Irber wrote the first version of the scripts as a proof of concept, and then I took over expansion and maintenance as our needs evolved.)</p>
<p>Using GitHub in this way had a number of advantages, some of which were unexpected.</p>
<ul>
<li>
<p>The main advantage was that the user interface for viewing and updating statuses was super easy. Finding issues could be done github search (and eventually via our project search engine, <a href="http://nih-data-commons.us/centillion/">centillion</a>). Permalinks could be bookmarked, too.</p>
</li>
<li>
<p>Linking between GitHub issues worked nicely: when you put a link from an issue in some other repo to a milestone, a back-link was automatically provided on GitHub.</p>
</li>
<li>
<p>The statuses of milestones were accessible to everyone, i.e. visible across the project.</p>
</li>
<li>
<p>People from any team could watch a milestone they were interested in.</p>
</li>
<li>
<p>Comments and questions could be posted on milestones, and (potentially) could be provided in the monthly rollup.</p>
</li>
<li>
<p>The GitHub Web and project interface went through churn during our project, but the issue API was not affected, so our scripts kept on working.</p>
</li>
<li>
<p>Unlike built-in GitHub projects functionality, this works easily across multiple repositories AND multiple organizations.</p>
</li>
</ul>
<h2>What if we had not used GitHub?</h2>
<p>Within the project, there was some pushback. Most of the pushback amounted to "but we are already using System X, can't we just use that?" But there was no consensus on what to use! Since it was all scriptable, we were expecting to write some status importers (but didn't need to within the first phase of the project). It would have been easy to auto-update issue labels using GitHub project management bridges (and I think at least one group did that without involving us).</p>
<p>GitHub enabled everyone to see each other's milestone statuses without having to give permissions beyond existing GitHub project memberships. I don't know how we would have done that another way.</p>
<p>Because we used a lightweight informal format with some simple scripts, we could update reporting formats and details quickly. If we'd used a heavierweight and/or closed source system, we might have had to put more time into configuration and/or bug workarounds.</p>
<p>GitHub is pretty scriptable, which came in really handy for wonky status update situations, or custom reports. I'm not sure how scriptable and well documented other issue tracking software is.</p>
<h2>So where's the code?</h2>
<p>I've extracted the core code to <a href="https://github.com/ctb/2019-dcppc-bot">github.com/ctb/2019-dcppc-bot</a>, and made a small running example!</p>
<p>There are two scripts, <a href="https://github.com/ctb/2019-dcppc-bot/blob/master/update-milestones.py"><code>update-milestones.py</code></a> and <a href="https://github.com/ctb/2019-dcppc-bot/blob/master/milestones-gh-to-csv.py"><code>milestones-gh-to-csv.py</code></a>. The first script parsed the big CSV file full of milestones, and updated the GitHub issues from it. The second script exports the GitHub issues and statuses for monthly reporting.</p>
<p><a href="https://github.com/ctb/2019-dcppc-bot/blob/master/update-milestones.py#L45"><code>create_issue_body_milestone</code></a> is what created / updated the actual issues.</p>
<p><a href="https://github.com/ctb/2019-dcppc-bot/blob/master/milestones-gh-to-csv.py#L125"><code>extract_report</code></a> built the milestone output reports, which were then output in the <a href="https://github.com/ctb/2019-dcppc-bot/blob/master/milestones-gh-to-csv.py#L196"><code>main</code> function</a>.</p>
<h2>Running stuff</h2>
<p>Create a token by going to GitHub settings, Developer Settings, Personal Access Tokens, Generate New Token.</p>
<p>Copy / paste the string into an environment variable (you'll need to replace the hex string with your own token).</p>
<div class="highlight"><pre><span></span><code><span class="k">export</span><span class="w"> </span><span class="n">GITHUB_TOKEN</span><span class="o">=</span><span class="s1">'a6161b3288894522b8930b67231d833295e7d5ba'</span>
</code></pre></div>
<p>Check that the token works and the repo exists (you'll want to replace <code>ctb/example-milestones</code> with a repository you have write access to!)</p>
<div class="highlight"><pre><span></span><code>./update-milestones.py update example-milestones.csv -f -vv -m ctb/example-milestones
</code></pre></div>
<p>Actually create the issues now, by parsing the <a href="https://github.com/ctb/2019-dcppc-bot/blob/master/example-milestones.csv"><code>example-milestones.csv</code> file</a></p>
<div class="highlight"><pre><span></span><code>./update-milestones.py update example-milestones.csv -f -vv -m ctb/example-milestones --change-github
</code></pre></div>
<p>This will create and/or update issues, e.g. like <a href="https://github.com/ctb/example-milestones/issues/1">ctb/example-milestones #1</a>.</p>
<p>Now, run a report:</p>
<div class="highlight"><pre><span></span><code>./milestones-gh-to-csv.py -m ctb/example-milestones example-milestones.csv
</code></pre></div>
<p>This will generate reports by team, e.g. <a href="https://github.com/ctb/2019-dcppc-bot/blob/master/report-team-White.csv"><code>report-team-White.csv</code></a> and <a href="https://github.com/ctb/2019-dcppc-bot/blob/master/report-team-Brown.csv"><code>report-team-Brown.csv</code></a>.</p>
<p>You can see the final set of issues <a href="https://github.com/ctb/example-milestones/issues?utf8=%E2%9C%93&q=is%3Aissue">here</a>.</p>
<h2>Was this a good idea?</h2>
<p>The project only ran for ~6 months in the end, and I would argue that scripting our own solution was a good investment of time and effort because of the flexibility it gave us. In particular, it let us iterate and converge on an approach that met the needs of the funders without unduly burdening the project managers.</p>
<p>In the long term, we might have tried to identify commercial software that had built-in visualization and exploration functionality. But I wouldn't have wanted to do that on the timeline we had for phase 1.</p>
<p>The code was hideous because it was all done really fast at the last minute before the first reporting period. Changes were done carefully, mostly by me, because I was the one who would suffer the most if we screwed up. If we'd brought our infrastructure engineer in to the project earlier, I probably would have asked him to put the time in to unit testing and so on, but the code was working well enough for us to just leave it be.</p>
<p>The general idea of using GitHub issues to surface milestone statuses across multiple teams and integrate with individual project trackers is pretty nice and open-sourcey.</p>
<p>The existing code ignores issues without metadata. So while we did not do this, you could salt "issues for reporting" into an existing repository full of issues, and extract info from just the reporting issues just fine.</p>
<p>So: in this case a quick hack worked out ok, and I'm not ashamed of it.</p>
<p>And maybe there are now better ways to do all this with GitHub Projects, but there weren't then :)</p>
<p>Last but not least: you should always be wary of writing code so that you can write code. Before you know it, maintaining your project management system could become someone's full time job... #yakshaving</p>
<p>--titus</p>
<p>Thanks to Luiz Irber for starting the project, and Charles Reid, Matthew Turk, and Tracy Teal for comments on a draft of this post!</p>Some questions and thoughts on journal peer review.2019-04-16T00:00:00+02:002019-04-16T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-04-16:/blog/2019-questions-and-thoughts-peer-review.html<p>What's up with current peer review practice?</p><h2>Can I use comments from other people's prior reviews when reviewing a submission to a new journal?</h2>
<p>I just had the dispiriting experience of receiving a paper to review from Journal B, that was unchanged from a prior submission to Journal A. The "dispiriting" part of the experience was that the paper was <em>completely</em> unchanged, despite a host of minor and major comments on the paper from all three reviewers for Journal A.</p>
<p>I ended up writing that I was disappointed that the authors had not seen fit to confront the bigger issues in any way, much less correct even the smallest and easiest errors; and then pasted in my previous review. What I <em>wanted</em> to do was paste in the expert reviews from the other two reviewers for Journal A, but I didn't feel like that was OK.</p>
<p>(If I get the paper back with some revisions, I'll reevaluate it in light of the Journal A reviews, too.)</p>
<p>I think the behavior of the authors is very questionable, too, and I hope they rethink this strategy. If your paper is desk-rejected by a hoity-toity journal without review, that's one thing; if reviewers put in hours of effort and give you detailed comments, you goshdarn well should put in an hour or two of your own time before resubmitting.</p>
<h2>Why don't all journals always send all the reviews to all reviewers?</h2>
<p>David Koslicki visited my lab yesterday, and I was reminded of the mash and MetaPalette situation from a few years back. Briefly:</p>
<p>I was a reviewer on both the mash paper (Ondov et al. (2016)) and the MetaPalette paper (Koslicki and Falush (2016)) and in my final review of MetaPalette I mentioned the mash paper enthusiastically. (Both were already up on biorxiv.)</p>
<p>At some point later on I sent David an e-mail to follow up on some suggestions I'd had, and we realized that he'd never received the text from my review of MetaPalette. He later told me that he thought that receiving my comments would have accelerated his research by a few months, by pointing him at a new area.</p>
<p>So why didn't mSystems send him the review text?!</p>
<p>(There are plenty of journals that are guilty of this.
Nature Biotech is one that I've noted in the past.)</p>
<h2>Isn't it irresponsible not to make some portion of the reviews public when the paper is published?</h2>
<p>Peer reviews often provide important context that can help people understand why the paper is important and interesting. It's fine and dandy to say that that should all be in the final paper, but that's a hard task and often papers are space constrained (...for some reason).</p>
<p>I think journals should make reviews public along with the article.</p>
<p>The biggest argument against this is that it might take some work by someone to properly adjust reviews for fixes from earlier versions. A short term fix might be to have a box for "this is the part of the review that I would like to make public if this paper is accepted".</p>
<h2>Why don't journals behave as if reviews belong to the reviewer?</h2>
<p>I no longer review for PNAS, because they started including a provision that I couldn't make any part of my review available in any form, even anonymously. I can understand that they don't want reputation laundering (e.g. <a href="http://ivory.idyll.org/blog/2013-review-howison.html">my previous behavior in posting reviews</a>, which boosts my own reputation while also being <a href="https://www.americanscientist.org/article/open-science-isnt-always-open-to-all-scientists">a sign of my own privilege</a>), but I see little harm in allowing it to be posted anonymously.</p>
<p>Journals sure are proprietary about work they didn't pay for. That's a bigger theme here, I guess :)</p>
<h2>There is no conclusion other than that peer review seems really broken.</h2>
<p>Anyway. Those are my ranty off the cuff comments for today.</p>
<p>--titus</p>Things to think about when developing shotgun metagenome classifiers2019-04-11T00:00:00+02:002019-04-11T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-04-11:/blog/2019-developing-metagenome-classifiers.html<p>Thoughts on goals and tradeoffs in classifying shotgun metagenome data.</p><p>So I was talking to someone about how we think about benchmarking and
developing <a href="https://sourmash.readthedocs.io">sourmash</a>, and then it
got long and kind of interesting, so I decided to write it up as a
blog post.</p>
<p>(I asked Luiz Irber for comments, and he had the best reaction ever:
"many feels, no time to write them, mostly agree, publish")</p>
<hr>
<p>When benchmarking, often people end up comparing <em>their</em> tool to tools developed to tackle different problems. To no big surprise, the first tool ends up winning out.</p>
<p>Here are questions that we asked ourselves, or decisions we made implicitly, when developing sourmash. Many of these have direct or indirect implications for benchmarking.</p>
<p><strong>Are we developing a library, a command line application, or a Web site?</strong> It's hard to do more than one at a time well. We've decided to focus on command line with sourmash, as a light wrapping around a Python library (which was a light wrapping around a C++ library, and will soon be a light wrapping around a Rust library). I think after 3 years we've reached a level of maturity where we could also support a Web site (but we don't really have the focus in the lab to do a good job of it, and would prefer to support someone else if they want to do it).</p>
<p><strong>How sensitive to coverage do we want to be?</strong> Phillip Brooks showed that sourmash is really specific and very sensitive, until you have fewer than (approximately) 10,000 reads from your genome of interest. Once your data set has fewer than 10,000 reads from a genome in it, we can't really detect that genome. (This is of course a tradeoff in terms of speed, underlying approach, database size, etc., and we're happy with that tradeoff.)</p>
<p><strong>Do we envision our tool being used in isolation, or as one part of an exploratory pipeline?</strong> We are firmly in the camp of using sourmash to do hypothesis generation, following which more compute intensive approaches are probably appropriate. For example, sourmash can tell you <em>which</em> known species are in your metagenome, but we haven't focused too much on assessing <em>how much of those species' genomes</em> are there - after all, that's (fairly) easy to do once you narrow down the list of possible genomes. And again, there are tradeoffs with many of the other design considerations below. But if we wanted to have a single software package that did everything we would design it differently (and it would be a lot harder, since you'd probably want to use multiple methods).</p>
<p><strong>Do we envision our tool being used by programmers?</strong> We really like having scriptable tools in our lab. That means the tool has decent command line behavior, has a high level Python API, and consumes and emits standard formats. This may not be what everyone wants to focus on though!</p>
<p><strong>Do we care about speed?</strong> Premature optimization can make your codebase ugly and complex. We've chosen (for now) to instead go with a fairly simple code base, which we then test the bejeezus out of. It supports optimization (Luiz Irber has done some amazing things with a profiler :) but we are against trading simple code for speed, because this is a research platform.</p>
<p><strong>What are our desired memory, disk, and time performance metrics?</strong> Do we care about one over the other? In general, we have chosen to prioritize low memory over performance, and performance over low disk space. But this isn't clear cut, and depends a lot on what methods we find interesting and implementable.</p>
<p><strong>What's our desired database resolution?</strong> Do we want ALL the genomes? Or just some genomes? We made the decision with sourmash to go for ALL the genomes. This causes problems when you think about the next few questions...</p>
<p><strong>What's our desired taxonomic resolution?</strong> We implicitly settled on strain-level resolution as our goal for <code>sourmash gather</code>, largely because of the algorithm we chose. (It works quite well for that!) But, unsurprisingly, <code>sourmash gather</code> performs quite poorly when looking at organisms from novel genera and families. It's actually quite hard to do both well.</p>
<p><strong>Who updates the database?</strong> And is it easy and straightforward to build new databases, or not? We worked hard on a friendly and flexible database building toolchain, because we expect new genomes to come out on a (very) regular basis (and we wanted to include them in our databases, based on our desired resolution).</p>
<p><strong>Do we want to support private databases, or not?</strong> We really like the idea of people using our tool in the privacy of their own lab to search their own collections of genomes. This means that we need to forego certain requirements (e.g. an NCBI taxid).</p>
<p><strong>Do we want a big centralized database, or not?</strong> One of our big concerns about models for database distribution & update that require one massive database, that can only realistically be updated by one group, is that they tend to go stale over time (as the group loses interest, etc.) Maintenance is not the strong suit of academic researchers :). So Luiz Irber has been working on IPFS and Dat-based models for database decentralization. This will (soon) permit incremental database updates without massive database download, among other things.</p>
<p><strong>What's our publication model?</strong> Do we want others to use our software for cool things? Or are we trying to publish our own innovative methods that we try and get into high impact-factor journals? Are we building a platform for others to build their own tools? Are we playing around with different methods and ideas and so on? We're not particularly interested in high-impact factor journals for sourmash, and we have a surprising number of people just using it do their own thing, so we've opted for providing citation handles via JOSS and F1000Research.</p>
<p><strong>How do we decide what functionality belongs in sourmash?</strong> Did we have explicit use cases that we decided up front? Or do we <a href="https://github.com/dib-lab/sourmash/issues/208">discover them as we go</a>? I'm much more comfortable doing iterations, finding users, and waiting for inspiration to strike, than I am in planning out sourmash years in advance. But then again, I'm an academic researcher and this fits our needs; we're not trying to serve a particular community.</p>
<p><strong>What's our contribution model?</strong> Are you interested in supporting collaboration and community development? Or are you interested in limiting external contributions to potential use cases? We are OK interested in both, but it adds a certain level of chaos and coordination challenges to the situation.</p>
<p>--titus</p>Our submission for the NHGRI Human Genome Reference Center call2019-04-10T00:00:00+02:002019-04-10T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-04-10:/blog/2019-nih-hgrc-proposal.html<p>We wrote a big grant proposal!</p><p>For the past month, I've been consumed in writing and submitting a grant for the NHGRI <a href="https://grants.nih.gov/grants/guide/rfa-files/rfa-hg-19-004.html">Human Genome Reference Center</a> funding opportunity. This is a planned $12.5m / 5 year effort to coordinate the new <a href="https://www.genome.gov/pages/about/nachgr/september2018agendadocuments/sept2018council_hg_reference_program.pdf">Human Genome Reference Program</a> (also see <a href="https://www.genome.gov/pages/research/sequencing/humangenomereferenceprogram/hgrp_webinar_faq.pdf">the Frequently Asked Questions</a>).</p>
<p>We submitted this grant proposal a week ago Tuesday! I joined with three collaborators on this grant proposal: <a href="https://curoverse.com/">Curii</a>, <a href="https://jimb.stanford.edu/giab">Genome in a Bottle</a>, and <a href="http://arep.med.harvard.edu/">the Church Lab</a>. We also partnered with the Personal Genome Project and the Open Humans platform.</p>
<p>Since we're all open-science-y people, we agreed to make the grant public after submission. I was thinking about how best to present it in a blog post, but then I remembered that grants are supposed to stand on their own with respect to the RFA. So ...here it is, with only a little bit of organization to make it more approachable!</p>
<p>The HGRC call asked for what was in effect one R01 and two R21 grants, along with another R01-sized grant on top. The first R01 was Component 1, a 12 page section discussing how the center would maintain, improve, and provide the Human Genome reference. The first R21 was a 6 page Component 2, describing the community outreach plans of the center, to do training and gather feedback. The second R21 was the 6 page Component 3, describing the logistical coordination of the rest of the Human Genome Reference Program (running meetings, providing materials, etc.) And the last R01 was an overarching summary of the three components, 12 pages in length.</p>
<p>The end PDF submission was over 300 pages in length. Good fun...</p>
<p>One last comment before I provide the links: just like reading someone else's submitted dissertation, your sole responsibility in reading someone else's ALREADY SUBMITTED GRANT is to make nice noises, like "Hey, that's great congrats on submitting it!" and "There are some great ideas in there!" You don't say "ooh, look, a typo on p3! (How unprofessional! Sucks that you can't fix it now!)" or "Gosh I would have written that completely differently." Basically, you should just be nice - we're going to go through a NIH review panel experience, and I'm sure they'll be properly critical :)</p>
<h2>The Actual Grant</h2>
<p><a href="https://github.com/dib-lab/2019-nih-hgrc-grant/blob/master/HGRCOverall.ResearchStrategy.pdf">Research Strategy - OVERALL</a> - a high level overview of the thing.</p>
<p><a href="https://github.com/dib-lab/2019-nih-hgrc-grant/blob/master/HGRCProject1.ResearchStrategy.pdf">Research Strategy - Component 1: Maintain, Improve, Provide the Human Reference Genome</a> - check out our cool validation strategy with Genome-in-a-Bottle-like data sets!</p>
<p><a href="https://github.com/dib-lab/2019-nih-hgrc-grant/blob/master/HGRCProject2.ResearchStrategy.pdf">Research Strategy - Component 2: Do Community Outreach and Needs</a> - here we proposed not just doing outreach but also building a community of practice!</p>
<p><a href="https://github.com/dib-lab/2019-nih-hgrc-grant/blob/master/HGRCProject3.ResearchStrategy.pdf">Research Strategy - Component 3: Provide Logistical Coordination for Human Genome Reference Program</a> - here we added a standardization effort!</p>
<p>Enjoy!</p>
<p>--titus</p>News from the NIH Data Commons Pilot Phase Consortium2019-04-09T00:00:00+02:002019-04-09T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-04-09:/blog/2019-nih-data-commons-update.html<p>The NIH Data Commons Pilot Phase Consortium is dead! (Long live the NIH Data Commons!)</p><p>You may recall that about a year and a half ago, <a href="http://ivory.idyll.org/blog/2017-commonspilot-kickoff.html">I got involved</a> in the <a href="https://commonfund.nih.gov/commons">NIH Data Commons</a>.</p>
<p>Between then and now, we built <a href="https://public.nihdatacommons.us/ProjectExecutionPlan/">a project execution plan</a>, <a href="https://public.nihdatacommons.us/deliverables/">ran Phase 1 for six months</a>, and then in October took a planned work moratorium for the purpose of doing future planning.</p>
<p>Then, in February, we received word that the the NIH Data Commons Pilot Phase Consortium (DCPPC) would not continue in its current form. Here's what we received:</p>
<blockquote>
<p>The NIH <a href="https://datascience.nih.gov/">Office of Data Science Strategy</a> has been asked to lead the next phase of trans-NIH data ecosystem development as described in the NIH <a href="https://datascience.nih.gov/strategicplan">Strategic Plan for Data Science</a>. The deliverables from the DCPPC will inform next steps, but we will not pursue a second phase of the DCPPC. New initiatives may emerge from the ODSS and/or from the ICs in response to the Strategic Plan, but they will communicate their plans as they are established.</p>
</blockquote>
<p>My award finished at the end of March, and I thought it would be a good time to update y'all (especially since I've been receiving questions!)</p>
<h2>What did the NIH Data Commons Pilot Phase Consortium achieve?</h2>
<p>I think we achieved quite a lot in our fairly short stint! (And there's a fair amount of public material that was made available as part of it, although it's not well advertised.)</p>
<p>I'm going to focus on things my team helped with, because that's what I know best. There were lots of technical prototypes as well, but those were produced by other teams and are not mine to discuss. (See <a href="https://public.nihdatacommons.us/deliverables/">the list of deliverables and their reviews for more info</a>. Happy to connect you to the authors if you're interested - drop me a line at ctbrown@ucdavis.edu.)</p>
<p>First off, here is the <a href="https://public.nihdatacommons.us/">top link to the public site that we created for the end of the first Pilot Phase</a>. There are links and documents in there that I continue to find useful, and expect to find useful for many years to come.</p>
<p>I'm particularly happy with how the <a href="http://nih-data-commons.us/use-case-library/">Use Case Library</a> effort was proceeding. I think we set a good path for collaboratively developing use cases for Phase 2, and even without a Phase 2 I will be making use of this approach and this material for other projects.</p>
<p>The Centillion search engine that my team built was pretty cool!! <a href="https://public.nihdatacommons.us/KC9_Centillion/">See the October writeup of it, here</a> and also the public GitHub page, <a href="https://github.com/dcppc/centillion/">here</a>.</p>
<p>The <a href="https://public.nihdatacommons.us/OnCommonsing/">"On Commonsing"</a> document we wrote up after a workshop on "Data Commonses" is something that I will be coming back to regularly!</p>
<p>People interested in pragmatic standards development might be interested in <a href="https://public.nihdatacommons.us/WhyAreMultipleStacksNecessary/">Why Multiple Stacks are Necessary</a>.</p>
<p>I continue to think the <a href="https://fairshake.cloud/">FAIRshake</a> portal is unreasonably cool... check out <a href="https://fairshake.cloud/project/">the projects</a>.</p>
<p>Personally, I learned a lot about interoperability and creating and growing community from this experience, and I think the same is true of most of the other participants. Completely apart from the technical and infrastructure efforts, the coordination and community aspects of this Pilot Phase seem likely to have long-term positive impacts on how many of us deal with these kinds of projects in the future.</p>
<h2>So what's next?</h2>
<p>I'm not sure!</p>
<p>I think it's fair to say that the problems the NIH Data Commons effort was tackling are not going away (you can see more about these problems in <a href="https://osf.io/58uef/">my talk slides</a> from <a href="https://www.dtls.nl/2018/08/27/c-titus-brown-keynote-speaker-at-dtl-communitieswork-2018/">my 2018 talk at the Dutch Techcentre for Life Sciences</a>). And the NIH and broader biomedical research community will certainly be working on many things in this area. And I may not be involved but I'm sure to have opinions. So, stay tuned!</p>
<p>--titus</p>Critically assessing open science - the CAOS meeting.2019-04-08T00:00:00+02:002019-04-08T00:00:00+02:00C. Titus Browntag:ivory.idyll.org,2019-04-08:/blog/2019-themes-caos-open-science.html<p>A summary of the CAOS open science meeting</p><p>The "Critical Assessment of Open Science" meeting, or CAOS, was
convened by Sage Bionetworks in New Orleans in early February. About
30 open science practitioners and advocates were invited by Sage to a
day long meeting in New Orleans to consider the last 10 years of
progress and failures in open science. The meeting was attended by
scientists, policy experts, funders, and others. While the emphasis
was on the biosciences, many themes were discussed in a broader
context of all of science.</p>
<p>You can read more about the motivation for the meeting, and see a
series of summary blog posts,
<a href="http://sagebionetworks.org/in-the-news/a-critical-assessment-of-open-science/">here</a>.</p>
<p><em>This</em> post is my attempt to summarize the entire meeting, based on notes
I took during the meeting.</p>
<hr>
<p>The meeting was organized in a series of
<a href="https://en.wikipedia.org/wiki/Call_and_response_(music)">"call and response"</a>
engagements, in which two participants "called" for 5 minutes to one
of five broad themes, and then a responder summarized, contextualized,
and responded to their call. There were multiple such calls &
responses in each session, for about 5 sessions. Audience
participation was lively!</p>
<p>The meeting was held under
<a href="https://www.chathamhouse.org/chatham-house-rule">Chatham House rules</a>,
so below I am reporting <em>my</em> takeaways without reference to specific
individual comments or revealing details. There should be some form of
publication output in the future so you can see who attended and get a
more global view of the meeting; I'll link to that below when it is
out.</p>
<p>Thank you to Sage Bionetworks for coordinating this meeting & inviting me!</p>
<h2>Main themes that emerged (for me)</h2>
<p>We hoped that open science would lead to new and better practices; what
we too often got was practices that fed into the same broken system.</p>
<p>As the value of analytics and data becomes ever more apparent, there
is ever more interest by commercial interests in capturing that value
in closed systems. Often, the data creators and/or owners seem to be
unaware of this capture, especially when the data is secondary to
their primary mission (e.g. in universities). This lack of awareness
Has Consequences.</p>
<p>Governance and sustainability of open institutions (especially open
source projects) is on a lot of people's minds. Sage has a large
team focused on this! (<a href="https://twitter.com/wilbanks">John Wilbanks</a> says "call me!")</p>
<p>We talked a fair bit about the challenge of convincing individuals and
groups that <strong>increased opportunity for unpredictable serendipity</strong>
was worth giving up <strong>predictable (but smaller) gains in
fame/power/money</strong>.</p>
<p>The invisibility of successful "open" came up repeatedly - the modern
data science ecosystem is built on R and Python, preprints in the life
sciences, open & FAIR data, and open source especially. That
successful open practices achieve near instant adoption is wonderful;
that they are not highlighted as successes of open in the open science
community is unfortunate; and their invisibility means that their
sustainability is often not strongly considered.
(You can see <a href="http://sagebionetworks.org/in-the-news/recognizing-the-successes-of-open-science/">a longer blog post by me on this topic, here.</a>)</p>
<p>It was great to see multiple statements about how the idea of one
consortium/community building THE platform for analysis in an area was
a non-starter. Functional interoperability, collaboration, and
ecosystem thinking within and across platforms is seen as critical,
even by the most senior researchers.</p>
<p>In concert with that, I see that every functional system is a
compromise between various requirements and design
considerations. Therefore building multiple differently functioning
systems is a good ecosystem bet.</p>
<p>Several different people referred to the <strong>increased attack surface</strong>
that open practices offer: e.g. by making your methods and data open,
you increase the ability of others to attack your conclusions. While
this is an important aspect of open science, it is also something that
discourages everyone, with disproportionate negative impact on already
marginalized populations. Sharing within "club" structures, or gated
communities, was seen as one possible solution.</p>
<p>We noted the need for & challenge of placing "do no harm" restrictions
on use and reuse of data; community codes of conduct were discussed as
one example of a governance structure that (combined with
not-entirely-open communities) could enforce such restrictions.</p>
<p>Diversity and inclusion was a frequently mentioned topic. Lack of
diversity in communities can be seen as empirical evidence of missing
structure in communities that is not clearly visible from within; I
think this is important when it comes to formal governance discussions
that can externalize internal culture (hopefully accurately).</p>
<p>Another interesting theme was the extent to which some saw that
grassroots communities of practice could be an antidote to the
<a href="https://en.wikipedia.org/wiki/The_Monkey%27s_Paw">"monkey's paw"</a> or "shitty genie" of requirements generation. Often,
engineers building infrastructure want detailed use cases and
requirements specification, which then leads to the wrong thing being
built (and the associated blame), while if the engineers are brought
into the community of practice they are more likely to build the right
thing due to shared understanding and iterative/continuous
participation.</p>
<p>The challenge of analyzing all the interesting data sets was
frequently mentioned. While not discussed at the meeting, in my view,
training is a way to bring prepared minds and hands to tackle the
analysis of interesting data sets. This training needs to be built in
rather than bolted on to projects, however.</p>
<h2>My own POV: the critical role of communities of practice</h2>
<p>Again and again, I saw that communities of practice presented a key
ingredient to solutions for problems in governance, training,
infrastructure, methods, etc. Communities of practice bring the people
to the problems! Fundamentally, I think open systems do not work
without a community of practice underpinning them.</p>
<p>Creating, growing, and sustaining these communities is, I think, one
of the most important tasks to be tackled. More on that as I have
time to write.</p>
<h2>Concluding thoughts</h2>
<p>One of the organizers closed out the meeting by asking everyone to
highlight one theme that surprised and/or dismayed them. This was a
productive if depressing way to extract essential takeaways!</p>
<p>"The cavalry isn't coming." One of the more sobering conclusions from
this part of meeting was that, given the seniority of the people in
the room, we had no one but ourselves to blame for failing at open in
the next decade. If we couldn't figure out how to coordinate and
incentivize open, then it was unlikely that someone else would step in
to help us out. <strong>We are the cavalry.</strong> (And existing, closed,
institutions are more resilient than we realized.)</p>
<p>Consumers are often very happy to trade data for convenience. This is a
challenge for open!</p>
<p>Open science can be weaponized by opponents of science, e.g. reproducibility
challenges can lead to the conclusion that all science is wrong; there
are many politicians eager to attack science. The dangers of further
deligitimizing science in the eyes of the world are real!</p>
<p>While scientists always start in and often revert to competitive mode,
they can also switch to cooperative mode with ease, given the proper
incentives and structure. (I personally recommend reading Kathleen
Fitz's book <a href="https://www.amazon.com/Generous-Thinking-Radical-Approach-University/dp/1421429462">Generous Thinking</a>, which focuses on this issue!)</p>
<p>A generational (?) concern was that DIY biology will eat all of biology,
and that this meeting could be viewed as a bunch of PDP-11 engineers
discussing the intricacies and importance of time sharing system design.
I personally think millenials are more sophisticated about data ownership,
more invested in sharing (and more sophisticated about its tradeoffs), and
are likely to seriously upset current apple carts, but I'm an optimist :).</p>
<p>There was a repeated concern that open biomedical science <em>has</em> to
translate into better outcomes, and a shared concern that open science
is an ideology built on practices that don't really work 80% of the time.</p>
<p>My own (depressing) conclusion was that it is not possible for open to
be truly open, and that completely open institutions are extremely
vulnerable to attack (for my previous thoughts on this in open source
projects, see
<a href="http://ivory.idyll.org/blog/2018-how-open-is-too-open.html">"How open is too open?"</a>). There
are gates that must be kept (hodor)! I'll expand on this theme in
another blog post when I have time!</p>
<hr>
<p>In general, I'm happy to expand on themes as time permits, if people
have questions!</p>
<hr>
<p>Immediately after writing this, I happened to revisit Denisse
Alejandra's article,
<a href="https://medium.com/@denalbz/reimagining-open-science-through-a-feminist-lens-546f3d10fa65">"Reimagining Open Science Through a Feminist Lens"</a>,
and I was encouraged by the overlap and relevance of a lot of what was
discussed at the CAOS meeting to this reimagination!</p>
<p>--titus</p>Sustaining open source: thinking about communities of effort2019-03-02T00:00:00+01:002019-03-02T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2019-03-02:/blog/2019-communities-of-effort.html<p>Thinking about how to sustain open source.</p><p>I just finished a day at the SIAM CSE 2019 conference, where I gave a talk
as part of a mini-symposium on software sustainability (<a href="http://ivory.idyll.org/blog/2018-siam-abstract.html">my abstract</a>,
and <a href="https://osf.io/2gzhy/">my talk slides</a>; see the <a href="http://ivory.idyll.org/blog/tag/cpr.html">'cpr' tag</a> for all my recent blog posts on this topic.)</p>
<p>When I was outlining the talk, I spent a fair amount of time noodling
about how I wanted to approach the subject. I have a lot of
disorganized thoughts that I think can be put together in interesting ways,
but for a 20 minute talk, I really needed to pick a narrow focus.</p>
<p>Here's what I ended up with. I'm curious for reactions!</p>
<h2>Defining a term, "communities of effort"</h2>
<p>I'll start by defining "communities of effort" as a community formed in
pursuit of a common goal. The goal can be definite or indefinite in
time, and may not be clearly defined, but it's something that (generally
speaking) the community is aligned on.</p>
<p>The term "effort" here refers to <a href="http://ivory.idyll.org/blog/2018-labor-and-engaged-effort.html">focused or engaged attention</a>,
and in this sense in particular, I mean the focused attention applied
towards the common goal.</p>
<p>One rational goal of such a community is to achieve the goal without
wasting effort through duplication or redundancy in work. This connects with
my earlier blog post on <a href="http://ivory.idyll.org/blog/2018-anti-sisyphean-league.html">the open source anti-Sisyphean League</a>, a
term coined by Cory Doctorow: the idea is that there are a number of
rocks to be rolled up hills, and (in an open community) there is no
reason for people to roll those rocks up the hill independently, since
they can take advantage of each other's efforts.</p>
<p>This community of effort directs itself towards achieving the goal,
applying the available effort to the task. Here, effort is a <em>finite</em>
resource that is consumable - you cannot apply the same effort to more
than one task, and the effort that is applied towards one task is not
available to be applied to another task. (Of course, the available
effort can be <em>renewed</em> or <em>increased</em> - more on that later.)</p>
<h2>Effort as a common pool resource</h2>
<p>The trickiest and most uncertain link is this: I think that the effort
applied towards the common goal is, to some extent, directed by the
community. That is, the available effort - which consists of work
by individuals towards the collective goal - is at the very least
loosely coordinated with the community, if not coordinated more closely.</p>
<p>(This may be because the community needs to be involved in order to
decrease redundancy. Not sure.)</p>
<p>If this is true - that effort is coordinated by the community rather
than the individual, and so is non-excludable, and also is a finite
resource that can be consumed, and is thus rivalrous, this turns
it into a common pool resource.</p>
<p>Common pool resources are well known to anyone who has heard of the
tragedy of the commons: they are resources that are subject to this
tragedy, of being consumable by many in an unregulated way.</p>
<h2>What are some examples of these "communities of effort"?</h2>
<p>A prime example is open source projects like Python. They're rooted
in a community approach; they're not not run by a corporation or a
government agency; and any structure (like a nonprofit) is created
after they already exist (and usually after they are successful!)</p>
<p>I think the Carpentries training community is another good example.
This is a community of people interested in teaching and training in
data science and software engineering that essentially self-assembled,
and is aligned around their mission (of teaching and training). The
non-profit structure around it is, again, an ex post facto creation.</p>
<p>Data analysis commons, in which methods, data, compute resources, and
data analysis interfaces are coordinated to address the data analysis
needs of a community, would be another example.</p>
<p>(Wikipedia might be another, but I'm less familiar with how it works.)</p>
<h2>Why do we care about these communities?</h2>
<p>Well, these communities are amazingly <em>effective</em>, in at least
some cases. For example, Python and R between them are essentially
<em>the</em> modern data science languages - both are open source, both
are community coordinated.</p>
<p>More generally, it is probably not an exaggeration to say that the
products of open source communities of effort underly the vast
majority of Silicon Valley software, as well as most research software.</p>
<p>Sustaining, growing, and supporting these communities is pretty
important!</p>
<h2>How do these communities get started, and why are they effective?</h2>
<p>One feature of successful communities of effort - those that seem to
succeed in growing their pool of available effort - is that they are
often very organic in their approach to tackling their mission. This
is probably an effect of the community-based approach, in that the
members of these communities are to a reasonably significant extent
self-motivated and self-directed to solve their problems, and so the
solutions are often bottom-up created with only a light level of
coordination on top. (I'll revisit this in terms of governance in a
bit.)</p>
<p>The other kind of fun thing is that these days it's pretty easy to bootstrap
a community of effort: with some enthusiasm and a site like GitHub, you can
spin up a new community project quite quickly.</p>
<p>Last but not least, many (most? all?) communities of effort have at
least one person who has placed their effort at the service of the
community mission. These are the leaders and/or maintainers of the
project.</p>
<h2>So what's the problem? It's all good, right?</h2>
<p>Well... there are a few things I don't really understand.</p>
<p>For one, the formation of large groups of people who sustain a collective
to pursue a common goal violates basic tenets of collective action - at least,
as I understand them. The idea here is that, if there is a large group
of people pursuing a common goal, then the smart (economically rational) thing
for someone to do is ...not do any work at all, because the individual will
reap the benefits of the group work. So, what's different with these
communities of effort?</p>
<p>Sustainability and in particular <em>maintenance</em> is a big question, too;
these communities often rely on one or a few core maintainers to make things
happen, and it is really unclear why these maintainers (who are often
unpaid or underpaid) would take on these tasks. Yes, they get kudos and
reputation, but kudos and reputation do not put food on the table... why do
they do it?</p>
<p>(One thought - perhaps the creation of a successful community
of effort really depends on there being at least one person who ignores
short-term economic rationality? So then you just don't see all the failed
attempts where someone decides not to be irrational and hence not bother?
Another thought is that perhaps the key aspect of many of these communities
being <em>open</em> means that the maintainer-type folk realize that
no one else is tackling the common goal, and since they need the goal met
as well, they might as well do it?)</p>
<h2>Does framing the problem as a common pool resource problem yield any solutions?</h2>
<p>I think it does.</p>
<p>First, once you recognize effort as the limiting resource, the
question of how to maintain and increase that resource comes to the
forefront. There are a number of possible mechanisms, including
investing in making the community easy or rewarding to join, welcoming
new contributors, and/or providing special methods or data or access
to community members. In this view, these activities become more central
than they are if you are thinking only about the overall goal or mission
of the community.</p>
<p>Second, Elinor Ostrom outlined some design principles for
sustainability of common pool resources based on empirical studies, in
<a href="https://www.amazon.com/Governing-Commons-Evolution-Institutions-Collective/dp/0521405998">Governing the Commons</a>. One of these principles is about making
collective choice arrangements that allow most of the appropriators
(members of the community) to participate in the decision making
process.</p>
<p>Basically, this boils down to rewarding people who invest effort with
some level of influence in how that effort is applied towards the
community goals. This both incentivizes participation with collective
ownership, and also seems to allow a form of organic communication
where the people applying effort feed results from their work back
into the overall community direction. This is, to my mind, one of the
things that leads these communities to be so effective.</p>
<p>This mode of governance <strong>by</strong> members of the community <strong>for</strong> the
community goal leads to another interesting thought. Funders
participate in these communities in indirect ways, by seeking to fund
(or being sought out to fund) effort within the community. Rarely is
the direction of this support directly dictated by the funder; it's
usually laundered through the community member(s) being supported.
This is both good and bad - it limits the degree to which funders (and
companies) can directly influence the project, but also means that
funders may not be able to easily identify the uses to which their
money will be put.</p>
<h2>Who is part of the community of effort?</h2>
<p>Anyone who contributes their effort is part of the community, and hence
should get some form of influence over governance (by the above design
principle).</p>
<p>Extractive contributors - contributors who do not contribute to the
overall effort, especially the <em>maintenance</em> effort - would not,
however, be considered part of the community. See
<a href="http://ivory.idyll.org/blog/2018-how-open-is-too-open.html">How open is too open?</a> for this argument.</p>
<p>People who are using the product of the community but not costing the
community any effort (e.g. consumers of the source code) would also
not be part of the community, unless they contribute in some way to the
project.</p>
<p>One interesting result of this kind of thinking is that, for data
analysis commons, people who provide data or methods, or training
people, or contributing documentations, are contributing effort. This
provides a level of rational inclusion of this kind of work within the
community, and also in governance; they are in a direct sense
contributing to the sustainability of the community of effort.</p>
<h2>Is academia a good home for these communities of effort?</h2>
<p>I note that the leadership and governance model in basic research, at
least, is often not inclusive of the people who are doing the work,
and instead centers on reputation and hierarchy. I don't think
universities and colleagues focused on basic research are likely to be
a good part of the support network for communities of effort, in
general.</p>
<p>I have been quite impressed with what I've seen of extension efforts at
universities, which are faculty-level investments of time and energy
in communities. I'm planning to look more in to the idea of a
digital extension model.</p>
<h2>Some final thoughts</h2>
<p>I think it's important to recognize that (these days at least) there
are lots of competing projects in which people can invest their time
and effort, and it's probably not a bad thing to frame it as a
competition between these communities for people's time and
attention. Communities that do a good job of attracting contributors
and incentivizing the inclusion of effort can win out and potentially
be more sustainable than communities that do a lousy job. (This has
potentially dire implications for some scientific research communities,
which are not always very welcoming or inclusive. I'm not sad about this.)</p>
<p>This framing also puts <em>soft skills</em> front and center in the equation,
and I think this is also a good outcome.</p>
<h2>Open / unaddressed questions</h2>
<p>Two open questions.</p>
<p>First, what are communities of effort <em>not</em> good at? I would venture
that any boring or maintenance level jobs would tend to be addressed
poorly by these communities, due to how human enthusiasm works.</p>
<p>Second, I want to return to the missing link mentioned above - that
these communities seemingly depend on one or more people placing their
effort in service of the community. What are the reasons why people do
this, and how do we support and maintain it? Inquiring minds want to
know... It would be nice if we had a reasonably comprehensive picture
of why this occurred, because it doesn't seem like rational behavior
on the face of it. (I'm very thankful that people do this, of course,
which is why I want to better support this path!)</p>
<h2>Acknowledgements</h2>
<p>I gratefully acknowledge Adam Resnick, Matter Trunnel, Josh Greenberg,
Nadia Eghbal, Luiz Irber, and Tracy Teal, with whom I've had inspiring
conversations on these fronts.</p>
<p>The NIH (via the Data Commons funding) and the Moore Foundation
provided funding to me to think about, read about, and explore these
issues.</p>
<p>Comments welcome!</p>
<p>--titus</p>My recent reading re sustaining open communities2019-03-01T00:00:00+01:002019-03-01T00:00:00+01:00C. Titus Browntag:ivory.idyll.org,2019-03-01:/blog/2019-oss-sustainability-reading.html<p>What has Titus been reading lately?</p><p>I've been interested in the sustainability of open communities for a
while, and with the NIH Data Commons effort, was finally able to start
connecting some dots and finding some reading material. As far as I
can tell, the relevant literature is fragmented across a bunch of
fields that include social and technical studies, sociology,
economics, and political science. This reading has led in some interesting
directions (see <a href="http://ivory.idyll.org/blog/2018-oss-framework-cpr.html">a high level post</a> as well as <a href="http://ivory.idyll.org/blog/tag/cpr.html">my overall collection of CPR blog posts</a>).</p>
<p>Every time I talk to someone in depth about it, I get some more
reading. Notwithstanding that, some of the reading I <em>have</em> already
found is really interesting! And (at the encouragement of Josh
Greenberg, among others) I thought I'd post a few of the books and
links that I've found inspiring. I'll also give my "first read"
impressions of them - this is very different from what I'd do if I
were a true scholar, but hopefully those impressions will help motivate
people to look into these books more!</p>
<p>Acknowledgements are due first -
<a href="https://twitter.com/michael_nielsen/status/1009075233368596482">Michael Nielsen is a rich source of references</a>,
as are
<a href="https://twitter.com/CameronNeylon/status/1009238646044545024">Cameron Neylon</a>
and
<a href="https://twitter.com/nayafia/status/1028053008867676160">Nadia Eghbal</a>. Thank
you!!</p>
<hr>
<p><a href="https://www.amazon.com/Governing-Commons-Evolution-Institutions-Collective/dp/0521405998">Governing the Commons</a>,
by Dr. Elinor Ostrom. This is a classic book that covers the topic she
is most well known for, and for which she received the Nobel Prize in
Economics. The first third of the book is a bit of a heavy slog, but
the last two thirds contains a number of fascinating case studies
about how common pool resources can be managed sustainably and how
these "commons" can be governed in such a way to promote
sustainability.</p>
<hr>
<p><a href="https://www.amazon.com/Logic-Collective-Action-Printing-Appendix/dp/0674537513">Logic of Collective Action</a>,
by Dr. Mancur Olson. This is another classic book that talks about how
there is no incentive for large groups of people to act collectively
to reach a common goal. I have a long blog post on this in
waiting... but it needs a lot of editing first.</p>
<hr>
<p><a href="https://www.amazon.com/Social-Life-Things-Commodities-Anthropology/dp/0521357268">The Social Life of Things</a>,
edited by Dr. Arjun Appadurai. This is a collection of mind-blowing
studies of how commodities acquire and retain value. Two particularly
pertinent quotes out of (literally) a hundred I noted down:</p>
<p>First,</p>
<blockquote>
<p>Here Renfrew shows us very persuasively that the decisive factors in
technological innovation (which is critical to the development of new
commodities) are often social and political rather than simply
technical. (p34)</p>
</blockquote>
<p>Second,</p>
<blockquote>
<p>This circuit ensures barrenness and death instead of fertility and
prosperity. It is based on the transformation of reciprocity into
commodity exchange. (p53, referencing Taussig, 1980:224).</p>
</blockquote>
<hr>
<p><a href="https://www.amazon.com/Fractivism-Corporate-Chemical-Experimental-Futures/dp/0822369028">Fractivism</a>,
by Dr. Sara Wylie. This is a recent book by Professor. Wylie, who is
at at Northeastern, and it talks about her efforts to coalesce a
community around collecting reports of fracking. Several themes of
this book that stuck with me is that people are hungry for community,
and that technology can be used in very intentional ways to build this
community. (Thanks to Gabriella Coleman for the introduction to this
work and to Dr. Wylie!)</p>
<hr>
<p><a href="https://www.amazon.com/Generous-Thinking-Radical-Approach-University/dp/1421429462">Generous Thinking</a>,
by Dr. Kathleen Fitzpatrick. Dr. Fitzpatrick is a professor at
Michigan State who ironically enough arrived just as I left for UC
Davis! Her book is about how to change the way we engage with each
other in the university to embrace a more generous kind of
engagement - one in which we take inspiration from each other, rather
than diving straight into a critical analysis.</p>
<hr>
<p>Josh Greenberg pointed me at
<a href="https://en.wikipedia.org/wiki/Club_good">"club goods"</a>, and in
particular the matrix of excludable/rivalrous types of
goods. <a href="https://twitter.com/ctitusbrown/status/1047409685001904134">See my tweet for a hand drawn version that some people liked and some people hated :)</a>.</p>
<hr>
<p><a href="http://worldaftercapital.org/">World after Capital</a>, by Albert
Wenger. This (free!) book has a lot of great discussion about the role of
attention in the modern world.</p>
<hr>
<p>That's all for now. I have about 50 other links and books to look at
after this, too! More ...not soon.</p>
<p>Feedback welcome!</p>
<p>--titus</p>