Detecting microbial contamination in long-read assemblies (from known microbes)

A week ago, Erich Schwarz e-mailed our lab list asking,

I would like to be able to download a set of between 1,000 and 10,000 bacterial genome assembly sequences that are reasonably representative of known bacteria. RefSeq's bacterial genome set is easy to download, but absolutely freaking huge (the aggregate FASTA file for its genome sequences is 410 GB).

After digging in a bit, Erich gave us his actual goal: to search for potential microbial contaminants, like so:

Do MegaBlastN on new genome assemblies from PacBio data. With PacBio one gets very few large contigs, so bacterial contamination is really easy to filter out with a simple MegaBlastN. However, I did my last big download of 3,000 microbial genomes from EBI in 2013. There's a lot more of them now!

My response:

I think sourmash gather on each contig would probably do the right thing for you, actually; https://sourmash.readthedocs.io/en/latest/tutorials.html if you have a "true positive" contaminated scaffold to share, I can test that fairly quickly.

Also - I assume the contigs are never chimeric, so if you find contamination in one it's ok to discard the whole thing?

Also - kraken should do a fine job of this albeit in a more memory intensive way. MegaBlastN isn't not much more sensitive than k-mer based approaches, I think.

This would let Erich search all 100k+ bacterial genomes without downloading the complete genomes. My recommendation was to do this to identify candidate genomes for contaminants, and then use something like mashmap to do a more detailed alignment and contaminant removal.

Erich responded with some useful links.

In fact, I have what should be both positive and negative controls for microbial contamination:
http://woldlab.caltech.edu/~schwarz/caeno_pacbio.previous/nigoni_mhap.decont_2015.11.11.fa.gz
http://woldlab.caltech.edu/~schwarz/caeno_pacbio.previous/nigoni_mhap.CONTAM_2015.11.11.fa.gz
which you are very welcome to try sourmashing!

After some other back and forth, I wrote a script to do the work; here's a rough run protocol:

curl -O -L http://woldlab.caltech.edu/~schwarz/caeno_pacbio.previous/nigoni_mhap.decont_2015.11.11.fa.gz
./gather-by-contig.py nigoni_mhap.CONTAM_2015.11.11.fa.gz genbank-k31.sbt.json --output-match foo.txt --output-nomatch foo2.txt —csv summary.csv

which should take a minute or two to run on a modern SSD laptop, and requires less than 1 GB of RAM (and about 18 GB of disk space for the genbank index).

A few comments before I go through the script in detail:

this uses MinHash downsampling as implemented in sourmash, so you have to feed long contigs in. This is appropriate for PacBio and Nanopore assemblies, but not for raw reads of any kind, and probably not for Illumina assemblies.
sourmash will happily do contaminant estimation of an entire data set (genomes, reads, etc.) - the goal here was to go line by line through the contigs and split them into "match" and "no match".
the databases are downloadable here

Last, but not least: this kind of ad hoc scripting functionality is what we aspire to enable with all our software. A command line program can't address all needs, but a default set of functionality provided via the command line, wrapping a more general purpose library, can!

You can download the script here; an annotated version is below.

An annotate version of the script

First, import the necessary things:

#! /usr/bin/env python
import argparse
import screed
import sourmash
from sourmash import sourmash_args, search
from sourmash.sbtmh import SearchMinHashesFindBestIgnoreMaxHash
import csv

In the main function, set up some arguments:

def main():
    p = argparse.ArgumentParser()
    p.add_argument('input_seqs')
    p.add_argument('sbt_database')
    p.add_argument('--threshold', type=float, default=0.05)
    p.add_argument('--output-nomatch', type=argparse.FileType('wt'))
    p.add_argument('--output-match', type=argparse.FileType('wt'))
    p.add_argument('--csv', type=argparse.FileType('wt'))
    args = p.parse_args()

Then, find the SBT database to load:

    tree = sourmash.load_sbt_index(args.sbt_database)
    print(f'found SBT database {args.sbt_database}')

Next, figure out the MinHash parameters used to construct this database, so we can use them to construct MinHashes for each sequence in the input file:

    leaf = next(iter(tree.leaves()))
    mh = leaf.data.minhash.copy_and_clear()

    print(f'using ksize={mh.ksize}, scaled={mh.scaled}')

Give some basic info:

    print(f'loading sequences from {args.input_seqs}')
    if args.output_match:
        print(f'saving match sequences to {args.output_match.name}')
    if args.output_nomatch:
        print(f'saving nomatch sequences to {args.output_nomatch.name}')
    if args.csv:
        print(f'outputting CSV summary to {args.csv.name}')

In the main loop, we'll need to track found items (for CSV summary output), and other basic stats:

    found_list = []
    total = 0
    matches = 0

Now, for each sequence in the input file of contigs:

    for record in screed.open(args.input_seqs):
        total += 1
        found = False

Set up a search function that finds the best match, and construct a new MinHash for each query sequence:

        search_fn = SearchMinHashesFindBestIgnoreMaxHash().search

        query_mh = mh.copy_and_clear()
        query_mh.add_sequence(record.sequence)
        query = sourmash.SourmashSignature(query_mh)

If the sequence is too small, quit.

        # too small a sequence/not enough hashes? notify
        if not query_mh.get_mins():
            print(f'note: skipping {query.name[:20]}, no hashes in sketch')
            continue

Now do the search, and pull off the first match:

        for leaf in tree.find(search_fn, query, args.threshold):
            found = True
            matches += 1
            similarity = query.similarity(leaf.data)
            found_list.append((record.name, leaf.data.name(), similarity))
            break

Nothing found? That's ok, just indicate empty.

        if not found:
            found_list.append((record.name, '', 0.0))

Output sequences appropriately:

        if found and args.output_match:
            args.output_match.write(f'>{record.name}\n{record.sequence}')
        if not found and args.output_nomatch:
            args.output_match.write(f'>{record.name}\n{record.sequence}')

and update the user:

        print(f'searched {total}, found {matches}', end='\r')

At the end, print out the summary (this merely leaves the preceding line alone), and output CSVs:

    print('')

    if args.csv:
        w = csv.DictWriter(args.csv, fieldnames=['query', 'match', 'score'])
        w.writeheader()
        for (query, match, score) in found_list:
            w.writerow(dict(query=query, match=match, score=score))

Finally, ...call the main function if this is run as a script:

if __name__ == '__main__':
    main()

Comments and questions welcome, as always!

best, --titus

There are comments.

social