Sourmash LCA databases now available for the GTDB taxonomy

I am happy to announce that we have made available prepared sourmash taxonomy ("LCA") databases for release 89 of the GTDB taxonomy.

The databases are available for download from the Open Science Framework in this project. There are prepared databases avaialble for k=21, k=31, and k=51.

What is the GTDB taxonomy?

GTDB is a revised bacterial and archaeal taxonomy based on phylogenetic relations between proteins from approximately 25k genomes. You can read more about it here.

GTDB is an alternative to the NCBI taxonomy. It is used by (among others) MGnify, the EBI metagenomics resource.

What is sourmash?

Sourmash is a research platform and bioinformatics tool for searching and analyzing genomes, based on a MinHash-inspired approach that allows genome similarity searches, genome containment searches, and compositional analysis of k-mers in large sequence data sets. You can read more about it here.

What do these databases let you do?

There are three immediate uses for these databases:

you can use the sourmash lca classify routine (and other LCA commands) to do taxonomic classification of genomes using the GTDB taxonomy. (See our tutorial on sourmash lca!)
you can do compositional analysis of metagenomes using sourmash lca summarize.
you can search for genomes in GTDB that are similar to genomes (or metagenomes) of interest, using sourmash search and sourmash gather.

How much memory does sourmash need to use these databases?

LCA databases take up less disk space than SBT databases, but are more memory intensive. Using these databases requires about 5 GB of RAM.

--titus

Appendix: How are these databases built?

We use a fully automated snakemake workflow to build them, here. It takes about 12 hours and under 100 GB of RAM to build the databases from the genomes under release89/fastani/database/.

Living in an Ivory Basement Stochastic thoughts on science, testing, and programming.

Sourmash LCA databases now available for the GTDB taxonomy

What is the GTDB taxonomy?

What is sourmash?

What do these databases let you do?

How much memory does sourmash need to use these databases?

Appendix: How are these databases built?

Comments !

What is the GTDB taxonomy?

What is sourmash?

What do these databases let you do?

How much memory does sourmash need to use these databases?

Appendix: How are these databases built?

Comments !

social