I am happy to announce that we have made available prepared sourmash taxonomy ("LCA") databases for release 89 of the GTDB taxonomy.
The databases are available for download from the Open Science Framework in this project. There are prepared databases avaialble for k=21, k=31, and k=51.
What is the GTDB taxonomy?
GTDB is a revised bacterial and archaeal taxonomy based on phylogenetic relations between proteins from approximately 25k genomes. You can read more about it here.
GTDB is an alternative to the NCBI taxonomy. It is used by (among others) MGnify, the EBI metagenomics resource.
What is sourmash?
Sourmash is a research platform and bioinformatics tool for searching and analyzing genomes, based on a MinHash-inspired approach that allows genome similarity searches, genome containment searches, and compositional analysis of k-mers in large sequence data sets. You can read more about it here.
What do these databases let you do?
There are three immediate uses for these databases:
-
you can use the
sourmash lca classify
routine (and other LCA commands) to do taxonomic classification of genomes using the GTDB taxonomy. (See our tutorial on sourmash lca!) -
you can do compositional analysis of metagenomes using
sourmash lca summarize
. -
you can search for genomes in GTDB that are similar to genomes (or metagenomes) of interest, using
sourmash search
andsourmash gather
.
How much memory does sourmash need to use these databases?
LCA databases take up less disk space than SBT databases, but are more memory intensive. Using these databases requires about 5 GB of RAM.
--titus
Appendix: How are these databases built?
We use a fully automated snakemake workflow to build them,
here. It
takes about 12 hours and under 100 GB of RAM to build the databases from the
genomes under release89/fastani/database/
.
Comments !