Recipe: Estimate genome size and coverage from shotgun sequencing data

In Extracting shotgun reads based on coverage in the data set, we showed how to get a read coverage spectrum for a shotgun data set. This is a useful diagnostic tool that can be used to estimate total genome size, average coverage, and repetitive content.

Uses for this recipe include estimating the sequencing depth of a particular library, estimating the repeat content, and generally characterizing your shotgun data.

See Estimating (meta)genome size from shotgun data for a recipe that estimates the single-copy genome size; that recipe can be used for any metagenomic or genomic data set, including ones subjected to whole genome amplification, while this recipe can only be used for data from unamplified/unbiased whole genome shotgun sequencing.

Let's assume you have a genome that you've sequenced to some unknown coverage, and you want to estimate the total genome size, repeat content, and sequencing coverage. If your reads are in reads.fa, you can generate a read coverage spectrum from your genome like so: -x 1e8 -k 20 reads.fa
~/dev/khmer/sandbox/ reads.fa reads-cov.dist
./ reads-cov.dist reads-cov.png --xmax=600 --ymax=500

and the result looks like this:

For this (simulated) data set, you can see three peaks: one on the far right, which contains the high-abundance reads from repeats; one in the middle, which contains the reads from the single-copy genome; and one all the way on the left at ~1, which contains all of the highly erroneous reads

Pick a coverage number near the peak that you think represents the single-copy genome - with sufficient coverage, and in the absence of significant heterozygosity, it should be the first peak that's well to the right of zero. Here, we'll pick 100.

Then run

./ reads.fa reads-cov.dist 100

and you should see:

Estimated total genome size is: 7666 bp
Estimated size > 5x repeats is: 587 bp
Estimated single-genome coverage is: 109.8

The true values (for this simulated genome) are 7500bp genome size, 500 bp > 5x repeat, and 150x coverage, so this isn't a terribly bad set of estimates. The single-genome coverage estimate is probably low because we're estimating read coverage with a k-mer based approach, which is sensitive to highly erroneous reads.

A few notes:

  • Yes, you could automate the selection of the first peak, but I think it's always a good idea to look at your data!
  • In the case of highly polymorphic diploid genomes, you would have to make sure to include the haploid reads but only count them 50% towards the final genome size.

Comments !

(Please check out the comments policy before commenting.)