I'm at the 2014 Marine Microbes Gordon Conference right now, and at
the end of my talk, I
brought up the point that the function of most genes is unknown. It's
not a controversial point in any community that does environmental
sequencing, but I feel it should be mentioned at least once during
every session on metagenome sequencing :).
The lack of functional information for the vast majority of genes is,
in my view, the broadest and biggest challenge facing environmental
microbiology. Known colloquially as "microbial dark matter" (ref
and Nature News),
it is fair to say that we have virtually no window into what the
majority of genes do. This is particularly a problem now that we
can readily access them with sequencing, and several fields are
racking up hundreds or thousands of data sets that are largely
uninterpretable from a functional perspective.
So what are our options? What can we do to characterize new genes?
There seem to be two poles of opinions: many experimentalists argue
that we need to devote significant efforts to doing more microbial
physiology, which is, after all, how we know most of what we already
know. People at the other pole seems to think that if we do enough
sequencing, eventually meaning will emerge - enough correlations will
turn into causation. (While these are obviously caricatures, I think
they capture most of the range of opinions I've heard, so I like 'em ;).
Neither course seems likely to me. Nobody is going to fund hundreds
or thousands of graduate student projects to characterize the
physiology of individual microbial species, which is more or less the
scale of effort needed. Similarly, while the sequencing folk have
clearly been "winning" (in the sense that there's a lot of sequencing
being done!) there's a growing backlash against large-scale
sequencing without a fairly pointed set of hypotheses behind them.
This backlash can be understand as a natural development -- the
so-called trough of disillusionment in the adoption and
understanding of new technologies -- but that makes it no less real.
Over the past few years, I've had the opportunity to discuss and
debate a variety of approaches to characterizing gene function in
microbes. Since I'm thinking about it a lot during this meeting, I
thought I'd take the time to write down as many of the ideas as I can
remember. There are two purposes to this -- first, I'm trawling for
new ideas, and second, maybe I can help inspire people to tackle these
Without further ado, here is my list, broken down into somewhat arbitrary
Experimental exploration and determination of gene function
Finding genetic systems and doing the hard work.
This is essentially the notion that we should focus in on a few
model systems that are genetically tractable (culturable,
transformable, and maybe even possessed of genome editing
techniques) and explore, explore, explore. I'm not sure which
microbes are tractable, or could be made tractable, but I gather we
are lacking model systems representative of a broad swath of marine
microbes, at least.
The upsides to this approach are that we know how to do it, and all
of the modern -omics tools can be brought to bear to accelerate
progress: genome, transcriptome, and proteome sequencing, as well
The downsides are that this approach is slow to start, time
consuming, and not particularly scalable. Because of that I'm not
sure there's much support for funding.
Transcriptome assisted culture.
A persistent challenge for studying microbes is that many of them
cannot be easily cultured, which is a soft prerequisite for
studying them in the lab. We can't culture them because often we
don't know what the right culture conditions are -- what do they
eat? Who do they like to hang out with?
One of the neater approaches to resolving this is the concept of
transcriptome assisted culture, which Irene Newton
pointed out to me in this neat PNAS paper on culturing Coxiella. Essentially,
Omsland et al. used transcriptome sequencing in conjunction with
repeated modifications to the culture medium to figure out what the
right growth medium was. In addition to converting an important
biomedical pathogen into something more easily culturable, the
authors gained important insights into its basic metabolism
and the interaction of Coxiella with its host cells.
Upsides? It's an approach that's readily enabled by modern -omics
tools, and it should be broadly applicable.
Downsides? Time consuming and probably not that scalable. However,
it's a rather sexy approach to the hard slog of understanding organisms
(and you can argue that it's basically the same as the model organism
approach) so it's marginally more fundable than just straight up
Another culture-based approach is the enrichment culture, in which
a complex microbial community (presumably capable of driving many
different biogeochemical processes) is grown in a more controlled
environment, usually one enriched for a particular kind of
precursor. This can be done with a flow reactor approach
where you feed in precursors and monitor the composition of the
outflow, or just by adding specific components to a culture mix and
seeing what grows.
For one example of this approach, see Oremland et al., 2005, in which the
authors isolated a microbe, Halarsenatibacter silvermanii, which
metabolized arsenic. They did this by serial transfer of the cells
into a fresh medium and then purifying the organism that
persistently grew through serial dilution at the end.
This is a bit of a reverse to the previous methods, where the focus
was on a specific organism and figuring out how it worked; here,
you can pick a condition that you're interested in and figure out
what grows in it. You can get both simplified communities and
potentially even isolates that function in specific conditions.
(Also see Winogradsky columns for a similar
environment that you could study.) You still need to figure out
what the organisms do and how they do it, but you start with quite
a bit more information and technology than you would otherwise -
most importantly, the ability to maintain a culture!
Pros: this is actually a lot more scalable than the model-organism
or culture-focused techniques above. You could imagine doing this on a
large scale with a fairly automated setup for serial transfer, and
the various -omics techniques could yield a lot of information for
relatively little per-organism investment. Someone would still need
to chase down the genes and functions involved, but I feel like this
could be a smallish part of a PhD at this point.
Cons: it's not terribly hypothesis driven, which grant agencies
don't always like; and you might find that you don't get that much
biological novelty out of the cultures.
You can also understand what genes do by putting them into
tractable model organisms. For example, one of the ways that Ed
DeLong's group showed that proteorhodopsin probably actually
engaged in photosynthesis was by putting the gene in E. coli. At the
time, there was no way to investigate the gene (from an uncultured
SAR86) in its host organism, so this was the only way they could
"poke" at it.
A significant and important extension of this idea is to transfer
random fragments from metagenomic fosmid or BAC libraries into
large populations of (say) E. coli, and then do a selection
experiment to enrich for those critters that can now grow in new
conditions. For example, see this paper
on identifying the gene behind the production of certain
antibiotics (hat tip to Juan Ugalde (@JuanUgaldeC for the reference). Also see
the "heterologous expression" paragraph in Handelsman (2004), or this other
antibiotic resistance paper from Riesenfeld
et al. (2004) (hat tips to Pat Schloss (@Pat Schloss), Jeff Gralnick (@bacteriality), and Daan Speth (@daanspeth) for
Pros: when it works, it's awesome!
Cons: most genes function in pathways, and unless you transfer in
the whole pathway, an individual gene might not do anything. This
has been addressed by transferring entire fosmids with whole
operons on them between microbes, and I think this is still worth
trying, but (to me) it seems like a low-probability path to
success. I could be wrong.
Why not just build a new critter genome using synthetic biology
see how it works? This is a radical extension of the previous idea
of transferring genes between different organisms. Since we can
now print long stretches of DNA on demand,
why not engineer our own pathways and put them into tractable
organisms to study in more detail?
I think this is one of the more likely ideas to ultimately work
out, but it has a host of problems. For one thing, you need to
have strong and reliable predictions of gene function. For
another, not all microbes will be able to execute all pathways, for
various biochemical reasons. So I expect the failure rate of this approach
to be quite high, at least at first.
Pros: when it works, it'll be awesome! And, unlike the functional
metagenomics approach, you can really engineer anything you want -
you don't need to find all your components already assembled in a
PCR product or fosmid.
Cons: expensive at first, and likely to have a high failure rate.
Unknown scalability, but probably can be heavily automated, especially
if you use selection approaches to enrich for organisms that work
(see previous item).
Computational exploration and determination of gene function
Look at the genome, feed it into a model of metabolism, and try to
understand what genes are doing and what genes are
missing. Metabolic flux analysis provides
one way to quickly identify whether a given gene complement is
sufficient to "explain" observed metabolism, but I'm unsure of how
well it works for badly annotated genomes (my guess? badly ;).
You can marry this kind of metabolic analysis with the kind of
nifty fill-in-the-blank work that Valerie de Crecy-Lagard does -- I
met Valerie a few years back on a visit to UFL, and thought, hey,
we need hundreds of people like her! Valerie tracks down "missing"
pathway genes in bacterial genomes, using a mixture of bioinformatics
and experimental techniques. This is going to be important if you're
predicting metabolic activity based on the presence/absence of annotated
In practice, this is going to be much easier in organisms that are
phylogenetically closer to model systems, where we can make better
use of homology to identify likely mis-annotated or un-annotated
genes. It also doesn't help us identify completely new functions
except by locating missing energy budgets.
Pros: completely or largely computational and hence potentially quite
Cons: completely or largely computational, so unlikely to work that
well :). Critically dependent on prior information, which we
already know is lacking. And hard or impossible to validate; until
you get to the point where on balance the predictions are not wrong,
it will be hard to get people to consider the expense of validation.
Gene-centric metabolic modeling
Rather than trying to understand how a complete microbe works, you can
take your cue from geochemistry and try to understand how a set of genes
(and transcripts, and proteins) all cooperate to execute the given
biogeochemistry. The main example I know of this is from Reed et al.
2013, with Julie Huber (@JulesDeep) and Greg Dick.
Pros: completely or largely computational and hence potentially quite
Cons: requires a fair bit of prior information. But perhaps easier to
validate, because you get predictions that are tied closely to a
particular biogeochemistry that someone already cares about.
Sequence everything and look for correlations.
This is the quintessential Big Data approach: if we sequence everything,
and then correlate gene presence/absence/abundance with metadata and
(perhaps) a smattering of hypotheses and models, then we might be able
to guess at what genes are doing.
Aaron Garoutte (@AaronGoroutte)
made the excellent point that we could use these correlations as a
starting point to decide which genes to invest more time and energy
in analyzing. When confronted with 100s of thousands of genes --
where do you start? Maybe with the ones that correlate best with
the environmental features you're most interested in ...
Pros: we're doing the sequencing anyway (although it's not clear to me
that the metadata is sufficient to follow through, and data availability
is a problem). Does not rely on prior information at all.
Cons: super unlikely to give very specific predictions; much more
likely to provide a broad range of hypotheses, and we don't have
the technology or scientific culture to do this kind of work.
Look for signatures of positive selection across different communities.
This is an approach suggested by Tom Schmidt and Barry Williams,
for which there is a paper soon to be submitted by Bjorn Ostman and
Tracy Teal et al. The basic idea is to look for signatures of
adaptive pressures on genes in complex metagenomes, in situations
where you believe you know what the overall selection pressure is.
For example, in nitrogen-abundant situations you would expect
different adaptive pressures on genes than in more nitrogen-limited
circumstances, so comparisons between fertilized and unfertilized
soils might yield something interesting.
Pros: can suggest gene function without relying on any functional
information at all.
Cons: unproven, and the multiple-comparison problem with statistics
might get you. Also, needs experimental confirmation!
My favorite idea - a forward evolutionary screen
Here's an idea that I've been kicking around for a while with
(primarily) Rich Lenski (@RELenski), based on some Campylobacter
work with JP
Jerome and Linda Mansfield.
Take fast evolving organisms (say, pathogens), and evolve them in
massive replicate on a variety of different carbon sources or other
conditions (plates vs liquid; different host genotypes; etc.) and
wait until they can't cross-grow. Then, sequence their genomes and
figure out what genes have been lost. You can now assume that
genes that are lost are not important for growing in those other
conditions, and put them in a database for people to query when
they want to know what a gene might not be important for.
We saw just this behavior in Campylobacter when we did serial transfer
in broth, and then plated it on motility assay plates: Campy lost its
motility genes, first reversibly (regulation) and then irreversibly
(conversion to pseudogene).
Harriet Alexander (@nekton4plankton) pointed out to me that
this bears some similarity to the kinds of transposon mutagenesis
experiments that were done in many model organisms in the 90s -
basically, forward genetics. Absolutely!
I have to think through how useful forward genetics would be in
this field a bit more thoroughly, though.
Pros: can be automated and can scale; takes advantage of massive
sequencing; should find lots of genes.
Cons: potentially quite expensive; unlikely to discover genes specific
to particular conditions of interest; requires a lot of effort for
things to come together.
So that's my list.
Can't we all get along? A need for complementary approaches.
I doubt there's a single magical approach, a silver bullet, that will
solve the overall problem quickly. Years, probably decades, of blood,
sweat, and tears will be needed. I think the best hope, though, is to
find ways to take advantage of all the tools at our disposal -- the
-omics tools, in particular -- to tackle this problem with reasonably
close coordination between computational and experimental and
theoretical researchers. The most valuable approaches are going to be
the ones that accelerate experimental work by utilizing hypothesis
generation from large data sets, targeted data gathering in pursuit of
a particular question, and pointed molecular biology and biochemistry
experiments looking at what specific genes and pathways do.
How much would this all cost?
Suppose I was a program manager and somebody gave me \$5m a year for 10
years to make this happen. What would be my Fantasy Grants Agency
split? (Note that, to date, no one has offered to give me that much
money, and I'm not sure I'd want the gig. But it's a fun brainstorming
I would devote roughly a third of the money to culture-based efforts
(#1-#3), a third to building computational tools to support analysis
and modeling (#6-#9), and a third to developing out the crazy ideas
(#4, #5, and #10). I'd probably start by asking for a mix of 3 and 5
year grant proposals: 3 years of lots of money for the stuff that
needs targeted development, 5 years of steady money for the crazier
approaches. Then I'd repeat as needed, trying to balance the craziness
More importantly, I'd insist on pre-publication sharing of all the
data within a walled garden of all the grantees, together with regular
meetings at which all the grad students and postdocs could mix to talk
about how to make use of the data. (This is an approach that Sage
Biosciences has been pioneering for biomedical research.) I'd
probably also try to fund one or two groups to facilitate the data
storage and analysis -- maybe at \$250k a year or so? -- so that all of
the technical details could be dealt with.
Is \$50m a lot of money? I don't think so, given the scale of the
problem. I note that a few years back, the NIH NIAID proposed to
devote 1-3 R01s (so \$2-4m total) to centers devoted to exploring the
function of 10-20 pathogen genes each, so that's in line with what I'm
proposing for tackling a much larger problem.
There are comments.