Yesterday I gave my third keynote address ever, at the Australasian
Genomics Technology Association's annual meeting in Melbourne (talk
slides here). On my
personal scale of talks, it was a 7 or 8 out of 10: I gave it a lot of
energy, and I think the main messages got across, but I ended up
conveying a few things that I regretted - in particular, the Twitter
feed pointed that I'd slagged on biologists a few times (slag, v:
British informal. Criticize (someone) in an abusive and insulting
manner).
Absolutely not my intent!
I consider myself a biologist, albeit one who works primarily on
biological data analysis. So I'm starting to call myself a
data-intensive biologist. And I thought now would be a good time to
talk about the emerging discipline of Data Intensive Biology.
Before I go on, this post is dedicated to my Australian hosts. I've
had a wonderful time so far, and I've been tremendously impressed with
the Australasian genomics and bioinformatics research communities.
They've got something great going on here and I look forward to further
interactions!
Who are Data Intensive Biologists?
In my talk, I included the de rigeur picture of a tidal wave, reproduced
below.
This tidal wave is intended to represent the data deluge in biology,
the combination of lots of -omics and sensor data that is starting to
hit the field of biology. If you look closely at the picture, you'll
see three groups of researchers.
- the inland researchers, who are up on the shore well away from
the boardwalk. They are looking at the tidal wave, wondering if
the water's going to reach them; they're far away from it, though,
so they're not really too worried about it.
- the boardwalk researchers, who are on the little walkway at the
base of the crashing wave. Some of them are looking at the wave in
horror, aware that this is going to be painful; others are busy
putting on little lifevests, in the vain hope that this will save
them; and the third group are looking the other way, wondering what
everyone is talking about.
- the surfer dudes and dudettes, who are having the time of their
lives, surfing down the face of the wave. Every now and then they
fall off, but the water's deep and they can get right back on the
board. (OK, it's an imperfect analogy.)
The surfers are the data intensive biologists: they love the data, they
love the thrill of data analysis, and they're going somewhere fast, although
maybe not in a forward direction.
The anatomy of a Data Intensive Biologist
In my ABiC 2014 keynote (keynote #2), I listed out five character types that participate in
bioinformatics. These were:
- Computer scientists
- Software engineers
- Data scientists
- Statisticians
- Biologists
(I missed a 6th, database maintainers, and possibly a 7th, data curators.)
In another miscommunication, I meant to say (but did not say during my
talk) that almost every effective bioinformatics researcher is
some linear combination of these seven characters. I think that data
intensive biologists can be defined on this set of axes, too.
So: data intensive biologists are biology researchers who are
focused on biological questions, but who have substantial footings
in many or all of the other fields above. That is, their focus is on
making biological progress, but they are using tools from computer
science, software engineering, data science, statistics, and databases
to study biology.
Some additional characteristics
Data Intensive Biologists:
- are usually well grounded in at least one field of biology;
- understand that data is not information, but that it's a darn good start;
- know that most of our current biological knowledge is limited or wrong,
but that we've got to rely on it anyway;
- are aware that investing in automation is sometimes repaid 10x in efficiency,
except when it's not;
- realize that reproducible computational analyses are a great idea;
- write software when they have to, but only because they have to;
- think data science training is neat, because it gives an enduring competitive
edge;
- constantly rebalance themselves between the CS, software engineering,
data science, and stats perspectives -- almost as if they were surfing
across those fields;
- get that open science is obvious if not always practical;
- are confused as to why biology graduate programs aren't teaching more
data analysis;
and, finally, data intensive biologists
- shouldn't worry about their career options.
Concluding thoughts
If you had to nail down a definition - people like to do that, for some
reason :) - I would go with:
Data intensive biology: a researcher focused on addressing
biological research questions primarily through large scale data
analysis or integration.
I don't have any interest in being exclusionary with this definition.
If you're tackling biological questions in any way, shape, or form,
and you're using lots of data to do it, you fit my definition!
Oh, and by the way? There are already workshops and faculty
positions in this area.
Although they're all at Hopkins, so maybe the best you can say is
that James Taylor and I agree on terminology :).
--titus
There are comments.