The emerging discipline of Data Intensive Biology

Yesterday I gave my third keynote address ever, at the Australasian Genomics Technology Association's annual meeting in Melbourne (talk slides here). On my personal scale of talks, it was a 7 or 8 out of 10: I gave it a lot of energy, and I think the main messages got across, but I ended up conveying a few things that I regretted - in particular, the Twitter feed pointed that I'd slagged on biologists a few times (slag, v: British informal. Criticize (someone) in an abusive and insulting manner). Absolutely not my intent!

I consider myself a biologist, albeit one who works primarily on biological data analysis. So I'm starting to call myself a data-intensive biologist. And I thought now would be a good time to talk about the emerging discipline of Data Intensive Biology.

Before I go on, this post is dedicated to my Australian hosts. I've had a wonderful time so far, and I've been tremendously impressed with the Australasian genomics and bioinformatics research communities. They've got something great going on here and I look forward to further interactions!

Who are Data Intensive Biologists?

In my talk, I included the de rigeur picture of a tidal wave, reproduced below.

This tidal wave is intended to represent the data deluge in biology, the combination of lots of -omics and sensor data that is starting to hit the field of biology. If you look closely at the picture, you'll see three groups of researchers.

  • the inland researchers, who are up on the shore well away from the boardwalk. They are looking at the tidal wave, wondering if the water's going to reach them; they're far away from it, though, so they're not really too worried about it.
  • the boardwalk researchers, who are on the little walkway at the base of the crashing wave. Some of them are looking at the wave in horror, aware that this is going to be painful; others are busy putting on little lifevests, in the vain hope that this will save them; and the third group are looking the other way, wondering what everyone is talking about.
  • the surfer dudes and dudettes, who are having the time of their lives, surfing down the face of the wave. Every now and then they fall off, but the water's deep and they can get right back on the board. (OK, it's an imperfect analogy.)

The surfers are the data intensive biologists: they love the data, they love the thrill of data analysis, and they're going somewhere fast, although maybe not in a forward direction.

The anatomy of a Data Intensive Biologist

In my ABiC 2014 keynote (keynote #2), I listed out five character types that participate in bioinformatics. These were:

  1. Computer scientists
  2. Software engineers
  3. Data scientists
  4. Statisticians
  5. Biologists

(I missed a 6th, database maintainers, and possibly a 7th, data curators.)

In another miscommunication, I meant to say (but did not say during my talk) that almost every effective bioinformatics researcher is some linear combination of these seven characters. I think that data intensive biologists can be defined on this set of axes, too.

So: data intensive biologists are biology researchers who are focused on biological questions, but who have substantial footings in many or all of the other fields above. That is, their focus is on making biological progress, but they are using tools from computer science, software engineering, data science, statistics, and databases to study biology.

Some additional characteristics

Data Intensive Biologists:

  • are usually well grounded in at least one field of biology;
  • understand that data is not information, but that it's a darn good start;
  • know that most of our current biological knowledge is limited or wrong, but that we've got to rely on it anyway;
  • are aware that investing in automation is sometimes repaid 10x in efficiency, except when it's not;
  • realize that reproducible computational analyses are a great idea;
  • write software when they have to, but only because they have to;
  • think data science training is neat, because it gives an enduring competitive edge;
  • constantly rebalance themselves between the CS, software engineering, data science, and stats perspectives -- almost as if they were surfing across those fields;
  • get that open science is obvious if not always practical;
  • are confused as to why biology graduate programs aren't teaching more data analysis;

and, finally, data intensive biologists

  • shouldn't worry about their career options.

Concluding thoughts

If you had to nail down a definition - people like to do that, for some reason :) - I would go with:

Data intensive biology: a researcher focused on addressing biological research questions primarily through large scale data analysis or integration.

I don't have any interest in being exclusionary with this definition. If you're tackling biological questions in any way, shape, or form, and you're using lots of data to do it, you fit my definition!

Oh, and by the way? There are already workshops and faculty positions in this area. Although they're all at Hopkins, so maybe the best you can say is that James Taylor and I agree on terminology :).

--titus

Comments !

(Please check out the comments policy before commenting.)