Published: Sat 09 May 2015
By C. Titus Brown
In science .
tags: ddd moore
I finally got a chance to more thoroughly read Mark Stalzer and Chris
Mentzel's arxiv preprint, "A Preliminary Review of Influential Works
in Data-Driven Discovery" . This
is a short review paper that discusses concepts highlighted by the
1,000+ "influential works" lists submitted to the Moore Foundation's
Data Driven Discovery (DDD) Investigator Competition. (Note, I was
one of the awardees .)
The core of this arxiv preprint is the section on "Clusters of
Influential Works", in which Stalzer & Mentzel go in detail through
the eight different concept clusters that emerged from their analysis
of the submissions. This is a fascinating section that should be
at the top of everyone's reading list. The topics covered are, in
the order presented in the paper, as follows:
Foundational theory, including Bayes' Theorem, information theory, and
Metropolis sampling;
Astronomy, and specifically the Sloan Digital Sky Survey ;
Genomics, focused around the Human Genome Project and methods for
searching and analyzing sequencing data;
Classical statistical methods, including the lasso, bootstrap methods,
boosting, expectation-maximization, random forests, false discovery rate,
and "isomap" (which I'd never heard of!);
Machine learning, including Support Vector Machines, artificial Neural
Networks (and presumably deep learning?), logistic belief networks,
and hidden Markov models;
The Google! Including PageRank, MapReduce, and "the overall anatomy"
of how Google does things; specific implementations included Hadoop,
BigTable, and Cloud DataFlow.
General tools, programming languages, and computational methods,
including Numerical Recipes, the R language, the IPython Notebook
(Project Jupyter), the Visual Display of Quantitative Information,
and SQL databases;
Centrality of the Scientific Method (as opposed to specific tools or
concepts). Here the discussion focused around the Fourth Paradigm
book which lays out the expansion of the scientific method from
empirical observation to theory to simulation to "big data science";
here, I thought the point that computers were used for both theory
and observation was well-made. This section is particularly worth
reading, in my opinion.
This collection of concepts is simply delightful - Stalzer and Mentzel
provide both a summary of the concepts and a fantastic curated set of
high-level references.
Since I don't know many of these areas that well (I've heard of most
of the subtopics, but I'm certainly not expert in ... any of them?
yikes) I evaluated the depth of their discussion by looking at the
areas I was most familiar with - genomics and tools/languages/methods.
My sense from this was that they covered the highlights of tools
better than the highlights of genomics, but this may well be because
genomics is a much larger and broader field at the moment.
Data-Driven Discovery vs Data Science
One interesting question that comes up frequently is what the
connection and overlap is between data-driven discovery, data science,
big data, data analysis, computational science, etc. This paper
provides a lot of food for thought and helps me draw some
distinctions. For example, it's clear that computational science
includes or at least overlaps with all of the concepts above, but
computational science also includes things like modeling that I don't
think clearly fit with the "data-driven discovery" theme. Similarly,
in my experience "data science" encompasses tools and methods, along
with intelligent application of them to specific problems, but
practically speaking does not often integrate with theory and
prediction. Likewise, "big data", in the sense of methods and
approaches designed to scale to analysis and integration of large data
set, is clearly one important aspect of data-driven discovery - but
only in the sense that in many cases more data seems to be better .
Ever since the "cage match" round of the Moore DDD competition, where
we discussed these issues in breakout groups, I've been working
towards the internal conclusion that data-driven discovery is the
exploration and acceleration of science through development of new
data science theory, methods, and tools . This paper certainly helps
nail that down by summarizing the components of "data driven
discovery" in the eyes of its practitioners.
Is this a framework for a class or graduate training theme?
I think a lot about research training, in several forms. I do a lot
of short-course peer instruction form (e.g. Data Carpentry, Software
Carpentry, and my DIB efforts); I've been talking with people about
graduate courses and graduate curricula, with especial emphasis on
data science (e.g. the Data Science Initiative at UC Davis); and, most generally,
I'm interested in "what should graduate students know if they want to
work in data-driven discovery"?
From the training perspective, this paper lays out the central
concepts that could be touched on either in a survey course or in
an entire graduate program; while my sense is that a PhD would
require coupling to a specific domain, I could certainly imagine a
Master's program or a dual degree program that touched on the
theory and practice of data driven discovery.
For one example, I would love to run a survey course on these topics,
perhaps in the area of biology. Such a course could go through
each of the subsections above, and discuss them in relation to
biology - for example, how Bayes' Theorem is used in medicine,
or how concepts from the Sloan Digital Sky Survey could be applied
to genomics, or where Google-style infrastructure could be used
to support research.
There's more than enough meat in there to have a whole graduate
program, though. One or two courses could integrate theory and tools,
another course could focus on practical application in a specific
domain, a third course could talk about general practice and computing
tools, and a fourth course could discuss infrastructure and scaling.
The missing bits - "open science" and "training"
Something that I think was missing from the paper was an in-depth
perspective on the role that open source, open data, and open science
can play. While these concepts were directly touched on in a few of
the subsections - most of the tools described were open source, for
example, and Michael Nielsen's excellent book "Reinventing Discovery"
was mentioned briefly in the context of network effects in scientific
communication and access - I felt that "open science" was an
unacknowledged undercurrent throughout.
It's clear that progress in science has always relied on sharing
ideas, concepts, methods, theory, and data. What I think is not yet
as clear to many is the extent to which practical, efficient, and
widely available implementations of methods have become important in
the computer age. And, for data-driven discovery, an increasingly
critical aspect is the infrastructure to support data sharing,
collaboration, and the application of these methods to large data
sets. These two themes -- sharing of implementation and importance
of infrastructure cut across many of the subsections in this paper,
including the specific domains of astronomy and human genomics, as
well as the Google infrastructure and languages/tools/implementation
subsections. I think the paper could usefully add a section on this.
Interestingly, the Moore Foundation DDD competition implicitly
acknowledged this importance by enriching for open scientists in
their selection of the awardees -- a surprising fraction of the
Investigators are active in open science, including myself and Ethan
White, and virtually all the Investigators are openly distributing
their research methodology. In that sense, open science is a notable
omission from the paper.
It's also interesting to note that training is missing from the
paper. If you believe data-driven discovery is part of the future of
science, then training is important because there's a general lack of
researchers and institutions that cover these topics. I'd guess that
virtually no one researcher is well versed in a majority of the
topics, especially since many of the topics are entire scientific
super-fields, and the rest are vast technical domains. In academic
research we're kind of used to the idea that we have to work in
collaboration (practice may be different...), but here academia
really fails to cover the entire data-driven discovery spectrum
because of the general lack of emphasis on expert use of tools and
infrastructure in universities.
So I think that investment in training is where the opportunities lie
for universities that want to lead in data-driven discovery, and this
is the main chance for funders that want to enable the network effect.
There are comments .