Published: Sun 23 August 2015
By C. Titus Brown
In science .
tags: training nih bd2k
Just as I was moving to UC Davis, a funding call for a training
coordination center
came out. I got partway down the path of applying for it before
realizing that I was overwhelmed with the move, but I did generate
some text that I thought was OK. Here it is!
The increasing velocity, variety, and volume of data generated in
biomedical research is challenging the existing data management and
analysis skills of most researchers. Access to and application of
these forms of data is hindered not only by a lack of training and
training opportunities in data-intensive biomedical research, but also
by the heterogeneity of biomedical data as well as limited
discoverability of training materials relevant to different biomedical
fields. Many biomedical fields of study - genomics, informatics,
biostatistics, and epidemiology, among others - have invested in data
analysis methodologies and training materials, but no comprehensive
index of biomedical data analysis methodologies, trainers, or training
materials exists to support this training gap. Significant investments
in biomedical data science training by the Helmsely Foundation and the
NIH BD2K program, as well as by foundations not devoted to human
health such as the Moore and Sloan Foundations, highlight the need for
and opportunities in peer knowledge coordination and training.
We propose to bridge the gap between the availability of biomedical
big data and the needs of biomedical researchers to make use of this
data by building a coordination center around proven principles of
open online collaboration. This coordination center will nucleate a
national and international community of expert trainers, together with
a catalog of openly available supporting materials developed by this
community, to enable the discoverability of resources and the training
of data-intensive biomedical researchers using modern, evidence-based
teaching practices.
Aim 1: Build an index to enable categorization, discovery, and review of open educational resources.
Subaim 1A: Create and maintain an index of open educational resources.
Subaim 1B: Create and maintain software tooling behind index, including categorization, discovery, and review of resources.
Subaim 1C: Support categorization, discovery, and personalization of educational resources through a controlled vocabulary, personalized search, and lesson tracks.
Aim 2: Coordinate with existing biomedical/data science research and training community.
Subaim 2A: Build a "matchmaking service" to help scientists identify potential collaborative partners for lab rotations.
Subaim 2B: Connect and coordinate training components for national and international biomedical data science initiatives, including BD2K Center awardees, BD2K R25 awardees, and Foundation funders.
Subaim 2C: Facilitate connections and communication with the larger Data Science training community.
Aim 3: Build a community of trainers and contributors to reuse, review, remix, and create training materials.
Subaim 3A: Initiate regional training centers (Davis, Harvard, St. Louis (or Chicago/Florida) for coordinating trainers, doing material discovery & curation/needs analysis/assessment;
Subaim 3B: Build and maintain a catalog of trained instructors that enables discoverability, coordination and collaboration for training purposes.
Subaim 3C: Encourage and develop a diverse community of contributors through partnerships at regional training centers and Data Carpentry initiatives.
Introduction
As the volume, variety, and velocity of biomedical data increases, so
too have the variety of training needs in analyzing this data. Data
science training specific to biomedical research is still relatively
rare, and where it exists it is siloed, reflecting the bottom-up
emergence of training and education in response to research-area
specific needs. However, as the research community and funders
respond to the increasing need with increased effort and funding,
there is an opportunity to coordinate efforts to serve the broader
purpose of training so-called "pi-shaped" researchers - researchers
with deep backgrounds in biomedicine and data science both.
We propose to create a virtual training coordination center to
organize and coordinate online training materials, facilitate
interactions and connections between the many biomedical and data
science communities, and nucleate the formation of a more cohesive and
more diverse biomedical data science training community.
This TCC will build and maintain a catalogue of open educational
resources that can be personalized to researcher-specific career
goals, and coordinate software development and data management with
the ELIXIR-UK TeSS project (tess.oerc.ox.ac.uk), which serves the
European community. Our main effort will be to provide automated and
semi-automated gathering and classification of training materials,
integrated into a sustainable open system that can be used by others,
and served via a personalized curriculum system that can recommend
materials based on research interests and prior training.
We will also interact with the national and international biomedical
data science communities, including the newly funded BD2K centers, R25
workshop and material grants, EU's ELIXIR, etc., and facilitate
connections between these communities and broader data science
training initiatives that include the Moore/Sloan Data Science
Environments at UW, NYU, and Berkeley, as well as Software Carpentry,
Data Carpentry, and the Mozilla Science Lab. A key component of this
will be a "matchmaking" service that seeks to identify and support
potential collaboration and "lab rotation" opportunities for
biomedical scientists looking for data science collaborators.
Finally, we will work to nucleate a diverse and expert community of
trainers who can use, reuse, remix, and build new materials. This
community will be built upon regional training coordination centers
and a recurrent Train-the-Trainers (T3) program to introduce trainers
to materials, training and assessment approaches, and technology
useful for training. We will also emphasize the inclusion of
underrepresented minorities in the T3 program.
Background
The growth of data in the natural sciences has been explosive, with a
simultaneous and dramatic increase in all "three Vs" of data - volume,
velocity, and variety - over the last two decades. This growth in
data has in turn led to an increasing interest in quantitative and
computational aspects of data analysis by academic and industry
researchers. Computational infrastructure, analysis software,
statistical methods for data analysis and integration, and research
into the fundamental methods underpinning data driven discovery has
all grown apace.
The growth in data has led to a training gap, in which many
researchers suffer from the lack of a solid foundation in quantitative
and computational methods. This gap is especially large in basic
biology and biomedical research, where traditionally very few
researchers have received any training in data analysis beyond basic
mathematics and statistics. Moreover, as the volume and importance of
data grow and the pace of research into data-driven discovery
accelerates, the training gap is widening. Meanwhile, the
opportunities for careers in biology-specific data analysis and
data-driven discovery are increasing rapidly in both industry and
academia, further increasing this gap between the need for trained
researchers and the supply.
A number of training programs have stepped up to address this gap.
One of the largest and broadest is Software Carpentry, a global
non-profit which runs two-day intensive workshops on basic
computational practice for academic scientists; as of 2015, Software
Carpentry has trained over 10,000 students in 14 years. While not
limited to biology, in 2014 half of the Software Carpentry workshops
were biology focused, and approximately 1300 of the 2600 trainees were
from biology backgrounds. iPlant Collaborative, an NSF-funded center
focused on biological data analysis, has also run many training
workshops to address the training gap. Internationally, the EU's
ELIXIR program and the Australian Bioinformatics Network are focused
on biological information, and have significant training programs.
More biomedically focused workshops and training programs in data
science have also begun to be developed. Of particular note, NHGRI
has funded a variety of computational training over the last decade
that include T-32s, R25s, and K and F mechanisms. In the last year,
the BD2K Initiative - formed specifically in recognition of
cross-Institute opportunities and challenges in data science - has
funded a number of "Big Data" centers and R25 workshop and resource
development grants, with more to come in 2015. Most recently, the
Helmsley Trust has invested $1.7m in the Mozilla Science Lab to help
increase the capacity of biomedical scientists to integrate
computation into their research.
The landscape of data science training is much larger than biology and
biomedical science, of course. In the past few years, a tremendous
variety of online resources, including written tutorials, videos,
Massive Open Online Courses (MOOCs), and webinars have emerged. Many
universities and institutes have started data science training
programs, with a notable investment by the Moore and Sloan foundations
in Data Science Environments at NYU, UW, and UC Berkeley focused on
data driven discovery. Furthermore, an NSF investment in BIO Centers
led to the initiation of Data Carpentry, a sister non-profit to
Software Carpentry that is focused on linking domain-specific data
analysis methodology to the broader contexts of efficiency and
reproducibility; Data Carpentry is now funded by the Moore Foundation.
Thus, the training landscape in data science generally, and biomedical
data science specifically, is large, complex, and international.
Moreover, the number of training programs and initiatives is growing
fast.
Over the last few years, several themes in biomedical data science
training have emerged:
Many styles of training are needed, across many career levels. The
training breakout at the BD2K "ADDSup" meeting in September 2014
summarized biomedical training opportunities in 10 dimensions,
including formal vs informal, beginner to advanced, in-person vs
online, short course vs long, centralized vs physical, andragogy vs
pedagogy, project-based vs structured, individual vs group learning,
"just in time" vs background training, and basic to clinically
focused. Different types of materials, teaching approaches, and
assessment approaches are appropriate for each of these.
Training opportunities are increasingly oversubscribed. Both surveys
and anecdotes suggest that the perceived need for training in
biomedical data science is great. For example, the Australian
Bioinformatics Network survey on bioinformatics needs concluded that
access to training is by far the most dominant concern for biologists;
more here. The summer workshop on sequence analysis run by Dr. Brown
routinely has 5-10x more applicants (~200) than can be accommodated
(25). Software Carpentry and Data Carpentry workshops with biology
focuses typically fill up within two days of their announcement and
always have a waiting list. And online courses on data analysis and
statistics typically have 10s of thousands of participants.
Common concerns of reproducibility, efficiency, automation, and
statistical correctness underpin every data science domain. At the
most domain-specific level, data science training inevitably must
focus on specific data types, data analysis problems, and data
analysis software. However, as trainees grow in expertise, the same
concerns consistently emerge no matter the domain: how do we make this
analysis reproducible? How can we most efficiently use available
compute resources? How do we run the current analysis on a new data
set? How do we assess significance and correctness of our results?
This convergence suggests a role for inter-domain coordination of
training, especially because some biomedical domains (such as
bioinformatics) have explored this area in more depth than others
There is a great need for more instructors versed in both advanced
biomedical data science and evidence-based educational practice. The
gap in biomedical data science is nowhere more evident than in the
lack of instructors capable of teaching data science to biomedical
researchers! A consistent theme from universities is that faculty
working in this area are overwhelmed by existing teaching and research
opportunities, which limits the available instructor pool. A major
limiting factor in offering more biology-focused Software Carpentry
workshops has been a lack of instructors, although this is somewhat
ameliorated by the use of graduate students and postdocs.
A more diverse trainee and instructor pool is needed. By definition,
underrepresented minorities are underrepresented in faculty lines, but
in biomedical data science, this further intersects with the
significant underrepresentation of women and minorities in
quantitative and computational disciplines. However, we are still at
an early stage in biomedical data science where this
underrepresentation could be addressed by targeted initiatives.
There is a strong need (and attendant opportunity) for centralized
coordination in biomedical data science training. Cataloging of
training opportunities and materials could increase the efficiency and
reuse of existing training, as well as identify where materials and
training are lacking. Instructor training can increase the available
pool of trained educators as well as provide opportunities for
underrepresented populations to get involved in training. And
coordination across domains on more advanced training topics could
broaden the scope of these advanced materials.
There are comments .