Unsub grant text: A Coordination Center for Training in Biomedical Data Science

Just as I was moving to UC Davis, a funding call for a training coordination center came out. I got partway down the path of applying for it before realizing that I was overwhelmed with the move, but I did generate some text that I thought was OK. Here it is!

The increasing velocity, variety, and volume of data generated in biomedical research is challenging the existing data management and analysis skills of most researchers. Access to and application of these forms of data is hindered not only by a lack of training and training opportunities in data-intensive biomedical research, but also by the heterogeneity of biomedical data as well as limited discoverability of training materials relevant to different biomedical fields. Many biomedical fields of study - genomics, informatics, biostatistics, and epidemiology, among others - have invested in data analysis methodologies and training materials, but no comprehensive index of biomedical data analysis methodologies, trainers, or training materials exists to support this training gap. Significant investments in biomedical data science training by the Helmsely Foundation and the NIH BD2K program, as well as by foundations not devoted to human health such as the Moore and Sloan Foundations, highlight the need for and opportunities in peer knowledge coordination and training.

We propose to bridge the gap between the availability of biomedical big data and the needs of biomedical researchers to make use of this data by building a coordination center around proven principles of open online collaboration. This coordination center will nucleate a national and international community of expert trainers, together with a catalog of openly available supporting materials developed by this community, to enable the discoverability of resources and the training of data-intensive biomedical researchers using modern, evidence-based teaching practices.

Aim 1: Build an index to enable categorization, discovery, and review of open educational resources.

Subaim 1A: Create and maintain an index of open educational resources.

Subaim 1B: Create and maintain software tooling behind index, including categorization, discovery, and review of resources.

Subaim 1C: Support categorization, discovery, and personalization of educational resources through a controlled vocabulary, personalized search, and lesson tracks.

Aim 2: Coordinate with existing biomedical/data science research and training community.

Subaim 2A: Build a "matchmaking service" to help scientists identify potential collaborative partners for lab rotations.

Subaim 2B: Connect and coordinate training components for national and international biomedical data science initiatives, including BD2K Center awardees, BD2K R25 awardees, and Foundation funders.

Subaim 2C: Facilitate connections and communication with the larger Data Science training community.

Aim 3: Build a community of trainers and contributors to reuse, review, remix, and create training materials.

Subaim 3A: Initiate regional training centers (Davis, Harvard, St. Louis (or Chicago/Florida) for coordinating trainers, doing material discovery & curation/needs analysis/assessment;

Subaim 3B: Build and maintain a catalog of trained instructors that enables discoverability, coordination and collaboration for training purposes.

Subaim 3C: Encourage and develop a diverse community of contributors through partnerships at regional training centers and Data Carpentry initiatives.


As the volume, variety, and velocity of biomedical data increases, so too have the variety of training needs in analyzing this data. Data science training specific to biomedical research is still relatively rare, and where it exists it is siloed, reflecting the bottom-up emergence of training and education in response to research-area specific needs. However, as the research community and funders respond to the increasing need with increased effort and funding, there is an opportunity to coordinate efforts to serve the broader purpose of training so-called "pi-shaped" researchers - researchers with deep backgrounds in biomedicine and data science both.

We propose to create a virtual training coordination center to organize and coordinate online training materials, facilitate interactions and connections between the many biomedical and data science communities, and nucleate the formation of a more cohesive and more diverse biomedical data science training community.

This TCC will build and maintain a catalogue of open educational resources that can be personalized to researcher-specific career goals, and coordinate software development and data management with the ELIXIR-UK TeSS project (tess.oerc.ox.ac.uk), which serves the European community. Our main effort will be to provide automated and semi-automated gathering and classification of training materials, integrated into a sustainable open system that can be used by others, and served via a personalized curriculum system that can recommend materials based on research interests and prior training.

We will also interact with the national and international biomedical data science communities, including the newly funded BD2K centers, R25 workshop and material grants, EU's ELIXIR, etc., and facilitate connections between these communities and broader data science training initiatives that include the Moore/Sloan Data Science Environments at UW, NYU, and Berkeley, as well as Software Carpentry, Data Carpentry, and the Mozilla Science Lab. A key component of this will be a "matchmaking" service that seeks to identify and support potential collaboration and "lab rotation" opportunities for biomedical scientists looking for data science collaborators.

Finally, we will work to nucleate a diverse and expert community of trainers who can use, reuse, remix, and build new materials. This community will be built upon regional training coordination centers and a recurrent Train-the-Trainers (T3) program to introduce trainers to materials, training and assessment approaches, and technology useful for training. We will also emphasize the inclusion of underrepresented minorities in the T3 program.


The growth of data in the natural sciences has been explosive, with a simultaneous and dramatic increase in all "three Vs" of data - volume, velocity, and variety - over the last two decades. This growth in data has in turn led to an increasing interest in quantitative and computational aspects of data analysis by academic and industry researchers. Computational infrastructure, analysis software, statistical methods for data analysis and integration, and research into the fundamental methods underpinning data driven discovery has all grown apace.

The growth in data has led to a training gap, in which many researchers suffer from the lack of a solid foundation in quantitative and computational methods. This gap is especially large in basic biology and biomedical research, where traditionally very few researchers have received any training in data analysis beyond basic mathematics and statistics. Moreover, as the volume and importance of data grow and the pace of research into data-driven discovery accelerates, the training gap is widening. Meanwhile, the opportunities for careers in biology-specific data analysis and data-driven discovery are increasing rapidly in both industry and academia, further increasing this gap between the need for trained researchers and the supply.

A number of training programs have stepped up to address this gap. One of the largest and broadest is Software Carpentry, a global non-profit which runs two-day intensive workshops on basic computational practice for academic scientists; as of 2015, Software Carpentry has trained over 10,000 students in 14 years. While not limited to biology, in 2014 half of the Software Carpentry workshops were biology focused, and approximately 1300 of the 2600 trainees were from biology backgrounds. iPlant Collaborative, an NSF-funded center focused on biological data analysis, has also run many training workshops to address the training gap. Internationally, the EU's ELIXIR program and the Australian Bioinformatics Network are focused on biological information, and have significant training programs.

More biomedically focused workshops and training programs in data science have also begun to be developed. Of particular note, NHGRI has funded a variety of computational training over the last decade that include T-32s, R25s, and K and F mechanisms. In the last year, the BD2K Initiative - formed specifically in recognition of cross-Institute opportunities and challenges in data science - has funded a number of "Big Data" centers and R25 workshop and resource development grants, with more to come in 2015. Most recently, the Helmsley Trust has invested $1.7m in the Mozilla Science Lab to help increase the capacity of biomedical scientists to integrate computation into their research.

The landscape of data science training is much larger than biology and biomedical science, of course. In the past few years, a tremendous variety of online resources, including written tutorials, videos, Massive Open Online Courses (MOOCs), and webinars have emerged. Many universities and institutes have started data science training programs, with a notable investment by the Moore and Sloan foundations in Data Science Environments at NYU, UW, and UC Berkeley focused on data driven discovery. Furthermore, an NSF investment in BIO Centers led to the initiation of Data Carpentry, a sister non-profit to Software Carpentry that is focused on linking domain-specific data analysis methodology to the broader contexts of efficiency and reproducibility; Data Carpentry is now funded by the Moore Foundation.

Thus, the training landscape in data science generally, and biomedical data science specifically, is large, complex, and international. Moreover, the number of training programs and initiatives is growing fast.

Over the last few years, several themes in biomedical data science training have emerged:

Many styles of training are needed, across many career levels. The training breakout at the BD2K "ADDSup" meeting in September 2014 summarized biomedical training opportunities in 10 dimensions, including formal vs informal, beginner to advanced, in-person vs online, short course vs long, centralized vs physical, andragogy vs pedagogy, project-based vs structured, individual vs group learning, "just in time" vs background training, and basic to clinically focused. Different types of materials, teaching approaches, and assessment approaches are appropriate for each of these.

Training opportunities are increasingly oversubscribed. Both surveys and anecdotes suggest that the perceived need for training in biomedical data science is great. For example, the Australian Bioinformatics Network survey on bioinformatics needs concluded that access to training is by far the most dominant concern for biologists; more here. The summer workshop on sequence analysis run by Dr. Brown routinely has 5-10x more applicants (~200) than can be accommodated (25). Software Carpentry and Data Carpentry workshops with biology focuses typically fill up within two days of their announcement and always have a waiting list. And online courses on data analysis and statistics typically have 10s of thousands of participants.

Common concerns of reproducibility, efficiency, automation, and statistical correctness underpin every data science domain. At the most domain-specific level, data science training inevitably must focus on specific data types, data analysis problems, and data analysis software. However, as trainees grow in expertise, the same concerns consistently emerge no matter the domain: how do we make this analysis reproducible? How can we most efficiently use available compute resources? How do we run the current analysis on a new data set? How do we assess significance and correctness of our results? This convergence suggests a role for inter-domain coordination of training, especially because some biomedical domains (such as bioinformatics) have explored this area in more depth than others

There is a great need for more instructors versed in both advanced biomedical data science and evidence-based educational practice. The gap in biomedical data science is nowhere more evident than in the lack of instructors capable of teaching data science to biomedical researchers! A consistent theme from universities is that faculty working in this area are overwhelmed by existing teaching and research opportunities, which limits the available instructor pool. A major limiting factor in offering more biology-focused Software Carpentry workshops has been a lack of instructors, although this is somewhat ameliorated by the use of graduate students and postdocs.

A more diverse trainee and instructor pool is needed. By definition, underrepresented minorities are underrepresented in faculty lines, but in biomedical data science, this further intersects with the significant underrepresentation of women and minorities in quantitative and computational disciplines. However, we are still at an early stage in biomedical data science where this underrepresentation could be addressed by targeted initiatives.

There is a strong need (and attendant opportunity) for centralized coordination in biomedical data science training. Cataloging of training opportunities and materials could increase the efficiency and reuse of existing training, as well as identify where materials and training are lacking. Instructor training can increase the available pool of trained educators as well as provide opportunities for underrepresented populations to get involved in training. And coordination across domains on more advanced training topics could broaden the scope of these advanced materials.

Comments !

(Please check out the comments policy before commenting.)