Note: This is the second post in a mini-series of blog posts inspired
by the workshop Envisioning the Scientific Paper of the Future.
An important yet rarely articulated assumption of a lot of my work in
biological data analysis is that data implies software: it's not
much good gathering data if you don't have the ability to analyze it.
For some data, spreadsheet software is good enough. This was the situation
molecular biology was in up until the early 2000s - sure, we'd get numbers
from instruments and sequences from sequencers, but they'd all fit pretty
handily in whatever software we had lying around.
Once numerical data sets get big enough -- e.g. I did approximately
50,000 qPCRs in my last two years of grad school, which was unpleasant
to handle in Excel -- we need to invest in software like R or Python,
which can do bulk and batch processing of the data. Software like
OpenRefine can also help with "manual" cleaning
and rationalization of the data. But this requires skills that
are still relatively specialized.
For other data, we need custom software built specifically for that
data type. This is true of sequence analysis, where most of my work
is focused: when you get 200m DNA sequences, each of length 150 bp,
there's no simple, effective way to query or summarize that using general
computational tools. We need specialized code to parse, summarize,
explore, and investigate these data sets. Using this code doesn't
necessarily require serious programming knowledge, but data analysts
may need fortitude in dealing with potentially immature software, as well
as a duct-tape mentality in terms of tying together software that
wasn't designed to integrate, or repurposing software that was meant for
a different purpose.
There is at least one other category of data analysis software that I
can think of but haven't personally experienced - that's the kind of
stuff that CERN and Facebook and Google have to deal with, where the
data sets are so overwhelmingly large that you need to build deep
software and hardware infrastructure to handle them. This becomes (I
think) more a matter of systems engineering than anything else, but I
bet there is a really strong domain knowledge component that is required
of at least some of the systems engineers here. I think some of the
cancer sequencing folk are close to this stage, judging from a talk I heard
from Lincoln Stein two years ago.
Data-intensive research increasingly lives beyond the "spreadsheet" level
As data set sizes increase across the board, researchers are
increasingly finding that spreadsheets are insufficient. This is for
all the reasons articulated in the Data Carpentry spreadsheet lesson,
so I won't belabor the point any more, but what does this mean for us?
So increasingly our analysis results don't depend on spreadsheets;
they depend on custom data processing scripts (in R, MATLAB, Python,
etc.) and other people's programs (e.g. in bioinformatics, mappers
and assemblers) and on multiple steps of data handling, cleaning,
summation, integration, analysis and summarization.
And, as is nicely laid out in Stodden et al. (2016), all of these
steps are critical components of the data interpretation and belong in
the Methods section of any publication!
What's your point, Dr. Brown?
When we talk about "the scientific paper of the future", one of the facets
that people are most excited about - and I think this Caltech panel
will probably focus on this facet - is that we now possess the technology
to readily and easily communicate the details of this data analysis.
Not only that, we can communicate it in such a way that it becomes
repeatable and explorable and remixable, using virtual environments and
data analysis notebooks.
I want to highlight something else, though.
When I read excellent papers on research data management like "10
aspects of highly effective research data"
(or is this a blog post? I can't always tell any more),
I falter at section headings that say data should be "comprehensible"
and "reviewed" and especially "reusable". This is not because they
are incorrect, but rather because these are so dependent on having
methods (i.e. software) to actually analyze the data. And that
software seems to be secondary for many data-focused folk.
For me, however, they are one and the same.
If I don't have access to software customized to deal with the
data-type specific nuances of this data set (e.g. batch effects of
RNAseq data), the data
set is much less useful.
If I don't know exactly what statistical cutoffs were used to extract
information from this data set by others, then the data set is much less
useful. (I can make my own determination as to whether those cutoffs
were good cutoffs, but if I don't know what they were, I'm stuck.)
If I don't have access to the custom software that was used in
removing noise, generated the interim results, and did the large-scale
data processing, I may not even be able to approximate the same final
results.
Where does this leave me? I think:
Archived data has diminished utility if we do not have the ability to
analyze it; for most data, this means we need software.
For each data set, we should aim to have at least one fully
articulated data processing pipeline (that takes us from data to
results). Preferably, this would be linked to the data somehow.
What I'm most excited about when it comes to the scientific paper of
the future is that most visions for it offer an opportunity to do
exactly this! In the future, we will increasingly attach detailed (and
automated and executable) methods to new data sets. And along with
driving better (more repeatable) science, this will drive better data
reuse and better methods development, and thereby accelerate science
overall.
Fini.
--titus
p.s. For a recent bioinformatics effort in enabling large-scale data
reuse, see "The Lair" from
the Pachter Lab.
p.p.s. No, I did not attempt to analyze 50,000 qPCRs in a spreadsheet.
...but others in the lab did.
There are comments.