Introductory lecture for:
DTL Focus meeting: Metadata for data reusability: eNotebook standards
31 OCTOBER 2019 Holland Heart House, Utrecht
Organised by DTL/ELIXIR-NL in collaboration with COST CHARME
In this meeting, we will explore our experiences with data reusability and we will evaluate how we can make sure that the R in FAIR data really means reusability for other purposes than the original research questions. What existing metadata standards can be used? What kind of extensions of standards and tools do we need? Since eNotebook and biological study database communities are really separate the question comes up whether we should connect them better and if so how we can best do that.
Many of the minimal metadata standards that are currently in use to describe study data are meant to help overcome the problem that published scientific studies can often not be reproduced. Of course, solving the reproducibility problem is indeed important. However, when talking about FAIR data the R really stands for data reuse and that goes beyond reproducibility. When reusing data we want to use the same data to answer different questions than the ones originally answered. That typically asks for more detailed data descriptions, richer metadata and the publication of data that was not deemed relevant for the original study. For example, for reusability of data collected in an epidemiological study much more detail is needed about the composition of a cohort than that it is “similar in composition” to a reference cohort.
Of course, asking for more data causes new problems. Study capture databases like Molgenis, the Phenotype database, and the ISA tools need to be able to capture the richer data and high-level study repositories like BioStudies and BioSamples need to be able to able to store the information collected in such study databases or provided directly to them. The other problem, which is probably even bigger, is that researchers are not likely to provide information that they do not deem relevant for their own research questions. Improved data citation and credit given for that may make this more rewarding. But we also need mechanisms that make it much easier to provide the richer data. eNotebooks might be especially relevant in this respect. Since eNotebooks are increasingly used to record study designs, protocols, raw data and study results, they form an important source of information in this respect. To really benefit from this we need to look at eNotebook data standards and data export formats. Such standards are now being developed and we are glad that Klemen Zupancic will join this meeting as an expert on eNotebook standards. We will also need to evaluate how we use such standards to connect eNotebook data to study capture databases and data repositories.
2. From reproducibility to reusability
DTL Focus meeting: Metadata for data reusability: eNotebook standards.
31 Oktober, 2019. Holland Heart House, Utrecht
Announcement: https://www.dtls.nl/events/dtl-focus-meeting-metadata-for-data-reusability-enotebook-standards/
4. Failing to reuse is expensive
Elon Musk: suppose you would have to throw away a 747 every
time you fly to the US (well 2 if you want to go back).
Suppose you would have to redo experiments every time you
want to use the data.
5.
6. The difference
Reproducibility:
can I answer the same question again and get the same result.
Reuse:
can I in addition answer different questions using the same data.
7. Reuse is essential
More value for money
It is a core argument for funders to stimulate:
Ø Better data management
Ø FAIR data
Ø Research data infrastructure to allow that
Ø Compute infrastructure
It is typically argued that a large chunk of total research budget (5-10%)
should be allocated for this.
8. Integrative Systems Biology
Internal &
external
data
repositories
e.g. dbNP,
Sage, Atlas
knowledge
resources &
(semantic web)
Integration
e.g. Open PHACTS
WikiPathways
study capturing
ISA
models
study
data
processing,
statistics,
storage
e.g. arrayanalysis.org
ontologies
modeling & data integration,
network biology (extension),
supervised statistics
curation,
simulation
annotation &
provenance
research
applications
mapping
BridgeDb
extraction,
SPARQLing
conversion
9. Reuse typically needs more meta data
Ø Hard to predict which data is needed
Ø Too much work
Ø No clear incentive to put in repositories (not needed to publish)
Ø Even standards aim for minimal (e.g. MIAME, MIAPPE)
Ø Solution? Can we:
Ø Facilitate collection of richer metadata
Ø Keep everything collected by default
Ø For that purpose connect data resources
10. Example 1
Multiple human studies look at effect of high fat vs low fat diets.
Typically these studies compare two groups of individuals
(often in a cross-over design).
Typically groups are described as: otherwise the same
E.g.: same average age and comparable age range.
Can you reuse and combine these studies to study age effects?
Only when individual ages are stored.
11. Example 2
Some of these studies mention Vitamin E content of diet.
Can you study (or exclude) effects of added (vs naturally present
Vitamin E)?
Only if:
• It is clear whether added or natural
• Fat source is clear to estimate natural
12. Data moves through funnel
Ø Collected on paper or not at all (and often lost)
Ø Collected in eNotebooks
Ø Uploaded to study databases like dbNP and Molgenis
Ø Uploaded to data repositories
13. eNotebooks
Ø Not often mentioned as essential by ELIXIR community
Ø Should follow ISA principle (facilitate collection of study design,
assays performed, sample descriptions)
Ø Export/import standards would facilitate data transfer:
- Between different eNotebook types
- To study databases (for combination and advance analysis)
- To repositories directly
Ø Like for study databases (shared) templates could help to harmonise
data
14. Study capture databases
Ø More visibility in ELIXIR community
Ø Typically do follow ISA principles (facilitate collection of study design,
assays performed, sample descriptions)
Ø Export/import standards would facilitate data transfer:
- To other study databases
- To repositories directly (often implemented)
Ø Development of shared templates started.
Ø Modular software design would facilitate reuse of components
15. Study capture databases
Ø More visibility in ELIXIR community
Ø Typically do follow ISA principle (facilitate collection of study design,
assays performed, sample descriptions)
Ø Export/import standards would facilitate data transfer:
- To other study databases
- To repositories directly (often implemented)
Ø Development of shared templates started.
Ø Modular software design would facilitate reuse of components
16. Study capture databases
Ø More visibility in ELIXIR community
Ø Typically do follow ISA principle (facilitate collection of study design,
assays performed, sample descriptions)
Ø Export/import standards would facilitate data transfer:
- To other study databases
- To repositories directly (often implemented)
Ø Development of shared templates started.
Ø Modular software design would facilitate reuse of components
17. (ELIXIR core) data repositories
Ø Very visible in ELIXIR community
Ø Almost always follow ISA principles
Ø Typically facilitate data upload of raw data in native format
Ø Support for upload from eNotebooks and study capture databases
varies
Ø Most are technology (transcriptomics, metabolomics, ets) specific
Ø BioSamples de facto can be the place to describe studies (samples)
Ø BioStudies became the place where “other” study are captured