From reproducibility to reusability: capturing metadata at the source

Chris Evelo
From reproducibility to reusability: capturing metadata at the source

From reproducibility to reusability
DTL Focus meeting: Metadata for data reusability: eNotebook standards.
31 Oktober, 2019. Holland Heart House, Utrecht
Announcement: https://www.dtls.nl/events/dtl-focus-meeting-metadata-for-data-reusability-enotebook-standards/

From reproducibility to reusability

Failing to reuse is expensive
Elon Musk: suppose you would have to throw away a 747 every
time you fly to the US (well 2 if you want to go back).
Suppose you would have to redo experiments every time you
want to use the data.

The difference
Reproducibility:
can I answer the same question again and get the same result.
Reuse:
can I in addition answer different questions using the same data.

Reuse is essential
More value for money
It is a core argument for funders to stimulate:
Ø Better data management
Ø FAIR data
Ø Research data infrastructure to allow that
Ø Compute infrastructure
It is typically argued that a large chunk of total research budget (5-10%)
should be allocated for this.

Integrative Systems Biology
Internal &
external
data
repositories
e.g. dbNP,
Sage, Atlas
knowledge
resources &
(semantic web)
Integration
e.g. Open PHACTS
WikiPathways
study capturing
ISA
models
study
data
processing,
statistics,
storage
e.g. arrayanalysis.org
ontologies
modeling & data integration,
network biology (extension),
supervised statistics
curation,
simulation
annotation &
provenance
research
applications
mapping
BridgeDb
extraction,
SPARQLing
conversion

Reuse typically needs more meta data
Ø Hard to predict which data is needed
Ø Too much work
Ø No clear incentive to put in repositories (not needed to publish)
Ø Even standards aim for minimal (e.g. MIAME, MIAPPE)
Ø Solution? Can we:
Ø Facilitate collection of richer metadata
Ø Keep everything collected by default
Ø For that purpose connect data resources

Example 1
Multiple human studies look at effect of high fat vs low fat diets.
Typically these studies compare two groups of individuals
(often in a cross-over design).
Typically groups are described as: otherwise the same
E.g.: same average age and comparable age range.
Can you reuse and combine these studies to study age effects?
Only when individual ages are stored.

Example 2
Some of these studies mention Vitamin E content of diet.
Can you study (or exclude) effects of added (vs naturally present
Vitamin E)?
Only if:
• It is clear whether added or natural
• Fat source is clear to estimate natural

Data moves through funnel
Ø Collected on paper or not at all (and often lost)
Ø Collected in eNotebooks
Ø Uploaded to study databases like dbNP and Molgenis
Ø Uploaded to data repositories

eNotebooks
Ø Not often mentioned as essential by ELIXIR community
Ø Should follow ISA principle (facilitate collection of study design,
assays performed, sample descriptions)
Ø Export/import standards would facilitate data transfer:
- Between different eNotebook types
- To study databases (for combination and advance analysis)
- To repositories directly
Ø Like for study databases (shared) templates could help to harmonise
data

Study capture databases
Ø More visibility in ELIXIR community
Ø Typically do follow ISA principles (facilitate collection of study design,
- To other study databases
- To repositories directly (often implemented)
Ø Development of shared templates started.
Ø Modular software design would facilitate reuse of components

Study capture databases
Ø More visibility in ELIXIR community
Ø Typically do follow ISA principle (facilitate collection of study design,
- To other study databases
- To repositories directly (often implemented)
Ø Development of shared templates started.
Ø Modular software design would facilitate reuse of components

(ELIXIR core) data repositories
Ø Very visible in ELIXIR community
Ø Almost always follow ISA principles
Ø Typically facilitate data upload of raw data in native format
Ø Support for upload from eNotebooks and study capture databases
varies
Ø Most are technology (transcriptomics, metabolomics, ets) specific
Ø BioSamples de facto can be the place to describe studies (samples)
Ø BioStudies became the place where “other” study are captured

From reproducibility to reusability: capturing metadata at the source

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

From reproducibility to reusability: capturing metadata at the source