Abstract
In this presentation, Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health, will share the NIH’s vision for a modernized, integrated FAIR biomedical data ecosystem and the strategic roadmap that NIH is following to achieve this vision. Dr. Gregurick will highlight projects being implemented by team members across the NIH’s 27 institutes and centers and will ways that industry, academia, and other communities can help NIH enable a FAIR data ecosystem. Finally, she will weave in how this strategy is being leveraged to address the COVID-19 pandemic.
Presenter: Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health
dkNET Webinar Information: https://dknet.org/about/webinar
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09/2020
1. Creating and Sustaining a FAIR
Biomedical Data Ecosystem
Susan Gregruick, Ph.D.
Associate Director for Data Science and
Director, Office of Data Science Strategy
October 9, 2020
2. Making Data FAIR
must have unique identifiers, effectively labeling it
within searchable resources.Findable
must be easily retrievable via open systems and
effective and secure authentication and authorization
procedures.
Accessible
should “use and speak the same language” via use
of standardized vocabularies.Interoperable
must be adequately described to a new user, have
clear information about data-usage licenses, and
have a traceable “owner’s manual,” or provenance.
Reusable
5. NIH supports many different biomedical
research communities with diverse sets
of data
6. 6
The Rime of the Ancient Mariner,
Samuel Taylor Coleridge
(excerpted)
Day after day, day after day,
We stuck, nor breath nor motion;
As idle as a painted ship
Upon a painted ocean.
Water, water, every where,
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.
7. This proliferation of data, and the
accompanying computing resources
and new algorithms, brings new
opportunities for discovery, as well as
new challenges
8. Journal articles could link
to repository data sets
Metadata were computable so that a
search for similar datasets was possible
Analysis tools were linked to datasets,
via Github, Bioconductor, Galaxy or
other….
9. NIDDK
The mission of the National Institute of Diabetes and Digestive and Kidney
Diseases (NIDDK) is to conduct and support research on diabetes and other
endocrine and metabolic diseases; digestive diseases, nutritional disorders,
and obesity; and kidney, urologic, and hematologic diseases, to improve health
and quality of life.
NIDDK supports research
studies across a wide variety of
disease areas and in turn
supports a variety of platforms
to house and manage the data
they each generate.
These studies utilize a spectrum
of modern experimental
techniques, generating different
modalities of data about the
patient and their disease state.
Collecting, integrating and
working with this all data
presents a variety of
challenges.
Challenges
New consortia would like to
share or reuse existing data
platforms rather than having
to create them from scratch
Integrating data from the
same patient across different
studies currently requires
significant manual effort
Supporting analysis and
visualization tools for
imaging data being produced
by various projects
Image from https://www.niddk.nih.gov/health-information/kidney-disease
10. Integration of GUDMAP expression data with GTEx eQTLs
Core Motivations
● GUDMAP contains gene
expression data across
various parts of the kidney and
urogenital system.
● The GTEx database contains
expression QTL (eQTL) data
correlating gene expression
with specific genomic variants
● Integrating GUDMAP data
with GTEx may lead to
insights into gene regulation in
kidney development and renal
disease
Potential Data Sources
● NIDDK GUDMAP
Genitourinary data
repository
● Common Fund GTEx gene
expression database
ResearchScientist
Icon made by Roundicons from www.flaticon.com
As a renal disease
researcher, I want to
combine gene data from
GUDMAP with eQTL data
from the Common Fund
GTEx resource in order to
investigate variants
involved in regulating
renal gene expression
11. Data integration within the TEDDY T1D platform
Core Motivations
● Data submitted to TEDDY at
different times and locations
are independent data releases
with different subject
identifiers per release
● The same subject will likely
have data spread across
multiple data releases
● Recombining this data is a
very manual process, having
an integrated data
environment would simplify
this significantly
Potential Data Sources
● Genomics
● Epigenomics
● Transcriptomics
● Proteomics
● Metabolomics
ResearchScientist
Icon made by Roundicons from www.flaticon.com
As a T1 diabetes
researcher, I want to
combine data across
TEDDY releases in order
to bring together all the
different modalities of
data collected from the
same subject
12. This is the promise
of the NIH
Strategic Plan for
Data Science
…and here’s how we will get there.
13. 13
0%
25%
50%
75%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
PERCENTAGE
YEAR
Percentage of NIH Supported PMC publications with data
availability statement
14. NIH Data Management and Sharing
Policy Development
• Researchers with NIH-funded or conducted research projects resulting in
the generation of scientific data will be required to submit a Plan
• Plans should explain how scientific data generated by a research study
will be managed and which of these scientific data will be shared
Community
Input
Solicited
• 189
submissions
from national
and
international
stakeholders
Identified
need for
appropriate
infrastructure
• policy and
implementation
to go
‘hand-in-hand’
Develop
draft policy
for data
management
and sharing
and related
guidance
Released
draft for
community
input
Release final
policy (2020)
15. Options of scaled implementation for sharing datasets
• PMC stores publication-related
supplemental materials and datasets
directly associated publications. Up
to 2 GB.
• Generate Unique Identifiers for the
stored supplementary materials and
datasets.
Use of commercial and non-profit
repositories STRIDES Cloud Partners
• Store and manage large scale, high
priority NIH datasets. (Partnership with
STRIDES)
• Assign Unique Identifiers, implement
authentication, authorization and
access control.
Datasets up to 2 gigabytes Datasets up to 20*gigabytes High Priority Datasets petabytes
PubMed Central
• Assign Unique Identifiers to
datasets associated with
publications and link to PubMed.
• Store and manage datasets
associated with publication, up to
20* GB.
NIH strongly encourages
open access Data Sharing Repositories
as a first choice.
https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html
Overview of Sharing Publication and
Related Data
16. • PMC stores publication-
related supplemental
materials and datasets
directly associated
publications. Up to 2 GB.
• Generate Unique
Identifiers for the stored
supplementary materials
and datasets.
Use of commercial and
non-profit repositories
STRIDES Cloud Partners
• Store and manage large
scale, high priority NIH
datasets. (Partnership with
STRIDES)
• Assign Unique Identifiers,
implement authentication,
authorization and access
control.
PubMed Central
• Assign Unique Identifiers
to datasets associated
with publications and link
to PubMed.
• Store and manage
datasets associated with
publication, up to 20* GB.
NIH supports many repositories for
biomedical data sharing
AphasiaBank
17. How to Find Data Repositories
• BMIC Data Repository Listing
https://www.nlm.nih.gov/NIHbmic/nih_data_sha
ring_repositories.html
• SciCruch/dkNET
• Organized by repository type and scientific
area.
https://dknet.org/about/Suggested-data-
repositories
• FAIRsharing
https://fairsharing.org/
• DataMed
https://datamed.org/
18. Optimized Funding for NIH Data
Repositories and Knowledgebases
• Data resources are important
research tools
• Historically funded through
research grants
• Funding mechanism should
be optimal for type of
resource
• End goal: researcher
confident in data and
information integrity
• Solution: New Funding
Announcement for data
repositories and knowledgebases
• Resource plan requirement
Scientific Impact
1.Community
Engagement
1.Quality of Data
and Services and
Efficiency of
Operations
Governance
19. Optimized Funding for NIH Data
Repositories and Knowledgebases
Funding Opportunities
• NIH released two funding
opportunities on Jan. 17 to
support biomedical data
repositories and
knowledgebases:
• Biomedical Data Repository
(PAR-20-089)
• Biomedical Knowledgebase
(PAR-20-097)
Scientific Impact
1.Community
Engagement
1.Quality of Data
and Services and
Efficiency of
Operations
Governance
20. Piloting a FAIR Generalist
Repository Using Figshare
https://nih.figshare.com
Existing Figshare features Pilot-specific features
22. • Generalist repositories are growing – more researchers are
depositing data and more publications are linking to generalist
repositories.
• Researchers need more education and guidance – where to
publish data and how to describe datasets in metadata fields
effectively.
• Metadata enhancement enables greater discoverability –
metrics indicate greater access but need longer time scale to
observe data reuse.
NIH Figshare Pilot – Key Takeaways
23. Guiding researchers on better metadata to
enhance data discoverability graphic credit: Ontotext
25. NIH is Harnessing the Power of the
Cloud for Biomedical Research
• Cloud computing offers multiple opportunities NIH can
leverage to advance biomedical research, including:
• Computation on biomedical data at an unprecedented scale
• Broad access to cutting-edge cloud technology with, for example,
industry-leading security tools
• Storage of large, diverse data in a way that enables easier sharing,
access, and reuse of data with other researchers
• A community-driven approach to data science that breaks down
disciplinary silos
• Adopt and develop cloud-based tools from industry or academia for
biomedical research
25
26. Turning Research Data Into
Knowledge and Discovery
26
The Science and Technology Research Infrastructure for
Discovery, Experimentation, and Sustainability (STRIDES)
Initiative
• State-of-the-art data storage and computational capabilities
• Training and education for researchers
• Innovative technologies such as artificial intelligence and
machine learning
Partnerships with and other commercial providers
27. STRIDES by the numbers*
27
17
NIH ICs extramural institutions programs/projects people trained
37 279 >2400
cost savings to
participating ICs
$9M
obligated by NIH /
expended to date
$51.5M /
$18M
compute hours
30M
petabytes stored
80
*as of 8/31
29. Moving Data to the Cloud for Large-Scale
Analysis
36.4 PB of public and
controlled-access
Sequence Read Archive
data in two clouds
(GCP & AWS)
30. We can now do this in 3-4 days instead of 12+
months directly as a result of the SRA data being
available in the cloud. This means we can share
this data with the CoV researchers today, when it
can make a difference, not a year from now. This is
important for COVID-19 now, and will be
important in response to the next pandemic."
– Artem Babaian, Lead Developer at Serratus and corresponding author for publication,
“Petabase-scale sequence alignment catalyses viral discovery”
https://www.biorxiv.org/content/10.1101/2020.08.07.241729v1.full
Benefits of the Cloud for Large-Scale
Analysis
32. Supplements to Enhance Software
Tools for Open Science
• New collaborations between biomedical researchers and
software engineers
Enhance software engineering of valuable
scientific tools
• Working with STRIDES Initiative is encouraged but not
required
Make research tools “cloud-ready”
NOT-OD-20-073
33. Topics Funded Across 12 Institutes
and Centers
FHIR Clinical Cloud Commons
Biomolecular
Simulation
Biophysics
Genomics Imaging Neuroimaging
35. EMRs/EHRs
Extract medical information from
text in EMRs/EHRs
Interpret genomic sequence
data to understand impact of
mutations on protein function
Read medical images
and help diagnose
diseases like
pneumonia and cancer
Monitor sleep and
vitals to send
information about
health at home to
doctors
Determine which calls to child
welfare systems warrant
deployment of family support and
prevention resources to protect at-
risk children
Examples from Katabi, Ng, Putnam-Hornstein, Troyanskaya, and others
AI in Biomedicine: Opportunities
36. NIH NVIDIA COVID CT-AI Classification
Segmentation
Image Classification
Preprocessing
Conversion to nifti
with 1x1x1 resampling
dicoms nifti
AH-Net Architecture
3D-Densnet-121
Apply Mask
Classification
Likely COVID
Vs
Unlikely COVID
Lung Segmentation
Mask
Baris Turkbey, Sheng Xu, Tom Sanford, Stephanie Harmon,, Mona Flores, Daguang Xu, Xiasong Wang, Ziyue Xu, Holger Roth, Dong Yang, Evrim Turkbey, Mike Kassin, Maxime Blain, Brad Wood
CT images have been
used in Asia to detect
COVID-19 virus in
patients
37. New Common Fund Initiative:
Artificial Intelligence for BiomedicaL
Excellence (AIBLE)
May 15, 2020 - NIH Council of Councils
• https://dpcpsi.nih.gov/council/may-15-2020-agenda
• AI Concept Clearance (start at 1:25min)
https://videocast.nih.gov/watch=36031
• NIH Artificial Intelligence Working Group Final Report
39. Support flagship efforts that generate large-scale
experimental data, with billions of data points
designed to:
i. be well-suited for ML analysis and inference
ii. address key biomedical challenges
iii. stimulate new approaches in machine learning
And that implement processes designed to:
i. develop improved criteria and technical
mechanisms for data access
ii. strengthen ethical criteria for dataset use
(consent, privacy, accountability, ...)
Support flagship data generation efforts to propel progress
by the scientific community.
27
data ethics people
Projects should:
▪ address key biomedical
challenges using ML methods
▪ advance ML methods for future
use in biomedicine
▪ produce transformative data
sets, designed with ML in mind
▪ propel new ways to gather
massive data in biomedicine
▪ involve strong engagement
from leading ML
researchers
Project review should:
▪ incorporate expertise in ML as
well as traditional biomedical
domains
40. Publish criteria for evaluating datasets based on their value for ML-based analysis.
▪ what makes a dataset most useful for ML-based analysis?
▪ what attributes are and aren’t addressed by existing datasets?
▪ start as guidelines; within two years recommend a subset as requirements
Develop and publish criteria for ML-friendly datasets.
30
Examples of potential criteria:
▪ clear provenance: as much metadata as possible, to detect & correct for batch effects
▪ well-described data: what does each variable mean? what’s the distribution of
values?
▪ accessible data: flexible data access policy, reasonable data access process
▪ large sample size: to allow training (and evaluation) without overfitting
▪ multimodal data: to study complex systems from multiple perspectives
▪ perturbation data: includes outcomes (“outputs”) as well as measurements (“inputs”)
▪ longitudinal data: to allow modeling and prediction of progression
▪ active learning: data grows over time, incorporates new data-gathering techniques,
and uses ML-based analysis of existing data to inform future data generation
data ethics people
41. Design and apply “datasheets” and “model cards” for
biomedical ML.
41
Potential datasheet best practices:
• demographics and UBR
characteristics
• privacy, consent, and copyright
issues
• known blind spots, which could
otherwise create hidden biases
Potential model card best practices:
• what training data was used
• how training and validation were
done
• known limitations on applicability
• intended use, and potential
harms of inappropriate use
• Develop and publish best practices for:
• “datasheets” that describe & evaluate training
datasets
• “model cards” that do the same for generated
models
• Test the best practices in the real world:
• build after-the-fact examples for existing datasets
• apply to new datasets, and update the best
practices
• Once best practices have been updated:
• require datasheets and model cards for all NIH
extramural grant applications and NIH intramural
projects that involve ML research
• encourage journals to do the same for paper
submission and publication
data ethics people
44. Smart and Connected Health (SCH)
Accelerate innovations
in computer and
information science
and engineering to
support the
transformation of
health and medicine
45. Smart Health & Data Science Research Areas
• Tools for interoperable, distributed, federated, & scalable digital infrastructure
• Novel ontological systems and knowledge representation approaches
• Methods for data integrity, provenance, security, privacy and reliability
Information Infrastructure
• Computational tools for fusion and analysis of multi-level and -scale data
• Knowledge representations, visualizations and reasoning algorithms
• Approaches for combining AI learning with mechanistic modeling
• Unstructured data interpretation
Transformative Data Science
• Design & fabrication of novel multimodal sensor systems
• Synthesis of new biorecognition elements
Novel Multimodal Sensor
System Hardware
• New approaches to support individuals to effectively participate in their own health
• User-tailored and context-aware interfaces to reduce burden and increase autonomy
• Develop new methods for context-dependent selection, presentation and use of data
Effective Usability
• Closed-loop or Human-in-the loop systems
• Technology platforms for optimizing delivery of health interventions
• Simulation and modeling methods and software tools
Automating Health
• Modeling on-visual context information and perception of complex images.
• Methods to exploit experts’ implicit knowledge to improve perceptual decision making
• Develop models of how experts respond to changes in cognitive factors
Medical Data Interpretation
47. 47
Coding it Forward
• Student-led non-profit places tech-
savvy students in federal agencies
• 16 students for summer placed in
admin or funding offices across 11
host institutes, centers, offices
(ICOs) for 10-week summer
program
• 2 students extended until the start of
school, 1 hired as contractor
• 24 students will start a fall fellowship
across 14 host ICOs
https://www.codingitforward.com/
48. NIH Data and Technology Advancement
(DATA) National Service Scholar Program
https://datascience.nih.gov/data-scholars
8 Scholars will…
Catalyze neuroscience research
Unravel the Alzheimer’s Disease Genome
Support cancer knowledge extraction
Accelerate the clinical adoption of machine intelligence applications in
medical imaging
Harness data science for health discovery and innovation in Africa
Expand theories of brain circuits
Integrate NIH cloud-based platforms for genomics research
Architect search across petabyte-scale data
…in 2021
49. Strategic Plan for Data Science:
Goals and Objectives
Data Infrastructure
Optimize data
storage and
security
Connect NIH data
systems
Modernized Data
Ecosystem
Modernize data
repository
ecosystems
Support storage
and sharing of
individual datasets
Better integrate
clinical and
observational data
into biomedical
data science
Data Management,
Analytics, and
Tools
Support useful,
generalizable, and
accessible tools
Broaden utility of,
and access to,
specialized tools
Improve discovery
and cataloging
resources
Workforce
Development
Enhance the NIH
data science
workforce
Expand the
national research
workforce
Engage a broader
community
Stewardship and
Sustainability
Develop policies
for a FAIR data
ecosystem
Enhance
stewardship
https://datascience.nih.gov
51. We’re putting COVID-19 data into
repositories and platforms so the data
will be USED by researchers!
52. What could researchers do with these
data?
Better understand
transmission and
infectivity
Evaluate Treatments
& Interventions
Predict Long-term
Sequelae
Link Social
Determinants of
Health with COVID-
19 related data and
exposures
Examine the impact
on Child & Maternal
Health
Resolve Technical &
Implementation
issues
53. 53
COVID Clinical Platforms
Increasing the amount and quality of EHR data related
to individuals with COVID-19
Pilot a new enrollment partner model to efficiently target recruitment in
expanded regions of the country and collect EHR data from proven partners
Rapidly collect EHR-derived clinical, lab, and imaging data from hospitals and
health plans at the peak of the pandemic and as it evolves
Develop a robust, flexible collaborative analytics infrastructure to enable a
high frequency response to COVID-19 and the next emerging threats
Include data from underserved populations, roughly 9.3M unique patients
PETAL’s ORCHID Trial & PETAL’s CORAL registry
o RED CORAL: observational study of retrospective review of data collected on
hospitalized patients with COVID-19
o BLUE CORAL, a multicenter prospective observational study designed to
collect comprehensive data on hospitalized patients with COVID-19. This
study will gather imaging, biospecimens, and long-term outcomes.
54. Honest
Broker
P O L I C Y R E S O U R C E S
W O R K B E N C H E S
/ T O O L S
Federated
Data Platforms
I N F O R M A T I O N
S Y S T E M S
D A T A D I S C O V E R Y
API API API API
TBD*
TBD*
TBD*
TBD* MIDRC, RADx, NICHD, NIA, etc.
Research Authentication
System
Hash
Diagram Elements
CDE Standards for Interoperability
Data Discovery across Platforms
examples include GA4GH FASP,
PIC-SURE
Research Authentication
System
Interoperable Elements
Data Linkage Across SystemsHonest
Broker
FHIR to map and move data
Interoperability Across Clinical COVID
Serving Data Platforms
55. 55
Researcher Workflows Before Researcher Authentication Services (RAS)
Platform 1
Cloud-based
Analysis Tool
LOGIN (5)
SEARCH/
SELECT
ACCESS
COMPUTE SHARE
SEARCH/
SELECT
ACCESS
Platform 2
1 3
2 4
5
Researchers login and/or give consent at least 5 times for each workflow in the Phase 1 interoperability use
cases
56. 56
AUTH N AUTH Z
Passport and Visa: Which dbGaP
studies/consent groups you are authorized to
access and your role
LOGIN (1)
SEARCH/
SELECT
ACCESS COMPUTE SHARE
ID Token: Who
you are
1
Before provisioning data, the platform validates the
passport/visa by calling RAS, so access information is
always up to date within the last 30 minutes
Researcher Workflows After RAS August Deploy
Authentication and Authorization provided by a central NIH service. Auth tokens move with the user as
they navigate to any of the four Phase 1 Data Platforms so that the researcher only logs in one time to RAS
57. Privacy-Preserving Tokens
N3C
Sites
N3C
Sites
Output de-id
tokens
Patient 123
Tokenize
NIH Clinical
Studies
Senior
Living EHR
Tokenize
Tokenize
Output de-id
tokens
Patient 456
Output de-id
tokens
Patient 789
John Smith
03/27/1945
Male
John Smith
• Admitted to N3C Hospital
• Participates in Clinical
Studies
• Lives in a Senior Living
Facility
N3C Linkage Honest
Broker
Patient 123
Patient 456
Patient 789
De-identified ‘Rosetta
Stone’ process that unifies
records
007
Match &
De-duplicate
Patient Care Tokenization
De-Duplication and
Linkages
59. NIH staff who deserve all the credit
• STRIDES: Andrea Norris, Nick Weber and NMDS team, and Fenglou Mao
• Connecting NIH Data Resources: Regina Bures, Ishwar Chandramouliswaran, Tanja Davidsen, Valentine Di
Francesco, Jeff Erickson, Tram Huyen, Rebecca Rosen, Steve Sherry, Alastair Thomson, Greg Farber, Dylan
Klomparens, Charles Schmitt, Susan, Wright, Ken Wiley, Kristofor Langlais, James Coulomb, Lora Kutkat, Nick
Weber, Allen Dearry
• Data Repository and Knowledgebase Resources: Kim Pruitt Valerie Florance, Valentina di Francesco, Ajay
Pillai, Qi Duan, Dawei Lin, Christine Colvis, Jennie Larkin, Ravi Ravichandran, and James Coulombe
• FHIR Pilots: Teresa Zayas-Caban, Denise Warzel, Kerry Goetz, Ken Wiley, Alison Cernick, Kenneth Wilkins,
Carolina Mendoza-Puccini, Matt McAuliffe, and Belinda Seto
• Criteria for Open Access Data Sharing Repositories: Mike Huerta, Dawei Lin, Maryam Zaringhalam, Lisa
Federer and BMIC Team
• Pilot for Scaled Implementation for Sharing Datasets: Ishwar Chandramouliswaran, Lisa Federer, Maryam
Zaringhalam, and Jennie Larkin
• Software Sustainability: Heidi Sofia, Ishwar Chandramouliswaran, Mike Conway, Tony Kirilusha, Xujing Wang,
Andrew Weitz, Todd Merchak, Allissa Dillman and Jess Mazerik
• Smart and Connected Health: Haluk Resat, Dana Wolff-Hughes, Partha Bhattacharyya, Fenglou Mao
• Coding-it-Forward Fellows Summer Program & DATA Scholars Program: Jess Mazerik, Wynn Meyer
60. 60
Office of Data
Science Strategy
www.datascience.nih.gov
A modernized, integrated, FAIR
biomedical data ecosystem
60@NIHDataScience /NIH.DataScience datascience@nih.gov