This document summarizes a presentation given by Dr. Phil Bourne on the National Data Science (NDS) initiative and the National Institutes of Health (NIH) All of Us Data and Science (ADDS) office. The presentation discusses how NDS can succeed by defining clear problems, starting with pilots, and developing sustainable applications. It then outlines ADDS's mission to accelerate biomedical research through an open data ecosystem. ADDS's strategy focuses on discovery, workforce development, policy, leadership, and sustainability through developing a shared "Commons" of digital research objects in the cloud. Pilot projects are evaluating this Commons framework and populating it with datasets and tools.
NDS Relevant Update from the NIH Data Science (ADDS) Office
1. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
NDS Relevant Update from the
NIH Data Science (ADDS) Office
Phil Bourne, Ph.D., FACMI
Associate Director for Data Science (ADDS)
2. How Can NDS Succeed?
• Be at the right place at right time
• Bring together all the right stakeholders – there
are groups missing now- eg application scientists,
publishers
• Define very well the problem(s) you are trying to
solve
• Start with pilots, but proceed to a soup to nuts
application that has value and can be sustained
4. ADDS Mission Statement
To use data science
to foster an
open digital ecosystem
that will accelerate
efficient, cost-effective
biomedical research
to enhance health, lengthen
life, and reduce illness and
disability
8. ADDS Strategy
• Discovery and Innovation
Enabling major scientific discovery and innovation through the BD2K Initiative
• Workforce development
Strengthen the ability of a diverse biomedical workforce to develop and benefit from data science
• Policy and process
Contribute to policies & processes involving data that further the NIH mission
• Leadership
Further visibility of NIH leadership in data science by the public, DHHS, USG at large, and
international funders
• Sustainability
To foster a sustainable, efficient, and productive data science ecosystem
Sustainability
Workforce
Development
Discovery &
Innovation
Policy &
Process
Leadership
9. ADDS Strategy
• Discovery and Innovation
Enabling major scientific discovery and innovation through the BD2K Initiative
• Workforce development
Strengthen the ability of a diverse biomedical workforce to develop and benefit from data science
• Policy and process
Contribute to policies & processes involving data that further the NIH mission
• Leadership
Further visibility of NIH leadership in data science by the public, DHHS, USG at large, and
international funders
• Sustainability
To foster a sustainable, efficient, and productive data science
ecosystem: The Commons
Sustainability
Workforce
Development
Discovery &
Innovation
Policy &
Process
Leadership
10. Some Developments…
• Centers, standards, training coordination
centers off and running
• Looking at funding reference datasets
• Hackathons and more…
• NLM 2.0
12. What is The Commons?
• Treats products of research – data, methods, papers
etc. as digital objects
• These digital objects exist in a shared virtual space
• Digital objects conform to FAIR principles:
– Findable
– Accessible (and usable)
– Interoperable
– Reusable
13. The Commons: Components
• Computing environment
– cloud and/or HPC
– supports access, utilization, sharing and storage of digital objects.
• Methods for Interoperability
– enables connectivity, shareability and interoperability between digital objects.
– APIs, Containers (docker etc)
• Digital object compliance model
– describes the properties of digital objects that enables them to be discoverable and
shareable
– Metadata, UIDs, Clear access controls (human subject data)
• Indexing
– Means to find and catalog digital objects
15. Computing Environment: Cloud
The ability to store, share and compute on digital
research objects
Especially useful for large data sets that are not easily computed
locally
Scalable and Elastic
Pay per use - Cost effective
An environment that fosters collaboration
16. The Commons: Cloud
Commercial
AWS, Google, Microsoft, IBM
Others
Academic
OSC (Open Science Cloud)
iDASH (HIPAA compliant)
The Broad
Others
17. The Commons: HPC
• Supercomputing Centers in the US
– Supported by DOE and NSF
• NERSC (San Francisco)
• ORNL (Oak Ridge)
• TACC (Texas)
• SDSC (San Diego)
• Argonne (Urbana- Champaign)
• Optimized, high performance systems with IT support
19. The Commons: Interoperability
• Software that supports connectivity and
interoperability between digital (data) objects
– API (Application Programing Interfaces)
• Expose and and provide direct access to data
• Enable data to be passed to analysis tools or pipelines
– Containers
• Package and deploy software tools and pipelines to the cloud
21. The Commons
Digital Object Compliance: FAIR
• Attributes of digital objects in the Commons
• Initial Phase
• Unique digital object identifiers of some type
• A minimal set of searchable metadata
• Physically available in a cloud based Commons provider
• Clear access rules (especially important for human subjects data)
• An entry (with metadata) in one or more indices
– Future Phases
• Standard, community based unique digital object identifiers
• Conform to community approved standard metadata for enhanced searching
• Digital objects accessible via open standard APIs
• Are physically and logical available to the commons
23. Commons Pilot Projects
• Evaluating Commons Framework & Populating the
Commons
– NIH funded Large Resource groups BD2K groups (cloud)
– HMP Data and tools available in the cloud (AWS)
• https://aws.amazon.com/datasets/1903160021374413
– NCI Cloud Pilots & Genomic Data Commons (AWS, Google)
• The Cloud Credits - business model for using cloud
resources
24. Commons Credits (business model)
The Commons
(infrastructure)Cloud Provider
A
Cloud Provider
B
Cloud Provider
C
Investigator
NIH
Provides credits Enables Search
Discovery Index
Uses credits in
the Commons IndexesOption:
Direct Funding
25. • Cost effective - Only pay for IT support used
• Drives competition – Better services at lower cost
• Supports data access and sharing by driving science into the Commons
• Can help determine metrics of data object usage
• Facilitates public-private partnership
• Never been tried, so we don’t have data about likelihood of success
• Cost Models: Predicated prices among providers
• Service Providers: Predicated on service providers willing to make the
investment to become conformant
• Persistence: The model is ‘Pay As You Go’ which means if you stop paying it
stops going
Cloud Credits: Pros and Cons
26. NIH… Turning Discovery Into Health
philip.bourne@nih.gov
https://datascience.nih.gov/
@pebourne