Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
NIH Data Summit - The NIH Data Commons
1. NIH Data Commons
NIH Data Storage Summit
October 20, 2017
Vivien Bonazzi Ph.D.
Senior Advisor for Data Science (NIH/OD)
Project Leader for the NIH Data Commons
3. Challenges with the current state of data
Generating large volumes of biomedical data
Cheap to generate, costly to store on local servers
Multiple copies of the same data in different locations
Building data resources that cannot be easily found by others
Data resources are not connected to each other and cannot
share data or tools
No standards and guidelines on how to share and access data
4. Convergence of factors
Increasing recognition of the need to support data sharing
Availability of digital technologies and infrastructures that
support Data at scale
Cloud: data storage, compute and sharing
FAIR – Findable Accessible Interoperable Reproducible
Understanding that data is a valuable resource that needs to be
sustained
5. https://gds.nih.gov/
Went into effect January 25, 2015
NCI guidance:
http://www.cancer.gov/grants-training/grants-management/nci-
policies/genomic-data
Requires public sharing of genomic data sets
12. The most successful organizations of the
future will be those that can
leverage their digital assets and transform
them into a digital enterprise
13. Data Commons
Enabling data driven science
Enable investigators to leverage all possible data and
tools in the effort to accelerate biomedical discoveries,
therapies and cures
by
driving the development of data infrastructure and data
science capabilities through collaborative research and
robust engineering
14. Developing a Data Commons
Treats products of research – data, methods, tools,
papers etc. as digital objects
For this presentation: Data = Digital Objects
These digital objects exist in a shared virtual space
Find, Deposit, Manage, Share, and Reuse data,
software, metadata and workflows
Digital object compliance through FAIR principles:
Findable
Accessible (and usable)
Interoperable
Reusable
15. The Data Commons
is a platform
that allows transactions to occur
on FAIR data at scale
16. The Data Commons Platform
Compute Platform: Cloud
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
FAIR
App store/User Interface/Portal
PaaS
SaaS
IaaS
19. Interoperability with other Commons’
Common goals – democratizing, collaborating & sharing data
Reuse of currently available open source tools which support
interoperability
GA4GH, UCSC, GDC, NYGC
May 2017 BioIT Commons Session
Shared open standard APIs for data access and computing
Ability to deploy and compute across multiple cloud environments
Docker containers – Dockerstore/Docker registry
Workflows management, sharing and deployment
Discoverability (indexing) objects across cloud commons
Global Unique identifiers
Common user authentication system
20. The Good News
Considerable agreement about the general approaches to
be taken
Many people are already addressing many of the problems:
Data architectures/platforms
Automated/semi-automated data access/authentication protocols
Common metadata standards and templates
Open tools and software
Instantiation and initial metrics of Findability, Accessibility,
Interoperability, and Reusability
Relationships/agreements with Cloud Service Providers that leverage
their interest in hosting NIH data
Moving data to the cloud and operating in a cloud environment
21. The Challenges
A need to “Bring it all Together” – Community endorsement of:
Metadata standards/tools/approaches
Crosswalks between equivalent terms/ontologies
Robust, shared approaches to data access/authentication
Best practices that will enable existing data to become FAIR and will
guide generation of future datasets
Rapidly evolving field makes approaches/tools/etc subject to
change – approaches need to be adaptable
Effort is required to adapt data to community standards and move
data to the cloud
How much does that cost and how long does it take?
Lack of interoperability between cloud providers
22. The Challenges
Making data FAIR comes with a cost
How much does it actually cost?
How can we minimize the cost?
How do we determine whether any one set of data warrants the
expense?
What is the value added to the data by making it FAIR?
What new science can be achieved?
How can new derived data or new computational approaches be
added to the dataset to enrich it?
What are the limitations of FAIRness from dataset to dataset?
28. NIH Data Commons Pilot : Implementation
Storage, NIH Marketplace, Metrics and Costs
Leveraging and extending relationships established as part of BD2K
to provide access cloud to storage and compute
Supplements: TOPMed, GTEx, MODs groups
Prepare (and move) data sets to the cloud for storage, access and
scientific use
Work collaboratively with the OT awardees to build towards data access
Data Commons OT Solicitation: Other Transaction
ROA: Research Opportunity Announcement
Developing the fundamental FAIR computational components to
support access, use and sharing of the 3 data sets above
30. Establishing a new NIH Marketplace
access to a sustainable cloud infrastructure for data science at NIH
Over the next 18 months, NIH will establish its own NIH Cloud Marketplace
Data Commons Pilot Consortium awardees ability to acquire cloud storage and compute
services
Enable ICs to easily acquire cloud storage and storage services from commercial
cloud providers, resellers, and integrators
Building on existing relationship with CSPs
Led by CIT with input from Multi-IC working group
Storage, NIH Marketplace, Metrics and Costs
31. Assessment and Evaluation
What are the costs associated with cloud storage and usage?
What are the business best practices?
How should costs be paid?
Who should pay them?
How should highly used data be managed vs less used data?
Are data producers supportive of this model?
Are users (of all experience levels) able to access and use data effectively?
How will we know if the Data Commons Pilot is successful?
How to adjust to changing needs?
Storage, NIH Marketplace, Metrics and Costs
32. Supplements to 3 Test Data Set Groups
Administrative Supplements to TOPMed, GTEx and MODs
PIs for each data set were requested to review the OT (ROA) and
determine appropriate ways to interact
Prepare (and move) data sets to the cloud for storage, access
and scientific use
Make community workflows and cloud based tools of popular
analysis pipelines from the 3 datasets accessible
Facilitate discovery and interpretation of the association of
human and model organism genotypes and phenotypes
33. NIH Data Commons: OT ROA
Key Capabilities – modular components
Development of Community Supported FAIR Guidelines and Metrics
Global Unique Identifiers (GUID) for FAIR biomedical data
Open Standard APIs (interoperability & connectivity)
Cloud Agnostic Architecture and Frameworks
Cloud User Workspaces
Research Ethics, Privacy, and Security (AUTH)
Indexing and Search
Scientific Use cases
Training, Outreach, Coordination
34. Stage 1: 180 day window
Develop MVPs (Minimum Viable Products)
Demonstrations of the Data Commons and its components
Have one copy of each test data set in each cloud provider
Understanding of the process required to achieve this
Draft version of a single standard access control system
be able to access and use the data through the access control system
Able to use a variety of analysis tools and pipelines on the 3 data sets in the
cloud – (driven by scientific use cases)
Have a rudimentary ability to query across test data sets
Display phenotype, expression and variant data aligned with a specific gene or
genomic location
Display model organism orthologs for a given set of human genes
Draft FAIR guidelines and metrics
Understand how each of the computational components that support the ability
to access data fit together and what standards are needed
Written plans of how and why these demonstrations should be extended into a full
Pilot
NIH Data Commons Pilot: Outcomes
35. Stage 2: 4 year period
To extend and fully implement the Data Commons Pilot based on the
design strategies and capabilities developed as part of Stage 1
Review of MVP/demonstrations and written plans from Stage 1
Goals and Milestones with clear and specific outcomes
Evaluate, negotiate, and revise terms of existing awards
Award additional OTs
NIH Data Commons Pilot: Outcomes
36. Acknowledgments
DPCPSI: Jim Anderson, Betsy Wilder, Vivien Bonazzi, Marie Nierras, Rachel Britt,
Sonyka Ngosso, Lora Kutkat, Kristi Faulk, Jen Lewis, Kate Nicholson,
Chris Darby, Tonya Scott
NHLBI: Gary Gibbons, Alastair Thomson, Teresa Marquette, Jeff Snyder,
Melissa Garcia, Maarten Lerkes, Ann Gawalt, Cashell Jaquish,
George, Papanicolaou
NHGRI: Eric Green, Valentina di Francesco, Ajay Pillai, Simona Volpi, Ken Wiley
NIAID: Nick Weber
CIT: Andrea Norris
NLM: Patti Brennan
NCBI: Steve Sherry
The Data Commons is a federated way to provide access and sharing of large , high value NIH data
The purpose of a Cloud based Data Commons is to make large data sets accessible and usable by the broader community.
Having one copy of a large data set on the cloud means it is accessible by many researchers and they don’t need to copy the data set from NCBI (or other repositories) to the cloud every time they want to use it.
One copy of a large data set on the cloud, accessed multiple times by many researchers who are only paying for the ability to compute on that data is more cost and time effective than moving the same large data set multiple times to the cloud
A cloud based Data Commons becomes much more powerful when (community based) standardized methods and systems are adopted. These standards apply to the way the data and tools interact with each and the computing environment they sit within ie cloud and other how data and tools are made accessible to the user.
Standards specifically relate to the FAIR guidelines, API to access data, workflows and tool, docker containers for deployment of tools to the cloud
Standards are what enable a federated Commons.
Standards create the basic ground rules and common language for interactions in the system.
The Data Commons Framework describes the ecosystem that the OT solicitation is building towards.
Each of the key capabilities described in the OT have a major role in the development of the ecosystem
Governance of the Commons can be found on slide XX
The purpose of this slide is to give a sense that to provide access to the data requires a series of modular reusable components
I wont describe each KC but I want to give them a sense that there are modular components that fit together to permit access
Multi IC Working Group co-chairs for the Data Commons Pilot
Gary Gibbons, Eric Green, Patti Brennan, Jim Anderson, Andrea Norris