The Commons: Leveraging the Power of the Cloud for Big Data
1. The Commons:
Leveraging the Power of
the Cloud for Big Data
Philip Bourne PhD, FACMI
Associate Director for Data Science (ADDS)
Vivien Bonazzi PhD ADDS Office (OD)
George Komatsoulis PhD (NCBI)
The National Institutes of Health
2. Disclaimer
I am no longer a techie
Start here https://datascience.nih.gov/commons
6. “And that’s why we’re here today. Because something called precision
medicine … gives us one of the greatest opportunities for new medical
breakthroughs that we have ever seen.”
President Barack Obama
January 30, 2015
New Science
7. New Science
Precision Medicine Initiative
National Research Cohort
>1 million U.S. volunteers
Numerous existing cohorts (many funded by NIH)
New volunteers
Participants will be centrally involved in design and implementation
of the cohort
They will be able to share genomic data, lifestyle information,
biological samples – all linked to their electronic health records
12. What are the PRINCIPLES of The Commons?
Supports a digital biomedical ecosystem
Treats products of research – data, software, methods,
papers etc. as digital research objects
Digital research objects exist in a shared virtual space
Find, Deposit, Manage, Share and Reuse data,
software, metadata and workflows
Digital objects need to conform to FAIR principles:
Findable
Accessible (and usable)
Interoperable
Reusable
13. What is The Commons Framework:?
Exploits new scalable computing technologies - Cloud
Provides physical or logical access to data
Simplifies access, sharing and interoperability of digital research
objects such as data, software, metadata and workflows
Makes digital research objects indexable and findable: FAIR
Provides understanding and accounting of usage patterns
Is potentially more cost effective given digital growth
Gives currency to digital objects and the people who develop and
support them
14. The Commons Framework
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
15. The Commons Framework
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
IaaS
PaaS
SaaS
16. Commons: Digital Object Compliance
Attributes of digital research objects in the Commons
Initial Phase
Unique digital object identifiers of resolvable to original authoritative
source
Machine readable
A minimal set of searchable metadata
Physically available in a cloud based Commons provider
Clear access rules (especially important for human subjects data)
An entry (with metadata) in one or more indices
Future Phases
Standard, community based unique digital object identifiers
Conform to community approved standard metadata for enhanced
searching
Digital objects accessible via open standard APIs
Are physically and logical available to the commons
18. Commons Pilots - current
The Cloud Credits Model
Infrastructure building blocks: IaaS: accessing cloud services for NIH grantees
Eventual portal for academic and commercial PaaS and SaaS
Commons Supplements – Data, analysis tools, APIs, containers
BD2K Centers
MODs (Model Organism Databases)
Interoperability (some)
HMP (Human Microbiome Project)
NIH Affiliated Commons projects
NIAID/CF/ADDS - Microbiome/HMP Cloud Pilot
NCI Cloud Pilots & Genomic Data Commons
19. Cloud credits
model (CCM)
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework: Cloud
Credits Model: George Komatsoulis
IaaS
PaaS
SaaS
20. Drivers of the Cloud Credits Model
Scalability
Exploiting new computing models
Cost Effectiveness
Simplified sharing of digital objects
FAIR: Findable, Accessible, Interoperable and
Reusable
Cloud computing supports many of these
objectives
21. The Cloud Credits Model
The Commons
Cloud Provider
A
Cloud Provider
B
Investigator
NIH
Provides credits Enables Search
Discovery Index
Uses credits in
the Commons
IndexesOption:
Direct Funding
HPC Provider
22. Supports simplified data sharing by driving science into publicly
accessible computing environments that still provide for
investigator level access control
Scalable for the needs of the scientific community for the next 5
years
Democratize access to data and computational tools
Cost effective
Competitive marketplace for biomedical computing services
Reduces redundancy
Uses resources efficiently
Advantages of this Model
23. Novelty:
Never been tried, so we don’t have data about likelihood of success
Cost Models:
Assumes stable or declining prices among providers
True for the last several years, but we can’t guarantee that it will continue,
particularly if there is significant consolidation in industry
Service Providers:
Assumes that providers are willing to make the investment to become
conformant
Market research suggests 3-5 providers within 2-3 months of launch
Persistence:
The model is ‘Pay As You Go’ which means if you stop paying it stops
going
Giving investigators an unprecedented level of control over what lives (or
dies) in the Commons
Potential Disadvantages of this Model
24. Minimum set of requirements for
Business relationships (reseller, investigators)
Interfaces (upload, download, manage, compute)
Capacity (storage, compute)
Networking and Connectivity
Information Assurance
Authentication and authorization
What does it mean for a provider to be
conformant?
25. NIH intends to run a 3 year pilot to test the efficacy of this business
model in enhancing data sharing and reducing costs.
Pilot will not directly interact with the existing grant system,
rather, it is being modeled on the mechanisms being used to gain
access to NSF and DOE national resources (HPC, light sources,
etc.)
The only required qualification for applying for credits will be that the
investigator has an existing NIH grant
A major element will be the collection of metrics to assess
effectiveness of this model
Pilot of the Commons Cloud Credits
Business Model
26. NIH recently completed a contract with the CAMH Federally
Funded Research and Development Center (FFRDC) to act as the
coordinating center for this effort.
We need you to:
Identify what capabilities will be useful to investigators
Provide guidance on the conformance requirements
Help identify good metrics
Define the criteria that are used to decide if credit requests are
selected
Status and requests
27. BD2K Centers,
MODS, HMP &
Interoperability
Supplements
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework: BD2K
centers, HMP, MODs & Interoperability
PaaS
SaaS
28. Public Beacons
Host Content
AMPLab 1000 Genomes Project
Broad Institute ExAC
Curoverse PGP, GA4GH Example Data
EBI
1000 Genomes Project, UK10K, GoNL, EVS,
GEUVADIS, UMCG Cardio GenePanel
Google
1000 Genomes Project, Phase III, Illumina Platinum
Genomes
ISB Known VARiants
NCBI NHLBI Exome Sequence Project
OICR 55 cancer datasets
SolveBio 56 public datasets
UCSC ClinVar, LOVD, UniProt
University of Leicester Cafe CardioKit, Cafe Variome Central
WTSI IBD, Native American, Egyptian, UK10K
Over 120 public datasets beaconized across 21 institutions
10s thousands of individuals
29. Testing the Commons framework
Facilitating connectivity, interoperability and access to digital
research objects
Interoperable (APIs, containers)
Digital object compliant: FAIR
Indexable
Publishable
Privacy/security (PHI)
Available on cloud platforms
Providing digital research objects to populate the Commons
Commons Supplements:
BD2K Centers, MODs, HMP & Interoperability
30. Making Human Microbiome Project (HMP) data broadly accessible,
computable, and usable.
Moving ~20TB of HMP data to AWS
Providing access to a suite of tools and APIs to facilitate data access and
use
Data and tools will follow FAIR principles (digital object compliance)
* In collaboration with Owen White (UM) – HMP DCC
NAID/CF/ADDS Microbiome* Cloud Pilot
HMP data and tools in the AWS cloud
31. Making cancer genomics data broadly accessible, computable, and
usable by researchers worldwide.
Genomic Data Commons (GDC) will store, analyze and distribute
~2.5 PB of cancer genomics data and associated clinical data
generated by the TCGA and TARGET (Therapeutically Applicable
Research to Generate Effective Treatments) initiatives
The NCI cloud pilots will make TCGA data available on the AWS
and Google clouds, along with a suite of tools and APIs to facilitate
their access and use
NCI Cloud pilots and
Genomic Data Commons
32. Thankyou
ADDS Office
Phil Bourne, Michelle Dunn, Jennie Larkin, Mark Guyer, Sonynka Ngosso
NCBI: George Komatsoulis
NHGRI: Valentina di Francesco, Kevin Lee
CIT: Debbie Sinmao, Andrea Norris, Stacy Charland
Trans NIH BD2K Executive Committee & Working groups
NCI: Warren Kibbe, Tony Kerlavage, Tanja Davidsen
NIAID: Nick Weber, Darrell Hurt, Maria Giovanni, JJ McGowan
Many biomedical researchers, cloud providers, IT professionals
Images of people from Infographic (NOTE: Image is just a placeholder—Jill will tweak)
Detailed Notes:
National Research Cohort <<OR name of study>>
>1 million U.S. volunteers committed to participating in research
Will combine a number of existing cohorts
Will include Dept of Veterans Affairs Million Veteran Program—note Veteran is singular per http://www.research.va.gov/MVP/
There is not enough funding for every researcher to house all the data they need
Analyzing the data is more expensive than producing it
It can take weeks to download large datasets
Mimimum Requirements:
Business relationship is to allow distribution and billing of credits and to ensure that liability issues are resolved. Investigator that puts digital object in the commons is the one that retains the liability associated with its use.
Interfaces – would need to be open, but not necessarily open-source. Requires support for basic operations. In addition, environment has to be open to all; so a private environment behind a university firewall won’t work.
Identifiers and metadata: Tied together and together enable researchers to search for and find resources.
Networking and Connectivity: Make sure that stuff is accessible, require connection to commodity internet and internet2, but key element from investigator point of view is a free egress tier for academics
Environment is secure
A&A: Must support inCommon because most NIH investigators have it. Minimizes hassle of granting access to collaborators across multiple platforms.
Approval of clouds: Self certify vs. NIH certify vs. 3rd party certify. In early test cases, may simply say ‘FedRamped’
Cloud vs IaaS: Some IaaS (AWS comes to mind) may be uninterested in providing the ‘conformant’ layer but support other companies that provide these services using AWS backend. Already exemplars of this: Seven Bridges Genomics and the Cancer Genomics Cloud Pilots are all software layers over an IaaS provider.
on this slide we have a list of Beacon providers and the content that they're serving. so to date we have over 120 public datasets that have been made available via Beacons at 12 different institutions. So this represents data from 10s of thousands of individuals and theses metrics, the numbers of datasets and individuals that they represent