1. BD2K & the Commons @ NIH
Vivien Bonazzi, Ph.D.
Senior Advisor for Data Science Technologies
Office of Data Science (ADDS)
National Institutes of Health
11. US Government Memo - Increasing Access to
Results of Federally Funded Scientific Research
In Feb 2013 the US OSTP issued a memo calling for
all US Federal Agencies to make digital assets
from federally funded research available
OSTP - Office of Science Technology Policy at the White House
Public Access to Data
Memohttp://www.whitehouse.gov/sites/default/files/microsites/ostp/
ostp_public_access_memo_2013.pdf
12. US Government Memo - Increasing Access to
Results of Federally Funded Scientific Research
Each agency’s public access plan shall:
Maximize access, by the general public and without
charge, to digitally formatted scientific data
created with Federal funds while:
i) protecting confidentiality and personal privacy
ii) recognizing proprietary interests, business confidential information, and intellectual
property rights and avoiding significant negative impact on intellectual property
rights, innovation, and U.S. competitiveness, and
iii) preserving the balance between the relative value of long-term preservation and
access and the associated cost and administrative burden.
13. NIH Response
In response to the
incredible growth of large
biomedical (digital)
datasets, the Director of
NIH established a special
Data and Informatics
Working Group (DIWG)
http://acd.od.nih.gov/diwg.htm
14. NIH Response
Establish new data science research and training programs
Fulfilling the recommendation of the ACD WG report
Big Data to Knowledge (BD2K) - 2013
http://datascience.nih.gov/bd2k
Establish a new position:
NIH Associate Director of Data Science
(ADDS)
Phil Bourne – 2014
16. BD2K – Big Data to Knowledge
Expanding training programs in data science
Find and Sharing Data & Software though Indexes
Targeted Software tools and methods
Data wrangling
Privacy security of data
Data repurposing
Applications of metadata
Advance Big methods, tools and applications
BD2K Centers of Excellence)
https://datascience.nih.gov/bd2k/funded-programs
17. To enable biomedical research as
a digital enterprise through which
new discoveries are made and
knowledge generated by
maximizing community
engagement and productivity.
18. NIH ADDS Mission Statement
To use data science
to foster an
Open Digital Ecosystem
that will accelerate
efficient, cost-effective
biomedical research
to enhance health, lengthen
life, and reduce illness and
disability
19. Enabling digital Ecosystems
via a Commons & BD2K
Leveraging BD2K efforts
Harnessing e-infrastructures
- Public-private partnerships & Interagency collaborations
Collaborating with external communities
20. Commons : Achieving a Balance
Biomedical Use Cases + Data Science + e-infrastructures
Supporting open biomedical science using robust, scalable
and flexible digital technologies
In collaboration with global communities
21. What are the PRINCIPLES of a Commons?
Supports a digital biomedical ecosystem
Treats products of research – data, software, methods,
papers etc. as digital objects
Digital objects exist in a shared virtual space
Find, Deposit, Manage, Share and Reuse data,
software, metadata and workflows
Digital objects need to conform to FAIR principles:
Findable
Accessible (and usable)
Interoperable
Reusable
22. Developing a Commons Framework
Exploits new scalable computing technologies - Cloud
Making digital objects : FAIR
Indexable/Findable, Accessible & Usable, Interoperable,
Reproducible
Simplifies access, sharing and interoperability of digital objects
such as data, software, metadata and workflows
Provides physical or logical access to digital objects
Provides understanding and accounting of usage patterns
Is potentially more cost effective given digital growth
Gives currency to digital objects and the people who develop and
support them
23. Commons Framework
Compute Platform: Cloud or SC Facilities
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
https://datascience.nih.gov/commons
24. Commons Framework
Compute Platform: Cloud or SC Facilities
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
IaaS
PaaS
SaaS
https://datascience.nih.gov/commons
25. Commons: Digital Object Compliance
Attributes of digital research objects in the Commons
Initial Phase
Unique digital object identifiers of resolvable to original authoritative
source
Machine readable
A minimal set of searchable metadata
Physically available in a cloud based Commons provider
Clear access rules (especially important for human subjects data)
An entry (with metadata) in one or more indices
Future Phases
Standard, community based unique digital object identifiers
Conform to community approved standard metadata and ontologies for
enhanced searching
Digital objects accessible via open standard APIs
Are physically and logical available to the commons
27. Towards Data Commons’
co-locate data, storage and computing
infrastructure with commonly used tools for
accessing, analyzing, sharing data to create an
open interoperable resource for the research
community.
29. Current Commons Pilots
Explore feasibility of the Commons framework
Provide data objects to populate the Commons
Facilitate collaboration and interoperability
Provide access to cloud (IaaS) and PaaS/SaaS via credits
Connecting credits to NIH Grants
Making large and/or high value NIH funded data sets and tool
accessible in the cloud
Developing Data & Software Indexing methods
Leveraging BD2K efforts bioCADDIE et al
Collaborating with external groups
30. Other Commons Activities
Testing cloud environments to enable access, sharing. use and
reuse of large data sets and accompanying tools
The Cancer Genome Atlas (TCGA) - NCI
Human Microbiome Project (HMP) - NIAID
Providing a portals to view representation and analysis of large
data sets (Genomic Data Commons – NCI)
32. Exploring feasibility of the Commons framework using
the BD2K Centers, MODs, and HMP groups
Facilitating connectivity, interoperability and access to
digital objects
Providing digital research objects to populate the
Commons
Enable biomedical science to happen more easily and
robustly
Connecting biology use cases with data science
Commons Framework Pilots
BD2K Centers, MODs, HMP
33. BD2K Centers,
MODS and HMP
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping to the Commons framework:
Commons Framework Pilots
PaaS
SaaS
34. Does your work map to the Commons framework?
Good
Bad
Ugly
How does it enable science?
Using robust computational methods
Enable biomedical use cases
Commons Framework Pilots
BD2K Centers, MODs, HMP
35. Commons Framework Pilots
PI Parent grant’s
IC
Project description
TOGA NIBIB • Cloud-hosted data publication system
• Allows the automatic creation and publication of data a personalized data
repository
MUSEN NIAID • Smart APIs – improved handling for metadata within APIs
• Ontological support for metadata within an API
• Improving smart API discoverability: a registry of APIs
HAN NIGMS • Docker container hub for BD2K community
• Docker containers for genomic analysis applications and pipelines
• Benchmark, Evaluation & best practices
COOPER/KOHANE NHGRI • Cloud based authenticated API access and exchange of causal modeling data
, tools + genomic and phenomic data (PICI)
• Docker containers for CCD tools available in AWS
HAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast
cancer susceptibility genes and variations
• (GA4GH) API : being able to query this data and metadata
Ohno-Machado NHLBI • Development of an ecosystem for repeatable science
• easy reuse of data AND software; tracking of provenance.
• Use of container technologies for software and data reuse.
Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation
tasks of the collaborating sites.
• An API to provide programmatic access to the relevant papers in PMC
White NHGRI • The entire HMP1 data set made accessible on AWS
• Analysis tools for microbiome data in AWS
Westerfield NHGRI • Development of a common data model for the MODs
• Development of APIs accessing data across the MODs
36. More specifically from a Data Science perspective
Open standards for APIs and Docker containers
Docker registry and best practices
Improved metadata handing in APIs
Data Object registry and indexing
Reusing what is currently available
bioCADDIE, schema.org and schema.org
Publication
Preprint server with Links to all digital objects
Commons Framework Pilots
BD2K Centers, MODs, HMP
37. Example of a biomedical Use Case:
Develop a common gene model for all the MODs
Develop a open well structured, resuable and documented
API that can be used across the MOD data
Why?
• To be able to query a human gene against all MOD orthologs
• Improved understanding of health and disease states
• Improved understanding of genome structure & organization
Commons Framework Pilots
BD2K Centers, MODs, HMP
38. The purpose of the Commons Framework is to support
BOTH
Biological use cases + Data Science methods
To allow biological research to happen at scale
Commons Framework Pilots
BD2K Centers, MODs, HMP
40. The Cloud Credits Model
The Commons
Cloud Provider
A
Cloud Provider
B
Investigator
NIH
Provides credits
HPC Provider
Uses credits in
Commons
Enabling search: Index
Commons Compliance
Commons Conformance
41. Drivers of the Cloud Credits Model
Scalability
Exploiting new computing models
Potentially Cost Effectiveness
Simplified sharing of digital objects
Cloud computing supports many of these
objectives
42. Cloud credits
model (CCM)
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework: Cloud
Credits Model:
IaaS
PaaS
SaaS
43. Supports simplified data sharing by driving science into publicly
accessible computing environments that still provide for
investigator level access control
Scalable for the needs of the scientific community for the next 5
years
Democratize access to data and computational tools
Cost effective
Competitive marketplace for biomedical computing services
Reduces redundancy
Uses resources efficiently
Advantages of this Model
44. Novelty:
Never been tried, so we don’t have data about likelihood of success
Cost Models:
Assumes stable or declining prices among providers
True for the last several years, but we can’t guarantee that it will continue,
particularly if there is significant consolidation in industry
Service Providers:
Assumes that providers are willing to make the investment to become
conformant
Market research suggests 3-5 providers within 2-3 months of launch
Persistence:
The model is ‘Pay As You Go’ which means if you stop paying it stops
going
Giving investigators an unprecedented level of control over what lives (or
dies) in the Commons
Potential Disadvantages of this Model
46. Data Sets in a Cloud Commons
Making High Value and/or High Volume NIH funded data
sets available in a cloud commons
Co-location of large datasets and compute power enables
access, use, resuse and sharing of data and tools
Data must adhere to FAIR/Commons compliance principles
Helps “seed” the Commons with FAIR/Commons compliant data
Provides an Indexable test data sets for bioCADDIE (and other
indexing efforts)
47. Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Large, high value Data Sets
NIH defined data sets
48. Data Sets in the Cloud Commons
Preliminary possible data sets
GTex (Genotype-Tissue Expression)
LINCS (Library of Integrated network based cellular signatures)
Model Organism Databases (MODs)
UniProt
Neuroimaging Resource (NITRIC)
Radiology Image Share
Epigenomics
GenPort
The Cancer Genome Atlas Project (TCGA) this data set is currently housed at the
GDC but there ARE plans to move to AWS and Google
BTRIS Data – NIH Clinical center
NIAID AIDs Data
dbGAP
GEO
49. Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Community Defined Data Sets
Community
defined data sets
50. Data Sets in a Cloud Commons: Opportunities
Ability to share data more easily
Ability to access and compute on data more easily
Reduced costs:
Costs is paid by NIH not the individual PI
Stops continues uploads of the same data sets
FAIR/ Commons Compliance of data sets
51. Data Sets in a Cloud Commons: Challenges
Supporting sensitive (human) data in commercial clouds
Updating, versioning, maintaining
Consents for data
Can be very strict and only valid across 1 data set
Analysis across data sets may constrained by consents
Optimizing for cloud environments: performance
Incentivizing data (and tool) generators to move and maintain
their data in the cloud
Data peering across clouds
Commercial clouds are resistant : cyclinders of excellence
Peering and Virtualization of services
53. Commons Pilots: Search & Index
Indexing and Searching digital objects in a Commons
Leveraging indexing methods within BD2K
BioCADDIE,
Others approach within BD2K
Schema.org
Coexisting efforts
54. BD2K Indexing
e.g. BioCADDIE,
Other, schema.org
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Indexing & Searching
55. What is bioCADDIE?
biomedical and healthCAre
Data Discovery Index Ecosystem
University of California San Diego
PI Lucila Ohno-Machado
Development of a prototype of Data Discovery Index (DDI)
Aims – “Pubmed” for Data
1. Help users find shared data
2. Build a prototype data discovery index
3. Evaluate requirements for next phase
56. ecosystem components for finding data
Policies
criteria for inclusion,
sustainability
Standards
metadata
data
Identifiers
reuse of existing
ID issuing
services
Metadata
minimal set
guidelines for mapping,
accessibility information,
provenanceSearch engine
connection to other
engines,
repositories, data
sets
57. Commons Pilots
Leveraging Schema.org
Marking up a biomedical resource using schema.org
Flexible and scalable
Developing a bioschema.org approach
Helps drive a community standard for reuse by other
groups
Harnesses the power of search engines to find digital objects
58. Commons : Achieving a Balance
Biomedical Use Cases + Data Science + e-infrastructures
Supporting open biomedical science using robust, scalable
and flexible digital technologies
In collaboration with global communities
59. Thankyou
ADDS Office
Phil Bourne, Michelle Dunn, Jennie Larkin, Mark Guyer, Sonynka Ngosso
NCBI: George Komatsoulis
NHGRI: Valentina di Francesco, Kevin Lee
CIT: Debbie Sinmao, Andrea Norris, Stacy Charland
Trans NIH BD2K Executive Committee & Working groups
NCI: Warren Kibbe, Tony Kerlavage, Lou Staudt, Tanja Davidsen, Ian Fore
NIAID: Nick Weber, Darrell Hurt, Maria Giovanni, JJ McGowan
Many biomedical researchers, cloud providers, IT professionals
There is not enough funding for every researcher to house all the data they need
Analyzing the data is more expensive than producing it
It can take weeks to download large datasets
OSTP Office of Science and Technology Policy
https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
OSTP Office of Science and Technology Policy
https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
. The ultimate objective is to provide the community with a fully functional DDI integrated into the digital commons.