SlideShare a Scribd company logo
1 of 60
BD2K & the Commons @ NIH
Vivien Bonazzi, Ph.D.
Senior Advisor for Data Science Technologies
Office of Data Science (ADDS)
National Institutes of Health
A Digital Story
NIH Data
NIH DataNIH Data
US Government Memo - Increasing Access to
Results of Federally Funded Scientific Research
In Feb 2013 the US OSTP issued a memo calling for
all US Federal Agencies to make digital assets
from federally funded research available
OSTP - Office of Science Technology Policy at the White House
Public Access to Data
Memohttp://www.whitehouse.gov/sites/default/files/microsites/ostp/
ostp_public_access_memo_2013.pdf
US Government Memo - Increasing Access to
Results of Federally Funded Scientific Research
Each agency’s public access plan shall:
Maximize access, by the general public and without
charge, to digitally formatted scientific data
created with Federal funds while:
i) protecting confidentiality and personal privacy
ii) recognizing proprietary interests, business confidential information, and intellectual
property rights and avoiding significant negative impact on intellectual property
rights, innovation, and U.S. competitiveness, and
iii) preserving the balance between the relative value of long-term preservation and
access and the associated cost and administrative burden.
NIH Response
 In response to the
incredible growth of large
biomedical (digital)
datasets, the Director of
NIH established a special
Data and Informatics
Working Group (DIWG)
http://acd.od.nih.gov/diwg.htm
NIH Response
Establish new data science research and training programs
Fulfilling the recommendation of the ACD WG report
Big Data to Knowledge (BD2K) - 2013
http://datascience.nih.gov/bd2k
Establish a new position:
NIH Associate Director of Data Science
(ADDS)
Phil Bourne – 2014
CHAPTER 3
BD2K – Big Data to Knowledge
 Expanding training programs in data science
 Find and Sharing Data & Software though Indexes
 Targeted Software tools and methods
 Data wrangling
 Privacy security of data
 Data repurposing
 Applications of metadata
 Advance Big methods, tools and applications
 BD2K Centers of Excellence)
https://datascience.nih.gov/bd2k/funded-programs
To enable biomedical research as
a digital enterprise through which
new discoveries are made and
knowledge generated by
maximizing community
engagement and productivity.
NIH ADDS Mission Statement
To use data science
to foster an
Open Digital Ecosystem
that will accelerate
efficient, cost-effective
biomedical research
to enhance health, lengthen
life, and reduce illness and
disability
Enabling digital Ecosystems
via a Commons & BD2K
Leveraging BD2K efforts
Harnessing e-infrastructures
- Public-private partnerships & Interagency collaborations
Collaborating with external communities
Commons : Achieving a Balance
Biomedical Use Cases + Data Science + e-infrastructures
Supporting open biomedical science using robust, scalable
and flexible digital technologies
In collaboration with global communities
What are the PRINCIPLES of a Commons?
 Supports a digital biomedical ecosystem
 Treats products of research – data, software, methods,
papers etc. as digital objects
 Digital objects exist in a shared virtual space
Find, Deposit, Manage, Share and Reuse data,
software, metadata and workflows
 Digital objects need to conform to FAIR principles:
 Findable
 Accessible (and usable)
 Interoperable
 Reusable
Developing a Commons Framework
 Exploits new scalable computing technologies - Cloud
 Making digital objects : FAIR
 Indexable/Findable, Accessible & Usable, Interoperable,
Reproducible
 Simplifies access, sharing and interoperability of digital objects
such as data, software, metadata and workflows
 Provides physical or logical access to digital objects
 Provides understanding and accounting of usage patterns
 Is potentially more cost effective given digital growth
 Gives currency to digital objects and the people who develop and
support them
Commons Framework
Compute Platform: Cloud or SC Facilities
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
https://datascience.nih.gov/commons
Commons Framework
Compute Platform: Cloud or SC Facilities
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
IaaS
PaaS
SaaS
https://datascience.nih.gov/commons
Commons: Digital Object Compliance
 Attributes of digital research objects in the Commons
Initial Phase
 Unique digital object identifiers of resolvable to original authoritative
source
 Machine readable
 A minimal set of searchable metadata
 Physically available in a cloud based Commons provider
 Clear access rules (especially important for human subjects data)
 An entry (with metadata) in one or more indices
Future Phases
 Standard, community based unique digital object identifiers
 Conform to community approved standard metadata and ontologies for
enhanced searching
 Digital objects accessible via open standard APIs
 Are physically and logical available to the commons
Towards Data Commons’
Towards Data Commons’
co-locate data, storage and computing
infrastructure with commonly used tools for
accessing, analyzing, sharing data to create an
open interoperable resource for the research
community.
NIH Commons PILOTS
Current Commons Pilots
 Explore feasibility of the Commons framework
 Provide data objects to populate the Commons
 Facilitate collaboration and interoperability
 Provide access to cloud (IaaS) and PaaS/SaaS via credits
 Connecting credits to NIH Grants
 Making large and/or high value NIH funded data sets and tool
accessible in the cloud
 Developing Data & Software Indexing methods
 Leveraging BD2K efforts bioCADDIE et al
 Collaborating with external groups
Other Commons Activities
 Testing cloud environments to enable access, sharing. use and
reuse of large data sets and accompanying tools
 The Cancer Genome Atlas (TCGA) - NCI
 Human Microbiome Project (HMP) - NIAID
 Providing a portals to view representation and analysis of large
data sets (Genomic Data Commons – NCI)
Commons Framework Pilots
 Exploring feasibility of the Commons framework using
the BD2K Centers, MODs, and HMP groups
 Facilitating connectivity, interoperability and access to
digital objects
 Providing digital research objects to populate the
Commons
 Enable biomedical science to happen more easily and
robustly
 Connecting biology use cases with data science
Commons Framework Pilots
BD2K Centers, MODs, HMP
BD2K Centers,
MODS and HMP
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping to the Commons framework:
Commons Framework Pilots
PaaS
SaaS
Does your work map to the Commons framework?
Good
Bad
Ugly
How does it enable science?
Using robust computational methods
Enable biomedical use cases
Commons Framework Pilots
BD2K Centers, MODs, HMP
Commons Framework Pilots
PI Parent grant’s
IC
Project description
TOGA NIBIB • Cloud-hosted data publication system
• Allows the automatic creation and publication of data a personalized data
repository
MUSEN NIAID • Smart APIs – improved handling for metadata within APIs
• Ontological support for metadata within an API
• Improving smart API discoverability: a registry of APIs
HAN NIGMS • Docker container hub for BD2K community
• Docker containers for genomic analysis applications and pipelines
• Benchmark, Evaluation & best practices
COOPER/KOHANE NHGRI • Cloud based authenticated API access and exchange of causal modeling data
, tools + genomic and phenomic data (PICI)
• Docker containers for CCD tools available in AWS
HAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast
cancer susceptibility genes and variations
• (GA4GH) API : being able to query this data and metadata
Ohno-Machado NHLBI • Development of an ecosystem for repeatable science
• easy reuse of data AND software; tracking of provenance.
• Use of container technologies for software and data reuse.
Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation
tasks of the collaborating sites.
• An API to provide programmatic access to the relevant papers in PMC
White NHGRI • The entire HMP1 data set made accessible on AWS
• Analysis tools for microbiome data in AWS
Westerfield NHGRI • Development of a common data model for the MODs
• Development of APIs accessing data across the MODs
 More specifically from a Data Science perspective
 Open standards for APIs and Docker containers
 Docker registry and best practices
 Improved metadata handing in APIs
 Data Object registry and indexing
 Reusing what is currently available
 bioCADDIE, schema.org and schema.org
 Publication
 Preprint server with Links to all digital objects
Commons Framework Pilots
BD2K Centers, MODs, HMP
 Example of a biomedical Use Case:
 Develop a common gene model for all the MODs
 Develop a open well structured, resuable and documented
API that can be used across the MOD data
 Why?
• To be able to query a human gene against all MOD orthologs
• Improved understanding of health and disease states
• Improved understanding of genome structure & organization
Commons Framework Pilots
BD2K Centers, MODs, HMP
The purpose of the Commons Framework is to support
BOTH
Biological use cases + Data Science methods
To allow biological research to happen at scale
Commons Framework Pilots
BD2K Centers, MODs, HMP
Commons Credits Model
The Cloud Credits Model
The Commons
Cloud Provider
A
Cloud Provider
B
Investigator
NIH
Provides credits
HPC Provider
Uses credits in
Commons
Enabling search: Index
Commons Compliance
Commons Conformance
Drivers of the Cloud Credits Model
 Scalability
 Exploiting new computing models
 Potentially Cost Effectiveness
 Simplified sharing of digital objects
Cloud computing supports many of these
objectives
Cloud credits
model (CCM)
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework: Cloud
Credits Model:
IaaS
PaaS
SaaS
 Supports simplified data sharing by driving science into publicly
accessible computing environments that still provide for
investigator level access control
 Scalable for the needs of the scientific community for the next 5
years
 Democratize access to data and computational tools
 Cost effective
 Competitive marketplace for biomedical computing services
 Reduces redundancy
 Uses resources efficiently
Advantages of this Model
 Novelty:
Never been tried, so we don’t have data about likelihood of success
 Cost Models:
Assumes stable or declining prices among providers
True for the last several years, but we can’t guarantee that it will continue,
particularly if there is significant consolidation in industry
 Service Providers:
Assumes that providers are willing to make the investment to become
conformant
Market research suggests 3-5 providers within 2-3 months of launch
 Persistence:
 The model is ‘Pay As You Go’ which means if you stop paying it stops
going
 Giving investigators an unprecedented level of control over what lives (or
dies) in the Commons
Potential Disadvantages of this Model
Cloud Commons
Reference Data Sets
Data Sets in a Cloud Commons
 Making High Value and/or High Volume NIH funded data
sets available in a cloud commons
 Co-location of large datasets and compute power enables
access, use, resuse and sharing of data and tools
 Data must adhere to FAIR/Commons compliance principles
 Helps “seed” the Commons with FAIR/Commons compliant data
 Provides an Indexable test data sets for bioCADDIE (and other
indexing efforts)
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Large, high value Data Sets
NIH defined data sets
Data Sets in the Cloud Commons
 Preliminary possible data sets
 GTex (Genotype-Tissue Expression)
 LINCS (Library of Integrated network based cellular signatures)
 Model Organism Databases (MODs)
 UniProt
 Neuroimaging Resource (NITRIC)
 Radiology Image Share
 Epigenomics
 GenPort
 The Cancer Genome Atlas Project (TCGA) this data set is currently housed at the
GDC but there ARE plans to move to AWS and Google
 BTRIS Data – NIH Clinical center
 NIAID AIDs Data
 dbGAP
 GEO
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Community Defined Data Sets
Community
defined data sets
Data Sets in a Cloud Commons: Opportunities
 Ability to share data more easily
 Ability to access and compute on data more easily
 Reduced costs:
 Costs is paid by NIH not the individual PI
 Stops continues uploads of the same data sets
 FAIR/ Commons Compliance of data sets
Data Sets in a Cloud Commons: Challenges
 Supporting sensitive (human) data in commercial clouds
 Updating, versioning, maintaining
 Consents for data
 Can be very strict and only valid across 1 data set
 Analysis across data sets may constrained by consents
 Optimizing for cloud environments: performance
 Incentivizing data (and tool) generators to move and maintain
their data in the cloud
 Data peering across clouds
 Commercial clouds are resistant : cyclinders of excellence
 Peering and Virtualization of services
Making things Findable
Indexing & Search methods
Commons Pilots: Search & Index
 Indexing and Searching digital objects in a Commons
Leveraging indexing methods within BD2K
BioCADDIE,
Others approach within BD2K
Schema.org
Coexisting efforts
BD2K Indexing
e.g. BioCADDIE,
Other, schema.org
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Mapping pilots to the Commons framework :
Indexing & Searching
What is bioCADDIE?
biomedical and healthCAre
Data Discovery Index Ecosystem
 University of California San Diego
 PI Lucila Ohno-Machado
 Development of a prototype of Data Discovery Index (DDI)
 Aims – “Pubmed” for Data
1. Help users find shared data
2. Build a prototype data discovery index
3. Evaluate requirements for next phase
ecosystem components for finding data
Policies
criteria for inclusion,
sustainability
Standards
metadata
data
Identifiers
reuse of existing
ID issuing
services
Metadata
minimal set
guidelines for mapping,
accessibility information,
provenanceSearch engine
connection to other
engines,
repositories, data
sets
Commons Pilots
 Leveraging Schema.org
 Marking up a biomedical resource using schema.org
 Flexible and scalable
 Developing a bioschema.org approach
 Helps drive a community standard for reuse by other
groups
 Harnesses the power of search engines to find digital objects
Commons : Achieving a Balance
Biomedical Use Cases + Data Science + e-infrastructures
Supporting open biomedical science using robust, scalable
and flexible digital technologies
In collaboration with global communities
Thankyou
 ADDS Office
Phil Bourne, Michelle Dunn, Jennie Larkin, Mark Guyer, Sonynka Ngosso
 NCBI: George Komatsoulis
 NHGRI: Valentina di Francesco, Kevin Lee
 CIT: Debbie Sinmao, Andrea Norris, Stacy Charland
 Trans NIH BD2K Executive Committee & Working groups
 NCI: Warren Kibbe, Tony Kerlavage, Lou Staudt, Tanja Davidsen, Ian Fore
 NIAID: Nick Weber, Darrell Hurt, Maria Giovanni, JJ McGowan
 Many biomedical researchers, cloud providers, IT professionals
The end

More Related Content

What's hot

Komatsoulis internet2 executive track
Komatsoulis internet2 executive trackKomatsoulis internet2 executive track
Komatsoulis internet2 executive trackGeorge Komatsoulis
 
Komatsoulis internet2 global forum 2015
Komatsoulis internet2 global forum 2015Komatsoulis internet2 global forum 2015
Komatsoulis internet2 global forum 2015George Komatsoulis
 
Mobile Data Analytics
Mobile Data AnalyticsMobile Data Analytics
Mobile Data AnalyticsRICHARD AMUOK
 
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
A Framework for Geospatial Web Services for Public Health by Dr. Leslie LenertA Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
A Framework for Geospatial Web Services for Public Health by Dr. Leslie LenertWansoo Im
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Edward Curry
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupEdward Curry
 
Linked Building (Energy) Data
Linked Building (Energy) DataLinked Building (Energy) Data
Linked Building (Energy) DataEdward Curry
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeEdward Curry
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsBig data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
 
Opportunities and Challenges for International Cooperation Around Big Data
Opportunities and Challenges for International Cooperation Around Big DataOpportunities and Challenges for International Cooperation Around Big Data
Opportunities and Challenges for International Cooperation Around Big DataPhilip Bourne
 
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
Big Data Systems: Past, Present &  (Possibly) Future with @techmilindBig Data Systems: Past, Present &  (Possibly) Future with @techmilind
Big Data Systems: Past, Present & (Possibly) Future with @techmilindEMC
 
Paving the way to open and interoperable research data service workflows Prog...
Paving the way to open and interoperable research data service workflows Prog...Paving the way to open and interoperable research data service workflows Prog...
Paving the way to open and interoperable research data service workflows Prog...ResearchSpace
 
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTUREA HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTUREijccsa
 
Research data management & planning: an introduction
Research data management & planning: an introductionResearch data management & planning: an introduction
Research data management & planning: an introductionMaggie Neilson
 
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...European Data Forum
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesEditor IJMTER
 

What's hot (19)

Komatsoulis internet2 executive track
Komatsoulis internet2 executive trackKomatsoulis internet2 executive track
Komatsoulis internet2 executive track
 
Komatsoulis internet2 global forum 2015
Komatsoulis internet2 global forum 2015Komatsoulis internet2 global forum 2015
Komatsoulis internet2 global forum 2015
 
Mobile Data Analytics
Mobile Data AnalyticsMobile Data Analytics
Mobile Data Analytics
 
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
A Framework for Geospatial Web Services for Public Health by Dr. Leslie LenertA Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
 
Smart Geo. Guido Satta (Maggio 2015)
Smart Geo. Guido Satta (Maggio 2015)Smart Geo. Guido Satta (Maggio 2015)
Smart Geo. Guido Satta (Maggio 2015)
 
Linked Building (Energy) Data
Linked Building (Energy) DataLinked Building (Energy) Data
Linked Building (Energy) Data
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in Europe
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsBig data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing Platforms
 
Opportunities and Challenges for International Cooperation Around Big Data
Opportunities and Challenges for International Cooperation Around Big DataOpportunities and Challenges for International Cooperation Around Big Data
Opportunities and Challenges for International Cooperation Around Big Data
 
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
Big Data Systems: Past, Present &  (Possibly) Future with @techmilindBig Data Systems: Past, Present &  (Possibly) Future with @techmilind
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
 
Paving the way to open and interoperable research data service workflows Prog...
Paving the way to open and interoperable research data service workflows Prog...Paving the way to open and interoperable research data service workflows Prog...
Paving the way to open and interoperable research data service workflows Prog...
 
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTUREA HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
 
Research data management & planning: an introduction
Research data management & planning: an introductionResearch data management & planning: an introduction
Research data management & planning: an introduction
 
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges
 
1
11
1
 

Similar to BD2K and the Commons : ELIXR All Hands

The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
 
The NIH Data Commons - BD2K All Hands Meeting 2015
The NIH Data Commons -  BD2K All Hands Meeting 2015The NIH Data Commons -  BD2K All Hands Meeting 2015
The NIH Data Commons - BD2K All Hands Meeting 2015Vivien Bonazzi
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...Carole Goble
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Anita de Waard
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationPhilip Bourne
 
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...Carole Goble
 
Toward a FAIR Biomedical Data Ecosystem
Toward a FAIR Biomedical Data EcosystemToward a FAIR Biomedical Data Ecosystem
Toward a FAIR Biomedical Data EcosystemGlobus
 
The NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentThe NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentPhilip Bourne
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceCarole Goble
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
 
Advancing Biomedical Knowledge Reuse with FAIR
Advancing Biomedical Knowledge Reuse with FAIRAdvancing Biomedical Knowledge Reuse with FAIR
Advancing Biomedical Knowledge Reuse with FAIRMichel Dumontier
 
Advancing Science In A Collaborative Web 20 World
Advancing Science In A Collaborative Web 20 WorldAdvancing Science In A Collaborative Web 20 World
Advancing Science In A Collaborative Web 20 WorldFranciel
 

Similar to BD2K and the Commons : ELIXR All Hands (20)

The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
The NIH Data Commons - BD2K All Hands Meeting 2015
The NIH Data Commons -  BD2K All Hands Meeting 2015The NIH Data Commons -  BD2K All Hands Meeting 2015
The NIH Data Commons - BD2K All Hands Meeting 2015
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & Innovation
 
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
 
Toward a FAIR Biomedical Data Ecosystem
Toward a FAIR Biomedical Data EcosystemToward a FAIR Biomedical Data Ecosystem
Toward a FAIR Biomedical Data Ecosystem
 
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content TypesIlik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
 
The NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentThe NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training Environment
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Sharing Big Data - Bob Jones
Sharing Big Data - Bob JonesSharing Big Data - Bob Jones
Sharing Big Data - Bob Jones
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
 
Advancing Biomedical Knowledge Reuse with FAIR
Advancing Biomedical Knowledge Reuse with FAIRAdvancing Biomedical Knowledge Reuse with FAIR
Advancing Biomedical Knowledge Reuse with FAIR
 
Advancing Science In A Collaborative Web 20 World
Advancing Science In A Collaborative Web 20 WorldAdvancing Science In A Collaborative Web 20 World
Advancing Science In A Collaborative Web 20 World
 

Recently uploaded

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 

Recently uploaded (20)

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 

BD2K and the Commons : ELIXR All Hands

  • 1. BD2K & the Commons @ NIH Vivien Bonazzi, Ph.D. Senior Advisor for Data Science Technologies Office of Data Science (ADDS) National Institutes of Health
  • 2.
  • 4.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. US Government Memo - Increasing Access to Results of Federally Funded Scientific Research In Feb 2013 the US OSTP issued a memo calling for all US Federal Agencies to make digital assets from federally funded research available OSTP - Office of Science Technology Policy at the White House Public Access to Data Memohttp://www.whitehouse.gov/sites/default/files/microsites/ostp/ ostp_public_access_memo_2013.pdf
  • 12. US Government Memo - Increasing Access to Results of Federally Funded Scientific Research Each agency’s public access plan shall: Maximize access, by the general public and without charge, to digitally formatted scientific data created with Federal funds while: i) protecting confidentiality and personal privacy ii) recognizing proprietary interests, business confidential information, and intellectual property rights and avoiding significant negative impact on intellectual property rights, innovation, and U.S. competitiveness, and iii) preserving the balance between the relative value of long-term preservation and access and the associated cost and administrative burden.
  • 13. NIH Response  In response to the incredible growth of large biomedical (digital) datasets, the Director of NIH established a special Data and Informatics Working Group (DIWG) http://acd.od.nih.gov/diwg.htm
  • 14. NIH Response Establish new data science research and training programs Fulfilling the recommendation of the ACD WG report Big Data to Knowledge (BD2K) - 2013 http://datascience.nih.gov/bd2k Establish a new position: NIH Associate Director of Data Science (ADDS) Phil Bourne – 2014
  • 16. BD2K – Big Data to Knowledge  Expanding training programs in data science  Find and Sharing Data & Software though Indexes  Targeted Software tools and methods  Data wrangling  Privacy security of data  Data repurposing  Applications of metadata  Advance Big methods, tools and applications  BD2K Centers of Excellence) https://datascience.nih.gov/bd2k/funded-programs
  • 17. To enable biomedical research as a digital enterprise through which new discoveries are made and knowledge generated by maximizing community engagement and productivity.
  • 18. NIH ADDS Mission Statement To use data science to foster an Open Digital Ecosystem that will accelerate efficient, cost-effective biomedical research to enhance health, lengthen life, and reduce illness and disability
  • 19. Enabling digital Ecosystems via a Commons & BD2K Leveraging BD2K efforts Harnessing e-infrastructures - Public-private partnerships & Interagency collaborations Collaborating with external communities
  • 20. Commons : Achieving a Balance Biomedical Use Cases + Data Science + e-infrastructures Supporting open biomedical science using robust, scalable and flexible digital technologies In collaboration with global communities
  • 21. What are the PRINCIPLES of a Commons?  Supports a digital biomedical ecosystem  Treats products of research – data, software, methods, papers etc. as digital objects  Digital objects exist in a shared virtual space Find, Deposit, Manage, Share and Reuse data, software, metadata and workflows  Digital objects need to conform to FAIR principles:  Findable  Accessible (and usable)  Interoperable  Reusable
  • 22. Developing a Commons Framework  Exploits new scalable computing technologies - Cloud  Making digital objects : FAIR  Indexable/Findable, Accessible & Usable, Interoperable, Reproducible  Simplifies access, sharing and interoperability of digital objects such as data, software, metadata and workflows  Provides physical or logical access to digital objects  Provides understanding and accounting of usage patterns  Is potentially more cost effective given digital growth  Gives currency to digital objects and the people who develop and support them
  • 23. Commons Framework Compute Platform: Cloud or SC Facilities Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface https://datascience.nih.gov/commons
  • 24. Commons Framework Compute Platform: Cloud or SC Facilities Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface IaaS PaaS SaaS https://datascience.nih.gov/commons
  • 25. Commons: Digital Object Compliance  Attributes of digital research objects in the Commons Initial Phase  Unique digital object identifiers of resolvable to original authoritative source  Machine readable  A minimal set of searchable metadata  Physically available in a cloud based Commons provider  Clear access rules (especially important for human subjects data)  An entry (with metadata) in one or more indices Future Phases  Standard, community based unique digital object identifiers  Conform to community approved standard metadata and ontologies for enhanced searching  Digital objects accessible via open standard APIs  Are physically and logical available to the commons
  • 27. Towards Data Commons’ co-locate data, storage and computing infrastructure with commonly used tools for accessing, analyzing, sharing data to create an open interoperable resource for the research community.
  • 29. Current Commons Pilots  Explore feasibility of the Commons framework  Provide data objects to populate the Commons  Facilitate collaboration and interoperability  Provide access to cloud (IaaS) and PaaS/SaaS via credits  Connecting credits to NIH Grants  Making large and/or high value NIH funded data sets and tool accessible in the cloud  Developing Data & Software Indexing methods  Leveraging BD2K efforts bioCADDIE et al  Collaborating with external groups
  • 30. Other Commons Activities  Testing cloud environments to enable access, sharing. use and reuse of large data sets and accompanying tools  The Cancer Genome Atlas (TCGA) - NCI  Human Microbiome Project (HMP) - NIAID  Providing a portals to view representation and analysis of large data sets (Genomic Data Commons – NCI)
  • 32.  Exploring feasibility of the Commons framework using the BD2K Centers, MODs, and HMP groups  Facilitating connectivity, interoperability and access to digital objects  Providing digital research objects to populate the Commons  Enable biomedical science to happen more easily and robustly  Connecting biology use cases with data science Commons Framework Pilots BD2K Centers, MODs, HMP
  • 33. BD2K Centers, MODS and HMP Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping to the Commons framework: Commons Framework Pilots PaaS SaaS
  • 34. Does your work map to the Commons framework? Good Bad Ugly How does it enable science? Using robust computational methods Enable biomedical use cases Commons Framework Pilots BD2K Centers, MODs, HMP
  • 35. Commons Framework Pilots PI Parent grant’s IC Project description TOGA NIBIB • Cloud-hosted data publication system • Allows the automatic creation and publication of data a personalized data repository MUSEN NIAID • Smart APIs – improved handling for metadata within APIs • Ontological support for metadata within an API • Improving smart API discoverability: a registry of APIs HAN NIGMS • Docker container hub for BD2K community • Docker containers for genomic analysis applications and pipelines • Benchmark, Evaluation & best practices COOPER/KOHANE NHGRI • Cloud based authenticated API access and exchange of causal modeling data , tools + genomic and phenomic data (PICI) • Docker containers for CCD tools available in AWS HAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast cancer susceptibility genes and variations • (GA4GH) API : being able to query this data and metadata Ohno-Machado NHLBI • Development of an ecosystem for repeatable science • easy reuse of data AND software; tracking of provenance. • Use of container technologies for software and data reuse. Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation tasks of the collaborating sites. • An API to provide programmatic access to the relevant papers in PMC White NHGRI • The entire HMP1 data set made accessible on AWS • Analysis tools for microbiome data in AWS Westerfield NHGRI • Development of a common data model for the MODs • Development of APIs accessing data across the MODs
  • 36.  More specifically from a Data Science perspective  Open standards for APIs and Docker containers  Docker registry and best practices  Improved metadata handing in APIs  Data Object registry and indexing  Reusing what is currently available  bioCADDIE, schema.org and schema.org  Publication  Preprint server with Links to all digital objects Commons Framework Pilots BD2K Centers, MODs, HMP
  • 37.  Example of a biomedical Use Case:  Develop a common gene model for all the MODs  Develop a open well structured, resuable and documented API that can be used across the MOD data  Why? • To be able to query a human gene against all MOD orthologs • Improved understanding of health and disease states • Improved understanding of genome structure & organization Commons Framework Pilots BD2K Centers, MODs, HMP
  • 38. The purpose of the Commons Framework is to support BOTH Biological use cases + Data Science methods To allow biological research to happen at scale Commons Framework Pilots BD2K Centers, MODs, HMP
  • 40. The Cloud Credits Model The Commons Cloud Provider A Cloud Provider B Investigator NIH Provides credits HPC Provider Uses credits in Commons Enabling search: Index Commons Compliance Commons Conformance
  • 41. Drivers of the Cloud Credits Model  Scalability  Exploiting new computing models  Potentially Cost Effectiveness  Simplified sharing of digital objects Cloud computing supports many of these objectives
  • 42. Cloud credits model (CCM) Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping pilots to the Commons framework: Cloud Credits Model: IaaS PaaS SaaS
  • 43.  Supports simplified data sharing by driving science into publicly accessible computing environments that still provide for investigator level access control  Scalable for the needs of the scientific community for the next 5 years  Democratize access to data and computational tools  Cost effective  Competitive marketplace for biomedical computing services  Reduces redundancy  Uses resources efficiently Advantages of this Model
  • 44.  Novelty: Never been tried, so we don’t have data about likelihood of success  Cost Models: Assumes stable or declining prices among providers True for the last several years, but we can’t guarantee that it will continue, particularly if there is significant consolidation in industry  Service Providers: Assumes that providers are willing to make the investment to become conformant Market research suggests 3-5 providers within 2-3 months of launch  Persistence:  The model is ‘Pay As You Go’ which means if you stop paying it stops going  Giving investigators an unprecedented level of control over what lives (or dies) in the Commons Potential Disadvantages of this Model
  • 46. Data Sets in a Cloud Commons  Making High Value and/or High Volume NIH funded data sets available in a cloud commons  Co-location of large datasets and compute power enables access, use, resuse and sharing of data and tools  Data must adhere to FAIR/Commons compliance principles  Helps “seed” the Commons with FAIR/Commons compliant data  Provides an Indexable test data sets for bioCADDIE (and other indexing efforts)
  • 47. Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping pilots to the Commons framework : Large, high value Data Sets NIH defined data sets
  • 48. Data Sets in the Cloud Commons  Preliminary possible data sets  GTex (Genotype-Tissue Expression)  LINCS (Library of Integrated network based cellular signatures)  Model Organism Databases (MODs)  UniProt  Neuroimaging Resource (NITRIC)  Radiology Image Share  Epigenomics  GenPort  The Cancer Genome Atlas Project (TCGA) this data set is currently housed at the GDC but there ARE plans to move to AWS and Google  BTRIS Data – NIH Clinical center  NIAID AIDs Data  dbGAP  GEO
  • 49. Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping pilots to the Commons framework : Community Defined Data Sets Community defined data sets
  • 50. Data Sets in a Cloud Commons: Opportunities  Ability to share data more easily  Ability to access and compute on data more easily  Reduced costs:  Costs is paid by NIH not the individual PI  Stops continues uploads of the same data sets  FAIR/ Commons Compliance of data sets
  • 51. Data Sets in a Cloud Commons: Challenges  Supporting sensitive (human) data in commercial clouds  Updating, versioning, maintaining  Consents for data  Can be very strict and only valid across 1 data set  Analysis across data sets may constrained by consents  Optimizing for cloud environments: performance  Incentivizing data (and tool) generators to move and maintain their data in the cloud  Data peering across clouds  Commercial clouds are resistant : cyclinders of excellence  Peering and Virtualization of services
  • 53. Commons Pilots: Search & Index  Indexing and Searching digital objects in a Commons Leveraging indexing methods within BD2K BioCADDIE, Others approach within BD2K Schema.org Coexisting efforts
  • 54. BD2K Indexing e.g. BioCADDIE, Other, schema.org Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data DigitalObjectCompliance App store/User Interface Mapping pilots to the Commons framework : Indexing & Searching
  • 55. What is bioCADDIE? biomedical and healthCAre Data Discovery Index Ecosystem  University of California San Diego  PI Lucila Ohno-Machado  Development of a prototype of Data Discovery Index (DDI)  Aims – “Pubmed” for Data 1. Help users find shared data 2. Build a prototype data discovery index 3. Evaluate requirements for next phase
  • 56. ecosystem components for finding data Policies criteria for inclusion, sustainability Standards metadata data Identifiers reuse of existing ID issuing services Metadata minimal set guidelines for mapping, accessibility information, provenanceSearch engine connection to other engines, repositories, data sets
  • 57. Commons Pilots  Leveraging Schema.org  Marking up a biomedical resource using schema.org  Flexible and scalable  Developing a bioschema.org approach  Helps drive a community standard for reuse by other groups  Harnesses the power of search engines to find digital objects
  • 58. Commons : Achieving a Balance Biomedical Use Cases + Data Science + e-infrastructures Supporting open biomedical science using robust, scalable and flexible digital technologies In collaboration with global communities
  • 59. Thankyou  ADDS Office Phil Bourne, Michelle Dunn, Jennie Larkin, Mark Guyer, Sonynka Ngosso  NCBI: George Komatsoulis  NHGRI: Valentina di Francesco, Kevin Lee  CIT: Debbie Sinmao, Andrea Norris, Stacy Charland  Trans NIH BD2K Executive Committee & Working groups  NCI: Warren Kibbe, Tony Kerlavage, Lou Staudt, Tanja Davidsen, Ian Fore  NIAID: Nick Weber, Darrell Hurt, Maria Giovanni, JJ McGowan  Many biomedical researchers, cloud providers, IT professionals

Editor's Notes

  1. There is not enough funding for every researcher to house all the data they need Analyzing the data is more expensive than producing it It can take weeks to download large datasets
  2. OSTP Office of Science and Technology Policy https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
  3. OSTP Office of Science and Technology Policy https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
  4. . The ultimate objective is to provide the community with a fully functional DDI integrated into the digital commons.