Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

NIH Data Summit - The NIH Data Commons

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Data Commons Garvan -  2016
Data Commons Garvan - 2016
Wird geladen in …3
×

Hier ansehen

1 von 37 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie NIH Data Summit - The NIH Data Commons (20)

Anzeige

Aktuellste (20)

NIH Data Summit - The NIH Data Commons

  1. 1. NIH Data Commons NIH Data Storage Summit October 20, 2017 Vivien Bonazzi Ph.D. Senior Advisor for Data Science (NIH/OD) Project Leader for the NIH Data Commons
  2. 2. What’s driving the need for a Data Commons?
  3. 3. Challenges with the current state of data  Generating large volumes of biomedical data  Cheap to generate, costly to store on local servers  Multiple copies of the same data in different locations  Building data resources that cannot be easily found by others  Data resources are not connected to each other and cannot share data or tools  No standards and guidelines on how to share and access data
  4. 4. Convergence of factors  Increasing recognition of the need to support data sharing  Availability of digital technologies and infrastructures that support Data at scale  Cloud: data storage, compute and sharing  FAIR – Findable Accessible Interoperable Reproducible  Understanding that data is a valuable resource that needs to be sustained
  5. 5. https://gds.nih.gov/ Went into effect January 25, 2015 NCI guidance: http://www.cancer.gov/grants-training/grants-management/nci- policies/genomic-data Requires public sharing of genomic data sets
  6. 6. Findable Accessible Interoperable Reusable
  7. 7. DATA has VALUE DATA is CENTRAL to the Digital Economy a signal of the coming Digital Economy
  8. 8. Scientific digital assets Data Software Workflows Documentation Journal Articles Organizations will be defined by their digital assets
  9. 9. The most successful organizations of the future will be those that can leverage their digital assets and transform them into a digital enterprise
  10. 10. Data Commons Enabling data driven science Enable investigators to leverage all possible data and tools in the effort to accelerate biomedical discoveries, therapies and cures by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering
  11. 11. Developing a Data Commons  Treats products of research – data, methods, tools, papers etc. as digital objects  For this presentation: Data = Digital Objects  These digital objects exist in a shared virtual space  Find, Deposit, Manage, Share, and Reuse data, software, metadata and workflows  Digital object compliance through FAIR principles:  Findable  Accessible (and usable)  Interoperable  Reusable
  12. 12. The Data Commons is a platform that allows transactions to occur on FAIR data at scale
  13. 13. The Data Commons Platform Compute Platform: Cloud Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data FAIR App store/User Interface/Portal PaaS SaaS IaaS
  14. 14. Other Data Commons’
  15. 15. Data Commons Engagement US Government Agencies & EU groups
  16. 16. Interoperability with other Commons’  Common goals – democratizing, collaborating & sharing data  Reuse of currently available open source tools which support interoperability  GA4GH, UCSC, GDC, NYGC  May 2017 BioIT Commons Session  Shared open standard APIs for data access and computing  Ability to deploy and compute across multiple cloud environments  Docker containers – Dockerstore/Docker registry  Workflows management, sharing and deployment  Discoverability (indexing) objects across cloud commons  Global Unique identifiers  Common user authentication system
  17. 17. The Good News  Considerable agreement about the general approaches to be taken  Many people are already addressing many of the problems:  Data architectures/platforms  Automated/semi-automated data access/authentication protocols  Common metadata standards and templates  Open tools and software  Instantiation and initial metrics of Findability, Accessibility, Interoperability, and Reusability  Relationships/agreements with Cloud Service Providers that leverage their interest in hosting NIH data  Moving data to the cloud and operating in a cloud environment
  18. 18. The Challenges  A need to “Bring it all Together” – Community endorsement of:  Metadata standards/tools/approaches  Crosswalks between equivalent terms/ontologies  Robust, shared approaches to data access/authentication  Best practices that will enable existing data to become FAIR and will guide generation of future datasets  Rapidly evolving field makes approaches/tools/etc subject to change – approaches need to be adaptable  Effort is required to adapt data to community standards and move data to the cloud  How much does that cost and how long does it take?  Lack of interoperability between cloud providers
  19. 19. The Challenges  Making data FAIR comes with a cost  How much does it actually cost?  How can we minimize the cost?  How do we determine whether any one set of data warrants the expense?  What is the value added to the data by making it FAIR?  What new science can be achieved?  How can new derived data or new computational approaches be added to the dataset to enrich it?  What are the limitations of FAIRness from dataset to dataset?
  20. 20. Development of a NIH Data Commons Pilot
  21. 21. NIH Data Commons Pilot allows access, use and sharing of large, high value NIH data in the cloud
  22. 22. NIH Data Commons Pilot
  23. 23. NIH Data Commons Structure 26 Cloud Services: APIs, Containers, GUIDs, Indexing, Search, Auth ACCESS Scientific analysis tools/workflows Data “Reference” Data Sets TOPMed, GTEx, MODs FAIR App store/User Interface/Portal/Workspace PaaS SaaS IaaS
  24. 24. Operationalizing the NIH Data Commons Pilot
  25. 25. NIH Data Commons Pilot : Implementation Storage, NIH Marketplace, Metrics and Costs Leveraging and extending relationships established as part of BD2K to provide access cloud to storage and compute Supplements: TOPMed, GTEx, MODs groups Prepare (and move) data sets to the cloud for storage, access and scientific use Work collaboratively with the OT awardees to build towards data access Data Commons OT Solicitation: Other Transaction ROA: Research Opportunity Announcement Developing the fundamental FAIR computational components to support access, use and sharing of the 3 data sets above
  26. 26. NIH Data Commons Pilot Consortium
  27. 27.  Establishing a new NIH Marketplace  access to a sustainable cloud infrastructure for data science at NIH  Over the next 18 months, NIH will establish its own NIH Cloud Marketplace  Data Commons Pilot Consortium awardees ability to acquire cloud storage and compute services  Enable ICs to easily acquire cloud storage and storage services from commercial cloud providers, resellers, and integrators  Building on existing relationship with CSPs  Led by CIT with input from Multi-IC working group Storage, NIH Marketplace, Metrics and Costs
  28. 28.  Assessment and Evaluation  What are the costs associated with cloud storage and usage?  What are the business best practices?  How should costs be paid?  Who should pay them?  How should highly used data be managed vs less used data?  Are data producers supportive of this model?  Are users (of all experience levels) able to access and use data effectively?  How will we know if the Data Commons Pilot is successful?  How to adjust to changing needs? Storage, NIH Marketplace, Metrics and Costs
  29. 29. Supplements to 3 Test Data Set Groups  Administrative Supplements to TOPMed, GTEx and MODs  PIs for each data set were requested to review the OT (ROA) and determine appropriate ways to interact  Prepare (and move) data sets to the cloud for storage, access and scientific use  Make community workflows and cloud based tools of popular analysis pipelines from the 3 datasets accessible  Facilitate discovery and interpretation of the association of human and model organism genotypes and phenotypes
  30. 30. NIH Data Commons: OT ROA  Key Capabilities – modular components  Development of Community Supported FAIR Guidelines and Metrics  Global Unique Identifiers (GUID) for FAIR biomedical data  Open Standard APIs (interoperability & connectivity)  Cloud Agnostic Architecture and Frameworks  Cloud User Workspaces  Research Ethics, Privacy, and Security (AUTH)  Indexing and Search  Scientific Use cases  Training, Outreach, Coordination
  31. 31.  Stage 1: 180 day window  Develop MVPs (Minimum Viable Products)  Demonstrations of the Data Commons and its components  Have one copy of each test data set in each cloud provider  Understanding of the process required to achieve this  Draft version of a single standard access control system  be able to access and use the data through the access control system  Able to use a variety of analysis tools and pipelines on the 3 data sets in the cloud – (driven by scientific use cases)  Have a rudimentary ability to query across test data sets  Display phenotype, expression and variant data aligned with a specific gene or genomic location  Display model organism orthologs for a given set of human genes  Draft FAIR guidelines and metrics  Understand how each of the computational components that support the ability to access data fit together and what standards are needed  Written plans of how and why these demonstrations should be extended into a full Pilot NIH Data Commons Pilot: Outcomes
  32. 32.  Stage 2: 4 year period  To extend and fully implement the Data Commons Pilot based on the design strategies and capabilities developed as part of Stage 1  Review of MVP/demonstrations and written plans from Stage 1  Goals and Milestones with clear and specific outcomes  Evaluate, negotiate, and revise terms of existing awards  Award additional OTs NIH Data Commons Pilot: Outcomes
  33. 33. Acknowledgments DPCPSI: Jim Anderson, Betsy Wilder, Vivien Bonazzi, Marie Nierras, Rachel Britt, Sonyka Ngosso, Lora Kutkat, Kristi Faulk, Jen Lewis, Kate Nicholson, Chris Darby, Tonya Scott NHLBI: Gary Gibbons, Alastair Thomson, Teresa Marquette, Jeff Snyder, Melissa Garcia, Maarten Lerkes, Ann Gawalt, Cashell Jaquish, George, Papanicolaou NHGRI: Eric Green, Valentina di Francesco, Ajay Pillai, Simona Volpi, Ken Wiley NIAID: Nick Weber CIT: Andrea Norris NLM: Patti Brennan NCBI: Steve Sherry
  34. 34. Stay in Touch QR Business Card LinkedIn @Vivien.Bonazzi Slideshare Blog (Coming soon!)

Hinweis der Redaktion

  • Current snapshot of Commons status

  •  
    Development of FAIR-ness Metrics 
  • The Data Commons is a federated way to provide access and sharing of large , high value NIH data
    The purpose of a Cloud based Data Commons is to make large data sets accessible and usable by the broader community.
    Having one copy of a large data set on the cloud means it is accessible by many researchers and they don’t need to copy the data set from NCBI (or other repositories) to the cloud every time they want to use it.
    One copy of a large data set on the cloud, accessed multiple times by many researchers who are only paying for the ability to compute on that data is more cost and time effective than moving the same large data set multiple times to the cloud

    A cloud based Data Commons becomes much more powerful when (community based) standardized methods and systems are adopted. These standards apply to the way the data and tools interact with each and the computing environment they sit within ie cloud and other how data and tools are made accessible to the user.
    Standards specifically relate to the FAIR guidelines, API to access data, workflows and tool, docker containers for deployment of tools to the cloud
    Standards are what enable a federated Commons.
    Standards create the basic ground rules and common language for interactions in the system.

  • The Data Commons Framework describes the ecosystem that the OT solicitation is building towards.
    Each of the key capabilities described in the OT have a major role in the development of the ecosystem
  • Governance of the Commons can be found on slide XX
  • The purpose of this slide is to give a sense that to provide access to the data requires a series of modular reusable components
    I wont describe each KC but I want to give them a sense that there are modular components that fit together to permit access
  • Multi IC Working Group co-chairs for the Data Commons Pilot
    Gary Gibbons, Eric Green, Patti Brennan, Jim Anderson, Andrea Norris

×