Anzeige
Anzeige

Más contenido relacionado

Presentaciones para ti(20)

Similar a CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects(20)

Anzeige
Anzeige

CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects

  1. This project has received funding from the European Union’s Horizon 2020 research and Innovation programme under grant agreement No. 825775 Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects Presenter: Tony Burdett (EMBL-EBI) Host: Marta Lloret Llinares (EMBL-EBI)
  2. This webinar is being recorded
  3. Audience Q&A Session Please write your questions in the questions window of the GoToWebinar application
  4. The challenges: Stay informed @CinecaProject www.cineca-project.eu Common Infrastructure for National Cohorts in Europe, Canada and Africa This project has received funding from the European Union’s Horizon 2020 research and Innovation programme under grant agreement No. 825775 Accelerating disease research and improving health by facilitating transcontinental human data exchange The vision: This project has received funding from the Canadian Institute of Health Research under grant agreement #404896
  5. Today’s presenter Tony Burdett leads the Archival Infrastructure and Technology team, which develops services and provides technology to support the activities of EMBL-EBI’s molecular archives, including data submission, storage, validation, coordination and presentation. Tony joined EMBL-EBI in 2005 and has personally built and led development teams for many resources such as the GWAS Catalog, ArrayExpress, the Expression Atlas and BioSamples. His team now develops the ingestion service for the Human Cell Atlas Data Coordination Platform, EMBL-EBI’s Unified Submission Interface, and the BioSamples database.
  6. Lessons learned from the Human Cell Atlas and other federated data projects Data Gravity in the Life Sciences Tony Burdett, EMBL-EBI 12th November, 2020
  7. A bit about me… • I joined EBI in 2005 • I have a biological and medical background • My career has been heavily focused on service engineering in bioinformatics • I’ve built, helped develop, or run the development teams for… • ArrayExpress • Expression Atlas • BioSamples • Ontology tooling • GWAS Catalog • Human Cell Atlas DCP
  8. Data Gravity I didn’t coin the term... https://datagravitas.com/2010/12/07/data-gravity-in-the-clouds/
  9. vR BC G = “Let data gravity of a given dataset, G, be the product of data volume, V and the regulatory restrictions of the region in which the data was generated, R, over the bandwidth at the location of the data, B, and the cost of compute in that location, C” Data Gravity Background photo created by rawpixel.com - www.freepik.com
  10. Data Gravity
  11. Data Gravity
  12. Data Gravity
  13. Why does “data gravity” matter?
  14. Who uses EMBL-EBI services?
  15. Percentage of whole genomes and exomes that are funded solely by healthcare systems 2012 ~1% 2017 ~20% 2022 >80% Changing Genomic Data Generation Landscape
  16. Data Gravity
  17. Big Data in Digital Biology: EMBL-EBI 2015-2019 Public Web Infrastructure • Web Requests: 27M → 40M/day • Unique Host IPs: 1.1M → 2.4M/month • Web Jobs: 138M → 145M/year • Search Requests: 272M → 551M/year 6.2PB → 22.7PB 1600VMs → 3100VMs (TB) 450TB → 973TB Slide acknowledgment: Steven Newhouse
  18. Data Gravity
  19. Data GravityData Gravity
  20. Collating Data for Analysis Data being analysed Cohort datasets Reference annotation datasets Proprietary, firewalled datasets
  21. Bottlenecks and Barriers FEDERATED DATA FEDERATED WORKFLOW EXECUTION GLOBAL FEDERATED RESEARCH PLATFORM
  22. ● Data and Data Sciences are core elements of Health Research and Innovation and in all elements of Biopharma Research ● The impact and reuse of data is rapidly growing - but nearly 80% of investment is spent assembling and harmonizing data Bottleneck: FAIR Data Forbes article on 2016 Data Scientist Report
  23. Cost of not having FAIR research data: €26bn/yr in Europe https://dx.doi.org/10.2777/02999 Impact on innovation
  24. Bottleneck: Data Federation • National genomics initiatives in most European countries • Primary goal healthcare diagnostics and personalised medicine • Federated EGA is a harmonised platform for human data discovery, access, distribution, coordinated via ELIXIR human data community • Central EGA: International submissions+helpdesk • Local EGA: Host data locally, share metadata, national node for submissions and/or helpdesk • EGA community: Host data locally, share metadata
  25. Bottleneck: Reproducible Research and Analysis Figure courtesy of: https://esciencelab.org.uk/projects/eosclife/
  26. @CinecaProject CINECA - Federated Analysis Data sources EGA Biobanks CHILD H3ABioNet .. WP1 Federated data discovery - Phenotype - Genotype - Data use WP4 Federated research - Federated GWAS - Federated Genomic Analyses WP3 Cohort Level Meta Data Representation WP2 AAI - Europe, Canada, Africa interoperability
  27. Sending Compute to Data… Globally? • Global data storage and analysis infrastructures required • Generating truly portable analysis workflows is complex - and we don’t have good solutions yet • Some high powered spacecraft still need building!
  28. Overcoming Data Gravity DEPENDS ON... Costs of compute Network bandwidth Data sharing regulations Data volumes
  29. “Cloud native” is the answer!
  30. Human Cell Atlas - profiling millions of human cells Global effort requiring: • Hundreds of labs • Organ-specific data • Disparate experimental techniques and data types Integrating data at this scale requires next generation technology and infrastructure
  31. Comprehensive Inclusive Organized Dynamic G en eti cs Accessible Tom Deerinck, NIGMS, NIH Human Cell Atlas Data Coordination Platform To bridge disparate data, tools and research from all over the world, we must bring them together in a public platform (the “HCA DCP”) that is:
  32. Labs contribute single-cell data DCP pipelines upload authors data and process Researchers access data on the portal Researchers find community tools to work with the data How it works: the DCP data flow
  33. HCA DCP Architecture
  34. Outcomes Downloads (Metadata) Downloads (Raw and Analysed Data) Checkout to Terra (to work on in analysis platform) HCA DCP Data Browser Statistics from Q3 2020, from a total 2671 data access requests
  35. “Cloud native” engineering is not enough to change behaviour Lessons Learned • The DCP adopted a heavily “cloud native” engineering approach • Services are somewhat traditional • Data archive (both raw and summary results) • Analysis pipeline • Engineered with cloud technology (has no impact to users) • All the data lives in AWS or GCP, in US-East (expensive to download) • Analysis platform available (but underused)
  36. Strategic Implications Data Gravity in the life sciences tells us we need a culture change
  37. Strategic Implications Data Gravity in the life sciences tells us we need a culture change Federating data and analysis requires: 1. Standards 2. Data provider adoption 3. Data consumer adoption 4. Understanding and considering data gravity
  38. Strategic Implications Data Gravity in the life sciences tells us we need a culture change Federating data and analysis requires: 1. Standards 2. Data provider adoption 3. Data consumer adoption 4. Understanding and considering data gravity SKILLS
  39. Strategic Implications Data Gravity in the life sciences tells us we need a culture change Federating data and analysis requires: 1. Standards 2. Data provider adoption 3. Data consumer adoption 4. Understanding and considering data gravity SKILLS INCENTIVES
  40. Strategic Implications Data Gravity in the life sciences tells us we need a culture change Federating data and analysis requires: 1. Standards 2. Data provider adoption 3. Data consumer adoption 4. Understanding and considering data gravity SKILLS INCENTIVES COSTS
  41. Credit to: Ian Harrow, FAIR & OM projects FAIR as enabler for the digital transformation Slide credit: Susanna Sansone 46 ● Data providers improve their own returns by implementing the FAIR Principles - gathering traction in big pharma ● FAIR enables powerful new AI analytics to access data for machine learning and prediction ● Requirements ○ financial, technical, training ● Challenges ○ change the culture, show business value, achieve the ‘FAIR enough’ ○ Sustain FAIR solutions and activities
  42. 47 https://www.covid19dataportal.org/ https://covidhub.psnc.pl/ https://covid19dataportal.se/sv/ https://covid19dataportal.jp/ COVID-19 Data Portals
  43. Top Tips: Driving Data Consumer Adoption 1. Identify good measures of value • What can I do faster, cheaper, better? • How many people are using your cloud platform vs downloading data? 2. Start small and expand • Big re-engineering efforts are costly, risky, and too slow to keep up with the rate of change in the field 3. Find some exemplars • Are there smaller sets of data that are high value? • Can you pilot approaches within communities? 4. Invest in training and outreach • Even if data is federated and the cloud platform exists, many bioinformaticians do not have the skills to exploit them
  44. Data Gravity
  45. vR BC G = “Let data gravity of a given dataset, G, be the product of data volume, V and the regulatory restrictions of the region in which the data was generated, R, over the bandwidth at the location of the data, B, and the cost of compute in that location, C” Data Gravity
  46. The AIT Team at EMBL-EBI Acknowledgements
  47. Questions?
  48. Questions? Title: Data Gravity in the Life Sciences: Lessons learned from the Human Cell Atlas and other federated data projects Presenter: Tony Burdett Please write your questions in the questions window of the GoToWebinar application
Anzeige