Más contenido relacionado

Presentaciones para ti(20)

Similar a dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09/2020(20)


Más de dkNET(20)


dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09/2020

  1. Creating and Sustaining a FAIR Biomedical Data Ecosystem Susan Gregruick, Ph.D. Associate Director for Data Science and Director, Office of Data Science Strategy October 9, 2020
  2. Making Data FAIR  must have unique identifiers, effectively labeling it within searchable resources.Findable  must be easily retrievable via open systems and effective and secure authentication and authorization procedures. Accessible  should “use and speak the same language” via use of standardized vocabularies.Interoperable  must be adequately described to a new user, have clear information about data-usage licenses, and have a traceable “owner’s manual,” or provenance. Reusable
  3. Is this what FAIR data looks like…
  4. Or is this FAIR Data…
  5. NIH supports many different biomedical research communities with diverse sets of data
  6. 6 The Rime of the Ancient Mariner, Samuel Taylor Coleridge (excerpted) Day after day, day after day, We stuck, nor breath nor motion; As idle as a painted ship Upon a painted ocean. Water, water, every where, And all the boards did shrink; Water, water, every where, Nor any drop to drink.
  7. This proliferation of data, and the accompanying computing resources and new algorithms, brings new opportunities for discovery, as well as new challenges
  8. Journal articles could link to repository data sets Metadata were computable so that a search for similar datasets was possible Analysis tools were linked to datasets, via Github, Bioconductor, Galaxy or other….
  9. NIDDK The mission of the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) is to conduct and support research on diabetes and other endocrine and metabolic diseases; digestive diseases, nutritional disorders, and obesity; and kidney, urologic, and hematologic diseases, to improve health and quality of life. NIDDK supports research studies across a wide variety of disease areas and in turn supports a variety of platforms to house and manage the data they each generate. These studies utilize a spectrum of modern experimental techniques, generating different modalities of data about the patient and their disease state. Collecting, integrating and working with this all data presents a variety of challenges. Challenges New consortia would like to share or reuse existing data platforms rather than having to create them from scratch Integrating data from the same patient across different studies currently requires significant manual effort Supporting analysis and visualization tools for imaging data being produced by various projects Image from
  10. Integration of GUDMAP expression data with GTEx eQTLs Core Motivations ● GUDMAP contains gene expression data across various parts of the kidney and urogenital system. ● The GTEx database contains expression QTL (eQTL) data correlating gene expression with specific genomic variants ● Integrating GUDMAP data with GTEx may lead to insights into gene regulation in kidney development and renal disease Potential Data Sources ● NIDDK GUDMAP Genitourinary data repository ● Common Fund GTEx gene expression database ResearchScientist Icon made by Roundicons from As a renal disease researcher, I want to combine gene data from GUDMAP with eQTL data from the Common Fund GTEx resource in order to investigate variants involved in regulating renal gene expression
  11. Data integration within the TEDDY T1D platform Core Motivations ● Data submitted to TEDDY at different times and locations are independent data releases with different subject identifiers per release ● The same subject will likely have data spread across multiple data releases ● Recombining this data is a very manual process, having an integrated data environment would simplify this significantly Potential Data Sources ● Genomics ● Epigenomics ● Transcriptomics ● Proteomics ● Metabolomics ResearchScientist Icon made by Roundicons from As a T1 diabetes researcher, I want to combine data across TEDDY releases in order to bring together all the different modalities of data collected from the same subject
  12. This is the promise of the NIH Strategic Plan for Data Science …and here’s how we will get there.
  13. 13 0% 25% 50% 75% 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 PERCENTAGE YEAR Percentage of NIH Supported PMC publications with data availability statement
  14. NIH Data Management and Sharing Policy Development • Researchers with NIH-funded or conducted research projects resulting in the generation of scientific data will be required to submit a Plan • Plans should explain how scientific data generated by a research study will be managed and which of these scientific data will be shared Community Input Solicited • 189 submissions from national and international stakeholders Identified need for appropriate infrastructure • policy and implementation to go ‘hand-in-hand’ Develop draft policy for data management and sharing and related guidance Released draft for community input Release final policy (2020)
  15. Options of scaled implementation for sharing datasets • PMC stores publication-related supplemental materials and datasets directly associated publications. Up to 2 GB. • Generate Unique Identifiers for the stored supplementary materials and datasets. Use of commercial and non-profit repositories STRIDES Cloud Partners • Store and manage large scale, high priority NIH datasets. (Partnership with STRIDES) • Assign Unique Identifiers, implement authentication, authorization and access control. Datasets up to 2 gigabytes Datasets up to 20*gigabytes High Priority Datasets petabytes PubMed Central • Assign Unique Identifiers to datasets associated with publications and link to PubMed. • Store and manage datasets associated with publication, up to 20* GB. NIH strongly encourages open access Data Sharing Repositories as a first choice. Overview of Sharing Publication and Related Data
  16. • PMC stores publication- related supplemental materials and datasets directly associated publications. Up to 2 GB. • Generate Unique Identifiers for the stored supplementary materials and datasets. Use of commercial and non-profit repositories STRIDES Cloud Partners • Store and manage large scale, high priority NIH datasets. (Partnership with STRIDES) • Assign Unique Identifiers, implement authentication, authorization and access control. PubMed Central • Assign Unique Identifiers to datasets associated with publications and link to PubMed. • Store and manage datasets associated with publication, up to 20* GB. NIH supports many repositories for biomedical data sharing AphasiaBank
  17. How to Find Data Repositories • BMIC Data Repository Listing ring_repositories.html • SciCruch/dkNET • Organized by repository type and scientific area. repositories • FAIRsharing • DataMed
  18. Optimized Funding for NIH Data Repositories and Knowledgebases • Data resources are important research tools • Historically funded through research grants • Funding mechanism should be optimal for type of resource • End goal: researcher confident in data and information integrity • Solution: New Funding Announcement for data repositories and knowledgebases • Resource plan requirement Scientific Impact 1.Community Engagement 1.Quality of Data and Services and Efficiency of Operations Governance
  19. Optimized Funding for NIH Data Repositories and Knowledgebases Funding Opportunities • NIH released two funding opportunities on Jan. 17 to support biomedical data repositories and knowledgebases: • Biomedical Data Repository (PAR-20-089) • Biomedical Knowledgebase (PAR-20-097) Scientific Impact 1.Community Engagement 1.Quality of Data and Services and Efficiency of Operations Governance
  20. Piloting a FAIR Generalist Repository Using Figshare Existing Figshare features Pilot-specific features
  21. Repository contains data funded by 21 different NIH ICOs NCATS NCCIH
  22. • Generalist repositories are growing – more researchers are depositing data and more publications are linking to generalist repositories. • Researchers need more education and guidance – where to publish data and how to describe datasets in metadata fields effectively. • Metadata enhancement enables greater discoverability – metrics indicate greater access but need longer time scale to observe data reuse. NIH Figshare Pilot – Key Takeaways
  23. Guiding researchers on better metadata to enhance data discoverability graphic credit: Ontotext
  24. Harnessing the power of the cloud
  25. NIH is Harnessing the Power of the Cloud for Biomedical Research • Cloud computing offers multiple opportunities NIH can leverage to advance biomedical research, including: • Computation on biomedical data at an unprecedented scale • Broad access to cutting-edge cloud technology with, for example, industry-leading security tools • Storage of large, diverse data in a way that enables easier sharing, access, and reuse of data with other researchers • A community-driven approach to data science that breaks down disciplinary silos • Adopt and develop cloud-based tools from industry or academia for biomedical research 25
  26. Turning Research Data Into Knowledge and Discovery 26 The Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative​ • State-of-the-art data storage and computational capabilities​ • Training and education for researchers​ • Innovative technologies such as artificial intelligence and machine learning​ Partnerships with and other commercial providers
  27. STRIDES by the numbers* 27 17 NIH ICs extramural institutions programs/projects people trained 37 279 >2400 cost savings to participating ICs $9M obligated by NIH / expended to date $51.5M / $18M compute hours 30M petabytes stored 80 *as of 8/31
  28. 28
  29. Moving Data to the Cloud for Large-Scale Analysis 36.4 PB of public and controlled-access Sequence Read Archive data in two clouds (GCP & AWS)
  30. We can now do this in 3-4 days instead of 12+ months directly as a result of the SRA data being available in the cloud. This means we can share this data with the CoV researchers today, when it can make a difference, not a year from now. This is important for COVID-19 now, and will be important in response to the next pandemic." – Artem Babaian, Lead Developer at Serratus and corresponding author for publication, “Petabase-scale sequence alignment catalyses viral discovery” Benefits of the Cloud for Large-Scale Analysis
  31. Enhancing Software Tools for Open Science
  32. Supplements to Enhance Software Tools for Open Science • New collaborations between biomedical researchers and software engineers Enhance software engineering of valuable scientific tools • Working with STRIDES Initiative is encouraged but not required Make research tools “cloud-ready” NOT-OD-20-073
  33. Topics Funded Across 12 Institutes and Centers FHIR Clinical Cloud Commons Biomolecular Simulation Biophysics Genomics Imaging Neuroimaging
  34. Advancing Artificial Intelligence (AI)
  35. EMRs/EHRs Extract medical information from text in EMRs/EHRs Interpret genomic sequence data to understand impact of mutations on protein function Read medical images and help diagnose diseases like pneumonia and cancer Monitor sleep and vitals to send information about health at home to doctors Determine which calls to child welfare systems warrant deployment of family support and prevention resources to protect at- risk children Examples from Katabi, Ng, Putnam-Hornstein, Troyanskaya, and others AI in Biomedicine: Opportunities
  36. NIH NVIDIA COVID CT-AI Classification Segmentation Image Classification Preprocessing Conversion to nifti with 1x1x1 resampling dicoms nifti AH-Net Architecture 3D-Densnet-121 Apply Mask Classification Likely COVID Vs Unlikely COVID Lung Segmentation Mask Baris Turkbey, Sheng Xu, Tom Sanford, Stephanie Harmon,, Mona Flores, Daguang Xu, Xiasong Wang, Ziyue Xu, Holger Roth, Dong Yang, Evrim Turkbey, Mike Kassin, Maxime Blain, Brad Wood CT images have been used in Asia to detect COVID-19 virus in patients
  37. New Common Fund Initiative: Artificial Intelligence for BiomedicaL Excellence (AIBLE) May 15, 2020 - NIH Council of Councils • • AI Concept Clearance (start at 1:25min) • NIH Artificial Intelligence Working Group Final Report
  38. data people ELSI Data collection analysis reuse People attract train convene Ethics accountable informed representative R2: criteria for ML-friendly datasets R3: “datasheets” and “model cards” R4: consent and data access standards R5: ethical principles for ML in biomedicine R7: ML-focused trainees and fellows. R8: convene cross-disciplinary collaborators R6: curricula for ML-BioMed experts R1: flagship data generation efforts Recommendations 38
  39. Support flagship efforts that generate large-scale experimental data, with billions of data points designed to: i. be well-suited for ML analysis and inference ii. address key biomedical challenges iii. stimulate new approaches in machine learning And that implement processes designed to: i. develop improved criteria and technical mechanisms for data access ii. strengthen ethical criteria for dataset use (consent, privacy, accountability, ...) Support flagship data generation efforts to propel progress by the scientific community. 27 data ethics people Projects should: ▪ address key biomedical challenges using ML methods ▪ advance ML methods for future use in biomedicine ▪ produce transformative data sets, designed with ML in mind ▪ propel new ways to gather massive data in biomedicine ▪ involve strong engagement from leading ML researchers Project review should: ▪ incorporate expertise in ML as well as traditional biomedical domains
  40. Publish criteria for evaluating datasets based on their value for ML-based analysis. ▪ what makes a dataset most useful for ML-based analysis? ▪ what attributes are and aren’t addressed by existing datasets? ▪ start as guidelines; within two years recommend a subset as requirements Develop and publish criteria for ML-friendly datasets. 30 Examples of potential criteria: ▪ clear provenance: as much metadata as possible, to detect & correct for batch effects ▪ well-described data: what does each variable mean? what’s the distribution of values? ▪ accessible data: flexible data access policy, reasonable data access process ▪ large sample size: to allow training (and evaluation) without overfitting ▪ multimodal data: to study complex systems from multiple perspectives ▪ perturbation data: includes outcomes (“outputs”) as well as measurements (“inputs”) ▪ longitudinal data: to allow modeling and prediction of progression ▪ active learning: data grows over time, incorporates new data-gathering techniques, and uses ML-based analysis of existing data to inform future data generation data ethics people
  41. Design and apply “datasheets” and “model cards” for biomedical ML. 41 Potential datasheet best practices: • demographics and UBR characteristics • privacy, consent, and copyright issues • known blind spots, which could otherwise create hidden biases Potential model card best practices: • what training data was used • how training and validation were done • known limitations on applicability • intended use, and potential harms of inappropriate use • Develop and publish best practices for: • “datasheets” that describe & evaluate training datasets • “model cards” that do the same for generated models • Test the best practices in the real world: • build after-the-fact examples for existing datasets • apply to new datasets, and update the best practices • Once best practices have been updated: • require datasheets and model cards for all NIH extramural grant applications and NIH intramural projects that involve ML research • encourage journals to do the same for paper submission and publication data ethics people
  43. New Partnerships in Data Science and AI
  44. Smart and Connected Health (SCH) Accelerate innovations in computer and information science and engineering to support the transformation of health and medicine
  45. Smart Health & Data Science Research Areas • Tools for interoperable, distributed, federated, & scalable digital infrastructure • Novel ontological systems and knowledge representation approaches • Methods for data integrity, provenance, security, privacy and reliability Information Infrastructure • Computational tools for fusion and analysis of multi-level and -scale data • Knowledge representations, visualizations and reasoning algorithms • Approaches for combining AI learning with mechanistic modeling • Unstructured data interpretation Transformative Data Science • Design & fabrication of novel multimodal sensor systems • Synthesis of new biorecognition elements Novel Multimodal Sensor System Hardware • New approaches to support individuals to effectively participate in their own health • User-tailored and context-aware interfaces to reduce burden and increase autonomy • Develop new methods for context-dependent selection, presentation and use of data Effective Usability • Closed-loop or Human-in-the loop systems • Technology platforms for optimizing delivery of health interventions • Simulation and modeling methods and software tools Automating Health • Modeling on-visual context information and perception of complex images. • Methods to exploit experts’ implicit knowledge to improve perceptual decision making • Develop models of how experts respond to changes in cognitive factors Medical Data Interpretation
  46. Let’s create a bright future
  47. 47 Coding it Forward • Student-led non-profit places tech- savvy students in federal agencies • 16 students for summer placed in admin or funding offices across 11 host institutes, centers, offices (ICOs) for 10-week summer program • 2 students extended until the start of school, 1 hired as contractor • 24 students will start a fall fellowship across 14 host ICOs
  48. NIH Data and Technology Advancement (DATA) National Service Scholar Program 8 Scholars will… Catalyze neuroscience research Unravel the Alzheimer’s Disease Genome Support cancer knowledge extraction Accelerate the clinical adoption of machine intelligence applications in medical imaging Harness data science for health discovery and innovation in Africa Expand theories of brain circuits Integrate NIH cloud-based platforms for genomics research Architect search across petabyte-scale data …in 2021
  49. Strategic Plan for Data Science: Goals and Objectives Data Infrastructure Optimize data storage and security Connect NIH data systems Modernized Data Ecosystem Modernize data repository ecosystems Support storage and sharing of individual datasets Better integrate clinical and observational data into biomedical data science Data Management, Analytics, and Tools Support useful, generalizable, and accessible tools Broaden utility of, and access to, specialized tools Improve discovery and cataloging resources Workforce Development Enhance the NIH data science workforce Expand the national research workforce Engage a broader community Stewardship and Sustainability Develop policies for a FAIR data ecosystem Enhance stewardship
  50. Data Science to Address COVID-19
  51. We’re putting COVID-19 data into repositories and platforms so the data will be USED by researchers!
  52. What could researchers do with these data? Better understand transmission and infectivity Evaluate Treatments & Interventions Predict Long-term Sequelae Link Social Determinants of Health with COVID- 19 related data and exposures Examine the impact on Child & Maternal Health Resolve Technical & Implementation issues
  53. 53 COVID Clinical Platforms  Increasing the amount and quality of EHR data related to individuals with COVID-19  Pilot a new enrollment partner model to efficiently target recruitment in expanded regions of the country and collect EHR data from proven partners  Rapidly collect EHR-derived clinical, lab, and imaging data from hospitals and health plans at the peak of the pandemic and as it evolves  Develop a robust, flexible collaborative analytics infrastructure to enable a high frequency response to COVID-19 and the next emerging threats  Include data from underserved populations, roughly 9.3M unique patients  PETAL’s ORCHID Trial & PETAL’s CORAL registry o RED CORAL: observational study of retrospective review of data collected on hospitalized patients with COVID-19 o BLUE CORAL, a multicenter prospective observational study designed to collect comprehensive data on hospitalized patients with COVID-19. This study will gather imaging, biospecimens, and long-term outcomes.
  54. Honest Broker P O L I C Y R E S O U R C E S W O R K B E N C H E S / T O O L S Federated Data Platforms I N F O R M A T I O N S Y S T E M S D A T A D I S C O V E R Y API API API API TBD* TBD* TBD* TBD* MIDRC, RADx, NICHD, NIA, etc. Research Authentication System Hash Diagram Elements CDE Standards for Interoperability Data Discovery across Platforms examples include GA4GH FASP, PIC-SURE Research Authentication System Interoperable Elements Data Linkage Across SystemsHonest Broker FHIR to map and move data Interoperability Across Clinical COVID Serving Data Platforms
  55. 55 Researcher Workflows Before Researcher Authentication Services (RAS) Platform 1 Cloud-based Analysis Tool LOGIN (5) SEARCH/ SELECT ACCESS COMPUTE SHARE SEARCH/ SELECT ACCESS Platform 2 1 3 2 4 5 Researchers login and/or give consent at least 5 times for each workflow in the Phase 1 interoperability use cases
  56. 56 AUTH N AUTH Z Passport and Visa: Which dbGaP studies/consent groups you are authorized to access and your role LOGIN (1) SEARCH/ SELECT ACCESS COMPUTE SHARE ID Token: Who you are 1 Before provisioning data, the platform validates the passport/visa by calling RAS, so access information is always up to date within the last 30 minutes Researcher Workflows After RAS August Deploy Authentication and Authorization provided by a central NIH service. Auth tokens move with the user as they navigate to any of the four Phase 1 Data Platforms so that the researcher only logs in one time to RAS
  57. Privacy-Preserving Tokens N3C Sites N3C Sites Output de-id tokens Patient 123 Tokenize NIH Clinical Studies Senior Living EHR Tokenize Tokenize Output de-id tokens Patient 456 Output de-id tokens Patient 789 John Smith 03/27/1945 Male John Smith • Admitted to N3C Hospital • Participates in Clinical Studies • Lives in a Senior Living Facility N3C Linkage Honest Broker Patient 123 Patient 456 Patient 789 De-identified ‘Rosetta Stone’ process that unifies records 007 Match & De-duplicate Patient Care Tokenization De-Duplication and Linkages
  58. a modernized, integrated, FAIR biomedical data ecosystem VISION
  59. NIH staff who deserve all the credit • STRIDES: Andrea Norris, Nick Weber and NMDS team, and Fenglou Mao • Connecting NIH Data Resources: Regina Bures, Ishwar Chandramouliswaran, Tanja Davidsen, Valentine Di Francesco, Jeff Erickson, Tram Huyen, Rebecca Rosen, Steve Sherry, Alastair Thomson, Greg Farber, Dylan Klomparens, Charles Schmitt, Susan, Wright, Ken Wiley, Kristofor Langlais, James Coulomb, Lora Kutkat, Nick Weber, Allen Dearry • Data Repository and Knowledgebase Resources: Kim Pruitt Valerie Florance, Valentina di Francesco, Ajay Pillai, Qi Duan, Dawei Lin, Christine Colvis, Jennie Larkin, Ravi Ravichandran, and James Coulombe • FHIR Pilots: Teresa Zayas-Caban, Denise Warzel, Kerry Goetz, Ken Wiley, Alison Cernick, Kenneth Wilkins, Carolina Mendoza-Puccini, Matt McAuliffe, and Belinda Seto • Criteria for Open Access Data Sharing Repositories: Mike Huerta, Dawei Lin, Maryam Zaringhalam, Lisa Federer and BMIC Team • Pilot for Scaled Implementation for Sharing Datasets: Ishwar Chandramouliswaran, Lisa Federer, Maryam Zaringhalam, and Jennie Larkin • Software Sustainability: Heidi Sofia, Ishwar Chandramouliswaran, Mike Conway, Tony Kirilusha, Xujing Wang, Andrew Weitz, Todd Merchak, Allissa Dillman and Jess Mazerik • Smart and Connected Health: Haluk Resat, Dana Wolff-Hughes, Partha Bhattacharyya, Fenglou Mao • Coding-it-Forward Fellows Summer Program & DATA Scholars Program: Jess Mazerik, Wynn Meyer
  60. 60 Office of Data Science Strategy A modernized, integrated, FAIR biomedical data ecosystem 60@NIHDataScience /NIH.DataScience