Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

No Free Lunch: Metadata in the life sciences

589 Aufrufe

Veröffentlicht am

This presentation covers some challenges and makes suggestions to support the work of creating flexible, interoperable data systems for the life sciences.

Veröffentlicht in: Wissenschaft
  • Als Erste(r) kommentieren

No Free Lunch: Metadata in the life sciences

  1. 1. Lightweight, Practical, Cross-domain Metadata March 2020: Bio-IT World West Chris Dwan (chris@dwan.org) https://dwan.org @fdmts
  2. 2. Conclusions Organizing data is a human practice, not a technology choice – There is no free lunch Start simple, with free technologies, and quick wins – NoSQL databases with headers and checksum – Plan to invest in infrastructure about 18 months into the journey – Don’t start with the whole genomes Good policy makes simple practice – Make data somebody’s job. – Enterprise data management has much to teach us
  3. 3. Geek Cred: My First Petabyte, 2008My first Petabyte: 2008
  4. 4. NIH circa 2008
  5. 5. “Gene expression data … … are meaningful only in the context of a detailed description of the conditions under which they were generated … … including the particular state of the living system under study … … and the perturbations to which it has been subjected State of the art: 2001
  6. 6. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. State of the art: 2019
  7. 7. 201 6:Data Quality Matters Ask a computational biologist / data scientist what fraction of their time is spent fighting data quality, formatting, and similar issues.
  8. 8. State of the Art: 2020 A miscommunication between the wet lab and the bioinformatics group resulted in an “embarrassing miscommunication” … … to the press.
  9. 9. Genomic Data Production in ContextGenomic data production @ Broad I did research computing at Broad from 2014 - 2017
  10. 10. ExAC / gnomAD: A powerful example ExAC (the Exome Aggregation Consortium) and gnomAD (the Genome Aggregation Database) represent vast amounts of work to harmonize both phenotype and consent
  11. 11. The IT Services Perspective Filesystem Metadata – File attributes: Size, format, creation and modification times – Permissions: Ownership, Access Control Lists – Access / usage patterns: Which files are accessed, and by whom – Compressibility / Deduplication: Lies Optimistic projections by vendors One function of a research computing team is to bridge the gap between data storage (usable capacity as provided by enterprise IT) and data services (semantically usable data)
  12. 12. The filesystem / directory tree is your default metadata database It is what your team is using today Any proposal less functional than descriptive filenames will fail
  13. 13. If you have four groups working on a compiler, you’ll get a four-pass compiler Eric S Raymond, The New Hacker’s Dictionary, 1996
  14. 14. Most primary data files (in bioinformatics) include valuable metadata, usually in the header. These can be quite verbose.
  15. 15. Bioinformatics as a discipline is filled with duplication. “hg19” here is identical to “build 37” from the previous slide.
  16. 16. Many file headers include the command line and parameters that were used to generate the file. This is the default method for storing experimental “provenance”
  17. 17. 2002: Parsers are a pretty awful solution
  18. 18. Container technology (Docker / Singularity) revolutionized software deployment. Instead of installers and configurators, we ship a whole operating system, with the app pre-installed. We do not have a similar solution for experimental metadata. We have not found a way to package and ship Domain experts and researchers. We don’t have containers for metadata
  19. 19. NoSQL is a delightful prototyping tool • NoSQL databases (MongoDB, PostgreSQL, …) do not require a fully defined schema. • You gain flexibility at the cost of consistency and possibly performance. • This makes them ideal for prototyping • “Plan to throw one away; you will, anyhow.” Fred Brooks, 1975 The Mythical Man-Month
  20. 20. What goes in the NoSQL? Unique key to identify the file The path to where it is stored A checksum (to find duplicates later) The header (scrape and store wholesale) Whatever else the lab said was important – Perhaps column headers from that spreadsheet they use…
  21. 21. Capture metadata at the time of creation • Metadata needs to be captured at the point of data creation • This amounts to putting more work on staff who are likely already overburdened and time conscious • Very little of the benefit of rigorous data processes will be felt in the lab (at least at first)
  22. 22. The Data Tzar Data Tzar Clearly empowered. Title sparks curiosity Data Janitor Data are trash. Low prestige job. Data Monkey Disrespectful, vaguely racist Chief Data Officer They mostly seem to work on licensing
  23. 23. The Data Tzar: Day 1 Engage with data generators – Tools to make their lives earlier - Dashboards, alerting systems, backups, routine analysis, QC checks – Go “breadth first” across the enterprise – Do not start with the whole genomes Sneakily harvest metadata Necessary resources (day 1): – 1 – 2 early career bioinformatics programmers – Access to an infrastructure engineer – A modest budget on your cloud provider of choice
  24. 24. The Data Tzar: First Year Do a lot of favors, build a lot of Shiny apps Convene working groups around specific types of data Create crosscutting dashboards for leadership Make friends with the heads of information security and compliance. Prepare a budget proposal
  25. 25. FAIR Data (within the enterprise) Findable • NoSQL database of metadata and checksums • It’s plenty for a good long time. Accessible • Federated identity management • Architecture of S3 buckets and production “roles” Interoperable • ”It’s much easier to go FAR than to go FAIR” Reusable • Data standards, ontologies, strong policy framework, including electronic consents for human subjects data.
  26. 26. Incredible opportunities here, and rapidly developing data silos The Clinical Data Ecosystem There is an incredible wealth of data available to support both clinical care and research Unfortunately, it is carved up and isolated. The phrase I hear most frequently from hospital CIOs: “No Upside” Patient Journals Consumer products Longitudinal Data from other providers … Electronic Medical Records Possibility of a self-normal (N of 1) over time Diagnostic Imaging Natural language processing has strong potentialClinical Notes Innovations in the basics of clinical observation Hospital Telemetry Pressure to avoid incidental findings prevent bias Primary Lab Data
  27. 27. Appropriate Use and Consent “We should be up front with participants that we can’t protect their privacy completely, and we should ensure that the most appropriate legislation is in place to protect participants from being exploited in any way.” - Eric Schadt, CEO, Sema4
  28. 28. Policies and Governance Appropriate usage Human readable document: Expectations of privacy and standards of behavior. Data Classification Governance document: Defines the major categories of data (corporate sensitive, clinical, …) and standards for handling of each. Written Information Security Policy (WISP) Technical document: Defines how systems must be configured to protect sensitive data and operations. Vendor Qualification Business SOP to establish practices around how vendor access and systes should be managed.
  29. 29. Conclusions Organizing data is a human practice, not a technology choice – There is no free lunch Start simple, with free technologies, and quick wins – NoSQL databases with headers and checksum – Plan to invest in infrastructure about 18 months into the journey – Don’t start with the whole genomes Good policy makes simple practice – Make data somebody’s job. – Enterprise data management has much to teach us
  30. 30. Thank you! Chris Dwan (chris@dwan.org) https://dwan.org @fdmts