Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Bio-IT Trends From The Trenches (digital edition)

Note: Contact me directly dag@bioteam.net if you would like a PDF download of these slides

This is Chris Dagdigian’s 10th year delivering his no holds barred, candid state of the industry address at BioIT World, and we are not going to let a pandemic stop him.

Instead of his typical talk, five distinguished panelists will join Chris for a spirited discussion on Current Events and Scientific Computing and the impacts of the COVID-19 Pandemic:

  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Bio-IT Trends From The Trenches (digital edition)

  1. 1. Trends from The Socially Distanced Trenches Twitter: chris_dag Email: <dag@bioteam.net > LinkedIn: chrisdag Slides: slideshare.net/chrisdag
  2. 2. Live event only! Visit: slido.com Enter Event Code:TFT ● Live Q&A ● Polls ● Panelist Bios & URLs Or scan QR -->
  3. 3. Trends from the trenches digitalTwitter: chris_dag <dag@bioteam.net> About today’s event
  4. 4. Trends from the trenches digitalTwitter: chris_dag <dag@bioteam.net> About today’s event 10 Years Ago ...
  5. 5. Twitter: chris_dag <dag@bioteam.net> BY accident we created an enduring thing ...
  6. 6. Trends from the trenches digitalTwitter: chris_dag <dag@bioteam.net> About today’s event ● 10th Anniversary: “Trends FromTheTrenches @ Bio-IT World” ● Postponed conference? No problem! ○ We can’t be together physically ○ But we can bring our friends and colleagues together virtually
  7. 7. And we have some amazing friends with us...
  8. 8. Bio-IT World 2020 is postponed; not cancelled!
  9. 9. Intro over … {mini} Trends time
  10. 10. 2020 scientific drivers for research it
  11. 11. What is driving Bio-it requirements in 2020? Science Driver Context IT Impact Genomics and Bioinformatics Historically dominant consumer of both storage and compute resources.This will continue as sequencing becomes less expensive and more widely used in both the lab and clinic settings. ● Storage capacity ● Non-GPU computing ● Large Memory computing ● Data Ingest & movement Image-based data acquisition and analysis The fastest-growing IT driver BioTeam observes “in the trenches” continues to be image capture and image-based storage driven by the increasing importance of both light (confocal and lattice light-sheet), 3D microscopy, CryoEM, MRI and fMRI image analysis. A number of BioTeam clients deployed CryoEM in 2019-2020. ● Storage capacity ● Storage performance ● GPU computing ● Large scale data ingest & movement
  12. 12. What is driving Bio-it requirements in 2020? Science Driver Context IT Impact ML and AI ML and AI techniques are expected to make a significant future contributions to Bio-IT requirements and platforms.These approaches may need hardware beyond the general purpose GPU and may require advanced GPUs, FPGAs, and neural processors. ● Storage performance ● Storage capacity ● GPU computing ● Cloud workload migration Chemistry and Molecular Dynamics Computational chemistry and MD simulations requirements differ significantly from bioinformatics/genomics requirements. It is worth noting that chemists are capable of consuming nearly infinite amounts of compute capacity -- if more power is available they simply run longer or more complex simulations. ● GPU computing ● Scratch storage performance ● Cloud workload migration
  13. 13. What is driving Bio-it requirements in 2020? ● Notice anything in the prior science/IT driver list? ○ Storage (and changing requirements for storage in 2020 ..) plays a big part ● Fortunate that this event is being sponsored by a very interesting storage player ● SinceVAST won’t talk tech today I’ll leave this URL for the nerds who want a deeper dive. This was the original write up by @glennklockwood that drove our interest: https://glennklockwood.blogspot.com/2019/02/vast-datas-storage-system-architecture.html
  14. 14. What are vendors most likely to lie about in 2020?
  15. 15. What are vendors most likely to lie about in 2020? ● Every IT purchase cycle includes breathlessly overhyped tech, heavily stage-managed & subsidized reference projects and aggressive-to-the-point-of-misleading sales techniques ● ML and AI have been like this for a while now and this stuff is creeping into *every* product pitch ● In the hype zone you are also more likely to encounter people who say creepy sexualized stuff like “... open the Kimono… ” in purportedly professional meetings ML & AI Of Course!
  16. 16. What are vendors most likely to lie about in 2020? ● Fact 1: These methods are real, beneficial and currently driving significant transformation in life science & healthcare ● Fact 2: Within the hype zone of new tech it is essential to approach things cautiously, carefully and a bit cynically (test all claims!) ○ Be cautious when buying big into stuff your org may not be able to fully exploit as this market innovates extremely fast ● Fact 3: Many end-users are still getting up to speed; the market knows this and sometimes relies on customer naivety or C-Suite bandwagon pressure ML & AI Of Course!
  17. 17. Today's focus: Scientific Data ● Why 1: Petascale data storage has been easy for many years now; managing and understanding the billions of files we store at petascale is still a gnarly problem ● Why 2: Data Management, Data Movement and Data Federation/Access still make up a significant percentage of BioTeam informatics-focused consulting ● Why 3: Some folk are nearing limits of what can be sensibly done with standard scale-out NAS ● Why 4: Image-based acquisition/analysis and ML/AI workloads are changing our baseline requirements for scientific data storage system capabilities in 2020
  18. 18. Today's focus: Scientific Data BIG
  19. 19. Today's focus: Scientific Data BIG SILOED
  20. 20. Today's focus: Scientific Data BIG SILOED DIRTY
  21. 21. Today's focus: Scientific Data BIG SILOED DIRTY BIASED
  22. 22. Big data ...
  23. 23. { Big } data ... ● “Far easier to acquire or generate data faster than it can be effectively stored over it’s full lifecycle” ● “Storage pricing not decreasing fast enough to match our increase in consumption” ● CryoEM: “Hold my sensor …” ● This disconnect was OK when most of us were at the low-end of peta-capable storage system limits ○ Rudderless expansion requires only ${money}; no leadership, no ownership and no difficult conversations with scientists ○ … now seeing limits of this style Me in years past: May 2020:
  24. 24. { Big } data ... ● “At 1-petabyte level your scientific storage platform needs a human data curator and real governance” ● Bioteam has seen multiple 10+ petabyte orgs in 2019-2020 with little to no governance, standards or human-led curation ● “Data awareness” is a competitive differentiator; organizations can succeed or fail on this capability alone ● Human data wranglers now need sophisticated storage reporting and metadata-aware tooling to perform their role Me in years past: May 2020:
  25. 25. { Big } data ... ● “Data triage is required; OK to delete some raw data if it is cheaper to go back to the -40F Freezer and rerun the experiment” ● “IT can’t make deletion decisions; triage is always led by Science/Research” ● Collectively we’ve kinda failed at data management ● New and sterner methods are required ○ Unconstrained growth of un-managed data will come back to haunt you in painful ways Me in years past: May 2020:
  26. 26. { Big } data ... 1. Culture Change -- Storage Is a Lab/Group Consumable a. View, manage and treat data storage systems in the same way we handle laboratory consumables -- have a plan, defend your request and budget accordingly b. Scientists and IT must both actively manage this consumable 2. No scientific data in $HOME and no more giant $HOME folders a. ALL storage allocations are now via Project or Group b. Few or no exceptions Trends I’mTryingTo Create In 2020
  27. 27. { Big } data ... NERSC $HOME policy! NERSC File System Quotas & Purging Overview https://docs.nersc.gov/
  28. 28. siloed data ...
  29. 29. { siloed } data ... ● ‘Data rich’ environments at network edge are increasing ● Data sources and types are increasing ● Not just images, instrument data and genomes ○ Events, Documents, iOT; time-series sensor streams, etc. ● Collaborative research efforts are increasing ● Petabytes of open access data available for access/download More and more silos & sources
  30. 30. { siloed } data ... ● Slow networks & lack of science DMZs can strand data @ edge ● Compute|transform happening at ingest/edge (emerging …) ● Data Lakes are effective but still *many* failures ● Data Commons methods increasingly attracting attention ○ Gen3 Commons from CTDS is our jam - https://gen3.org/ https://ctds.uchicago.edu/gen3 ○ Gen3 COVID19 Data Commons: https://chicagoland.pandemicresponsecommons.org/ What we see ...
  31. 31. dirty data ...
  32. 32. { dirty } data ... ● In one career generation we’ve gone from handwritten lab notebooks to petascale data wrangling ● Generally we’ve scaled capacity but ignored or underinvested in governance, curation, metadata, data cleaning, SOPs and standards ○ Blame lies with scientists and scientific leadership ○ IT can’t MAKE you clean up after yourself ○ Years (or decades) of data neglect are PAINFUL to handle ● Annoying 2019 - 2020 ● But this is gonna mess up a ton of ML & AI work in coming years We are not great at data hygiene
  33. 33. Biased data ...
  34. 34. ● Our responsibility to ensure ML and AI are handled responsibly ● Especially with our prior failures at ‘data hygiene’ ● Need clean data from diverse and equitable sources to have any hope of applying machine methods broadly and across our many disciplines ● { our panelists may have a lot to say on this topic … } Model & Data Bias -- Risks for the ML/AI Era { Biased } data ...
  35. 35. Join at slido.com #TFT Audience Q&A Session
  36. 36. Stuff I got wrong
  37. 37. Stuff I got wrong 1. “Compilers Matter Again!” 2. “CPU benchmarking is back!” 3. “We need policy driven auto-tiering storage” 4. “Single global storage namespace should be the goal” Failed predictions from past trends talks ...
  38. 38. Stuff I got wrong 1. “Compilers Matter Again!” Failed predictions from past trends talks ... ● BioTeam observed performance differences between GNU compilers and optimized commercial compilers in ‘19 ● Significant difference for ‘hot’ tools like Relion CryoEM suite ● Expected more interest and more work building scientific tools for different compilers in 2019 - 2020 ● I was wrong
  39. 39. Stuff I got wrong 2. “CPU benchmarking is back!” Failed predictions from past trends talks ... ● Intel has serious CPU competition for the first time in a while ● We expected a ton of AMD vs Intel benchmarking ● … especially as the exascale supercomputers announced in 2019-2020 seem to have clearly made their architecture choices ● I was wrong. Never materialized in our enterprise work
  40. 40. Stuff I got wrong 3. “We need policy driven auto-tiering storage” Failed predictions from past trends talks ... ● Changed my mind on this ● IT-managed auto-tiering based on “policy” is not ideal for our world ○ Let me tell you about that time the policy engine on a 12 petabyte filesystem decided to archive & stub all .bashrc files to tape :) ○ Making IT own or control the tiering process is wrong. Active partnership needed. ○ Scientists manage large data sets in ways that are not easily translated to generic “policy” like “last access time” , “file age” -- great for corporate, bad for science workflows ● What we ACTUALLY need ○ User self-service for tiering, movement and archive decisions ○ Let researchers tier/move/archive based on Project or Group paths|tags
  41. 41. Stuff I got wrong 4. “Single global storage namespace should be the goal” Failed predictions from past trends talks ... ● Awesome idea in theory ● But .. ○ Scientific leadership has failed to drive data management as a priority ○ Years of data shows that scientific end-users are not responsible stewards when the storage environment is charitably described as “wild west” ● Tired of scientists who build careers on “data intensive science” WHINING about having to … um … actively manage data that drives their career ● I’m done coddling scientists as an IT person in this particular space ○ If big data is part of your research mission than do your darn job and take ownership ○ Have to move data between systems?Tiers? Archive?Tough cookies. Do your job.
  42. 42. thanks; Panel time! Acknowledgements: ○ Stan Gloss ○ twitter.com/CircuitSwan ○ twitter.com/melrom ○ https://desertedislanddevops.com/ Twitter: chris_dag Email: <dag@bioteam.net> LinkedIn: chrisdag Slides: slideshare.net/chrisdag
  43. 43. Join at slido.com #TFT Audience Q&A Session