Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

"The Cutting Edge Can Hurt You"

75 Aufrufe

Veröffentlicht am

Slides from the 2008 Bio-IT World Conference

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

"The Cutting Edge Can Hurt You"

  1. 1. The Cutting Edge Can Hurt You Stories from real-world adopters of next generation sequencing technology Christopher Dwan The Bioteam
  2. 2. Bioteam • Consultancy, with a software business • Vendor Neutral, Technology Agnostic • “Bridge the gap” between high performance computing and life sciences • Founded 2003 Shameless Plugs: • We’re in Booth 113 • Next Generation Sequencing Workshop (yesterday, plus next year) • http://bioteam.net • We’re hiring.
  3. 3. cdwan@bioteam.net Disclaimer • Most BioTeam clients don’t have 7 figure IT budgets, Petabyte SANs dedicated datacenters, and so on. • Many of these problems: – Are quite different for the largest Bio-HPC centers – Simply don’t matter to the nationally funded projects.
  4. 4. I offer no answers… cdwan@bioteam.net
  5. 5. Review: 2007 Predictions • Multi-core commodity processors – Workstations are already insanely powerful. • Virtualization on the workstation – Why port code, when you can make a new machine? • Reconfigurable computing goes mainstream – Partnerships, Collaborations, and re-seller agreements. • Next generation DNA sequencing – Data tsunami
  6. 6. 2007 Predictions - Reviewed • Multi-core commodity processors – I touched a 16 core workstation with 10TB of disk. It “just worked.” • Virtualization on the workstation Everywhere! Including Data! – I saw a workstation replace 6 legacy OS’s in one shop. • Reconfigurable computing goes mainstream Seemingly not yet – Talk to the folks on the show floor to get strong contradictions. • Next generation DNA sequencing. Yup. – Data tsunami
  7. 7. NEXT GENERATION SEQUENCERS
  8. 8. $5M: If you get one of these … cdwan@bioteam.net
  9. 9. You probably know about these cdwan@bioteam.net
  10. 10. cdwan@bioteam.net Next Generation Sequencing • Costs – Instrument cost: $5x105 to $106 – Reagent Cost: $3k - $10k – ~ 1 TB / machine / day – 4 or 5 vendors • Cost imbalance with IT components – $7k for an experiment – $3k for a new server • Opens high-throughput to much smaller labs.
  11. 11. Next Generation Sequencing • Naming is annoying: – “Next gen”, “new gen”, “now gen” • High Throughput DNA Sequencing – Helicos, Roche, Illumina, ABI, Church Lab – Also other domains: Cofocal microscopes, mass spec, … “The old days, which were about two years ago”
  12. 12. Fundamentals • Standard facilities questsion: – Heat / Power / Floor capacity, just as much as ever • Network: – Moving 1TB of data from instrument to the next room, much less to a collaborator – Still with the sneakernet. • Security • Data • Lab information management
  13. 13. cdwan@bioteam.net
  14. 14. Information Management • Day 0 – Simply catching the data and not dropping it is a challenge. • 3 months – Postdocs carrying data around on firewire disks – Data management with post-it notes. • 6 months – Instrument vendor updates their software. – Re-analysis? • 1 year – New machine from a different vendor.
  15. 15. Networking / Data Motion • Data motion can interfere with data acquisition (go on, ask the instrument vendor) • Software updates can interfere with attempts to automate data motion • Move 1TB of data from lab to network closet – Old building network, instrument offline for hours – Small $200 4 port gigabit switch (see “security”) – Excuse for a building-wide upgrade, 1 year horizon
  16. 16. Security • Network and IT Security: serious job. • Labs must propose workable solutions, not wait for security staff to provide them. • Common observations: – Mess with building wide network = security audit – No solution = system offline. “Stay out of the news”
  17. 17. Humans are the problem Lab Instruments Tape Backup Compute Cluster One-off Linux / Windows Machine RAID / Data Store Dev Workstation Management: Throughput? Status? Schedule? Security? Lab Staff Availability, quality, “did it work?” IT Staff Throughput? Status? Schedule Bioinformaticians Access to data, metadata,
  18. 18. Automation is the solution Lab Instruments Tape Backup Compute Cluster One-off Linux / Windows Machine RAID / Data Store Dev Workstation Management Machines need to write web pages IT Staff Offer levels of service, negotiate directly with scientists Bioinformaticians Access to data, metadata, LIMS
  19. 19. WIKILIMS
  20. 20. Wikipedia/ UC Berkeley
  21. 21. cdwan@bioteam.net Automatic data capture - Ra Most structured content can be captured and recorded by programs as it is generated
  22. 22. cdwan@bioteam.net File data raid Meta data wiki Wikilims: Next Gen Data Store
  23. 23. Version Differences
  24. 24. cdwan@bioteam.net Launch an assembly Launch an assembly on the cluster
  25. 25. WikiLIMS • Still sold as a custom service – No long term license – Full source code access – Highly customizable • Variety of customers: – Navy Medical Research Labs – Cold Spring Harbor – Emory University – National Cancer Institute – … • Both 113, we’ll talk your ear off. This is the semantic web
  26. 26. All updates happening at once cdwan@bioteam.net
  27. 27. STORAGE AND BACKUPS “On their way to becoming a sick joke”
  28. 28. cdwan@bioteam.net Storage • Storage: Same in 2008 as in 2006 – Unhappy technology tradeoffs – ‘Exotic’ vendors offer blazing speed and a few features – ‘Mainstream’ vendors exclusively focused on enterprise – What I need: Massive scaling, decent speed & grab bag of enterprise features • Real World Solution, early 2008: – 100TB disk, backup, small cluster, plus all infrastructure – Price range: $225k - $998k Cut the problem into pieces.
  29. 29. Archive vs. Resequence • 2007 – Shocking suggestion to delete primary data – Sanger suggested the MAID (Massive Array of Idle Disks) – Novartis reported that 97% of their files are never accessed 3 months after generation. • 2008 – Instrument vendors deleting large volumes of data inside the box. – Less shock, more “data lifecycle” “New instrument data would be different anyway”
  30. 30. Data Storage • 1 TB – $200 @ Savers ($0.20 / GB) • 24 – 48TB – Commodity solutions, many vendors ($0.70 /GB) • 100TB+ – Interesting architectural tradeoffs – Decision should be based on support expectations – Below $3 / GB really scares me • 1 – 2PB – “Large”
  31. 31. cdwan@bioteam.net •100+TB SAN •50+ compute nodes •One rather warm closet.
  32. 32. Backups are legion • Archive: – 1TB Firewire disk - $150 – 800GB LTO4 Tape - $90 (plus a sizable machine) • Disaster Recovery: – Failover, redundancy, etc. – Just buy two of everything. • Incremental Rollback – Traditional “backups” – Daily, weekly, differentials Talk to Finance people about backups.
  33. 33. Data Ingest (instruments) Legacy Storage Architecture 4PB Tape Archive 24TB “hot” disk For analysis SGI SMP Machines Linux Cluster Caching problem Workstations Web / FTP access
  34. 34. Data Ingest (instruments) New Realities Allow Simplification 4PB Tape Backup 1PB “hot” disk For analysis SGI SMP Machines Linux Cluster Workstations Web / FTP access
  35. 35. NEW, COOL STUFF
  36. 36. cdwan@bioteam.net Amazon Web Services • EC2 for virtualized computing – The economics are compelling • One month of serious experimentation: – $9.00 USD billed to credit card – Various money making approaches • Flexible pricing allows reselling & revenue sharing • Create a EC2 image and add my own fees on top to cover development and support costs – As a developer, I don’t need your credit card • Amazon handles all transactions & billing
  37. 37. Bioteam and Amazon EC2 • This is the grid: – Every Bioteam consultant independently deployed an EC2 solution in 2008. • Inquiry – Since 2004 - “bioinformatics on a cluster” – Apple, Microsoft CCS, Linux, etc. – May 1, 2008: Inquiry on Amazon EC2 – CPU Cost to customer: $10 / node day • Data service: 500GB, constantly updated: – $1400 yr: downloads, maintenance, and storage – $17 yr / cost to Bioteam to support a customer
  38. 38. Conclusion If scientists are wasting a bunch of time on IT, we’ve got more work to do.
  39. 39. Disturbing Observation I seem to have presented both a functional “grid” and an instance of a “semantic web” in the same talk.
  40. 40. Thank You • Cambridge Healthtech Institute – Cindy Crowninshield, Kevin Davies • Bioteam Customers – Ed Delong (MIT), Tim Read (NMRC), Yuri Kotliari (NIH), CSHL • Bioteam – Mike Cariaso, Chris Dagdigian, Stan Gloss, Brian Osborne, Bill Van Etten, Jiesheng Zhang • Community – Bioclusters, Sun Grid Engine, Bioinformatics.org
  41. 41. cdwan@bioteam.net
  42. 42. Questions cdwan@bioteam.net

×