Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Slides from a talk @ Bio-IT World Asia

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

  1. 1. Best Practices & LessonsLearned Life Science Informatics & The CloudTuesday, May 28, 13
  2. 2. 2I’m Chris.I’m an infrastructure geek.I work for the BioTeam.Twitter: @chris_dagTuesday, May 28, 13
  3. 3. Who, what & whyBioTeam‣ Independent consulting shop‣ Staffed by scientists forced tolearn IT, SW & HPC to get ourown research done‣ 12+ years bridging the “gap”between science, IT & highperformance computing‣ www.bioteam.net3Tuesday, May 28, 13
  4. 4. Seriously.Listen to me at your own risk‣ Clever people find multiplesolutions to common issues‣ I’m fairly blunt, burnt-out andcynical in my advanced age‣ Significant portion of my workhas been done in demandingproduction Biotech & Pharmaenvironments‣ Filter my words accordingly4Tuesday, May 28, 13
  5. 5. Other 2013 Presentations ...Bio-IT World Boston5Tuesday, May 28, 13
  6. 6. Bio-IT World Boston: “Multi-Tenant Research Clusters”6http://slideshare.net/chrisdag/Tuesday, May 28, 13
  7. 7. Bio-IT World Boston: “HPC Trends from the trenches.”7http://slideshare.net/chrisdag/Tuesday, May 28, 13
  8. 8. 8Meta: Why Cloud?What the sales & marketing folks won’t tell youGetting PracticalIntroHPC Case Study12345Tuesday, May 28, 13
  9. 9. 9The big pictureWhy we need IaaS clouds ...Tuesday, May 28, 13
  10. 10. Why life science needs infrastructure clouds10Big Picture‣ HUGE revolution in the rate at which lab platforms arebeing redesigned, improved & refreshed• Example: CCD sensor upgrade on that confocalmicroscopy rig just doubled your storage requirements• Example: That 2D ultrasound imager is now a 3D imager• Example: Illumina HiSeq upgrade just doubled the rate atwhich you can acquire genomes. Massive downstreamincrease in storage, compute & data movement needsTuesday, May 28, 13
  11. 11. 11The Central Problem Is ...‣ Instrumentation & protocols are changing FAR FASTERthan we can refresh our Research-IT & ScientificComputing infrastructure• The science is changing month-to-month ...• ... while our IT infrastructure only gets refreshed every 2-7years‣ We have to design systems TODAY that can supportunknown research requirements & workflows over manyyears (gulp ...)Tuesday, May 28, 13
  12. 12. 12The Central Problem Is ...‣ The easy period is over‣ 5 years ago you could toss inexpensive storage andservers at the problem; even in a nearby closet or undera lab bench if necessary‣ That does not work any more; real solutions requiredTuesday, May 28, 13
  13. 13. 13And a related problem ...‣ It has never been easier to acquire vast amounts of datacheaply and easily‣ Growth rate of data creation/ingest exceeds rate atwhich the storage industry is improving disk capacity‣ Not just a storage lifecycle problem. This data *moves*and often needs to be shared among multiple entitiesand providers• ... ideally without punching holes in your firewall orconsuming all available internet bandwidthTuesday, May 28, 13
  14. 14. If you get it wrong ...‣ Lost opportunity‣ Missing capability‣ Frustrated & very vocal scientific staff‣ Problems in recruiting, retention,publication & product development14Tuesday, May 28, 13
  15. 15. 15IaaS to the RescueTuesday, May 28, 13
  16. 16. IaaS solves the current critical “Research IT” dilemma16Why Cloud?‣ IaaS clouds let us react andrespond to scientificrequirements that change farfaster than we can refreshlocal datacenters andenterprise IT platformsImage: shanelin via FlickrTuesday, May 28, 13
  17. 17. Beyond capability and agility gains ...17Why Cloud?‣ The economic benefits are real, inescapable andtrending in the proper direction‣ Internet-scale providers with millions of cores andexabytes of spinning disk spanning the globeleverage operational efficiencies you will never comeclose to matching internally‣ ... be suspicious of people who claim otherwiseTuesday, May 28, 13
  18. 18. Also ...18Why Cloud?‣ Clouds becoming a naturalplace for data exchange &access‣ “scriptable everything”enables entirely newcapabilities not possibleinternally*‣ Finance people love convertingCapEx to OpExTuesday, May 28, 13
  19. 19. 19Meta: Why Cloud?What the sales & marketing folks won’t tell youGetting PracticalIntroHPC Case Study12345Tuesday, May 28, 13
  20. 20. What the salesfolk won’t tell you ...20‣ There is no one-size-fits-all researchdesign pattern ...‣ You are not going to toss everythingand replace it with “Big Data”‣ Very few of us have a single pipeline orworkflow that we can devote endlessengineering effort to‣ We are not going to toss out hundredsof legacy codes and rewrite everythingfor GPUs or MapReduce‣ For research HPC it’s all about thebuilding blocks { and how we caneffectively use/deploy them }Tuesday, May 28, 13
  21. 21. 21What the salesfolk won’t tell you‣ Your organization actually needs THREE tested clouddesign patterns:‣ (1) To handle ‘legacy’ scientific apps & workflows‣ (2) The special stuff that is worth re-architecting‣ (3) Hadoop & big data analyticsTuesday, May 28, 13
  22. 22. Legacy HPC on the Cloud22Design Pattern #1 - Legacy‣ There are many hundreds ofexisting algorithms andapplications in the life scienceinformatics space‣ We’ll be running/using thesecodes for years to come‣ Many can’t or will never berefactored or rewritten‣ I call this the “legacy” designpatternTuesday, May 28, 13
  23. 23. 23One Easy Solution.Tuesday, May 28, 13
  24. 24. StarCluster24Design Pattern #1 - Legacy‣ MIT StarCluster• http://web.mit.edu/star/cluster/‣ Infinite Awesomeness. Worth a talk by itself.‣ This is your baseline‣ Extend as neededTuesday, May 28, 13
  25. 25. 25Design Pattern #2 - “Cloudy”‣ Some of our research workflows are important enough tobe rewritten for “the cloud” and the advantages that atruly elastic & API-driven infrastructure can deliver‣ This is where you have the most freedom‣ Many published best practices you can borrow‣ Warning: Cloud vendor lock-in potential is strongest hereTuesday, May 28, 13
  26. 26. 26Design Pattern #3 - Hadoop/BigData‣ Hadoop and “big data” need to be on your radar‣ Be careful though, you’ll need a gas mask to avoid thesmog of marketing and vapid hype‣ The utility is real and this does represent one “futurepath” for analysis of large data setsTuesday, May 28, 13
  27. 27. 27Design Pattern #3 - Hadoop/BigData‣ It’s going to be a MapReduce world, get used to it‣ Little need to roll your own Hadoop in 2013‣ ISV & commercial ecosystem already healthy‣ Multiple providers today; both onsite & cloud-based‣ Often a slam-dunk cloud use caseTuesday, May 28, 13
  28. 28. What you need to know28Design Pattern #3 - Hadoop/BigData‣ “Hadoop” and “Big Data” are now general terms‣ You need to drill down to find out what people actuallymean‣ We are still in the period where senior leadership maydemand “Hadoop” or “BigData” capability without anyactual business or scientific needTuesday, May 28, 13
  29. 29. What you need to know29Hadoop & “Big Data”‣ In broad terms you can break “Big Data” down into two verybasic use cases:1. Compute: Hadoop can be used as a very powerful platform forthe analysis of very large data sets. The google search termhere is “map reduce”2. Data Stores: Hadoop is driving the development of verysophisticated “no-SQL” “non-Relational” databases and dataquery engines. The google search terms include “nosql”,“couchdb”, “hive”, “pig” & “mongodb”, etc.‣ Your job is to figure out which type applies for the groupsrequesting “Hadoop” or “BigData” capabilityTuesday, May 28, 13
  30. 30. Hadoop vs traditional Linux Clusters30High Throughput Science‣ Hadoop is a very complex beast‣ It’s also the way of the future so you can’t ignore it‣ Very tight dependency on moving the ‘compute’ as closeas possible to the ‘data’‣ Hadoop clusters are just different enough that they donot integrate cleanly with traditional Linux HPC system‣ Often treated as separate silo or punted to the cloudTuesday, May 28, 13
  31. 31. What you need to know31Hadoop & “Big Data”‣ Hadoop is being driven by a small group of academicswriting and releasing open source life science hadoopapplications;‣ Your people will want to run these codes‣ In some academic environments you may find peoplewanting to develop on this platformTuesday, May 28, 13
  32. 32. 32Meta: Why Cloud?What the sales & marketing folks won’t tell youGetting PracticalIntroHPC Case Study12345Tuesday, May 28, 13
  33. 33. Strategy33Practical Advice‣ Research oriented IT organizations need a cloud strategytoday; or risk being bypassed by employeesTuesday, May 28, 13
  34. 34. Design Patterns34Practical Advice‣ Remember the three design patterns on the cloud:• Legacy HPC systems(replicate traditional clusters in the cloud)• Hadoop• Cloudy(when you rewrite something to fully leverage cloudcapability)Tuesday, May 28, 13
  35. 35. Policies and Procedures35Practical Advice‣ Cloud technology bits are easy. Cloud Process and Policydiscussions take forever‣ Start these conversations sooner rather than later!Tuesday, May 28, 13
  36. 36. Core services that take time and advance planning36Practical Advice‣ A few of key foundational cloud services take time andadvanced planning to deploy properly:‣ VPNs & subnet schemes‣ Identity Management & Access Control‣ Data MovementTuesday, May 28, 13
  37. 37. Data Movemement37Practical Advice‣ A few words & pictures on data movement ...Tuesday, May 28, 13
  38. 38. 38Physical data movement station 1Tuesday, May 28, 13
  39. 39. 39Physical data movement station 2Tuesday, May 28, 13
  40. 40. 40“Naked” Data MovementTuesday, May 28, 13
  41. 41. 41“Naked” Data ArchiveTuesday, May 28, 13
  42. 42. 42Cloud Data Movement‣ Things changed pretty definitively in 2012‣ And the next image shows why ...Tuesday, May 28, 13
  43. 43. 43March 2012Tuesday, May 28, 13
  44. 44. Network vs. PhysicalCloud Data Movement‣ With a 1GbE internet connection ...‣ and using Aspera software ....‣ We sustained 700 MB/sec for more than 7 hoursfreighting genomes into Amazon Web Services‣ This is fast enough for many use cases, includinggenome sequencing core facilities*‣ Chris Dwan’s webinar on this topic:http://biote.am/7e44Tuesday, May 28, 13
  45. 45. Network vs. PhysicalCloud Data Movement‣ Results like this mean we now favor network-based datamovement over physical media movement‣ Large-scale physical data movement carries a highoperational burden and consumes non-trivial staff time &resources45Tuesday, May 28, 13
  46. 46. There are three ways to do network data movement ...Cloud Data Movement‣ Buy software from Aspera and be done with it‣ Attend the annual SuperComputing conference & seewhich student group wins the bandwidth challengecontest; use their code‣ Get GridFTP from the Globus folks46Tuesday, May 28, 13
  47. 47. SysAdmin vs Programmer47Practical Advice‣ Recognize the blurring line betweenIT / Informatics / SW Engineer‣ ... and how it may mix up your org chartTuesday, May 28, 13
  48. 48. Very blurry lines in 2013 for all of these roles48Scientist/SysAdmin/Programmer‣ Radical change in last ~2 yearsfor how IT is provisioned,delivered, managed & supported‣ Root cause (Technology)Virtualization & Cloud‣ Root Cause (Operations)Configuration Mgmt, SystemsOrchestration & InfrastructureAutomation‣ SysAdmins & IT staff need to re-skill and retrain to stay relevantTuesday, May 28, 13
  49. 49. Very blurry lines in 2013 for all of these roles49Scientist/SysAdmin/Programmer‣ When everything has an API ..‣ .. anything can be‘orchestrated’ or ‘automated’remotely‣ And by the way ...‣ The APIs (‘knobs & buttons’)are accessible to allTuesday, May 28, 13
  50. 50. Very blurry lines in 2013 for all of these roles50Scientist/SysAdmin/Programmer‣ IT jobs, roles andresponsibilities areundergoing rapidupheaval‣ SysAdmins must learn toprogram in order toharness automation tools‣ Programmers & Scientistscan now self-provisionand control sophisticatedIT resourcesTuesday, May 28, 13
  51. 51. Very blurry lines in 2012 for all of these roles51Scientist/SysAdmin/Programmer‣ My take on the future ...‣ Far more control is going into thehands of the research end user‣ IT support roles will radicallychange -- no longer owners orgatekeepers‣ IT will handle policies,procedures, reference patterns ,security & best practices‣ Researchers will control the“what”, “when” and “how big”Tuesday, May 28, 13
  52. 52. 52Thanks! Email: chris@bioteam.nethttp://slideshare.net/chrisdag/Tuesday, May 28, 13
  53. 53. 53Cloud HPC Case StudyTime Permitting ...Tuesday, May 28, 13
  54. 54. Next Generation Nuclear Magnetic Resonance54NMR Probehead Simulation on AWS‣ CAE Simulation Project‣ via www.hpcexperiment.com‣ Software: CST Studio 2012‣ My role: Volunteer HPC MentorTuesday, May 28, 13
  55. 55. Simulating next-generation NMR probeheads55Why this was an interesting project‣ Frontend interface is graphicsheavy and requires Windows‣ Studio ‘solvers’ run Linux orWindows; support GPUs and MPItask distribution‣ Simultaneous use of local andcloud-based solvers actually works‣ flexLM license server involved‣ Non-trivial security and geo-location requirementsTuesday, May 28, 13
  56. 56. 56When we ran at modest scale ...16 large compute nodes + 22 GPU nodes$30/hour on AWS Spot Market.HPC on the cloud is real.Tuesday, May 28, 13
  57. 57. Design Attempt #157‣ Hybrid Linux/Windows cloud running in AWS EU Region‣ Failure:• No GPU nodes in EU at the time• No cc2.4xlarge at the timeTuesday, May 28, 13
  58. 58. Design Attempt #258‣ Move Hybrid Linux/Windows system to US-EAST‣ ... with synthetic test data‣ Best-practices VPC isolation & VPN access‣ It looked like this ...Tuesday, May 28, 13
  59. 59. Architecture #2 59Tuesday, May 28, 13
  60. 60. Design Attempt #260‣ Attempt #2 Failed:‣ CST FrontEnd Controller running at end-user site couldnot tolerate NAT translation used by solvers‣ No GPU nodes available within VPC at that timeTuesday, May 28, 13
  61. 61. Design Attempt #361‣ Design #3 Finally works‣ VPC shrunk to single license server running in US EAST‣ All Windows/Linux/GPU solover nodes running in EU‣ NO NAT, NO VPC For Solvers‣ Extensive use of AWS spot instance serversTuesday, May 28, 13
  62. 62. At experiment end it looked like this ... 62Tuesday, May 28, 13
  63. 63. 63Non Trivial HPC on the Cloud16 large compute nodes + 22 GPU nodes$30/hour on AWS Spot Market.Tuesday, May 28, 13
  64. 64. Why this work was ‘easy’ on Amazon AWS ...64Nightmare on any other cloud‣ Lets discuss why this simulation workload would bemuch, much harder to do on some other cloudplatform ...Tuesday, May 28, 13
  65. 65. Why this work was ‘easy’ on Amazon AWS ...65Nightmare on any other cloud1. Virtual Servers2. Block Storage3. Object Storage4. ... and maybe some otherstuff if I’m lucky‣ EC2, S3, EBS, RDS, SNS,SQS, SWS, GPUs, SSDs,CloudFormation, VPC, ENIs,SecurityGroups, 10GbEDirectConnect, ReservedInstances, ImportExport,Spot Market‣ And ~25 other products andservice features with moreadded monthly‘Brand X’ Cloud AWSTuesday, May 28, 13
  66. 66. Easy on AWS; much harder elsewhereOne very specific example66‣ The widely used FLEXlmlicense server uses NICMAC addresses whengenerating license keys‣ Different MAC? Sciencestops. Screwed.‣ VPC ENIs allow separationof MAC address fromNetwork Interface.Badass.Tuesday, May 28, 13
  67. 67. Why this work was ‘easy’ on Amazon AWS ...A few other examples ...67VPCSpot Marketcc* & cg*ec2 instancetypesIncredibly powerful. Actually useful.Approachable even if you are not an IPSEC or BGProuting god.Compelling economics. Once you start you’ll likelynever run anywhere else.The competition can’t compete.Fat nodes with bidirectional 10GbE bandwidth.And don’t get me started on SSD or Provisioned-performance EBS volumes.Tuesday, May 28, 13
  68. 68. 68Thanks!Email: chris@Bioteam.netTuesday, May 28, 13