This is a talk I gave at a Northwestern University - Complete Genomics Workshop on April 21, 2011 about using clouds to support research in genomics and related areas.
AWS Community Day CPH - Three problems of Terraform
Â
Bionimbus - Northwestern CGI Workshop 4-21-2011
1. Bionimbus: A Cloud-Based Infrastructure for Managing, Analyzing and Sharing Genomics Data April 21, 2011 Robert Grossman Institute for Genomics & Systems Biology (IGSB) Computation InstituteUniversity of Chicago and Open Cloud Consortium
5. The Challenge is to Support Cubes of High Throughput Sequence Data Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set. Different developmental stages Different pathologies Perturb the environment
6. We Have a Problem … vs More and more of your colleagues produce so much data that they cannot easily manage, move, analyze and share it. Centers and large projects build their own infrastructure. Every else is on their own.
13. Step 4. Login on to Bionimbus and view your data
14. Step 5. Use Bionimbus to perform standard and custom pipelines. Using the ability of Bionimbus to launch multiple virtual machines reduced this analysis from 25 days to 1 day.
15. Step 2. Send sample tobe sequenced. Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc. InternalSequencers BID Generator CGI Step 5. Cloud based analysis using IGSB and 3rd party tools and applications. Step 3a. Return rawreads. Step 3b. Returnvariant calls, CNV, annotation… Bionimbus Private Cloud UC Bionimbus Community Cloud Step 4. Secure datarouting to appropriatecloud based upon BID. Bionimbus Private Cloud XY Amazon dbGaP
17. Clouds provide on-demand computing and storage resources at the scale and with the reliability of a data center. Computer scientists were caught by surprise. 17
18. What is a Cloud? 18 Software as a Service (SaaS)
19. What Else a Cloud? 19 Infrastructure as a Service (IaaS) Users get one or more virtual machines “on demand”
20. Are There Other Types of Clouds? 20 ad targeting Hadoop was developed for processing Internet scale data for ad targeting and related applications but is now used for processing genomics data and may other applications.
23. Elastic, On-Demand Computing with Usage Based Pricing Is New 23 costs the same as 1 computer in a rack for 120 hours 120 computers in three racks for 1 hour Data center scale computing often leverages virtualization technologies.
27. Case Study: ModENCODE Bionimbus is used to process the modENCODE data from the White lab (over 1000 experiments). BionimbusVMs were used for some of the integrative analysis. Bionimbus is used as a backup for the modENCODE DCC
34. TFsPredictions 537 silencers 2,307 new promoters 12,285 enhancers 14,145 insulators www.modencode.org www.cistrack.org Negre et al. Nature 2011
35. Case Study: IGSB All samples processed by the Institute for Genomics & Systems Biology High-Throughput Genome Analysis Core (HGAC) at the University of Chicago use Bionimbus.
42. GWT-based Front End Elastic Cloud Services Database Services Analysis Pipelines & Re-analysis Services Intercloud Services Large Data Cloud Services Data Ingestion Services
43. (Eucalyptus, OpenStack) GWT-based Front End Elastic Cloud Services (PostgreSQL) Database Services Analysis Pipelines & Re-analysis Services Intercloud Services (IDs, etc.) Large Data Cloud Services (UDT, replication) Data Ingestion Services (Hadoop, Sector/Sphere)
45. A successful cloud will… 3. High performance ingestion and transport of data. 2. Provide Compute services at the scale of a data center. 1. Provide long term persistent storage services at the scale of a data center.
46. A successful cloud will… 6. Peer with private genomics clouds. 5. Peer with public clouds. 4. Support the liberation of data.
48. Bionimbus Road Map Over the next 3 to 4 months, we will: Launch Bionimbus (we are in a pre-launch) Add Galaxy-based workflow to Bionimbus Add secure routing of genomes Add more public datasets Add more pipelines