This document summarizes a talk on cloud computing and bioinformatics given by Angel Pizarro. The talk discusses 3 pillars of systems design for cloud computing, 3 storage implementations (S3, distributed file systems on EC2, memory grids), and 3 areas of bioinformatics affected by clouds (computational biology, software engineering, applications to life sciences). It also summarizes 3 internal projects at UPenn related to resource management, workflow management, and an RNA-Seq analysis pipeline called RUM that combines Bowtie and BLAT.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Everything comes in 3's
1. Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine
2. Outline This talk looks at the practical aspects of Cloud Computing We will be diving into specific examples 3pillars of systems design 3storage implementations 3 areas of bioinformatics And how they are affected by clouds 3interesting internal projects There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
3. Pillars of Systems Design Provisioning API access (AWS, Microsoft, RackSpace, GoGrid, etc.) Not discussing further, since this is the WHOLE POINT of cloud computing. Configuration How to get a system up to the point you can do something with it Command and Control How to tell the system what to do
4. System Configuration with Chef Automatic installation of packages, service configuration and initialization Specifications use a real programming language with known behavior Bring the system to an idempotent state http://opscode.com/chef/ http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
5. Chef Recipes & Cookbooks The specification for installing and configuring a system component Able to support more than one platform Has access to system-wide information hostname, IP addr, RAM, # processors, etc. Contain templates, documentation, static files & assets Can define dependencies on other recipes Executed in order, execution stops at first failure
6. Simple Recipe : Rsync Install rsync to the system Meta data file states what platforms are supported Note that Chef is a Linux centric system BUT, the WikiWiki is MessyMessy Look at Chef Solo & Resources
7. More Complex Recipe: Heartbeat Installs heartbeat package Registers the service and specifies that is can be restarted and provides a status message Finally it starts the service
8. Command and Control Traditional grid computing QSUB – SGE, PBS, Torque Usually requires tightly coupled and static systems Shared file systems, firewalls, user accounts, shared exe & lib locations Best for capability processes (e.g. MPI) Map-Reduce is the new hotness Best for data-parallel processes Assumes loosely coupled non-static components Job staging is a critical component
9. Map Reduce in a Nutshell Algorithm pioneered by Google for distributed data analysis Data-parallel analysis fit well into this model Split data, work on each part in parallel, then merge results Hadoop, Disco, CloudCrowd, …
12. Roll-Your-Own MR on AWS Define small scripts to Split a FASTA file Run a BLAT search The first script make defines the inputs of the second Submit the input FASTA to S3 Start a master node as the central communication hub Start slave nodes, configured to ask for work from master and save results back to S3 Press “Play”
13. Workflow of Distributed BLAT Boot master & slaves PC Master Submit the BLAT job S3 Slave Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes Upload inputs Download results Slave Slave Slave
14. Master Node => Resque Github developed background job processing framework Jobs attached to a class from your application, stored as JSON Uses REDIS key-value store Simple front end for viewing job queue status, failed job http://github.com/defunkt/resque Resque can invoke any class that has a class method “perform()”
16. Storage in the Cloud : S3 Permanent storage for your data Pay as you go for transmission and holding Eliminates backups Pretty good CDN Able to hook into better CDN SLA via CloudFront Can be slow at times Reports of 10 second delay, but average is 300ms response Your Data S3
18. Storage 2: Distributed FS on EC2 Hadoop HDFS, Gigaspaces, etc. Network latency may be an issue for traditional DFSs Gluster, GPFS, etc. Tighter integration with execution framework, better performance? Your Data EC2 Node EC2 Node EC2 Node EC2 Node EC2 Node Disk
19. DFS on EC2 m1.xlarge Costs * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
20. Storage 3: Memory Grids “RAM is the new Disk” Application level RAM clustering Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces Performance for capability jobs? Your Data EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
21. Memory Grid Cost Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
22. Cloud Influence on Bioinformatics Computational Biology Algorithms will need to account for large I/O latency Statistical tests will need to account for incomplete information, or incremental results Software Engineering Built for the cloud algorithms are popping up CloudBurst is a feature example in AWS EMR! Application to Life Sciences Deploy ready-made images for use Cycle Computing, ViPDAC, others soon to follow
23. Algorithms need to be I/O centric Incur a slightly higher computational burden to reduce I/O across non-optimal networks P. Balaji, W. Feng, H. Lin 2008
24. Some Internal Projects Resource Manager Service for on-demand provisioning and release of EC2 nodes Utilizes Chef to define and apply roles (compute node, DB server, etc) Terminates idle compute nodes at 52 minutes Workflow Manager Defines and executes data analysis workflows Relies on RM to provision nodes Once appropriate worker nodes are available, acts as the central work queue RUM RNA-SeqUltimate Mapper Map Reduce RNA-Seq analysis pipeline Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
26. RUM (Bowtie + BLAT + processing) Significantly increases the confidence of your data
27. RUM Costs Computational cost ~$100 - $200 6-8 hours per lane on m2.4xlarge ($2.40 / hour) Cost of reagents ~= $10,000 1% of total
28. Acknowledgements Garret FitzGerald Ian Blair John Hogenesch Greg Grant Tilo Grosser NIH & UPENN for support My Team David Austin Andrew Brader Weichen Wu Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s
Hinweis der Redaktion
REFERENCE Semantic-based Distributed I/O with the ParaMEDICFramework P. Balaji, W. Feng, H. Lin ACM/IEEE International Symposium on High-Performance Distributed Computing, April 2008.http://www.mpiblast.org/About/Publications