Everything comes in 3's

Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine

Outline This talk looks at the practical aspects of Cloud Computing We will be diving into specific examples 3pillars of systems design 3storage implementations 3 areas of bioinformatics And how they are affected by clouds 3interesting internal projects There are 2 hard problems in computer science: caching, naming, and off-by-1 errors

Pillars of Systems Design Provisioning API access (AWS, Microsoft, RackSpace, GoGrid, etc.) Not discussing further, since this is the WHOLE POINT of cloud computing. Configuration How to get a system up to the point you can do something with it Command and Control How to tell the system what to do

System Configuration with Chef Automatic installation of packages, service configuration and initialization Specifications use a real programming language with known behavior Bring the system to an idempotent state http://opscode.com/chef/ http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg

Chef Recipes & Cookbooks The specification for installing and configuring a system component Able to support more than one platform Has access to system-wide information hostname, IP addr, RAM, # processors, etc. Contain templates, documentation, static files & assets Can define dependencies on other recipes Executed in order, execution stops at first failure

Simple Recipe : Rsync Install rsync to the system Meta data file states what platforms are supported Note that Chef is a Linux centric system BUT, the WikiWiki is MessyMessy Look at Chef Solo & Resources

More Complex Recipe: Heartbeat Installs heartbeat package Registers the service and specifies that is can be restarted and provides a status message Finally it starts the service

Command and Control Traditional grid computing QSUB – SGE, PBS, Torque Usually requires tightly coupled and static systems Shared file systems, firewalls, user accounts, shared exe & lib locations Best for capability processes (e.g. MPI) Map-Reduce is the new hotness Best for data-parallel processes Assumes loosely coupled non-static components Job staging is a critical component

Map Reduce in a Nutshell Algorithm pioneered by Google for distributed data analysis Data-parallel analysis fit well into this model Split data, work on each part in parallel, then merge results Hadoop, Disco, CloudCrowd, …

Serial Execution of Proteomics Search

Roll-Your-Own MR on AWS Define small scripts to Split a FASTA file Run a BLAT search The first script make defines the inputs of the second Submit the input FASTA to S3 Start a master node as the central communication hub Start slave nodes, configured to ask for work from master and save results back to S3 Press “Play”

Workflow of Distributed BLAT Boot master & slaves PC Master Submit the BLAT job S3 Slave Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes Upload inputs Download results Slave Slave Slave

Master Node => Resque Github developed background job processing framework Jobs attached to a class from your application, stored as JSON Uses REDIS key-value store Simple front end for viewing job queue status, failed job http://github.com/defunkt/resque Resque can invoke any class that has a class method “perform()”

Storage in the Cloud : S3 Permanent storage for your data Pay as you go for transmission and holding Eliminates backups Pretty good CDN Able to hook into better CDN SLA via CloudFront Can be slow at times Reports of 10 second delay, but average is 300ms response Your Data S3

Storage 2: Distributed FS on EC2 Hadoop HDFS, Gigaspaces, etc. Network latency may be an issue for traditional DFSs Gluster, GPFS, etc. Tighter integration with execution framework, better performance? Your Data EC2 Node EC2 Node EC2 Node EC2 Node EC2 Node Disk

DFS on EC2 m1.xlarge Costs * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3

Storage 3: Memory Grids “RAM is the new Disk” Application level RAM clustering Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces Performance for capability jobs? Your Data EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads

Memory Grid Cost Take home message: Unless your needs are small, you may be better off procuring bare-metal resources

Cloud Influence on Bioinformatics Computational Biology Algorithms will need to account for large I/O latency Statistical tests will need to account for incomplete information, or incremental results Software Engineering Built for the cloud algorithms are popping up CloudBurst is a feature example in AWS EMR! Application to Life Sciences Deploy ready-made images for use Cycle Computing, ViPDAC, others soon to follow

Algorithms need to be I/O centric Incur a slightly higher computational burden to reduce I/O across non-optimal networks P. Balaji, W. Feng, H. Lin 2008

Some Internal Projects Resource Manager Service for on-demand provisioning and release of EC2 nodes Utilizes Chef to define and apply roles (compute node, DB server, etc) Terminates idle compute nodes at 52 minutes Workflow Manager Defines and executes data analysis workflows Relies on RM to provision nodes Once appropriate worker nodes are available, acts as the central work queue RUM RNA-SeqUltimate Mapper Map Reduce RNA-Seq analysis pipeline Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads

RUM (Bowtie + BLAT + processing) Significantly increases the confidence of your data

RUM Costs Computational cost ~$100 - $200 6-8 hours per lane on m2.4xlarge ($2.40 / hour) Cost of reagents ~= $10,000 1% of total

Acknowledgements Garret FitzGerald Ian Blair John Hogenesch Greg Grant Tilo Grosser NIH & UPENN for support My Team David Austin Andrew Brader Weichen Wu Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s

Everything comes in 3's

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Everything comes in 3's

Ähnlich wie Everything comes in 3's (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Everything comes in 3's

Hinweis der Redaktion