SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine
Outline This talk looks at the practical aspects of Cloud Computing We will be diving into specific examples 3pillars of systems design 3storage implementations 3 areas of bioinformatics  And how they are affected by clouds 3interesting internal projects There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
Pillars of Systems Design Provisioning API access (AWS, Microsoft, RackSpace, GoGrid, etc.) Not discussing further, since this is the WHOLE POINT of cloud computing. Configuration How to get a system up to the point you can do something with it Command and Control How to tell the system what to do
System Configuration with Chef Automatic installation of packages, service configuration and initialization Specifications use a real programming language with known behavior Bring the system to an idempotent state http://opscode.com/chef/ http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
Chef Recipes & Cookbooks The specification for installing and configuring a system component Able to support more than one platform Has access to system-wide information hostname, IP addr, RAM, # processors, etc. Contain templates, documentation, static files & assets Can define dependencies on other recipes Executed in order, execution stops at first failure
Simple Recipe : Rsync Install rsync to the system Meta data file states what platforms are supported Note that Chef is a Linux centric system BUT, the WikiWiki is MessyMessy Look at Chef Solo & Resources
More Complex Recipe: Heartbeat Installs heartbeat package Registers the service and specifies that is can be restarted and provides a status message Finally it starts the service
Command and Control Traditional grid computing QSUB – SGE, PBS, Torque Usually requires tightly coupled and static systems Shared file systems, firewalls, user accounts, shared exe & lib locations Best for capability processes (e.g. MPI)  Map-Reduce is the new hotness Best for data-parallel processes Assumes loosely coupled non-static components Job staging is a critical component
Map Reduce in a Nutshell Algorithm pioneered by Google for distributed data analysis Data-parallel analysis fit well into this model Split data, work on each part in parallel, then merge results Hadoop, Disco, CloudCrowd, …
Serial Execution of Proteomics Search
Parallel Proteomics Search
Roll-Your-Own MR on AWS Define small scripts to Split a FASTA file Run a BLAT search The first script make defines the inputs of the second Submit the input FASTA to S3 Start a master node as the central communication hub Start slave nodes, configured to ask for work from master and save results back to S3 Press “Play”
Workflow of Distributed BLAT Boot master & slaves PC Master Submit the BLAT job S3 Slave Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes Upload inputs Download results Slave Slave Slave
Master Node => Resque Github developed background job processing framework Jobs attached to a class from your application, stored as JSON Uses REDIS key-value store Simple front end for viewing job queue status, failed job http://github.com/defunkt/resque Resque can invoke any class that has a class method “perform()”
The scripts
Storage in the Cloud : S3 Permanent storage for your data Pay as you go for transmission and holding Eliminates backups Pretty good CDN Able to hook into better CDN SLA via CloudFront Can be slow at times Reports of 10 second delay, but average is 300ms response Your Data S3
S3 Costs
Storage 2: Distributed FS on EC2 Hadoop HDFS, Gigaspaces, etc. Network latency may be an issue for traditional DFSs Gluster, GPFS, etc. Tighter integration with execution framework, better performance? Your Data EC2 Node EC2 Node EC2 Node EC2 Node EC2 Node Disk
DFS on EC2 m1.xlarge Costs * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
Storage 3: Memory Grids “RAM is the new Disk” Application level RAM clustering Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces Performance for capability jobs? Your Data EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
Memory Grid Cost Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
Cloud Influence on Bioinformatics Computational Biology Algorithms will need to account for large I/O latency Statistical tests will need to account for incomplete information, or incremental results Software Engineering Built for the cloud algorithms are popping up CloudBurst is a feature example in AWS EMR! Application to Life Sciences Deploy ready-made images for use Cycle Computing, ViPDAC, others soon to follow
Algorithms need to be I/O centric Incur a slightly higher computational burden to reduce I/O across non-optimal networks P. Balaji, W. Feng, H. Lin 2008
Some Internal Projects Resource Manager Service for on-demand provisioning and release of EC2 nodes Utilizes Chef to define and apply roles (compute node, DB server, etc) Terminates idle compute nodes at 52 minutes Workflow Manager Defines and executes data analysis workflows Relies on RM to provision nodes Once appropriate worker nodes are available, acts as the central work queue RUM RNA-SeqUltimate Mapper Map Reduce  RNA-Seq analysis pipeline Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
Bowtie Alone
RUM (Bowtie + BLAT + processing) Significantly increases the confidence of your data
RUM Costs Computational cost ~$100 - $200 6-8 hours per lane on m2.4xlarge ($2.40 / hour) Cost of reagents ~= $10,000 1% of total
Acknowledgements Garret FitzGerald Ian Blair John Hogenesch Greg Grant Tilo Grosser NIH & UPENN for support  My Team David Austin Andrew Brader Weichen Wu Rate me!   http://speakerrate.com/talks/3041-everything-comes-in-3-s

Weitere ähnliche Inhalte

Was ist angesagt?

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
New Process/Thread Runtime
New Process/Thread Runtime	New Process/Thread Runtime
New Process/Thread Runtime Linaro
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorLarry Lang
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtosNAVER D2
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBbmbouter
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUseHortonworks
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCynthia Thomas
 
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...DigitalOcean
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...NECST Lab @ Politecnico di Milano
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNSbacongobbler
 

Was ist angesagt? (20)

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
New Process/Thread Runtime
New Process/Thread Runtime	New Process/Thread Runtime
New Process/Thread Runtime
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUse
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
 
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
Escape the Walls of PaaS: Unlock the Power & Flexibility of DigitalOcean App ...
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNS
 

Andere mochten auch

JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNRJavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNRRyan Sciampacone
 
padrino_and_sequel
padrino_and_sequelpadrino_and_sequel
padrino_and_sequeldelagoya
 
Couchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemCouchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemdelagoya
 
Itmat pcbi-r-course-1
Itmat pcbi-r-course-1Itmat pcbi-r-course-1
Itmat pcbi-r-course-1delagoya
 
CouchDB : More Couch
CouchDB : More CouchCouchDB : More Couch
CouchDB : More Couchdelagoya
 

Andere mochten auch (6)

JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNRJavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
JavaOne 2013: Effective Foreign Function Interfaces: From JNI to JNR
 
padrino_and_sequel
padrino_and_sequelpadrino_and_sequel
padrino_and_sequel
 
Ruby FFI
Ruby FFIRuby FFI
Ruby FFI
 
Couchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemCouchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problem
 
Itmat pcbi-r-course-1
Itmat pcbi-r-course-1Itmat pcbi-r-course-1
Itmat pcbi-r-course-1
 
CouchDB : More Couch
CouchDB : More CouchCouchDB : More Couch
CouchDB : More Couch
 

Ähnlich wie Everything comes in 3's

AWS Summit 2018 Summary
AWS Summit 2018 SummaryAWS Summit 2018 Summary
AWS Summit 2018 SummaryAshish Mrig
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesComunidade NetPonto
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloudawesomesos
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecturewlscaudill
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsExpertos en TI
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)Sri Prasanna
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Ramprasad Nagaraja
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityPapitha Velumani
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Greenfield Development with CQRS
Greenfield Development with CQRSGreenfield Development with CQRS
Greenfield Development with CQRSDavid Hoerster
 
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...Amazon Web Services
 

Ähnlich wie Everything comes in 3's (20)

AWS Summit 2018 Summary
AWS Summit 2018 SummaryAWS Summit 2018 Summary
AWS Summit 2018 Summary
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Cloud C
Cloud CCloud C
Cloud C
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Tombolo
TomboloTombolo
Tombolo
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloud
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecture
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS Applications
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
Amf304 optimizing-design-and-e-660cc73d-5c4c-4331-8f59-48cccdc1b7f4-135588426...
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Greenfield Development with CQRS
Greenfield Development with CQRSGreenfield Development with CQRS
Greenfield Development with CQRS
 
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
AMF304-Optimizing Design and Engineering Performance in the Cloud for Manufac...
 

Kürzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Everything comes in 3's

  • 1. Everything Comes in 3’s Angel Pizarro Director, ITMAT Bioinformatics Facility University of Pennsylvania School of Medicine
  • 2. Outline This talk looks at the practical aspects of Cloud Computing We will be diving into specific examples 3pillars of systems design 3storage implementations 3 areas of bioinformatics And how they are affected by clouds 3interesting internal projects There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
  • 3. Pillars of Systems Design Provisioning API access (AWS, Microsoft, RackSpace, GoGrid, etc.) Not discussing further, since this is the WHOLE POINT of cloud computing. Configuration How to get a system up to the point you can do something with it Command and Control How to tell the system what to do
  • 4. System Configuration with Chef Automatic installation of packages, service configuration and initialization Specifications use a real programming language with known behavior Bring the system to an idempotent state http://opscode.com/chef/ http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
  • 5. Chef Recipes & Cookbooks The specification for installing and configuring a system component Able to support more than one platform Has access to system-wide information hostname, IP addr, RAM, # processors, etc. Contain templates, documentation, static files & assets Can define dependencies on other recipes Executed in order, execution stops at first failure
  • 6. Simple Recipe : Rsync Install rsync to the system Meta data file states what platforms are supported Note that Chef is a Linux centric system BUT, the WikiWiki is MessyMessy Look at Chef Solo & Resources
  • 7. More Complex Recipe: Heartbeat Installs heartbeat package Registers the service and specifies that is can be restarted and provides a status message Finally it starts the service
  • 8. Command and Control Traditional grid computing QSUB – SGE, PBS, Torque Usually requires tightly coupled and static systems Shared file systems, firewalls, user accounts, shared exe & lib locations Best for capability processes (e.g. MPI) Map-Reduce is the new hotness Best for data-parallel processes Assumes loosely coupled non-static components Job staging is a critical component
  • 9. Map Reduce in a Nutshell Algorithm pioneered by Google for distributed data analysis Data-parallel analysis fit well into this model Split data, work on each part in parallel, then merge results Hadoop, Disco, CloudCrowd, …
  • 10. Serial Execution of Proteomics Search
  • 12. Roll-Your-Own MR on AWS Define small scripts to Split a FASTA file Run a BLAT search The first script make defines the inputs of the second Submit the input FASTA to S3 Start a master node as the central communication hub Start slave nodes, configured to ask for work from master and save results back to S3 Press “Play”
  • 13. Workflow of Distributed BLAT Boot master & slaves PC Master Submit the BLAT job S3 Slave Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes Upload inputs Download results Slave Slave Slave
  • 14. Master Node => Resque Github developed background job processing framework Jobs attached to a class from your application, stored as JSON Uses REDIS key-value store Simple front end for viewing job queue status, failed job http://github.com/defunkt/resque Resque can invoke any class that has a class method “perform()”
  • 16. Storage in the Cloud : S3 Permanent storage for your data Pay as you go for transmission and holding Eliminates backups Pretty good CDN Able to hook into better CDN SLA via CloudFront Can be slow at times Reports of 10 second delay, but average is 300ms response Your Data S3
  • 18. Storage 2: Distributed FS on EC2 Hadoop HDFS, Gigaspaces, etc. Network latency may be an issue for traditional DFSs Gluster, GPFS, etc. Tighter integration with execution framework, better performance? Your Data EC2 Node EC2 Node EC2 Node EC2 Node EC2 Node Disk
  • 19. DFS on EC2 m1.xlarge Costs * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
  • 20. Storage 3: Memory Grids “RAM is the new Disk” Application level RAM clustering Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces Performance for capability jobs? Your Data EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM EC2 RAM * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
  • 21. Memory Grid Cost Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
  • 22. Cloud Influence on Bioinformatics Computational Biology Algorithms will need to account for large I/O latency Statistical tests will need to account for incomplete information, or incremental results Software Engineering Built for the cloud algorithms are popping up CloudBurst is a feature example in AWS EMR! Application to Life Sciences Deploy ready-made images for use Cycle Computing, ViPDAC, others soon to follow
  • 23. Algorithms need to be I/O centric Incur a slightly higher computational burden to reduce I/O across non-optimal networks P. Balaji, W. Feng, H. Lin 2008
  • 24. Some Internal Projects Resource Manager Service for on-demand provisioning and release of EC2 nodes Utilizes Chef to define and apply roles (compute node, DB server, etc) Terminates idle compute nodes at 52 minutes Workflow Manager Defines and executes data analysis workflows Relies on RM to provision nodes Once appropriate worker nodes are available, acts as the central work queue RUM RNA-SeqUltimate Mapper Map Reduce RNA-Seq analysis pipeline Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
  • 26. RUM (Bowtie + BLAT + processing) Significantly increases the confidence of your data
  • 27. RUM Costs Computational cost ~$100 - $200 6-8 hours per lane on m2.4xlarge ($2.40 / hour) Cost of reagents ~= $10,000 1% of total
  • 28. Acknowledgements Garret FitzGerald Ian Blair John Hogenesch Greg Grant Tilo Grosser NIH & UPENN for support My Team David Austin Andrew Brader Weichen Wu Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s

Hinweis der Redaktion

  1. REFERENCE Semantic-based Distributed I/O with the ParaMEDICFramework
P. Balaji, W. Feng, H. Lin
ACM/IEEE International Symposium on High-Performance Distributed Computing,
April 2008.http://www.mpiblast.org/About/Publications