SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Supporting bioinformatics
applications with hybrid
multi-cloud services
Mohamed Abouelhoda
Joint work with
Ahmed Abdullah Ali and Mohamed Elkalioby
1
ElasticHPC Overview
ElasticHPC
Configuration
ElasticHPC
Web Interface
IaaS controller & Mapper
• ElasticHPC Supports the creation and management of cloud computing resources
over multiple public cloud Providers Including Amazon, Azure, Google and Clouds
supporting OpenStack.
Cluster Manager
Security & Networking
Storage Manager
Job&DataManager
Cloud
Provider
#1
Cloud
Provider
#2
Cloud
Provider
#3
2
Cloud Computing
Infrastructure
Virtual Servers
Business
Models
Software
Platform
Cloud Management Provider
• Cloud Computing provides the access to hardware, platforms and software, it
takes care of hosting and storage.
• User has no clue where his/her data is.
3
Cloud Computing
Cloud deployment models
 Public Cloud
 Private Cloud
 Hybrid Cloud
 Community Cloud
Private Cloud
Public Cloud
Hybrid Cloud
4
Cloud Computing
Advantages
 Service automation and self-
service models
 Easy to deploy
 It is an immigration from CapEx
to OpEx
 Data recovery and backup
Disadvantages
 Security Issues
 User has no clue where his/her
data is
 Legacy systems incompatibility
 Higher operational cost for long
term usage
Advantages and disadvantages of cloud computing
5
Cloud Computing
Cloud Computing for Bioinformatics Applications
Some tools already developed for bioinformatics applications
 Crossbow,
 Myrna
 CloudBrust,
 CloudBlast,
 Cloud–RNA,
 etc.
These tools are demonstrated on cloud computing and their techniques are not
generic to other tools and supports only Amazon Web Services
6
Cloud Computing
Computer Cluster middleware packages over cloud
Middleware packages support computer cluster management
over cloud
 StarCluster
 Vappio
 CloudMan
 etc.
These middleware packages do not support running computer cluster over multiple
Cloud providers
7
Cloud Computing
Cloud computing providers
8
Our contribution
Cluster 1 Cluster 2 Cluster 3
Provider 1 Provider 2 Provider 1 Provider 2
Non-Federated Cloud Cluster Federated Cloud Cluster
Our contribution is to extend bioinformatics applications to run over multiple
clusters on different cloud service
providers and supporting two types of compute cluster
 Non-Federated Cloud Cluster
 Federated Cloud Cluster
9
Our contribution
ElasticHPC supports creation and management of computer cluster for
bioinformatics solutions on:
– Amazon Web Services
– Microsoft Windows Azure
– Google Compute Engine
– OpenStack based clouds
Provider 2 Provider 1 Provider 2
10
Use cases
11
Use case scenarios
Provider 2 Provider 1 Provider 2
Simplified version of the variant analysis workflow based on NGS technology as an
example for our use case scenarios
12
The variant analysis workflow: the tools BWA, Picard, GATK are usually used for the
three steps of the workflow. On the arrows, we write the different file formats of
the processed data
Multiple clusters over multiple clouds
Provider 2 Provider 1 Provider 2
Multiple independent clusters over multiple clouds and each cluster
processes part of the input data
13
Multiple clusters over multiple clouds
Provider 2 Provider 1 Provider 2
Using this scenario depends on:
 Time constraint or not.
 Reducing the cost within specific time (Spot instances)
14
Input File 3
Cloud 1
Input File 1
Cluster 1 Cluster 2
Input File 2
Cloud 2
Cluster 3
Storing
Output files
On Object storage or S3
Multiple clusters over multiple clouds
Provider 2 Provider 1 Provider 2
Each cluster is created in one cloud and solves a step of the workflow.
15
Multiple clusters over multiple clouds
Provider 2 Provider 1 Provider 2
In the case of technical limitations
Some technical specification preventing a step from
running in one cloud, but the other steps can run in cheaper cloud.
16
Cloud 1
Cloud 2 Cloud 3
Cluster 1
Cluster 3Cluster 2
Read Mapping Step
Mark Duplicates Step
Variant Calling Step
Storing
Output files
On Object storage or S3
One cluster of federated cloud machines
Provider 2 Provider 1 Provider 2
One cluster composed of different machines from different clouds where
one master job queue which dispatches the jobs among the nodes in
different clouds.
17
Cloud 1 Cloud 2
Persistent
Process
Communication Layer
Master
Node
One cluster of federated cloud machines
master job queue dispatches the jobs among the nodes in different clouds
that works on the job level rather than the whole (sub-) workflow level
18
One cluster of federated cloud machines
• Using this scenario depends on
• The processing time differs from one job to another.
• The characteristics of the processed data
• Internet connection among the cloud sites
• Good management of input data according to its
characteristics
19
Elastic-hpc package
20
Elastic-HPC
• Software library facilitates creation & use of high
performance cloud computing resources for bioinformatics
over multiple cloud service providers
• Basic Features
• Creation of multi-cloud clusters
• Management of cluster
• Data management options
» NFS
» S3FS
» GlusterFS
• Job submission options
» PBS Torque
» Sun Grid Engine (SGE)
• Bioinformatics tool set
» 200 sequence analysis tools coming from BioLinux
» EMBOSS
» NCBI Toolkit
» SHRiMP, Bowtie2, GATK, BWA, ..etc.
21
Implementation of multi-cloud
elasticHPC
22
Implementation of multi-cloud
elasticHPC
23
The three major commercial providers Amazon, Azure, and Google
Amazon Web Services (AWS)
Execution Model:
• Highest CPU virtual machine of type c3.8xlarge (32 Cores and 108 GB
RAM $1.68/hr)
Storage Model:
• EBS “Elastic Block Storage” such as Hard disks and block devices
• S3 “Simple Storage Services” it is some sort of object storage.
Pricing models:
• Pay as you go
• Reserved instances
• Spot instances
Implementation of multi-cloud
elasticHPC
24
Microsoft Windows Azure
Execution Model:
• Highest CPU virtual machine of type A9 (16 cores, 112 GB RAM
$4.47/hr)
Storage Models:
• Page Blobs such as Hard disks and block devices as a file system with a
maximum size of 1 TB
• Block Blobs with maximum size of 200 GB.
Pricing models:
• Pay as you go “pay per minute”
Implementation of multi-cloud
elasticHPC
25
Google Compute Engine /Google Cloud
Execution Model:
• Highest CPU virtual machine of type n1-highmem-16 (16 cores , 104 GB
RAM, $1.18/hr)
• Also Google provides hard disks, snapshots and images within execution
models
Storage Models:
• Object Storage
Pricing models:
• Pay as you go “pay per minute”
• sustained use
Implementation of multi-cloud
elasticHPC
Comparing the features among Amazon Web Services, Windows Azure and Google
Compute Engine, including the business model
26
Implementation of multi-cloud
elasticHPC
Implementation Details
 The elasticHPC follows a server client architecture.
27
Implementation of multi-cloud
elasticHPC
elasticHPC Interface
1. Create Federated/Non-
Federated Clusters
2. Upload Configuration File
3. Upload Cloud Specific
Credentials and Start Clusters
3
1
2
28
Implementation of multi-cloud
elasticHPC
IaaS Controller and Mapper
translates the request to the
corresponding APIs specific to
each cloud platform.
29
Implementation of multi-cloud
elasticHPC
Cluster Manager
handles all functions related to
the creation and management
of clusters at that cloud site
including security settings
and storage devices
30
Implementation of multi-cloud
elasticHPC
Job and Data Manager
handles job submission and
data transfer management
between cluster’s nodes and
different storage types
(Block/Object) storage.
31
Experiments
32
Experiments
Variant Analysis Workflow
 Input exome dataset of size ≈ 9 GB
 using BWA for read mapping, Picard for marking duplicates,
and GATK for variant calling
33
Experiments
Experiment 1
The workflow was executed 3 times independently on:
 Google
 n1-highmem-8 (8 Cores, 52 GB RAM, $0.452/hour)
 AWS
 m3.2xlarge (8 Cores, 30 GB RAM,$0.56/hour)
 Azure
 Standard A7 (8 Cores, 56 GB RAM, $1.00/hour)
The 9 GB input data is divided into blocks to be processed in parallel over
the cluster nodes
34
Experiments
Experiment 1
Google and Amazon have the same performance, on the other Hand Azure has the
Worst performance
35
Running times in minutes. “MarkD “ stands for mark duplicate step. The numbers
Between backets are the cost in USD
Experiments
Experiment 1
Noted that Mark duplicate has no performance improvement when
adding More nodes (increasing computing power) because Picard
requires all reads to be a one set of input.
36
Experiments
Experiment 2
Using the same input dataset but with stronger machine for the Mark Duplicate
step on Amazon
c3.8xlarge
Amazon c3.8xlarge,
which has 32 cores
and 108 GB RAM
and costs $1.68
Mark Duplicate
Google cluster
n1-highmem-8
8 Cores, 52GB
RAM, $0.452
n1-highmem-8
READ MAPPING
VARIANT CALLING
Uploading
VCF output
File to Object
Storage
S3/Google
Objects
Transfer
Mapped
BAM File
1 2
34
37
Experiments
Experiment 2
Google will always retrieve better cost when the parallelization leads to
fractions of hour. So the best cost with comparable performance for these
three steps workflow is when we use hybrid cloud of Amazon and Google.
38
Running times in minutes using single provider and multicloud scenario of
two providers. The numbers between brackets are the cost in USD
Conclusion
 Introducing ElasticHPC that creates and manage computer cluster over multiple
cloud platforms for bioinformatics applications
 Google and Azure offer “The charge per minutes” pricing model
 Amazon charges per hour as a pricing model
 ElasticHPC enables the data analyst to use cloud with best offer at the time of
analysis
 elasticHPC opens the way for the development of more advanced layers for task
scheduling and cost-time optimization
 Future work, we will include different ideas to use shared storage from
multi-cloud as a shared file system
39
Thank you
40
Availability and requirements
• Project name: elasticHPC.
• Project home page: http://www.elastichpc.org.
• Operating system(s): Linux.
• Programming language: Python, C, Java script, HTML, Shell
script.
• Other requirements: Compatible with the browsers
FireFox, Chrome, Safari, and Opera. See the manual for
more details.
• License: Free for academics. Authorization license needed
for commercial usage (Please contact the corresponding
author for more details).
• Any restrictions to use by non-academics: No restrictions.
41
Configurations file
################## BASIC SETTING FOR CLOUD PLATFORMS ##############
[GCE]
# GOOGLE COMPUTE ENGINE CONFIGURATION
PROJECT_ID =
ZONE = us-central1-a
CLIENT_SECRET = config/client_secret.json
COMPUTE_SCOPE = https://www.googleapis.com/auth/compute
OAUTH_STORAGE = oauth2.dat
IMAGE_PROJECT =
SERVICE_EMAIL = default
NETWORK = default
SCOPES = https://www.googleapis.com/auth/devstorage.full_control
API_VERSION = v1
CLUSTER_CLIENT_KEY = keys/key
ROOT_DISK=disks
Configuration s file Sample
Google Specific configuration section
42
Configurations file
################## BASIC SETTING FOR CLOUD PLATFORMS ##############
[AZURE]
# MICROSOFT WINDOWS AZURE CONFIGURATIONS
SUBSCRIPTION_ID =
THUMBPRINT =
STORAGE_ACCOUNT =
STORAGE_KEY =
CERTIFICATE_PATH = mycert.pem
PKFILE = mycert.cer
CERT_DATA_PATH = mycert.pfx
CERT_PASSWORD =
REGION = WUS
CONTAINER=newcontainer
Configuration s file Sample
Azure Specific configuration section
43
Configurations file
########## BASIC SETTING FOR CLOUD PLATFORMS ########
[AWS]
# AMAZON WEB SERVICES CONFIGURATIONS
pkey= pk.pem
cert= cert.pem
accessKey=
secretKey=
keyPair= instance-key
securityGroup =
keyPairPath= instance-key.pem
INSTANCE_TYPE = m3.medium
MASTER_TYPE = m3.medium
REGION = USW1
ZONE = us-west-1c
Configuration s file Sample
Amazon Specific configuration section
44
Configurations file
###### DEFINE CLUSTERS #######
[CLUSTERS]
CLUSTERS_LIST= CLUSTER1, CLUSTER 2
[CLUSTER1] ### CLUSTER 1 is hybrid cluster over multi-
cloud
# CLUSTER CONFIGURATION
CLUSTER_NAME= cluster1
CLUSTER_PREFIX = cluster1
MachineSets=MachineSet2,MachineSet3,MachineSet1
MASTER_NODE_LOCATION= MachineSet2
NFS = True
# NFS CONFIGURATION
NFS_MOUNTING_POINT=/home
NFS_DEVICE=/dev/xvdf
NFS_FSID=0
NFS_EBS_Mode=NEW_VOLUME
# attach new volume
NFS_NEW_VOLUME_SIZE=10
# in case of attach an exist volume
GLUSTER=False
GLUSTER_MOUNT_POINT = /gluster/WGA/
GLUSTER_VOLUME_NAME = gv0
GLUSTER_STRIPE = 1
GLUSTER_REPLICATE = 1
GLUSTER_FORMAT_DISK = False
Cluster s Section defines multiple clusters where each one has multiple Machine sets,
every Machine sets represents a cluster on different cloud service provider
[MachineSet1]
NODES = 2
PROVIDER = GCE
# IMAGE CONFIGURATION
IMAGE_ID = tavaxy2
……
FIREWALL=ehpc,http2,apache2
FW_PORTS=5000,8080,80
FW_PROTOCOLS=tcp,tcp,tcp
[MachineSet2]
NODES = 0
PROVIDER = AWS
IMAGE_ID = ami-077d9a43
……..
FW_PORTS=5000,8080,80
FW_PROTOCOLS=tcp,tcp,tcp
[MachineSet3]
NODES=0
Provider = AZURE
IMAGE_ID = ehpc-generic26
OS_URL =
……..
FW_PORTS=5000,8080,80
FW_PROTOCOLS=tcp,tcp,tcp
45

Weitere ähnliche Inhalte

Was ist angesagt?

Robust Containers by Eric Brewer
Robust Containers by Eric BrewerRobust Containers by Eric Brewer
Robust Containers by Eric Brewer
Docker, Inc.
 

Was ist angesagt? (19)

Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaSScaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
 
Docker on Amazon ECS
Docker on Amazon ECSDocker on Amazon ECS
Docker on Amazon ECS
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
kubernetes 101
kubernetes 101kubernetes 101
kubernetes 101
 
DevOps in AWS with Kubernetes
DevOps in AWS with KubernetesDevOps in AWS with Kubernetes
DevOps in AWS with Kubernetes
 
Robust Containers by Eric Brewer
Robust Containers by Eric BrewerRobust Containers by Eric Brewer
Robust Containers by Eric Brewer
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory Guide
 
Federated mesos clusters for global data center designs
Federated mesos clusters for global data center designsFederated mesos clusters for global data center designs
Federated mesos clusters for global data center designs
 
Containers kuberenetes
Containers kuberenetesContainers kuberenetes
Containers kuberenetes
 
Kubernetes Requests and Limits
Kubernetes Requests and LimitsKubernetes Requests and Limits
Kubernetes Requests and Limits
 
Getting started with kubernetes
Getting started with kubernetesGetting started with kubernetes
Getting started with kubernetes
 
Evolution of containers to kubernetes
Evolution of containers to kubernetesEvolution of containers to kubernetes
Evolution of containers to kubernetes
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overview
 
Federated Kubernetes: As a Platform for Distributed Scientific Computing
Federated Kubernetes: As a Platform for Distributed Scientific ComputingFederated Kubernetes: As a Platform for Distributed Scientific Computing
Federated Kubernetes: As a Platform for Distributed Scientific Computing
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
Learn kubernetes in 90 minutes
Learn kubernetes in 90 minutesLearn kubernetes in 90 minutes
Learn kubernetes in 90 minutes
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
Containers kuberenetes
Containers kuberenetesContainers kuberenetes
Containers kuberenetes
 

Andere mochten auch

Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
bosc
 
Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011
Mosab-Khayat
 
تسويق خدمات المعلومات
تسويق خدمات المعلوماتتسويق خدمات المعلومات
تسويق خدمات المعلومات
u083125
 
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012مالثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
Prof. Sherif Shaheen
 

Andere mochten auch (20)

Delivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the Cloud
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
 
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
 
الهوية الرقمية على مواقع التواصل الاجتماعي
الهوية الرقمية على مواقع التواصل الاجتماعيالهوية الرقمية على مواقع التواصل الاجتماعي
الهوية الرقمية على مواقع التواصل الاجتماعي
 
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
 استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I) استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
 
مهارات+1
مهارات+1مهارات+1
مهارات+1
 
Dr Justin Schonfeld - Bioinformatics Applications
Dr Justin Schonfeld - Bioinformatics ApplicationsDr Justin Schonfeld - Bioinformatics Applications
Dr Justin Schonfeld - Bioinformatics Applications
 
Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011
 
Present
PresentPresent
Present
 
e justice
e justice e justice
e justice
 
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLDDr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
 
Visual Studio
Visual StudioVisual Studio
Visual Studio
 
Brin bws13 quiz mmc
Brin bws13 quiz mmcBrin bws13 quiz mmc
Brin bws13 quiz mmc
 
Bioinformatics lecture 1
Bioinformatics lecture 1Bioinformatics lecture 1
Bioinformatics lecture 1
 
تسويق خدمات المعلومات
تسويق خدمات المعلوماتتسويق خدمات المعلومات
تسويق خدمات المعلومات
 
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012مالثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
 
الثقافة التقنية والمواطنة الالكترونية
الثقافة التقنية والمواطنة الالكترونيةالثقافة التقنية والمواطنة الالكترونية
الثقافة التقنية والمواطنة الالكترونية
 
From Sunset To Sunrise
From Sunset To SunriseFrom Sunset To Sunrise
From Sunset To Sunrise
 
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفيةدور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
 
ABT 609 PPT
ABT 609 PPTABT 609 PPT
ABT 609 PPT
 

Ähnlich wie Supporting bioinformatics applications with hybrid multi-cloud services

Cloud computing(bit mesra kolkata extn.)
Cloud computing(bit mesra kolkata extn.)Cloud computing(bit mesra kolkata extn.)
Cloud computing(bit mesra kolkata extn.)
ASHUTOSH KUMAR
 

Ähnlich wie Supporting bioinformatics applications with hybrid multi-cloud services (20)

An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 
Microsoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosMicrosoft Azure in HPC scenarios
Microsoft Azure in HPC scenarios
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
Microservices , Docker , CI/CD , Kubernetes Seminar - Sri Lanka
Microservices , Docker , CI/CD , Kubernetes Seminar - Sri Lanka Microservices , Docker , CI/CD , Kubernetes Seminar - Sri Lanka
Microservices , Docker , CI/CD , Kubernetes Seminar - Sri Lanka
 
Introductio to Docker and usage in HPC applications
Introductio to Docker and usage in HPC applicationsIntroductio to Docker and usage in HPC applications
Introductio to Docker and usage in HPC applications
 
GREEN CLOUD COMPUTING
GREEN CLOUD COMPUTINGGREEN CLOUD COMPUTING
GREEN CLOUD COMPUTING
 
Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland
 
Architecting .NET solutions in a Docker ecosystem - .NET Fest Kyiv 2019
Architecting .NET solutions in a Docker ecosystem - .NET Fest Kyiv 2019Architecting .NET solutions in a Docker ecosystem - .NET Fest Kyiv 2019
Architecting .NET solutions in a Docker ecosystem - .NET Fest Kyiv 2019
 
Task programming
Task programmingTask programming
Task programming
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
 
Cloud computing overview
Cloud computing overviewCloud computing overview
Cloud computing overview
 
Cloud computing: highlights
Cloud computing: highlightsCloud computing: highlights
Cloud computing: highlights
 
Cloud computing(bit mesra kolkata extn.)
Cloud computing(bit mesra kolkata extn.)Cloud computing(bit mesra kolkata extn.)
Cloud computing(bit mesra kolkata extn.)
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computers
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Cloudsim & greencloud
Cloudsim & greencloud Cloudsim & greencloud
Cloudsim & greencloud
 
High Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the CloudHigh Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the Cloud
 
High Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the CloudHigh Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the Cloud
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Supporting bioinformatics applications with hybrid multi-cloud services

  • 1. Supporting bioinformatics applications with hybrid multi-cloud services Mohamed Abouelhoda Joint work with Ahmed Abdullah Ali and Mohamed Elkalioby 1
  • 2. ElasticHPC Overview ElasticHPC Configuration ElasticHPC Web Interface IaaS controller & Mapper • ElasticHPC Supports the creation and management of cloud computing resources over multiple public cloud Providers Including Amazon, Azure, Google and Clouds supporting OpenStack. Cluster Manager Security & Networking Storage Manager Job&DataManager Cloud Provider #1 Cloud Provider #2 Cloud Provider #3 2
  • 3. Cloud Computing Infrastructure Virtual Servers Business Models Software Platform Cloud Management Provider • Cloud Computing provides the access to hardware, platforms and software, it takes care of hosting and storage. • User has no clue where his/her data is. 3
  • 4. Cloud Computing Cloud deployment models  Public Cloud  Private Cloud  Hybrid Cloud  Community Cloud Private Cloud Public Cloud Hybrid Cloud 4
  • 5. Cloud Computing Advantages  Service automation and self- service models  Easy to deploy  It is an immigration from CapEx to OpEx  Data recovery and backup Disadvantages  Security Issues  User has no clue where his/her data is  Legacy systems incompatibility  Higher operational cost for long term usage Advantages and disadvantages of cloud computing 5
  • 6. Cloud Computing Cloud Computing for Bioinformatics Applications Some tools already developed for bioinformatics applications  Crossbow,  Myrna  CloudBrust,  CloudBlast,  Cloud–RNA,  etc. These tools are demonstrated on cloud computing and their techniques are not generic to other tools and supports only Amazon Web Services 6
  • 7. Cloud Computing Computer Cluster middleware packages over cloud Middleware packages support computer cluster management over cloud  StarCluster  Vappio  CloudMan  etc. These middleware packages do not support running computer cluster over multiple Cloud providers 7
  • 9. Our contribution Cluster 1 Cluster 2 Cluster 3 Provider 1 Provider 2 Provider 1 Provider 2 Non-Federated Cloud Cluster Federated Cloud Cluster Our contribution is to extend bioinformatics applications to run over multiple clusters on different cloud service providers and supporting two types of compute cluster  Non-Federated Cloud Cluster  Federated Cloud Cluster 9
  • 10. Our contribution ElasticHPC supports creation and management of computer cluster for bioinformatics solutions on: – Amazon Web Services – Microsoft Windows Azure – Google Compute Engine – OpenStack based clouds Provider 2 Provider 1 Provider 2 10
  • 12. Use case scenarios Provider 2 Provider 1 Provider 2 Simplified version of the variant analysis workflow based on NGS technology as an example for our use case scenarios 12 The variant analysis workflow: the tools BWA, Picard, GATK are usually used for the three steps of the workflow. On the arrows, we write the different file formats of the processed data
  • 13. Multiple clusters over multiple clouds Provider 2 Provider 1 Provider 2 Multiple independent clusters over multiple clouds and each cluster processes part of the input data 13
  • 14. Multiple clusters over multiple clouds Provider 2 Provider 1 Provider 2 Using this scenario depends on:  Time constraint or not.  Reducing the cost within specific time (Spot instances) 14 Input File 3 Cloud 1 Input File 1 Cluster 1 Cluster 2 Input File 2 Cloud 2 Cluster 3 Storing Output files On Object storage or S3
  • 15. Multiple clusters over multiple clouds Provider 2 Provider 1 Provider 2 Each cluster is created in one cloud and solves a step of the workflow. 15
  • 16. Multiple clusters over multiple clouds Provider 2 Provider 1 Provider 2 In the case of technical limitations Some technical specification preventing a step from running in one cloud, but the other steps can run in cheaper cloud. 16 Cloud 1 Cloud 2 Cloud 3 Cluster 1 Cluster 3Cluster 2 Read Mapping Step Mark Duplicates Step Variant Calling Step Storing Output files On Object storage or S3
  • 17. One cluster of federated cloud machines Provider 2 Provider 1 Provider 2 One cluster composed of different machines from different clouds where one master job queue which dispatches the jobs among the nodes in different clouds. 17
  • 18. Cloud 1 Cloud 2 Persistent Process Communication Layer Master Node One cluster of federated cloud machines master job queue dispatches the jobs among the nodes in different clouds that works on the job level rather than the whole (sub-) workflow level 18
  • 19. One cluster of federated cloud machines • Using this scenario depends on • The processing time differs from one job to another. • The characteristics of the processed data • Internet connection among the cloud sites • Good management of input data according to its characteristics 19
  • 21. Elastic-HPC • Software library facilitates creation & use of high performance cloud computing resources for bioinformatics over multiple cloud service providers • Basic Features • Creation of multi-cloud clusters • Management of cluster • Data management options » NFS » S3FS » GlusterFS • Job submission options » PBS Torque » Sun Grid Engine (SGE) • Bioinformatics tool set » 200 sequence analysis tools coming from BioLinux » EMBOSS » NCBI Toolkit » SHRiMP, Bowtie2, GATK, BWA, ..etc. 21
  • 23. Implementation of multi-cloud elasticHPC 23 The three major commercial providers Amazon, Azure, and Google Amazon Web Services (AWS) Execution Model: • Highest CPU virtual machine of type c3.8xlarge (32 Cores and 108 GB RAM $1.68/hr) Storage Model: • EBS “Elastic Block Storage” such as Hard disks and block devices • S3 “Simple Storage Services” it is some sort of object storage. Pricing models: • Pay as you go • Reserved instances • Spot instances
  • 24. Implementation of multi-cloud elasticHPC 24 Microsoft Windows Azure Execution Model: • Highest CPU virtual machine of type A9 (16 cores, 112 GB RAM $4.47/hr) Storage Models: • Page Blobs such as Hard disks and block devices as a file system with a maximum size of 1 TB • Block Blobs with maximum size of 200 GB. Pricing models: • Pay as you go “pay per minute”
  • 25. Implementation of multi-cloud elasticHPC 25 Google Compute Engine /Google Cloud Execution Model: • Highest CPU virtual machine of type n1-highmem-16 (16 cores , 104 GB RAM, $1.18/hr) • Also Google provides hard disks, snapshots and images within execution models Storage Models: • Object Storage Pricing models: • Pay as you go “pay per minute” • sustained use
  • 26. Implementation of multi-cloud elasticHPC Comparing the features among Amazon Web Services, Windows Azure and Google Compute Engine, including the business model 26
  • 27. Implementation of multi-cloud elasticHPC Implementation Details  The elasticHPC follows a server client architecture. 27
  • 28. Implementation of multi-cloud elasticHPC elasticHPC Interface 1. Create Federated/Non- Federated Clusters 2. Upload Configuration File 3. Upload Cloud Specific Credentials and Start Clusters 3 1 2 28
  • 29. Implementation of multi-cloud elasticHPC IaaS Controller and Mapper translates the request to the corresponding APIs specific to each cloud platform. 29
  • 30. Implementation of multi-cloud elasticHPC Cluster Manager handles all functions related to the creation and management of clusters at that cloud site including security settings and storage devices 30
  • 31. Implementation of multi-cloud elasticHPC Job and Data Manager handles job submission and data transfer management between cluster’s nodes and different storage types (Block/Object) storage. 31
  • 33. Experiments Variant Analysis Workflow  Input exome dataset of size ≈ 9 GB  using BWA for read mapping, Picard for marking duplicates, and GATK for variant calling 33
  • 34. Experiments Experiment 1 The workflow was executed 3 times independently on:  Google  n1-highmem-8 (8 Cores, 52 GB RAM, $0.452/hour)  AWS  m3.2xlarge (8 Cores, 30 GB RAM,$0.56/hour)  Azure  Standard A7 (8 Cores, 56 GB RAM, $1.00/hour) The 9 GB input data is divided into blocks to be processed in parallel over the cluster nodes 34
  • 35. Experiments Experiment 1 Google and Amazon have the same performance, on the other Hand Azure has the Worst performance 35 Running times in minutes. “MarkD “ stands for mark duplicate step. The numbers Between backets are the cost in USD
  • 36. Experiments Experiment 1 Noted that Mark duplicate has no performance improvement when adding More nodes (increasing computing power) because Picard requires all reads to be a one set of input. 36
  • 37. Experiments Experiment 2 Using the same input dataset but with stronger machine for the Mark Duplicate step on Amazon c3.8xlarge Amazon c3.8xlarge, which has 32 cores and 108 GB RAM and costs $1.68 Mark Duplicate Google cluster n1-highmem-8 8 Cores, 52GB RAM, $0.452 n1-highmem-8 READ MAPPING VARIANT CALLING Uploading VCF output File to Object Storage S3/Google Objects Transfer Mapped BAM File 1 2 34 37
  • 38. Experiments Experiment 2 Google will always retrieve better cost when the parallelization leads to fractions of hour. So the best cost with comparable performance for these three steps workflow is when we use hybrid cloud of Amazon and Google. 38 Running times in minutes using single provider and multicloud scenario of two providers. The numbers between brackets are the cost in USD
  • 39. Conclusion  Introducing ElasticHPC that creates and manage computer cluster over multiple cloud platforms for bioinformatics applications  Google and Azure offer “The charge per minutes” pricing model  Amazon charges per hour as a pricing model  ElasticHPC enables the data analyst to use cloud with best offer at the time of analysis  elasticHPC opens the way for the development of more advanced layers for task scheduling and cost-time optimization  Future work, we will include different ideas to use shared storage from multi-cloud as a shared file system 39
  • 41. Availability and requirements • Project name: elasticHPC. • Project home page: http://www.elastichpc.org. • Operating system(s): Linux. • Programming language: Python, C, Java script, HTML, Shell script. • Other requirements: Compatible with the browsers FireFox, Chrome, Safari, and Opera. See the manual for more details. • License: Free for academics. Authorization license needed for commercial usage (Please contact the corresponding author for more details). • Any restrictions to use by non-academics: No restrictions. 41
  • 42. Configurations file ################## BASIC SETTING FOR CLOUD PLATFORMS ############## [GCE] # GOOGLE COMPUTE ENGINE CONFIGURATION PROJECT_ID = ZONE = us-central1-a CLIENT_SECRET = config/client_secret.json COMPUTE_SCOPE = https://www.googleapis.com/auth/compute OAUTH_STORAGE = oauth2.dat IMAGE_PROJECT = SERVICE_EMAIL = default NETWORK = default SCOPES = https://www.googleapis.com/auth/devstorage.full_control API_VERSION = v1 CLUSTER_CLIENT_KEY = keys/key ROOT_DISK=disks Configuration s file Sample Google Specific configuration section 42
  • 43. Configurations file ################## BASIC SETTING FOR CLOUD PLATFORMS ############## [AZURE] # MICROSOFT WINDOWS AZURE CONFIGURATIONS SUBSCRIPTION_ID = THUMBPRINT = STORAGE_ACCOUNT = STORAGE_KEY = CERTIFICATE_PATH = mycert.pem PKFILE = mycert.cer CERT_DATA_PATH = mycert.pfx CERT_PASSWORD = REGION = WUS CONTAINER=newcontainer Configuration s file Sample Azure Specific configuration section 43
  • 44. Configurations file ########## BASIC SETTING FOR CLOUD PLATFORMS ######## [AWS] # AMAZON WEB SERVICES CONFIGURATIONS pkey= pk.pem cert= cert.pem accessKey= secretKey= keyPair= instance-key securityGroup = keyPairPath= instance-key.pem INSTANCE_TYPE = m3.medium MASTER_TYPE = m3.medium REGION = USW1 ZONE = us-west-1c Configuration s file Sample Amazon Specific configuration section 44
  • 45. Configurations file ###### DEFINE CLUSTERS ####### [CLUSTERS] CLUSTERS_LIST= CLUSTER1, CLUSTER 2 [CLUSTER1] ### CLUSTER 1 is hybrid cluster over multi- cloud # CLUSTER CONFIGURATION CLUSTER_NAME= cluster1 CLUSTER_PREFIX = cluster1 MachineSets=MachineSet2,MachineSet3,MachineSet1 MASTER_NODE_LOCATION= MachineSet2 NFS = True # NFS CONFIGURATION NFS_MOUNTING_POINT=/home NFS_DEVICE=/dev/xvdf NFS_FSID=0 NFS_EBS_Mode=NEW_VOLUME # attach new volume NFS_NEW_VOLUME_SIZE=10 # in case of attach an exist volume GLUSTER=False GLUSTER_MOUNT_POINT = /gluster/WGA/ GLUSTER_VOLUME_NAME = gv0 GLUSTER_STRIPE = 1 GLUSTER_REPLICATE = 1 GLUSTER_FORMAT_DISK = False Cluster s Section defines multiple clusters where each one has multiple Machine sets, every Machine sets represents a cluster on different cloud service provider [MachineSet1] NODES = 2 PROVIDER = GCE # IMAGE CONFIGURATION IMAGE_ID = tavaxy2 …… FIREWALL=ehpc,http2,apache2 FW_PORTS=5000,8080,80 FW_PROTOCOLS=tcp,tcp,tcp [MachineSet2] NODES = 0 PROVIDER = AWS IMAGE_ID = ami-077d9a43 …….. FW_PORTS=5000,8080,80 FW_PROTOCOLS=tcp,tcp,tcp [MachineSet3] NODES=0 Provider = AZURE IMAGE_ID = ehpc-generic26 OS_URL = …….. FW_PORTS=5000,8080,80 FW_PROTOCOLS=tcp,tcp,tcp 45