ElasticHPC Supports the creation and management of cloud computing resources over multiple public cloud Providers Including Amazon, Azure, Google and Clouds supporting OpenStack.
4. Cloud Computing
Cloud deployment models
Public Cloud
Private Cloud
Hybrid Cloud
Community Cloud
Private Cloud
Public Cloud
Hybrid Cloud
4
5. Cloud Computing
Advantages
Service automation and self-
service models
Easy to deploy
It is an immigration from CapEx
to OpEx
Data recovery and backup
Disadvantages
Security Issues
User has no clue where his/her
data is
Legacy systems incompatibility
Higher operational cost for long
term usage
Advantages and disadvantages of cloud computing
5
6. Cloud Computing
Cloud Computing for Bioinformatics Applications
Some tools already developed for bioinformatics applications
Crossbow,
Myrna
CloudBrust,
CloudBlast,
Cloud–RNA,
etc.
These tools are demonstrated on cloud computing and their techniques are not
generic to other tools and supports only Amazon Web Services
6
7. Cloud Computing
Computer Cluster middleware packages over cloud
Middleware packages support computer cluster management
over cloud
StarCluster
Vappio
CloudMan
etc.
These middleware packages do not support running computer cluster over multiple
Cloud providers
7
9. Our contribution
Cluster 1 Cluster 2 Cluster 3
Provider 1 Provider 2 Provider 1 Provider 2
Non-Federated Cloud Cluster Federated Cloud Cluster
Our contribution is to extend bioinformatics applications to run over multiple
clusters on different cloud service
providers and supporting two types of compute cluster
Non-Federated Cloud Cluster
Federated Cloud Cluster
9
10. Our contribution
ElasticHPC supports creation and management of computer cluster for
bioinformatics solutions on:
– Amazon Web Services
– Microsoft Windows Azure
– Google Compute Engine
– OpenStack based clouds
Provider 2 Provider 1 Provider 2
10
12. Use case scenarios
Provider 2 Provider 1 Provider 2
Simplified version of the variant analysis workflow based on NGS technology as an
example for our use case scenarios
12
The variant analysis workflow: the tools BWA, Picard, GATK are usually used for the
three steps of the workflow. On the arrows, we write the different file formats of
the processed data
13. Multiple clusters over multiple clouds
Provider 2 Provider 1 Provider 2
Multiple independent clusters over multiple clouds and each cluster
processes part of the input data
13
14. Multiple clusters over multiple clouds
Provider 2 Provider 1 Provider 2
Using this scenario depends on:
Time constraint or not.
Reducing the cost within specific time (Spot instances)
14
Input File 3
Cloud 1
Input File 1
Cluster 1 Cluster 2
Input File 2
Cloud 2
Cluster 3
Storing
Output files
On Object storage or S3
15. Multiple clusters over multiple clouds
Provider 2 Provider 1 Provider 2
Each cluster is created in one cloud and solves a step of the workflow.
15
16. Multiple clusters over multiple clouds
Provider 2 Provider 1 Provider 2
In the case of technical limitations
Some technical specification preventing a step from
running in one cloud, but the other steps can run in cheaper cloud.
16
Cloud 1
Cloud 2 Cloud 3
Cluster 1
Cluster 3Cluster 2
Read Mapping Step
Mark Duplicates Step
Variant Calling Step
Storing
Output files
On Object storage or S3
17. One cluster of federated cloud machines
Provider 2 Provider 1 Provider 2
One cluster composed of different machines from different clouds where
one master job queue which dispatches the jobs among the nodes in
different clouds.
17
18. Cloud 1 Cloud 2
Persistent
Process
Communication Layer
Master
Node
One cluster of federated cloud machines
master job queue dispatches the jobs among the nodes in different clouds
that works on the job level rather than the whole (sub-) workflow level
18
19. One cluster of federated cloud machines
• Using this scenario depends on
• The processing time differs from one job to another.
• The characteristics of the processed data
• Internet connection among the cloud sites
• Good management of input data according to its
characteristics
19
23. Implementation of multi-cloud
elasticHPC
23
The three major commercial providers Amazon, Azure, and Google
Amazon Web Services (AWS)
Execution Model:
• Highest CPU virtual machine of type c3.8xlarge (32 Cores and 108 GB
RAM $1.68/hr)
Storage Model:
• EBS “Elastic Block Storage” such as Hard disks and block devices
• S3 “Simple Storage Services” it is some sort of object storage.
Pricing models:
• Pay as you go
• Reserved instances
• Spot instances
24. Implementation of multi-cloud
elasticHPC
24
Microsoft Windows Azure
Execution Model:
• Highest CPU virtual machine of type A9 (16 cores, 112 GB RAM
$4.47/hr)
Storage Models:
• Page Blobs such as Hard disks and block devices as a file system with a
maximum size of 1 TB
• Block Blobs with maximum size of 200 GB.
Pricing models:
• Pay as you go “pay per minute”
25. Implementation of multi-cloud
elasticHPC
25
Google Compute Engine /Google Cloud
Execution Model:
• Highest CPU virtual machine of type n1-highmem-16 (16 cores , 104 GB
RAM, $1.18/hr)
• Also Google provides hard disks, snapshots and images within execution
models
Storage Models:
• Object Storage
Pricing models:
• Pay as you go “pay per minute”
• sustained use
30. Implementation of multi-cloud
elasticHPC
Cluster Manager
handles all functions related to
the creation and management
of clusters at that cloud site
including security settings
and storage devices
30
31. Implementation of multi-cloud
elasticHPC
Job and Data Manager
handles job submission and
data transfer management
between cluster’s nodes and
different storage types
(Block/Object) storage.
31
33. Experiments
Variant Analysis Workflow
Input exome dataset of size ≈ 9 GB
using BWA for read mapping, Picard for marking duplicates,
and GATK for variant calling
33
34. Experiments
Experiment 1
The workflow was executed 3 times independently on:
Google
n1-highmem-8 (8 Cores, 52 GB RAM, $0.452/hour)
AWS
m3.2xlarge (8 Cores, 30 GB RAM,$0.56/hour)
Azure
Standard A7 (8 Cores, 56 GB RAM, $1.00/hour)
The 9 GB input data is divided into blocks to be processed in parallel over
the cluster nodes
34
35. Experiments
Experiment 1
Google and Amazon have the same performance, on the other Hand Azure has the
Worst performance
35
Running times in minutes. “MarkD “ stands for mark duplicate step. The numbers
Between backets are the cost in USD
36. Experiments
Experiment 1
Noted that Mark duplicate has no performance improvement when
adding More nodes (increasing computing power) because Picard
requires all reads to be a one set of input.
36
37. Experiments
Experiment 2
Using the same input dataset but with stronger machine for the Mark Duplicate
step on Amazon
c3.8xlarge
Amazon c3.8xlarge,
which has 32 cores
and 108 GB RAM
and costs $1.68
Mark Duplicate
Google cluster
n1-highmem-8
8 Cores, 52GB
RAM, $0.452
n1-highmem-8
READ MAPPING
VARIANT CALLING
Uploading
VCF output
File to Object
Storage
S3/Google
Objects
Transfer
Mapped
BAM File
1 2
34
37
38. Experiments
Experiment 2
Google will always retrieve better cost when the parallelization leads to
fractions of hour. So the best cost with comparable performance for these
three steps workflow is when we use hybrid cloud of Amazon and Google.
38
Running times in minutes using single provider and multicloud scenario of
two providers. The numbers between brackets are the cost in USD
39. Conclusion
Introducing ElasticHPC that creates and manage computer cluster over multiple
cloud platforms for bioinformatics applications
Google and Azure offer “The charge per minutes” pricing model
Amazon charges per hour as a pricing model
ElasticHPC enables the data analyst to use cloud with best offer at the time of
analysis
elasticHPC opens the way for the development of more advanced layers for task
scheduling and cost-time optimization
Future work, we will include different ideas to use shared storage from
multi-cloud as a shared file system
39
41. Availability and requirements
• Project name: elasticHPC.
• Project home page: http://www.elastichpc.org.
• Operating system(s): Linux.
• Programming language: Python, C, Java script, HTML, Shell
script.
• Other requirements: Compatible with the browsers
FireFox, Chrome, Safari, and Opera. See the manual for
more details.
• License: Free for academics. Authorization license needed
for commercial usage (Please contact the corresponding
author for more details).
• Any restrictions to use by non-academics: No restrictions.
41
42. Configurations file
################## BASIC SETTING FOR CLOUD PLATFORMS ##############
[GCE]
# GOOGLE COMPUTE ENGINE CONFIGURATION
PROJECT_ID =
ZONE = us-central1-a
CLIENT_SECRET = config/client_secret.json
COMPUTE_SCOPE = https://www.googleapis.com/auth/compute
OAUTH_STORAGE = oauth2.dat
IMAGE_PROJECT =
SERVICE_EMAIL = default
NETWORK = default
SCOPES = https://www.googleapis.com/auth/devstorage.full_control
API_VERSION = v1
CLUSTER_CLIENT_KEY = keys/key
ROOT_DISK=disks
Configuration s file Sample
Google Specific configuration section
42
43. Configurations file
################## BASIC SETTING FOR CLOUD PLATFORMS ##############
[AZURE]
# MICROSOFT WINDOWS AZURE CONFIGURATIONS
SUBSCRIPTION_ID =
THUMBPRINT =
STORAGE_ACCOUNT =
STORAGE_KEY =
CERTIFICATE_PATH = mycert.pem
PKFILE = mycert.cer
CERT_DATA_PATH = mycert.pfx
CERT_PASSWORD =
REGION = WUS
CONTAINER=newcontainer
Configuration s file Sample
Azure Specific configuration section
43
44. Configurations file
########## BASIC SETTING FOR CLOUD PLATFORMS ########
[AWS]
# AMAZON WEB SERVICES CONFIGURATIONS
pkey= pk.pem
cert= cert.pem
accessKey=
secretKey=
keyPair= instance-key
securityGroup =
keyPairPath= instance-key.pem
INSTANCE_TYPE = m3.medium
MASTER_TYPE = m3.medium
REGION = USW1
ZONE = us-west-1c
Configuration s file Sample
Amazon Specific configuration section
44
45. Configurations file
###### DEFINE CLUSTERS #######
[CLUSTERS]
CLUSTERS_LIST= CLUSTER1, CLUSTER 2
[CLUSTER1] ### CLUSTER 1 is hybrid cluster over multi-
cloud
# CLUSTER CONFIGURATION
CLUSTER_NAME= cluster1
CLUSTER_PREFIX = cluster1
MachineSets=MachineSet2,MachineSet3,MachineSet1
MASTER_NODE_LOCATION= MachineSet2
NFS = True
# NFS CONFIGURATION
NFS_MOUNTING_POINT=/home
NFS_DEVICE=/dev/xvdf
NFS_FSID=0
NFS_EBS_Mode=NEW_VOLUME
# attach new volume
NFS_NEW_VOLUME_SIZE=10
# in case of attach an exist volume
GLUSTER=False
GLUSTER_MOUNT_POINT = /gluster/WGA/
GLUSTER_VOLUME_NAME = gv0
GLUSTER_STRIPE = 1
GLUSTER_REPLICATE = 1
GLUSTER_FORMAT_DISK = False
Cluster s Section defines multiple clusters where each one has multiple Machine sets,
every Machine sets represents a cluster on different cloud service provider
[MachineSet1]
NODES = 2
PROVIDER = GCE
# IMAGE CONFIGURATION
IMAGE_ID = tavaxy2
……
FIREWALL=ehpc,http2,apache2
FW_PORTS=5000,8080,80
FW_PROTOCOLS=tcp,tcp,tcp
[MachineSet2]
NODES = 0
PROVIDER = AWS
IMAGE_ID = ami-077d9a43
……..
FW_PORTS=5000,8080,80
FW_PROTOCOLS=tcp,tcp,tcp
[MachineSet3]
NODES=0
Provider = AZURE
IMAGE_ID = ehpc-generic26
OS_URL =
……..
FW_PORTS=5000,8080,80
FW_PROTOCOLS=tcp,tcp,tcp
45