HPC Clusters in the (almost) Infinite Cloud

HPC Clusters in the [almost]* Infinite cloud
Global Scientific Computing
2016-04-21
*Does not apply to mathematicians with specialties in Cantorian set theory who should immediately ask for a copy of my very long disclaimer.

Science is one of the greatest areas of
computation
Amazon
huge, really disruptive, impact
what we are about

Team
The Scientific Computing (SciCo) team is a global group
of scientists and specialists from Amazon Web Services.
We’re responsible for making the sure the cloud continually
innovates in ways that benefitthe global community of
researchers from whom we draw our inspiration.
Our aim is to bring the revolutionary benefits of agility and
extreme scale to this community so we can all keep making
the discoveries that will change the world and impact the
lives of everyone on our planet.
We have team members from physics, astronomy,
aeronautical engineering,and genomics and all have
extensive experience in research and high performance
computing.We even have a rocket scientist.

Peak: 58K cores
Valley: 12K cores

“… the online book and decorative pillowsellerAmazon.com
swooped in and, in 2006, launchedits own computer rental system—
the future Amazon WebServices. The once-fledgling servicehas
since turned cloud computinginto a mainstream phenomenon …”
Source: Bloomberg Business - April 22, 2015
$7B retail business
10,000 employees
A whole lotof
servers
2006 2015
Every day,AWS
adds enough server
capacity to power
this $7B enterprise

Existing
1. Oregon
2. California
3. Virginia
4. Dublin
5. Frankfurt
6. Singapore
7. Sydney
8. Seoul
9. Tokyo
10. Sao Paulo
11. Beijng
12. US GovCloud
1. Ohio
2. India
3. UK
4. Canada
5. China+1
AWS Region
Availability Zone
regions are sovereign your data never
leaves

Map of scientific collaboration between researchers - Olivier H. Beauchesne - http://bit.ly/e9ekP2
Science means Collaboration

More time spentcomputing the data than moving the data.

Disrupting science, wherever it’s happening.

9600 Strips (~80TB) to be delivered to GSFC
~1600 strips (~20TB) at GSFC
Area Of Interest (AOI) for Sub-Saharan Arid and Semi-arid Africa

Meeeeelions
cores
time
Collective action
When everyone comes
together in the cloud to
share the resource, and
only pays for what they
use, the efficiency is
huge.

Spot Market
cores
time
Spot Market
Our ultimate space
filler.
Spot Instancesallow you
to name your own price
for spare AWS EC2
computing capacity.
Great for workloads that
aren’t time sensitive, and
especially popular in
research (hint: it’s really
cheap).

Spot Bid Advisor
The Spot Bid Advisor
analyzes Spot price history
to help you determine a bid
price that suits your needs.
You should weigh your
application’s tolerance for
interruption and your cost
saving goals when selecting
a Spot instance and bid
price.
The lower your frequency of
being outbid,the longer
your Spot instances are
likely to run without
interruption.
https://aws.amazon.com/ec2/spot/bid-advisor/
Bid Price & Savings
Your bid price affects your
ranking when it comes to
acquiring resources in the
SPOT market, and is the
maximum price you will pay.
But frequently you’ll pay a
lot less.

https://aws.amazon.com/ec2/spot/bid-advisor/
Spot Bid Advisor
As usual with AWS,
anything you can do with
the web console,you can
do with an API or command
line.
bash-3.2$ python get_spot_duration.py
--region us-east-1
--product-description 'Linux/UNIX'
--bids c3.large:0.05,c3.xlarge:0.105,c3.2xlarge:0.21,c3.4xlarge:0.42,c3.8xlarge:0.84
Duration Instance Type Availability Zone
168.0 c3.8xlarge us-east-1a
168.0 c3.8xlarge us-east-1d
168.0 c3.8xlarge us-east-1e
168.0 c3.4xlarge us-east-1b
168.0 c3.xlarge us-east-1d
168.0 c3.xlarge us-east-1e
168.0 c3.large us-east-1b
168.0 c3.large us-east-1d
168.0 c3.large us-east-1e
117.7 c3.large us-east-1a
23.0 c3.xlarge us-east-1b
0.1 c3.xlarge us-east-1a

Wall clock time: ~1 hour Wall clock time: ~1 week
Cost: the same

When you only pay for what you use …
• If you’re only able to use your compute, say, 30%
of the time, you only pay for that time.
1
Pocket the savings
• Buy chocolate
• Buy a spectrometer
• Hire a scientist.
2 Go faster
• Use 3x the cores to
run your jobs at 3x
the speed.
3
Go Large
• Do 3x the science,
or consume 3x the
data.
…youhaveoptions.

39 years of computational chemistry in 9 hours
Novartis ran a project thatinvolved virtually screening10 million
compounds against a commoncancer target in less than a week. They
calculated that it wouldtake50,000 cores andclose toa $40million
investment if they wanted to runthe experimentinternally.
Partnering with Cycle Computing and Amazon Web
Services (AWS), Novartis built a platform thst ran
across 10,600 Spot Instances (~87,000 cores) and
allowed Novartis to conduct 39 years of
computational chemistry in 9 hours for a cost of
$4,232. Out of the 10 million compounds screened,
three were successfully identified.

CHILES will produce the first HI deep field, to be carried out with the VLA in B
array and covering a redshift rangefrom z=0 to z=0.45. The field is centered
at the COSMOS field. It will produce neutral hydrogen images of at least 300
galaxies spread over the entire redshiftrange.
The team at ICRAR in Australia have been able to implement theentire
processing pipeline in the cloud for around $2,000 per month by exploitingthe
SPOT market, which means the $1.75M they otherwise neededto spend on
an HPC cluster can be spent on way cooler things that impact their research
… like astronomers.

SKA
https://aws.amazon.com/blogs/aws/new-astrocompute-in-the-cloud-grants-program/
AWS is providing a grants pool of AWS credits and up
to one petabyteof storagefor an AWS Public Data Set
in order to support the development of an “Appstorefor
Astronomy”.
The data set will be initially provided by several of the
SKA’s precursor telescopes includingCSIRO’s ASKAP,
ICRAR’s MWA in Australia, and KAT-7 (pathfinder to
the SKA precursor telescope Meerkat) in South Africa.
The grants are open to anyone who is making use of
radio astronomical telescopes or radio astronomical
data resources aroundthe world.
The grants will be administered by the SKA. They will
be looking for innovating,cloud-basedalgorithms and
tools that will be able to handle andprocess this never
ending data stream. Youcan also read their post on
Seeing Stars Through theCloud.

not
http://blog.csiro.au/wtf-is-that-how-were-trawling-the-universe-for-the-unknown/
WTF’s cloud-based backendis hosted on
Amazon Web Services servers, where the
researchers are able to access software for
data reduction, calibration andviewing right
from their desktop. The team is currently
issuing a challenge using datapeppered
with “EMU (Easter) Eggs” – objects that
might pose a challenge to datamining
algorithms.
This way they hope to train the system to
recognise things that systematically depart
from known categories of astronomical
objects, to help better prepare for
unanticipateddiscoveries that would
otherwise remain hidden.

“The Zooniverse is heavily reliant on Amazon Web
Services (AWS), particularly Elastic Compute
Cloud (EC2) virtual private servers and Simple
Storage Service (S3) data storage. AWS is the
most cost-effective solution for the dynamic needs
of Zooniverse’s infrastructure …”
http://wwwconference.org/proceedings/www2014/companion/p1049.pdf
The World’s LargestCitizen Science Platform
… cost is a factor – running a central API means that whenthe Zooniverse is quiet and
there aren’t many people aboutwe can scale backthe number of servers we’re running
(automagically on Amazon Web Services) to a minimal level.

AWS Building blocks
TECHNICAL &
BUSINESS
SUPPORT
Account
Management
Support
Professional
Services
Solutions
Architects
Training &
Certification
Security
& Pricing
Reports
Partner
Ecosystem
AWS
MARKETPLACE
Backup
Big Data
& HPC
Business
Apps
Databases
Development
Industry
Solutions
Security
MANAGEMENT
TOOLS
Queuing
Notifications
Search
Orchestration
Email
ENTERPRISE
APPS
Virtual
Desktops
Storage
Gateway
Sharing &
Collaboration
Email &
Calendaring
Directories
HYBRID CLOUD
MANAGEMENT
Backups
Deployment
Direct
Connect
Identity
Federation
Integrated
Management
SECURITY &
MANAGEMENT
Virtual Private
Networks
Identity &
Access
Encryption
Keys
Configuration Monitoring Dedicated
INFRASTRUCTURE
SERVICES
Regions
Availability
Zones
Compute
Storage
Objects,
Blocks, Files
Databases
SQL, NoSQL,
Caching
CDNNetworking
PLATFORM
SERVICES
App
Mobile
& Web
Front-end
Functions
Identity
Data Store
Real-time
Development
Containers
Source
Code
Build
Tools
Deploymen
t
DevOps
Mobile
Sync
Identity
Push
Notifications
Mobile
Analytics
Mobile
Backend
Analytics
Data
Warehousing
Hadoop
Streaming
Data
Pipelines
Machine
Learning

EC2
http://aws.amazon.com/ec2/instance-types/
There’s a couple dozen
EC2 compute instance
types alone, each of
which is optimized for
different things.
One size does not fit
all.

C4Intel Xeon E5-2666 v3, custom built for AWS.
Intel Haswell, 16 FLOPS/tick
2.9 GHz, turbo to 3.5 GHz
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html
Feature Specification
Processor Number E5-2666 v3
Intel® Smart Cache 25 MiB
Instruction Set 64-bit
Instruction Set Extensions AVX 2.0
Lithography 22 nm
Processor Base Frequency 2.9 GHz
Max All Core Turbo Frequency 3.2 GHz
Max Turbo Frequency 3.5 GHz (available on c4.2xLarge)
Intel® Turbo Boost Technology 2.0
Intel® vPro Technology Yes
Intel® Hyper-Threading Technology Yes
Intel® Virtualization Technology (VT-x) Yes
Intel® Virtualization Technology for Directed
I/O (VT-d)
Yes
Intel® VT-x with Extended Page Tables (EPT) Yes
Intel® 64 Yes

API
Java
Python
Ruby
PHP
Shell
…
… and in most popular languages.
http://boto.readthedocs.org/en/latest/
Almost anything you can do in the GUI, you can do on the command line.

Anything you can do in the GUI, you can do on the command line.
as-create-launch-config spotlc-5cents
--image-id ami-e565ba8c
--instance-type m1.small
--spot-price “0.05”
. . .
as-create-auto-scaling-group spotasg
--launch-configuration spotlc-5cents
--availability-zones “us-east-1a,us-east-1b”
--max-size 16
--min-size 1
--desiredcapacity 3

cfnCluster - provision an HPC cluster in minutes
#cfncluster
https://github.com/awslabs/cfncluster
cfncluster is a sample code framework that deploys and maintains clusters on AWS. It is reasonably
agnostic to what the cluster is for and can easily be extended to support different frameworks. The
CLI is stateless, everything is done using CloudFormation or resources within AWS.
10 minutes
http://boofla.io/u/cfnCluster – (Boof’s HOWTO slides)

Head
node
Instance
Compute
node
Instance
Compute
node
Instance
Compute
node
Instance
Compute
node
Instance
10G Network
Auto-scaling group
Virtual Private Cloud
/shared
Head Instance
• 2 or more cores (as needed)
• CentOS 6.x, 7.x, Amazon Linux
• OpenMPI, gcc etc…
Choice of scheduler:
• Torque, SGE, OpenLava,
SLURM
Compute Instances
• 2 or more cores (as needed)
• CentOS 6.x, 7.x, Amazon Linux
Auto Scaling group driven by scheduler
queue length.
Can start with 0 (zero) nodes and only
scale when there are jobs.
Compute
Node
Instance
Compute
Node
Instance
Compute
Node
Instance
Compute
Node
Instance
Compute
Node
Instance
Clusters in the Cloud

Infrastructure as code
#cfncluster
The creation process might take a few minutes (maybe
up to 5 mins or so, depending on how you configuredit.
Because the API to Cloud Formation(the service that
does all the orchestration) is asynchronous, we can kill
the terminal session if we wanted to and watch thewhole
show from the AWS console (where you’ll find it all under
the “Cloud Formation”dashboard in the events tabfor this
stack.
$ cfnCluster create boof-cluster
Starting: boof-cluster
Status: cfncluster-boof-cluster - CREATE_COMPLETE Output:"MasterPrivateIP"="10.0.0.17"
Output:"MasterPublicIP"="54.66.174.113"
Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"
Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"

Yes, it’s a real HPC cluster
#cfncluster
Now you have a cluster, probably running CentOS 6.x, with Sun Grid Engine as a default scheduler, and openMPI and
a bunch of other stuff installed. You also have a shared filesystem in /shared and an autoscaling group ready to
expand the number of compute nodes in the cluster when the existing ones get busy.
You can customize quite a lot via the .cfncluster/config file - check out the comments.
arthur ~ [26] $ cfnCluster create boof-cluster
Starting: boof-cluster
Status: cfncluster-boof-cluster - CREATE_COMPLETE
Output:"MasterPrivateIP"="10.0.0.17"
Output:"MasterPublicIP"="54.66.174.113"
Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"
Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"
arthur ~ [27] $ ssh ec2-user@54.66.174.113
The authenticity of host '54.66.174.113 (54.66.174.113)' can't be established.
RSA key fingerprint is 45:3e:17:76:1d:01:13:d8:d4:40:1a:74:91:77:73:31.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '54.66.174.113' (RSA) to the list of known hosts.
[ec2-user@ip-10-0-0-17 ~]$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda1 10185764 7022736 2639040 73% /
tmpfs 509312 0 509312 0% /dev/shm
/dev/xvdf 20961280 32928 20928352 1% /shared
[ec2-user@ip-10-0-0-17 ~]$ qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
ip-10-0-0-136 lx-amd64 8 1 4 8 - 14.6G - 1024.0M -
ip-10-0-0-154 lx-amd64 8 1 4 8 - 14.6G - 1024.0M -
[ec2-user@ip-10-0-0-17 ~]$ qstat
[ec2-user@ip-10-0-0-17 ~]$
[ec2-user@ip-10-0-0-17 ~]$ ed hw.qsub
hw.qsub: No such file or directory
a
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -pe mpi 2
#$ -S /bin/bash
#
module load openmpi-x86_64
mpirun -np 2 hostname
.
w
110
q
[ec2-user@ip-10-0-0-17 ~]$ ll
total 4
-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub
[ec2-user@ip-10-0-0-17 ~]$ qsub hw.qsub
Your job 1 ("hw.qsub") has been submitted
[ec2-user@ip-10-0-0-17 ~]$
job-ID prior name user state submit/start at queue
slots ja-task-ID
--------------------------------------------------------------------------
----------------------
1 0.55500 hw.qsub ec2-user r 02/01/2015 05:57:25
all.q@ip-10-0-0-44.ap-southeas 2
[ec2-user@ip-10-0-0-17 ~]$ ls -l
total 8
-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub
-rw-r--r-- 1 ec2-user ec2-user 26 Feb 1 05:57 hw.qsub.o1
[ec2-user@ip-10-0-0-17 ~]$ cat hw.qsub.o1
ip-10-0-0-136
ip-10-0-0-154
[ec2-user@ip-10-0-0-17 ~]$

System-wide Upgrade from Ivy Bridge to Haswell
#cfncluster
$ ed ~/.cfncluster/config
/compute_instance_type/
compute_instance_type = c3.8xLarge
s/c3/c4/p
w
949
$ cfncluster update boof-cluster
Downgrading is just as easy. Honest.
really

cfnCluster options to explore …
#cfncluster
# (defaults to t2.micro for default template)
# Master Server EC2 instance type
# (defaults to t2.micro for default template
master_instance_type = c4.4xLarge
# Inital number of EC2 instances to launch as compute nodes in the cluster.
# (defaults to 2 for default template)
initial_queue_size = 0
# Maximum number of EC2 instances that can be launched in the cluster.
# (defaults to 10 for the default template)
max_queue_size = 64
# Boolean flag to set autoscaling group to maintain initial size and scale back
# (defaults to false for the default template)
maintain_initial_size = true
# Cluster scheduler
# (defaults to sge for the default template)
scheduler = sge
# Type of cluster to launch i.e. ondemand or spot
# (defaults to ondemand for the default template)
cluster_type = spot
# Spot price for the ComputeFleet
spot_price = 0.12
# Cluster placement group. This placement group must already exist.
# (defaults to NONE for the default template)
#placement_group = NONE

Spending & Budgeting
#cfncluster
It’s easy to see very quickly how you’ve spent, and on what services …

Spending Control 2/2
#cfncluster
… and even easier to set limits and alerts, based on your budget.

Adrian White
Adrian White leads our Scientific Computing
business across Asia. He joined AWS at it’s
inception in Australia in 2012 and spent several
years as a Solutions Architect, helping hundreds
of customers build highly optimized solutions in
the cloud, and teaching thousands more through
his workshops at our global summits across Asia.
Adrian comes from a background deep in
software development as well as application and
content delivery and deployment. This means
he's very happy talking about automation,
application deployment and software
development in the cloud with almost anyone
and can demystify almost anything.
He holds a double major in Computer Science
and Physics from the University of New South
Wales in Australia, and has never really fully
recovered.
BIO

Angel Pizarro
Angel Pizarro had 15 years of experience in
bioinformatics (the computational side of
biomedical research) prior to joining AWS in
2013. Angel is the global lead SciCo for genomics
and life science research computing initiatives at
AWS. The goal of the initiatives are to accelerate
the adoption of cloud computing in the global
community of biomedical researchers.
Prior to AWS, Angel led the Bioinformatics and
HPC resources at University of Pennsylvania
Medical School.
BIO

Anna Greenwood
Anna Greenwood joined AWS in 2015 to manage
the AWS Cloud Credits for Research Program.
Through this program, AWS provides credits to
enable customers to pilot their research
workloads on AWS and to develop tools that
advance scientific research.
Prior to joining AWS, Anna was a research
scientist at Fred Hutchinson Cancer Research
Center where she led research projects related to
genomics and neuroscience. In some circles she
is known by the prestigious title “The MacGyver
of Fish Neuroscience”. Anna has a PhD in
Neuroscience from Stanford University and a BS
from Rutgers University.
BIO

Brendan Bouffler (“boof”)
Brendan Bouffler has 20 years of experience in
the global IT industry dealing with very large
systems in high performance environments. He
has been responsible for designing and building
hundreds of HPC systems for commercial
enterprises as well as research and defence
sectors all around the world and has quite a
number of his efforts listed in the top500,
including some that have placed inthe top 5.
Brendan previously lead the HPC Organization for
Dell in Asia and joined Amazon in 2014 to help
accelerate the adoption of cloud computing in
the scientific community globally. He holds a
degree in Physics and an interest in testing
several of its laws as they apply to bicycles. This
has frequently resultedin hospitalization.
BIO

Jeff Layton
Dr. Jeffrey Layton is an official rocket scientist and
has been kicking round in the HPC industry for more
than 25 years.
Jeff has served in roles as a Professor, Engineer and
Scientist at Boeing, Lockheed Martin, NASA, and
Clarkson University, and has led technical efforts for
High Performance Computing companies such as
Intel, Panasas, and Dell.
He’s been a cluster builder, user and code monkey, a
cluster admin, as well as a systems engineer,
manager, and benchmark engineer for HPC vendors.
He is also an active contributor to multiple open
source projects and writes for technical publications,
magazines, books, and websites.
His Ph.D. is from Purdue in Aeronautical and
Astronautical Engineering.
BIO

Kevin Jorissen
Kevin Jorissen has 10 years of experience in
computational science. He developed software
solving the quantum physics equations for light
absorption by materials, taught workshops to
scientists worldwide, and wrote about high
performance computing in the cloud before it
was fashionable. He worked as a postdoctoral
researcher in Antwerp, Lausanne, Seattle, and
Zurich.
Kevin joined Amazon in 2015 to help accelerate
the adoption of cloud computing in the scientific
community globally. He holds a Ph.D. in Physics,
but feels equally proud of growing parsnips,
surviving a night bus from Darjeeling to Kolkata,
and adapting to medium-spicy Sichuan food. He
thinks a five-hour run in the Cascade mountains
is a fine wayto spend a free Saturday.
BIO

Michael Kokorowski
Dr. Michael Kokorowski is a space geek with
half a decade spent at NASA’s Jet Propulsion
Laboratory and another half a dozen years at the
University of Washington in Seattle studying
magnetospheric physics.
Michael is obsessed with getting the most
science bang for the buck. He has experience
managing complex scientific projects, delivering
results for NASA and NSF.
He has experiments that have flown to Earth’s
stratosphere, were lost on the Antarctic ice
sheet, sit atop a volcano in Hawaii, and send data
over a laser link from the International Space
Station.
BIO

Sanjay PadhiBIO
Dr. Sanjay Padhi High Energy
Physics
co-creator of the HTCondor
simulation activities by CMS
before joining CERN

Traci Ruthowski
Traci’s research interests are at the intersection
of applied technology and Earth Sciences.
Currently Funded by NSF, Traci is exploring the
use of disruptive technology, including the
Internet of Things (IoT), as means to enable
research inthe Alaskan Arctic.
Prior to AWS, Traci held several roles at The
University of Michigan including: Enterprise
Architect for the Office of the Vice President for
Research, HPC Lead for medical research, and
Solutions Developer of clinical treatment
systems. Traci is the author of the book, Google
VisualizationAPI Essentials.
Traci has a Graduate Certification in Business
Administration from Eastern Michigan University
and also holds a BS in Statistics as well as a BFA
in Performing Arts Technology from the
University of Michigan.
BIO

HPC Clusters in the (almost) Infinite Cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie HPC Clusters in the (almost) Infinite Cloud

Ähnlich wie HPC Clusters in the (almost) Infinite Cloud (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

HPC Clusters in the (almost) Infinite Cloud