SlideShare a Scribd company logo
1 of 56
Download to read offline
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nathan McGuirt, Manager, Solutions Architecture, AWS
Gabriele Garzoglio, HEP Cloud Facility Project Manager, Fermilab
December 2016
Building HPC Clusters as Code
in the (Almost) Infinite Cloud
CMP318
What to Expect from the Session
• Why customers are using AWS for HPC/HTC
• Leveraging Spot Instances for big compute at low cost
• Accelerating deployment with automation and managed
services
Agenda
• Why AWS for HPC?
• Automating cluster deployment
• Fermi National Accelerator Laboratory
• Demo of scaling jobs on a budget
High Performance Computing (HPC) vs.
High Throughput Computing (HTC)
HPC: High performance computing
(cluster computing)
- Tightly clustered
- Latency sensitive
HTC: High throughput computing
(grid computing)
- Less inter-node communication
- More horizontal scalability (pleasingly
parallel)
Why AWS for HPC?
Time to research
%
Time to research
Innovation and performance
Scalability and flexibility
Data
AWS Snowball AWS Direct Connect
Cost
Cost – Spot market
Request
1
2
3
4
5
6
7
8
9
Bid Price
$1.00
$0.55
$0.50
$0.33
$0.20
$0.18
$0.15
$0.10
$0.05
Spot Price
$0.20
$0.20
$0.20
$0.20
$0.20
Spot Bid Advisor
Spot Fleet
Spot Fleet
Clusters as code
Automation
• Fully custom
• APIs
• AWS CloudFormation
• Managed services
• Amazon EMR
• AWS Batch
• Software cluster management solutions
• CFNCluster
• Alces Flight
• Partner offerings
API - SDKs
Java Python PHP .NET Ruby nodeJS
iOS Android AWS Toolkit
for Visual
Studio
AWS Toolkit
for Eclipse
Tools for
Windows
PowerShell
CLI
CloudFormation
CloudFormation
Resources:
Ec2Instance:
Type: AWS::EC2::Instance
Properties:
SecurityGroups:
- Ref: InstanceSecurityGroup
KeyName: mykey
ImageId: ''
InstanceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Enable SSH access via port 22
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: '22'
ToPort: '22'
CidrIp: 0.0.0.0/0
EMR
AWS Batch
AWS CFNCluster
$ pip install cfncluster
...
$ cfncluster configure
...
$ cfncluster run mycluster
Alces Flight
Alces Flight is a software offering self-service
supercomputers via the AWS Marketplace.
Creates self-scaling clusters with more than
750 popular scientific applications pre-installed,
complete with libraries and various compiler
optimizations, ready to run. The clusters use
the AWS Spot Instances by default.
AWS Partners in the HPC Space
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Gabriele Garzoglio, HEP Cloud Facility Co-Project Manager, Fermilab
December 2016
The HEP Cloud Facility
Elastic Computing for High Energy Physics
Computing at the Fermi National Accelerator Laboratory
Lead United States particle physics laboratory
• Funded by the Department of Energy
• ~100 PB of data on tape
• High Throughput Computing characterized by:
• “Pleasingly parallel” tasks
• High CPU instruction / Bytes IO ratio
• But still lots of I/O. See Pfister: “In Search of
Clusters”
Focus on Neutrino Physics
• Including the NOvA Experiment
Strong collaborations with international
laboratories
• CERN / Large Hardron Collider (LHC)
Experiments
• Brookhaven National Laboratory (BNL)
• Lead institution (“Tier-1”) for the Compact Muon
Solenoid (CMS)
Drivers of Facility Evolution: Capacity / Cost / Elasticity
Price of one core-year on
Commercial CloudsHEP needs: 10-100 x today capacity
Facility size: 15k cores
NOvA experiment jobs in queue at FNAL
Usage is not steady-state
CMS Analysis Users – Yearly Cycle
Vision for Facility Evolution
• Strategic Plan for U.S. Particle Physics (P5 Report to the U.S. funding agencies)
Fermilab Facility
HTC, HPC Cores
68.7K
Disk Systems
37.6 PB
Tape
101 PB
10/100 Gbit
Networking
~5k internal
network ports
The Facility Today is “Fixed”
Rapidly evolving computer architectures
and increasing data volumes require
effective crosscutting solutions that are
being developed in other science
disciplines and in industry.
• HEP Cloud Vision Statement
– HEPCloud is envisioned as a portal to an ecosystem of diverse computing resources commercial or
academic
– Provides “complete solutions” to users, with agreed upon levels of service
– The Facility routes to local or remote resources based on workflow requirements, cost, and efficiency of
accessing various resources
– Manages allocations of users to target compute engines
• Pilot project to explore feasibility, capability of HEPCloud
– Goal of moving into production during FY18
– Seed money provided by industry
HEP Cloud Architecture
Overview External Relationships
HEP Cloud Architecture
Overview External Relationships
Basic idea: Add
disparate resources
(Cloud VM, HPC slots,
Grid nodes, local
resources) into a
central resource pool.
Fermilab HEPCloud: Expanding to the Cloud
Reference herein to any specific commercial product, process, or service by trade name, trademark,
manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or
favoring by the United States Government or any agency thereof.
– Provisioning
– Performance
– Image portability
– On-demand services
• Where to start?
– Market leader:
Amazon Web Services (AWS)
• Integration challenges that needs to
be managed to run at scale:
– Networking
– Storage and data movement
– Monitoring and accounting
– Security
Integration Challenges: Provisioning – Create an Overlay Batch System with
GlideinWMS and HTCondor
condor
submit
VO Frontend
HTCondor
Central Manager
HTCondor
Schedulers
HTCondor
Schedulers
Frontend
Grid Site
Virtual Machine
Job
Local Resources
Virtual Machine
Job
GlideinWMS Factory
HTCondor-G
High Performance
Computers
Virtual Machine
Job
Cloud Provider
Virtual Machine
VM
Glidein
HTCondor
Startd
Job
Pull Job
Integration Challenges: Provisioning – Containing costs
• Using AWS Spot market to
contain costs
• Workflows are already engineered
to sustain preemption from the
Grid
– Job are “short”, i.e., killed jobs are
affordable w/o checkpointing
– Preempted jobs are automatically
resubmitted
– Data management systems
identify files in a dataset that were
not processed and allow recovery
CMS use case:
Histogram of number
of times each job
started
(measure of
preemption)
NOvA use case:
number of VMs
running (blue) and
preempted (red)
every hour
2.5M jobs
with no
preemption
240 VM / h
60 VM preempted
in 1h
400K jobs
with one
preemption
Integration Challenges: Provisioning – Containing costs
• The Decision Engine oversees
the costs and optimizing VM
placement using the status of the
facility, the historical prices, and
the job characteristics
Bid at 25% x on-demand price has lowest expected cost
• Based on pre-emption history,
calculating the probability that a 5-
24 h job finishes within a week
although it has to restart due to
preemption, for various bidding
algorithms.
$0.25 / h
Integration Challenges: Performance
Benchmarks used to compare workflow duration on AWS (and $$) with local execution
Need EBS
Need EBS
32 cores
scale w/ cores Need EBS
Need EBS
32 cores
scale w/ cores
Need
parallel
streams
c3.2xlarge c3.2xlarge
good candidate – want > 1
From AWS to
FNAL: 7Gbps
Access to S3 always
saturates the 1 Gbps
interface
Integration Challenges: Performance
CMS Use Case:
Wallclock distribution by AWS instance type
Integration Challenges: Image Portability
Build VM management tool,
considering:
• HVM virtualization (HW VM
+ Xen) on AWS: gives
access to all AWS
resources
• Contain VM size (saves
import time and cost)
• Import process covers
multiple AWS accounts and
regions
• AuthN with AWS use short-
lived role-based tokens,
rather than long term keys
Build “Golden Image” from standard Fermilab Worker Node configuration VM.
Integration Challenges: On-demand Services
Jobs depend on software services to run
Automating the deployment of these services on AWS on-demand - enables scalability and cost savings
• Services include data caching (e.g., Squid) WMS , submission service, data transfer, etc.
• As services are made deployable on-demand, instantiate ensemble of services together (e.g.,
through AWS CloudFormation)
Example: on-demand Squid
• Deploy Squid via
auto-scaling services.
Squid is deployed if average
group bandwidth utilization
is too high. Server is
deployed or destroyed in
30 seconds.
• Front Squids with a
load balancer.
• Name the load balancer for that
region via Route 53
Auto Scaling
group
CloudFormation
"SquidInstanceType" : { "Type" : "String", "Default" : "c3.xlarge", … },
"SquidLaunchConfiguration" : { "Type" : "AWS::AutoScaling::LaunchConfiguration",
"Properties" : {
"InstanceType" : { "Ref" : "SquidInstanceType" },
"ImageId" : { "Fn::FindInMap" : [ "AMIRegionMap", {"Ref":"AWS::Region"}, "SquidAMI" ]},
"SecurityGroups" : [ { "Fn::FindInMap" :
["SecurityGroupRegionMap",{"Ref":"AWS::Region"}, "SquidSG" ] } ],
… } }
"SquidAutoscalingGroup" : { "Type" : "AWS::AutoScaling::AutoScalingGroup",
"Properties" : {
"AvailabilityZones" : {"Ref" : "AvailabilityZones"},
"LaunchConfigurationName" : {"Ref" : "SquidLaunchConfiguration" },
"LoadBalancerNames" : [ {"Ref" : "SquidLoadBalancer" } ],
… } },
"SquidAutoscaleUpPolicy" : { "Type" : "AWS::AutoScaling::ScalingPolicy",
"Properties" : {
"AdjustmentType" : "ChangeInCapacity",
"AutoScalingGroupName" : { "Ref" : "SquidAutoscalingGroup" },
"ScalingAdjustment" : "1”
… } },
…
Integration Challenges: On-demand Services – CloudFormation
"SquidNetworkBandwidthHighAlarm" : { "Type" : "AWS::CloudWatch::Alarm",
"Properties" : {
"AlarmDescription" : "Scale up if average NetworkIn > for 5 minutes",
"MetricName" : "NetworkOut",
"Statistic" : "Average",
"Period" : "300",
"Threshold" : "1100000000",
"AlarmActions" : [ { "Ref" : "SquidAutoscaleUpPolicy" } ],
"ComparisonOperator" : "GreaterThanThreshold”,
… } }
…
"SecurityGroupRegionMap" : {
"us-west-2“ : { "SquidSG" : "sg-xxxxf6cb" },
"us-east-1" : { "SquidSG" : "sg-xxxx70ca" },
… }
"SquidLoadBalancer" : {"Type" : "AWS::ElasticLoadBalancing::LoadBalancer",
"Properties" : {
"CrossZone" : "false",
"SecurityGroups" : [ {"Fn::FindInMap" :
[ "SecurityGroupRegionMap", { "Ref" : "AWS::Region" } , "SquidSG" ] } ],
"Listeners" : [ { "LoadBalancerPort":"3128", "InstancePort":"3128", "Protocol":"TCP" } ],
"HealthCheck" : { "Target" : "TCP:3128", "HealthyThreshold" : "3", … }
… } }
Integration Challenges: On-demand Services – CloudFormation
"elbHostedZone": { "Type" : "AWS::Route53::HostedZone",
"Properties" : {
"HostedZoneConfig" : {
"Comment" : "auto-generated private hosting zone for ELB” },
"Name" : { "Fn::Join" : ["", [{"Ref":"AvailabilityZone"},".elb.fnaldata.org.”]]},
"VPCs" : [{
"VPCId" : { … },
"VPCRegion" : { "Ref" : "AWS::Region"} }]
} }
"elbDNS" : { "Type" : "AWS::Route53::RecordSet",
"Properties" : {
"HostedZoneId" : { "Ref" : "elbHostedZone" },
"Name" : { "Fn::Join" :
["", ["elb2.",{"Ref":"AvailabilityZone"},".elb.fnaldata.org."]]},
"ResourceRecords" : [ { "Fn::GetAtt" : [ "SquidLoadBalancer", "DNSName" ] } ]
… } }
Clients call Squid as elb2.<AvailabilityZone>.elb.fnaldata.org
Integration Challenges: On-demand Services – CloudFormation
Integration Challenges: Networking
Implement routing / firewall configuration
to use peered ESNet / AWS to route
data flow through ESNet
AWS / ESNet data egress cost waiver
• For data transferred through
ESNet, transfer charges are
waived for data costs up to 15%
of the total
Integration Challenges: Storage and Data Movement
Integrate S3 storage stage-in/-out for AWS internal /
external access - enables flexibility on data
management
• Consider O(1000) jobs finishing on the cloud and
transferring output to remote storage
• Storage bandwidth capacity is limited
• Two main strategies for data transfers:
1. Fill the available network transfer by having some
jobs wait - Put the jobs on a queue and transfer
data from as many jobs as possible - idle VMs
have a cost
2. Store data on S3 almost concurrently (due to high
scalability) and transfer data back asynchronously
- data on S3 has a cost
• The cheapest strategy depends on the storage
bandwidth, number of jobs, etc.
S3
Integration Challenges: Monitoring and Accounting
Monitor # GCloud VMs (S. Korea Priv. Cloud) Monitor # AWS VMs
Accounting:
$ by VO and VM Type
Monitor
HEP Cloud
Slots
NoVA Data Processing
Processing the 2014/2015 dataset
3 use cases: Particle ID, Montecarlo ,
Data Reconstruction
Received AWS research grant
Dark Energy Survey
Gravitational Waves
Search for optical
counterpart of events
detected by LIGO/VIRGO
gravitational wave detectors (FNAL LDRD)
Modest CPU needs, but want 5-10 hour turnaround
Burst activity driven entirely by physical phenomena
(gravitational wave events are transient)
Rapid provisioning to peak
CMS Monte Carlo Simulation
Generation (and detector simulation, digitization,
reconstruction) of simulated events in time for
Moriond conference.
58,000 compute cores, steady-state
Demonstrates scalability
Received AWS research grant
Initial HEPCloud Use Cases
Results from the CMS Use Case
• All CMS simulation requests fulfilled by the conference
deadline (Rencontres de Moriond 2016 )
– 2.9 million jobs, 15.1 million wall hours
• 9.5% badput – includes preemption from spot pricing
• 87% CPU efficiency
– 518 million events generated
CMS Reaching ~60k slots on AWS with HEPCloud
10% Test 25%
60000 slots
10000 VM
Each color corresponds to a
different region / zone /machine type
HEPCloud AWS: 25% of CMS global capacity
Production
Analysis
Reprocessing
Production on AWS
via FNAL HEPCloud
Production
Analysis
Reprocessing
Production on AWS
via FNAL HEPCloud
On-premises vs. cloud cost comparison
Average cost per core-hour
• On-premises resource: 0.9 cents per
core-hour
• Includes power, cooling, staff,
but assumes 100% utilization
• Off-premises at AWS (CMS use case):
1.4 cents per core-hour
• Off-premises at AWS (NOvA use case):
3.0 cents per core-hour
• Use case demanded bigger VM
Benchmarks
• Specialized (“ttbar”) benchmark focused on HEP workflows
• On-premises: 0.0163 ttbar /s (higher = better)
• Off-premises: 0.0158 ttbar /s
Raw compute performance roughly equivalent
Cloud costs approaching equivalence
Amazon provisions/retires 60k cores for our system in ~1 hour
Acknowledgements
The support from the Computing Sector
The Fermilab HEPCloud Facility team
AWS and their engagement team, in particular Jamie Baker
The HTCondor team
The collaboration and contributions from KISTI, in particular Dr. Seo-Young Noh
The Illinois Institute of Technology (IIT) students and professors Ioan Raicu and
Shangping Ren
The Italian National Institute of Nuclear Physics (INFN) summer student program
• NOvA: http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5774
• CMS: http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5750
For More Information:
demonstration
Thank you!
Remember to complete
your evaluations!
Related Sessions
CMP201 - Auto Scaling – The Fleet Management Solution for Planet Earth

More Related Content

What's hot

What's hot (20)

AWS re:Invent 2016: Workshop: Migrating Microsoft Applications to AWS (ENT216)
AWS re:Invent 2016: Workshop: Migrating Microsoft Applications to AWS (ENT216)AWS re:Invent 2016: Workshop: Migrating Microsoft Applications to AWS (ENT216)
AWS re:Invent 2016: Workshop: Migrating Microsoft Applications to AWS (ENT216)
 
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...
 
AWS re:Invent 2016: Global Traffic Management with Amazon Route 53 Traffic Fl...
AWS re:Invent 2016: Global Traffic Management with Amazon Route 53 Traffic Fl...AWS re:Invent 2016: Global Traffic Management with Amazon Route 53 Traffic Fl...
AWS re:Invent 2016: Global Traffic Management with Amazon Route 53 Traffic Fl...
 
AWS re:Invent 2016: 5 Security Automation Improvements You Can Make by Using ...
AWS re:Invent 2016: 5 Security Automation Improvements You Can Make by Using ...AWS re:Invent 2016: 5 Security Automation Improvements You Can Make by Using ...
AWS re:Invent 2016: 5 Security Automation Improvements You Can Make by Using ...
 
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
 
AWS re:Invent 2016: Best practices for running enterprise workloads on AWS (E...
AWS re:Invent 2016: Best practices for running enterprise workloads on AWS (E...AWS re:Invent 2016: Best practices for running enterprise workloads on AWS (E...
AWS re:Invent 2016: Best practices for running enterprise workloads on AWS (E...
 
Accelerate your Business with SAP on AWS - AWS Summit Cape Town 2017
Accelerate your Business with SAP on AWS - AWS Summit Cape Town 2017 Accelerate your Business with SAP on AWS - AWS Summit Cape Town 2017
Accelerate your Business with SAP on AWS - AWS Summit Cape Town 2017
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
AWS re:Invent 2016: Getting Started with the Hybrid Cloud: Enterprise Backup ...
AWS re:Invent 2016: Getting Started with the Hybrid Cloud: Enterprise Backup ...AWS re:Invent 2016: Getting Started with the Hybrid Cloud: Enterprise Backup ...
AWS re:Invent 2016: Getting Started with the Hybrid Cloud: Enterprise Backup ...
 
AWS re:Invent 2016: Design, Deploy, and Optimize Microsoft SharePoint on AWS ...
AWS re:Invent 2016: Design, Deploy, and Optimize Microsoft SharePoint on AWS ...AWS re:Invent 2016: Design, Deploy, and Optimize Microsoft SharePoint on AWS ...
AWS re:Invent 2016: Design, Deploy, and Optimize Microsoft SharePoint on AWS ...
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 
AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)
AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)
AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)
 
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
 
AWS Services for Content Production
AWS Services for Content ProductionAWS Services for Content Production
AWS Services for Content Production
 
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
 
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
 
Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017Introduction to Storage on AWS - AWS Summit Cape Town 2017
Introduction to Storage on AWS - AWS Summit Cape Town 2017
 
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
 
How to Migrate your Startup to AWS
How to Migrate your Startup to AWSHow to Migrate your Startup to AWS
How to Migrate your Startup to AWS
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
 

Viewers also liked

Viewers also liked (20)

Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
HPC in AWS - Technical Workshop
HPC in AWS - Technical WorkshopHPC in AWS - Technical Workshop
HPC in AWS - Technical Workshop
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
 
Fermilab aws on demand
Fermilab aws on demandFermilab aws on demand
Fermilab aws on demand
 
HPC on AWS
HPC on AWSHPC on AWS
HPC on AWS
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" Bioinformatics
 
Building an HPC Cluster in 10 Minutes
Building an HPC Cluster in 10 MinutesBuilding an HPC Cluster in 10 Minutes
Building an HPC Cluster in 10 Minutes
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
 
Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWS
 
AWS re:Invent 2016: [JK REPEAT] The Enterprise Fast Lane - What Your Competit...
AWS re:Invent 2016: [JK REPEAT] The Enterprise Fast Lane - What Your Competit...AWS re:Invent 2016: [JK REPEAT] The Enterprise Fast Lane - What Your Competit...
AWS re:Invent 2016: [JK REPEAT] The Enterprise Fast Lane - What Your Competit...
 
AWS re:Invent 2016: Leverage the Power of the Crowd To Work with Amazon Mecha...
AWS re:Invent 2016: Leverage the Power of the Crowd To Work with Amazon Mecha...AWS re:Invent 2016: Leverage the Power of the Crowd To Work with Amazon Mecha...
AWS re:Invent 2016: Leverage the Power of the Crowd To Work with Amazon Mecha...
 
AWS re:Invent 2016: Chalice: A Serverless Microframework for Python (DEV308)
AWS re:Invent 2016: Chalice: A Serverless Microframework for Python (DEV308)AWS re:Invent 2016: Chalice: A Serverless Microframework for Python (DEV308)
AWS re:Invent 2016: Chalice: A Serverless Microframework for Python (DEV308)
 
AWS re:Invent 2016: Turbocharge Your Microsoft .NET Developments with AWS (DE...
AWS re:Invent 2016: Turbocharge Your Microsoft .NET Developments with AWS (DE...AWS re:Invent 2016: Turbocharge Your Microsoft .NET Developments with AWS (DE...
AWS re:Invent 2016: Turbocharge Your Microsoft .NET Developments with AWS (DE...
 

Similar to AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318)

Suitability of Commercial Clouds for NASA's HPC Applications
Suitability of Commercial Clouds for NASA's HPC ApplicationsSuitability of Commercial Clouds for NASA's HPC Applications
Suitability of Commercial Clouds for NASA's HPC Applications
inside-BigData.com
 

Similar to AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318) (20)

High Performance Computing on AWS
High Performance Computing on AWSHigh Performance Computing on AWS
High Performance Computing on AWS
 
High Performance Computing with AWS
High Performance Computing with AWSHigh Performance Computing with AWS
High Performance Computing with AWS
 
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
Risk Management and Particle Accelerators: Innovating with New Compute Platfo...
Risk Management and Particle Accelerators: Innovating with New Compute Platfo...Risk Management and Particle Accelerators: Innovating with New Compute Platfo...
Risk Management and Particle Accelerators: Innovating with New Compute Platfo...
 
High Performance Computing on AWS
High Performance Computing on AWSHigh Performance Computing on AWS
High Performance Computing on AWS
 
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
 
Kinney j aws
Kinney j awsKinney j aws
Kinney j aws
 
Cloud Economics: The Financial Case for Cloud Migration
Cloud Economics: The Financial Case for Cloud MigrationCloud Economics: The Financial Case for Cloud Migration
Cloud Economics: The Financial Case for Cloud Migration
 
Cloud Overview
Cloud OverviewCloud Overview
Cloud Overview
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Could the “C” in HPC stand for Cloud?
Could the “C” in HPC stand for Cloud?Could the “C” in HPC stand for Cloud?
Could the “C” in HPC stand for Cloud?
 
What would you do with a million cores - HPC on AWS
What would you do with a million cores - HPC on AWSWhat would you do with a million cores - HPC on AWS
What would you do with a million cores - HPC on AWS
 
High Performance Computing Pitch Deck
High Performance Computing Pitch DeckHigh Performance Computing Pitch Deck
High Performance Computing Pitch Deck
 
OpenNebula TechDay Boston 2015 - Bringing Private Cloud Computing to HPC and ...
OpenNebula TechDay Boston 2015 - Bringing Private Cloud Computing to HPC and ...OpenNebula TechDay Boston 2015 - Bringing Private Cloud Computing to HPC and ...
OpenNebula TechDay Boston 2015 - Bringing Private Cloud Computing to HPC and ...
 
Suitability of Commercial Clouds for NASA's HPC Applications
Suitability of Commercial Clouds for NASA's HPC ApplicationsSuitability of Commercial Clouds for NASA's HPC Applications
Suitability of Commercial Clouds for NASA's HPC Applications
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Navops talk at hpc in the cloud meetup 19 march 2019
Navops talk at hpc in the cloud meetup 19 march 2019Navops talk at hpc in the cloud meetup 19 march 2019
Navops talk at hpc in the cloud meetup 19 march 2019
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cloud (CMP318)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Nathan McGuirt, Manager, Solutions Architecture, AWS Gabriele Garzoglio, HEP Cloud Facility Project Manager, Fermilab December 2016 Building HPC Clusters as Code in the (Almost) Infinite Cloud CMP318
  • 2. What to Expect from the Session • Why customers are using AWS for HPC/HTC • Leveraging Spot Instances for big compute at low cost • Accelerating deployment with automation and managed services
  • 3. Agenda • Why AWS for HPC? • Automating cluster deployment • Fermi National Accelerator Laboratory • Demo of scaling jobs on a budget
  • 4. High Performance Computing (HPC) vs. High Throughput Computing (HTC) HPC: High performance computing (cluster computing) - Tightly clustered - Latency sensitive HTC: High throughput computing (grid computing) - Less inter-node communication - More horizontal scalability (pleasingly parallel)
  • 5. Why AWS for HPC?
  • 10. Data AWS Snowball AWS Direct Connect
  • 11. Cost
  • 12. Cost – Spot market Request 1 2 3 4 5 6 7 8 9 Bid Price $1.00 $0.55 $0.50 $0.33 $0.20 $0.18 $0.15 $0.10 $0.05 Spot Price $0.20 $0.20 $0.20 $0.20 $0.20
  • 17. Automation • Fully custom • APIs • AWS CloudFormation • Managed services • Amazon EMR • AWS Batch • Software cluster management solutions • CFNCluster • Alces Flight • Partner offerings
  • 18. API - SDKs Java Python PHP .NET Ruby nodeJS iOS Android AWS Toolkit for Visual Studio AWS Toolkit for Eclipse Tools for Windows PowerShell CLI
  • 20. CloudFormation Resources: Ec2Instance: Type: AWS::EC2::Instance Properties: SecurityGroups: - Ref: InstanceSecurityGroup KeyName: mykey ImageId: '' InstanceSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Enable SSH access via port 22 SecurityGroupIngress: - IpProtocol: tcp FromPort: '22' ToPort: '22' CidrIp: 0.0.0.0/0
  • 21. EMR
  • 23. AWS CFNCluster $ pip install cfncluster ... $ cfncluster configure ... $ cfncluster run mycluster
  • 24. Alces Flight Alces Flight is a software offering self-service supercomputers via the AWS Marketplace. Creates self-scaling clusters with more than 750 popular scientific applications pre-installed, complete with libraries and various compiler optimizations, ready to run. The clusters use the AWS Spot Instances by default.
  • 25. AWS Partners in the HPC Space
  • 26. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Gabriele Garzoglio, HEP Cloud Facility Co-Project Manager, Fermilab December 2016 The HEP Cloud Facility Elastic Computing for High Energy Physics
  • 27. Computing at the Fermi National Accelerator Laboratory Lead United States particle physics laboratory • Funded by the Department of Energy • ~100 PB of data on tape • High Throughput Computing characterized by: • “Pleasingly parallel” tasks • High CPU instruction / Bytes IO ratio • But still lots of I/O. See Pfister: “In Search of Clusters” Focus on Neutrino Physics • Including the NOvA Experiment Strong collaborations with international laboratories • CERN / Large Hardron Collider (LHC) Experiments • Brookhaven National Laboratory (BNL) • Lead institution (“Tier-1”) for the Compact Muon Solenoid (CMS)
  • 28. Drivers of Facility Evolution: Capacity / Cost / Elasticity Price of one core-year on Commercial CloudsHEP needs: 10-100 x today capacity Facility size: 15k cores NOvA experiment jobs in queue at FNAL Usage is not steady-state CMS Analysis Users – Yearly Cycle
  • 29. Vision for Facility Evolution • Strategic Plan for U.S. Particle Physics (P5 Report to the U.S. funding agencies) Fermilab Facility HTC, HPC Cores 68.7K Disk Systems 37.6 PB Tape 101 PB 10/100 Gbit Networking ~5k internal network ports The Facility Today is “Fixed” Rapidly evolving computer architectures and increasing data volumes require effective crosscutting solutions that are being developed in other science disciplines and in industry. • HEP Cloud Vision Statement – HEPCloud is envisioned as a portal to an ecosystem of diverse computing resources commercial or academic – Provides “complete solutions” to users, with agreed upon levels of service – The Facility routes to local or remote resources based on workflow requirements, cost, and efficiency of accessing various resources – Manages allocations of users to target compute engines • Pilot project to explore feasibility, capability of HEPCloud – Goal of moving into production during FY18 – Seed money provided by industry
  • 30. HEP Cloud Architecture Overview External Relationships
  • 31. HEP Cloud Architecture Overview External Relationships Basic idea: Add disparate resources (Cloud VM, HPC slots, Grid nodes, local resources) into a central resource pool.
  • 32. Fermilab HEPCloud: Expanding to the Cloud Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. – Provisioning – Performance – Image portability – On-demand services • Where to start? – Market leader: Amazon Web Services (AWS) • Integration challenges that needs to be managed to run at scale: – Networking – Storage and data movement – Monitoring and accounting – Security
  • 33. Integration Challenges: Provisioning – Create an Overlay Batch System with GlideinWMS and HTCondor condor submit VO Frontend HTCondor Central Manager HTCondor Schedulers HTCondor Schedulers Frontend Grid Site Virtual Machine Job Local Resources Virtual Machine Job GlideinWMS Factory HTCondor-G High Performance Computers Virtual Machine Job Cloud Provider Virtual Machine VM Glidein HTCondor Startd Job Pull Job
  • 34. Integration Challenges: Provisioning – Containing costs • Using AWS Spot market to contain costs • Workflows are already engineered to sustain preemption from the Grid – Job are “short”, i.e., killed jobs are affordable w/o checkpointing – Preempted jobs are automatically resubmitted – Data management systems identify files in a dataset that were not processed and allow recovery CMS use case: Histogram of number of times each job started (measure of preemption) NOvA use case: number of VMs running (blue) and preempted (red) every hour 2.5M jobs with no preemption 240 VM / h 60 VM preempted in 1h 400K jobs with one preemption
  • 35. Integration Challenges: Provisioning – Containing costs • The Decision Engine oversees the costs and optimizing VM placement using the status of the facility, the historical prices, and the job characteristics Bid at 25% x on-demand price has lowest expected cost • Based on pre-emption history, calculating the probability that a 5- 24 h job finishes within a week although it has to restart due to preemption, for various bidding algorithms. $0.25 / h
  • 36. Integration Challenges: Performance Benchmarks used to compare workflow duration on AWS (and $$) with local execution Need EBS Need EBS 32 cores scale w/ cores Need EBS Need EBS 32 cores scale w/ cores Need parallel streams c3.2xlarge c3.2xlarge good candidate – want > 1 From AWS to FNAL: 7Gbps Access to S3 always saturates the 1 Gbps interface
  • 37. Integration Challenges: Performance CMS Use Case: Wallclock distribution by AWS instance type
  • 38. Integration Challenges: Image Portability Build VM management tool, considering: • HVM virtualization (HW VM + Xen) on AWS: gives access to all AWS resources • Contain VM size (saves import time and cost) • Import process covers multiple AWS accounts and regions • AuthN with AWS use short- lived role-based tokens, rather than long term keys Build “Golden Image” from standard Fermilab Worker Node configuration VM.
  • 39. Integration Challenges: On-demand Services Jobs depend on software services to run Automating the deployment of these services on AWS on-demand - enables scalability and cost savings • Services include data caching (e.g., Squid) WMS , submission service, data transfer, etc. • As services are made deployable on-demand, instantiate ensemble of services together (e.g., through AWS CloudFormation) Example: on-demand Squid • Deploy Squid via auto-scaling services. Squid is deployed if average group bandwidth utilization is too high. Server is deployed or destroyed in 30 seconds. • Front Squids with a load balancer. • Name the load balancer for that region via Route 53 Auto Scaling group CloudFormation
  • 40. "SquidInstanceType" : { "Type" : "String", "Default" : "c3.xlarge", … }, "SquidLaunchConfiguration" : { "Type" : "AWS::AutoScaling::LaunchConfiguration", "Properties" : { "InstanceType" : { "Ref" : "SquidInstanceType" }, "ImageId" : { "Fn::FindInMap" : [ "AMIRegionMap", {"Ref":"AWS::Region"}, "SquidAMI" ]}, "SecurityGroups" : [ { "Fn::FindInMap" : ["SecurityGroupRegionMap",{"Ref":"AWS::Region"}, "SquidSG" ] } ], … } } "SquidAutoscalingGroup" : { "Type" : "AWS::AutoScaling::AutoScalingGroup", "Properties" : { "AvailabilityZones" : {"Ref" : "AvailabilityZones"}, "LaunchConfigurationName" : {"Ref" : "SquidLaunchConfiguration" }, "LoadBalancerNames" : [ {"Ref" : "SquidLoadBalancer" } ], … } }, "SquidAutoscaleUpPolicy" : { "Type" : "AWS::AutoScaling::ScalingPolicy", "Properties" : { "AdjustmentType" : "ChangeInCapacity", "AutoScalingGroupName" : { "Ref" : "SquidAutoscalingGroup" }, "ScalingAdjustment" : "1” … } }, … Integration Challenges: On-demand Services – CloudFormation
  • 41. "SquidNetworkBandwidthHighAlarm" : { "Type" : "AWS::CloudWatch::Alarm", "Properties" : { "AlarmDescription" : "Scale up if average NetworkIn > for 5 minutes", "MetricName" : "NetworkOut", "Statistic" : "Average", "Period" : "300", "Threshold" : "1100000000", "AlarmActions" : [ { "Ref" : "SquidAutoscaleUpPolicy" } ], "ComparisonOperator" : "GreaterThanThreshold”, … } } … "SecurityGroupRegionMap" : { "us-west-2“ : { "SquidSG" : "sg-xxxxf6cb" }, "us-east-1" : { "SquidSG" : "sg-xxxx70ca" }, … } "SquidLoadBalancer" : {"Type" : "AWS::ElasticLoadBalancing::LoadBalancer", "Properties" : { "CrossZone" : "false", "SecurityGroups" : [ {"Fn::FindInMap" : [ "SecurityGroupRegionMap", { "Ref" : "AWS::Region" } , "SquidSG" ] } ], "Listeners" : [ { "LoadBalancerPort":"3128", "InstancePort":"3128", "Protocol":"TCP" } ], "HealthCheck" : { "Target" : "TCP:3128", "HealthyThreshold" : "3", … } … } } Integration Challenges: On-demand Services – CloudFormation
  • 42. "elbHostedZone": { "Type" : "AWS::Route53::HostedZone", "Properties" : { "HostedZoneConfig" : { "Comment" : "auto-generated private hosting zone for ELB” }, "Name" : { "Fn::Join" : ["", [{"Ref":"AvailabilityZone"},".elb.fnaldata.org.”]]}, "VPCs" : [{ "VPCId" : { … }, "VPCRegion" : { "Ref" : "AWS::Region"} }] } } "elbDNS" : { "Type" : "AWS::Route53::RecordSet", "Properties" : { "HostedZoneId" : { "Ref" : "elbHostedZone" }, "Name" : { "Fn::Join" : ["", ["elb2.",{"Ref":"AvailabilityZone"},".elb.fnaldata.org."]]}, "ResourceRecords" : [ { "Fn::GetAtt" : [ "SquidLoadBalancer", "DNSName" ] } ] … } } Clients call Squid as elb2.<AvailabilityZone>.elb.fnaldata.org Integration Challenges: On-demand Services – CloudFormation
  • 43. Integration Challenges: Networking Implement routing / firewall configuration to use peered ESNet / AWS to route data flow through ESNet AWS / ESNet data egress cost waiver • For data transferred through ESNet, transfer charges are waived for data costs up to 15% of the total
  • 44. Integration Challenges: Storage and Data Movement Integrate S3 storage stage-in/-out for AWS internal / external access - enables flexibility on data management • Consider O(1000) jobs finishing on the cloud and transferring output to remote storage • Storage bandwidth capacity is limited • Two main strategies for data transfers: 1. Fill the available network transfer by having some jobs wait - Put the jobs on a queue and transfer data from as many jobs as possible - idle VMs have a cost 2. Store data on S3 almost concurrently (due to high scalability) and transfer data back asynchronously - data on S3 has a cost • The cheapest strategy depends on the storage bandwidth, number of jobs, etc. S3
  • 45. Integration Challenges: Monitoring and Accounting Monitor # GCloud VMs (S. Korea Priv. Cloud) Monitor # AWS VMs Accounting: $ by VO and VM Type Monitor HEP Cloud Slots
  • 46. NoVA Data Processing Processing the 2014/2015 dataset 3 use cases: Particle ID, Montecarlo , Data Reconstruction Received AWS research grant Dark Energy Survey Gravitational Waves Search for optical counterpart of events detected by LIGO/VIRGO gravitational wave detectors (FNAL LDRD) Modest CPU needs, but want 5-10 hour turnaround Burst activity driven entirely by physical phenomena (gravitational wave events are transient) Rapid provisioning to peak CMS Monte Carlo Simulation Generation (and detector simulation, digitization, reconstruction) of simulated events in time for Moriond conference. 58,000 compute cores, steady-state Demonstrates scalability Received AWS research grant Initial HEPCloud Use Cases
  • 47. Results from the CMS Use Case • All CMS simulation requests fulfilled by the conference deadline (Rencontres de Moriond 2016 ) – 2.9 million jobs, 15.1 million wall hours • 9.5% badput – includes preemption from spot pricing • 87% CPU efficiency – 518 million events generated
  • 48. CMS Reaching ~60k slots on AWS with HEPCloud 10% Test 25% 60000 slots 10000 VM Each color corresponds to a different region / zone /machine type
  • 49. HEPCloud AWS: 25% of CMS global capacity Production Analysis Reprocessing Production on AWS via FNAL HEPCloud Production Analysis Reprocessing Production on AWS via FNAL HEPCloud
  • 50. On-premises vs. cloud cost comparison Average cost per core-hour • On-premises resource: 0.9 cents per core-hour • Includes power, cooling, staff, but assumes 100% utilization • Off-premises at AWS (CMS use case): 1.4 cents per core-hour • Off-premises at AWS (NOvA use case): 3.0 cents per core-hour • Use case demanded bigger VM Benchmarks • Specialized (“ttbar”) benchmark focused on HEP workflows • On-premises: 0.0163 ttbar /s (higher = better) • Off-premises: 0.0158 ttbar /s Raw compute performance roughly equivalent Cloud costs approaching equivalence Amazon provisions/retires 60k cores for our system in ~1 hour
  • 51. Acknowledgements The support from the Computing Sector The Fermilab HEPCloud Facility team AWS and their engagement team, in particular Jamie Baker The HTCondor team The collaboration and contributions from KISTI, in particular Dr. Seo-Young Noh The Illinois Institute of Technology (IIT) students and professors Ioan Raicu and Shangping Ren The Italian National Institute of Nuclear Physics (INFN) summer student program • NOvA: http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5774 • CMS: http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5750 For More Information:
  • 53.
  • 56. Related Sessions CMP201 - Auto Scaling – The Fleet Management Solution for Planet Earth