SlideShare a Scribd company logo
1 of 42
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling HPC applications in EC2
with Elastic Fabric Adapter
Brian Barrett
Principal Engineer
AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
HPC applications in AWS
What is EFA?
Getting started with EFA
EFA tech deep-dive
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Wednesday, Nov 28
High Performance Computing on AWS
3:15 – 4:15 | Aria East, Piazza Level, Orovada 2
Wednesday, Nov 28
Running High Performance Computing Workloads in the Cloud
4:00 – 5:00 | Aria West, Level 3, Starvine 3, Table 8
Wednesday, Nov 28
Deploying a Burstable and Event-Driven HPC Cluster on AWS
1:00 – 2:00 | Aria West, Level 3, Starvine 10, Table 6
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Weather simulation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
0
20
40
60
80
100
120
140
160
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 50 100 150 200 250 300 350
ScaleUp
Time(s)
Cores
c4.8xlarge Time c4.8xlarge Scaleup
Structural simulation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Trek applications for engineering:
• Computational Fluid Dynamics
• Star-CCM+ and HEEDS software
Cloud for product design and engineering
Simulations for bicycle design:
• Execute multiple simulations
in parallel
• Fully explore the design
space to make informed
decisions about drafting
techniques related to
competitive bicycling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fluid dynamics – Ansys Fluent
C4.8xlarge instance type
140M cell model
F1 car CFD benchmark
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
HPC in aerospace
Boom leverages Rescale and AWS to
enable supersonic travel
• Simulated vortex lift with 200M cell
models on 512+ cores
• Increased simulation throughput:
100 jobs in parallel with 6x speedup
per job → 600x speedup
• Elastic HPC capacity and pay-as-
you-go AWS clusters allow business
agility & ability to scale
“Rescale’s ScaleX cloud platform is a
game-changer for engineering. It
gives Boom computing resources
comparable to building a large on-
premise HPC center. Rescale lets us
move fast with minimal capital
spending and resources overhead.”
Josh Krall
CTO & Co-Founder
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Children’s Hospital of Philadelphia and Edico Genome
Achieve Fastest-Ever Analysis of 1,000 Genomes
Orlando, Fla., Oct 19, 2018 – The
Children’s Hospital of Philadelphia
(CHOP) and Edico Genome today set a
new scientific world standard in rapidly
processing whole human genomes into
data files usable for researchers aiming
to bring precision medicine into
mainstream clinical practice. Utilizing
Edico Genome’s DRAGENTM Genome
Pipeline, deployed on 1,000 Amazon
EC2 F1 instances on the Amazon Web
Services (AWS) Cloud, 1,000 pediatric
genomes were processed in two hours
and 25 minutes.
Genomics processing on FPGA
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EBS Volumes
Enhanced Networking
Hardware
Quick Amazon Elastic Compute Cloud (Amazon EC2)
review
c5n.18xlarge
Software
NVMe
ENA
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
HPC software stack in Amazon EC2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
HPC software stack in Amazon EC2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
HPC network performance
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
EC2 MPI multi-stream bandwidth
Series1 Series2 Series3 Series4
0
10
20
30
40
50
60
1 2 3 4
EC2 MPI Latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
HPC software stack with EFA
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
HPC software stack with EFA
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EBS Volumes
Enhanced Networking
Hardware
Introducing EFA
c5n.18xlarge
Software
NVMe
ENAEFA
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
EC2 MPI multi-stream bandwidth
Series1 Series2 Series3 Series4 Series5
0
10
20
30
40
50
60
1 2 3 4 5
EC2 MPI Latency
HPC network performance with EFA
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EFA getting started
• Supported platforms
• C5n.18xlarge, C5n.9xlarge, P3dn.24xlarge
• EFA Kernel module
• Upstream in progress
• https://github.com/amzn/amzn-drivers
• Libfabric Network Stack
• AWS-custom version for first half 2019
• MPI Implementation or NCCL
• Open MPI 3.1.3 or later or NCCL 2.3.8 or later
• Intel MPI and MPICH in development
See https://aws.amazon.com/hpc/ for more details
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EFA constraints
• Subnet-local communication
• Must have both an “allow all traffic within security group” ingress and
egress rule
• 1 EFA ENI per instance
• EFA ENIs can only be added at instance launch or to stopped instance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s build an EFA-enabled cluster
% aws ec2 run-instances --count=4 --region us-east-1 --image-id ami-ABCD 
--instance-type c5n.18xlarge --placement GroupName=ABCD 
--network-interfaces DeleteOnTermination=true,DeviceIndex=0,
SubnetId=subnet-ABCD,InterfaceType=efa --security-group-ids sg-ABCD
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What we just built…
Availability Zone #1
Subnet ABCD
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s build an EFA-enabled cluster
% aws ec2 run-instances --count=4 --region us-east-1 --image-id ami-ABCD 
--instance-type c5n.18xlarge --placement GroupName=ABCD 
--network-interfaces DeleteOnTermination=true,DeviceIndex=0,
SubnetId=subnet-ABCD,InterfaceType=efa --security-group-ids sg-ABCD
🕰
% ssh ec2-user@ec2-1-2-3-4.compute-1.amazonaws.com
% lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev
08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:06.0 Ethernet controller: Amazon.com, Inc. Device efa0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
And verify drivers…
% lsmod | grep efa
efa 81920 0
ib_core 266240 2 efa,ib_uverbs
% fi_info -p efa
provider: efa
fabric: EFA-fe80::883:afff:fed3:776c
domain: efa_0-rdm
version: 3.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::883:afff:fed3:776c
domain: efa_0-dgrm
version: 3.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxr
fabric: EFA-fe80::883:afff:fed3:776c
domain: efa_0-rdm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
And verify MPI
% ompi_info | grep 'mtl: ofi'
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v3.1.3)
% mpirun -np 2 -hostfile ~/h ./ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I use EFA with…
AWS CloudFormation: Launch Templates
Amazon EC2 Auto Scaling Groups: Launch Templates
Spot/Spot Fleet: Launch Templates
AWS Batch: Launch Templates
Launch Templates: add InterfaceType : efa to the Network section
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Creating a launch template with EFA
{
"NetworkInterfaces": [{
"AssociatePublicIpAddress": false,
"DeviceIndex": 0,
"SubnetId": "subnet-ABCD",
"InterfaceType" : "efa"
}],
"Placement " : {
"GroupName": "ABCD"
},
"ImageId": "ami-ABCD",
"InstanceType": "c5n.18xlarge",
"CpuOptions": {
”CoreCount": 36,
"ThreadsPerCore": 1
}
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How we used to write HPC applications
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Libfabric changes the picture
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Libfabric Components
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EFA and libfabric: endpoints
Two native endpoint types:
• RDM (Reliable DataGram)
• DGRM (unreliable DataGRaM)
One utility endpoint type:
• RxR (RDM over RDM)
EFA protocol custom to AWS
% fi_info -p efa
provider: efa
fabric: EFA-fe80::883:afff:fed3:776c
domain: efa_0-rdm
version: 3.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::883:afff:fed3:776c
domain: efa_0-dgrm
version: 3.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxr
fabric: EFA-fe80::883:afff:fed3:776c
domain: efa_0-rdm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EFA native modes
• RDM
• Reliable, unordered datagrams
• ~8 KiB max message size
• Send/receive interface, with no tag matching
• Native multi-pathing; no “flow limit”
• DGRAM
• Unreliable, unordered datagrams
• ~8 KiB max message size
• Send/receive interface
• Subject to same “flow limit” as TCP/IP and UDP/IP over ENA
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Utility endpoint type
• RxR: build libfabrics interface over RDM
• Completion ordered datagrams
• tagged matching support (ie, MPI)
• Max message size > system memory size
• Large iovecs
• RxR developed by AWS as part of EFA
• Contributing back to Libfabric community shortly
• Currently implemented to support MPI implementations
• Future work includes supporting RMA and atomic transfer interfaces
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scalable Reliable Datagram (SRD)
• New protocol designed for AWS’s unique datacenter network
• Network aware multipath routing
• Guaranteed delivery
• Orders of magnitude lower tail latency
• No ordering guarantees
• Implemented as part of our 3rd generation Nitro chip
• EFA exposes SRD as a reliable datagram interface
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SRD link failure handling
0
5000
10000
15000
20000
25000
0 1000 2000 3000 4000 5000 6000 7000
TCP
Series1
0
5000
10000
15000
20000
25000
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
SRD
Series1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
F.A.Q.s
• When will it reach general availability?
• First half 2019
• How do I sign up for the preview?
• https://pages.awscloud.com/elastic-fabric-adapter-preview.html
• What regions will EFA launch in?
• Any region with C5n or P3dn support
• What is your MPI latency?
• Less than 15 𝜇s ½ RTT in placement group (osu_latency benchmark)
• We will be constantly iterating behind the scenes to lower latency, including expanding
.metal options
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Brian Barrett
bbarrett@amazon.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

Amazon EKS - Elastic Container Service for Kubernetes
Amazon EKS - Elastic Container Service for KubernetesAmazon EKS - Elastic Container Service for Kubernetes
Amazon EKS - Elastic Container Service for KubernetesAmazon Web Services
 
週末趣味のAWS Transit Gatewayでの経路制御
週末趣味のAWS Transit Gatewayでの経路制御週末趣味のAWS Transit Gatewayでの経路制御
週末趣味のAWS Transit Gatewayでの経路制御Namba Kazuo
 
서버리스 앱 배포 자동화 (김필중, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
서버리스 앱 배포 자동화 (김필중, AWS 솔루션즈 아키텍트) :: AWS DevDay2018서버리스 앱 배포 자동화 (김필중, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
서버리스 앱 배포 자동화 (김필중, AWS 솔루션즈 아키텍트) :: AWS DevDay2018Amazon Web Services Korea
 
Getting Started with AWS Compute Services
Getting Started with AWS Compute ServicesGetting Started with AWS Compute Services
Getting Started with AWS Compute ServicesAmazon Web Services
 
Designing security & governance via AWS Control Tower & Organizations - SEC30...
Designing security & governance via AWS Control Tower & Organizations - SEC30...Designing security & governance via AWS Control Tower & Organizations - SEC30...
Designing security & governance via AWS Control Tower & Organizations - SEC30...Amazon Web Services
 
AWS 상의 컨테이너 서비스 소개 ECS, EKS - 이종립 / Principle Enterprise Evangelist @베스핀글로벌
AWS 상의 컨테이너 서비스 소개 ECS, EKS - 이종립 / Principle Enterprise Evangelist @베스핀글로벌AWS 상의 컨테이너 서비스 소개 ECS, EKS - 이종립 / Principle Enterprise Evangelist @베스핀글로벌
AWS 상의 컨테이너 서비스 소개 ECS, EKS - 이종립 / Principle Enterprise Evangelist @베스핀글로벌BESPIN GLOBAL
 
AWS Black Belt Online Seminar AWS Key Management Service (KMS)
AWS Black Belt Online Seminar AWS Key Management Service (KMS) AWS Black Belt Online Seminar AWS Key Management Service (KMS)
AWS Black Belt Online Seminar AWS Key Management Service (KMS) Amazon Web Services Japan
 
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...Amazon Web Services
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift UpdateAmazon Web Services Japan
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵Amazon Web Services Korea
 
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018Amazon Web Services
 

What's hot (20)

AWS 101
AWS 101AWS 101
AWS 101
 
Amazon EKS - Elastic Container Service for Kubernetes
Amazon EKS - Elastic Container Service for KubernetesAmazon EKS - Elastic Container Service for Kubernetes
Amazon EKS - Elastic Container Service for Kubernetes
 
Migrating to the Cloud
Migrating to the CloudMigrating to the Cloud
Migrating to the Cloud
 
週末趣味のAWS Transit Gatewayでの経路制御
週末趣味のAWS Transit Gatewayでの経路制御週末趣味のAWS Transit Gatewayでの経路制御
週末趣味のAWS Transit Gatewayでの経路制御
 
Overview of Amazon Web Services
Overview of Amazon Web ServicesOverview of Amazon Web Services
Overview of Amazon Web Services
 
AWS IAM Introduction
AWS IAM IntroductionAWS IAM Introduction
AWS IAM Introduction
 
서버리스 앱 배포 자동화 (김필중, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
서버리스 앱 배포 자동화 (김필중, AWS 솔루션즈 아키텍트) :: AWS DevDay2018서버리스 앱 배포 자동화 (김필중, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
서버리스 앱 배포 자동화 (김필중, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
 
What is AWS?
What is AWS?What is AWS?
What is AWS?
 
Getting Started with AWS Compute Services
Getting Started with AWS Compute ServicesGetting Started with AWS Compute Services
Getting Started with AWS Compute Services
 
Designing security & governance via AWS Control Tower & Organizations - SEC30...
Designing security & governance via AWS Control Tower & Organizations - SEC30...Designing security & governance via AWS Control Tower & Organizations - SEC30...
Designing security & governance via AWS Control Tower & Organizations - SEC30...
 
AWS 상의 컨테이너 서비스 소개 ECS, EKS - 이종립 / Principle Enterprise Evangelist @베스핀글로벌
AWS 상의 컨테이너 서비스 소개 ECS, EKS - 이종립 / Principle Enterprise Evangelist @베스핀글로벌AWS 상의 컨테이너 서비스 소개 ECS, EKS - 이종립 / Principle Enterprise Evangelist @베스핀글로벌
AWS 상의 컨테이너 서비스 소개 ECS, EKS - 이종립 / Principle Enterprise Evangelist @베스핀글로벌
 
AWS Security and SecOps
AWS Security and SecOpsAWS Security and SecOps
AWS Security and SecOps
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
AWS Black Belt Online Seminar AWS Key Management Service (KMS)
AWS Black Belt Online Seminar AWS Key Management Service (KMS) AWS Black Belt Online Seminar AWS Key Management Service (KMS)
AWS Black Belt Online Seminar AWS Key Management Service (KMS)
 
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
 
AWS EC2 Fundametals
AWS EC2 FundametalsAWS EC2 Fundametals
AWS EC2 Fundametals
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
 
Aws route 53
Aws route 53Aws route 53
Aws route 53
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
 

Similar to [NEW LAUNCH!] Scaling Tightly-coupled HPC workloads on HPC with Elastic Fabric Adapter and High Bandwidth (Network Optimized) EC2 Instances. (ENT360) - AWS re:Invent 2018

[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...Amazon Web Services
 
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Amazon Web Services
 
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...Amazon Web Services
 
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Amazon Web Services
 
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Amazon Web Services
 
Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%Amazon Web Services
 
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...Amazon Web Services
 
Semplificare la gestione dei container con i servizi AWS
Semplificare la gestione dei container con i servizi AWSSemplificare la gestione dei container con i servizi AWS
Semplificare la gestione dei container con i servizi AWSAmazon Web Services
 
Deploying Microservices using AWS Fargate (CON315-R1) - AWS re:Invent 2018
Deploying Microservices using AWS Fargate (CON315-R1) - AWS re:Invent 2018Deploying Microservices using AWS Fargate (CON315-R1) - AWS re:Invent 2018
Deploying Microservices using AWS Fargate (CON315-R1) - AWS re:Invent 2018Amazon Web Services
 
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018Amazon Web Services
 
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018Amazon Web Services
 
Optimizing Application Performance and Costs with Auto Scaling - AWS Online T...
Optimizing Application Performance and Costs with Auto Scaling - AWS Online T...Optimizing Application Performance and Costs with Auto Scaling - AWS Online T...
Optimizing Application Performance and Costs with Auto Scaling - AWS Online T...Amazon Web Services
 
Getting-started-with-containers on AWS
Getting-started-with-containers on AWSGetting-started-with-containers on AWS
Getting-started-with-containers on AWSAmazon Web Services
 
Containers and mission-critical applications - SEP309-R - AWS re:Inforce 2019
Containers and mission-critical applications - SEP309-R - AWS re:Inforce 2019 Containers and mission-critical applications - SEP309-R - AWS re:Inforce 2019
Containers and mission-critical applications - SEP309-R - AWS re:Inforce 2019 Amazon Web Services
 
SRV318 Running Kubernetes with Amazon EKS
SRV318 Running Kubernetes with Amazon EKSSRV318 Running Kubernetes with Amazon EKS
SRV318 Running Kubernetes with Amazon EKSAmazon Web Services
 
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Amazon Web Services
 

Similar to [NEW LAUNCH!] Scaling Tightly-coupled HPC workloads on HPC with Elastic Fabric Adapter and High Bandwidth (Network Optimized) EC2 Instances. (ENT360) - AWS re:Invent 2018 (20)

[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
 
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
 
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
 
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
 
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
 
Compute@Scale
Compute@ScaleCompute@Scale
Compute@Scale
 
Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%
 
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
 
AWS Container services
AWS Container servicesAWS Container services
AWS Container services
 
Containers - State of the Union
Containers - State of the UnionContainers - State of the Union
Containers - State of the Union
 
Semplificare la gestione dei container con i servizi AWS
Semplificare la gestione dei container con i servizi AWSSemplificare la gestione dei container con i servizi AWS
Semplificare la gestione dei container con i servizi AWS
 
Deploying Microservices using AWS Fargate (CON315-R1) - AWS re:Invent 2018
Deploying Microservices using AWS Fargate (CON315-R1) - AWS re:Invent 2018Deploying Microservices using AWS Fargate (CON315-R1) - AWS re:Invent 2018
Deploying Microservices using AWS Fargate (CON315-R1) - AWS re:Invent 2018
 
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018
 
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
 
Optimizing Application Performance and Costs with Auto Scaling - AWS Online T...
Optimizing Application Performance and Costs with Auto Scaling - AWS Online T...Optimizing Application Performance and Costs with Auto Scaling - AWS Online T...
Optimizing Application Performance and Costs with Auto Scaling - AWS Online T...
 
Getting-started-with-containers on AWS
Getting-started-with-containers on AWSGetting-started-with-containers on AWS
Getting-started-with-containers on AWS
 
Containers and mission-critical applications - SEP309-R - AWS re:Inforce 2019
Containers and mission-critical applications - SEP309-R - AWS re:Inforce 2019 Containers and mission-critical applications - SEP309-R - AWS re:Inforce 2019
Containers and mission-critical applications - SEP309-R - AWS re:Inforce 2019
 
SRV318 Running Kubernetes with Amazon EKS
SRV318 Running Kubernetes with Amazon EKSSRV318 Running Kubernetes with Amazon EKS
SRV318 Running Kubernetes with Amazon EKS
 
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
 
EKS Workshop
 EKS Workshop EKS Workshop
EKS Workshop
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

[NEW LAUNCH!] Scaling Tightly-coupled HPC workloads on HPC with Elastic Fabric Adapter and High Bandwidth (Network Optimized) EC2 Instances. (ENT360) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling HPC applications in EC2 with Elastic Fabric Adapter Brian Barrett Principal Engineer AWS
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda HPC applications in AWS What is EFA? Getting started with EFA EFA tech deep-dive
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Wednesday, Nov 28 High Performance Computing on AWS 3:15 – 4:15 | Aria East, Piazza Level, Orovada 2 Wednesday, Nov 28 Running High Performance Computing Workloads in the Cloud 4:00 – 5:00 | Aria West, Level 3, Starvine 3, Table 8 Wednesday, Nov 28 Deploying a Burstable and Event-Driven HPC Cluster on AWS 1:00 – 2:00 | Aria West, Level 3, Starvine 10, Table 6
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Weather simulation
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 0 20 40 60 80 100 120 140 160 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 0 50 100 150 200 250 300 350 ScaleUp Time(s) Cores c4.8xlarge Time c4.8xlarge Scaleup Structural simulation
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Trek applications for engineering: • Computational Fluid Dynamics • Star-CCM+ and HEEDS software Cloud for product design and engineering Simulations for bicycle design: • Execute multiple simulations in parallel • Fully explore the design space to make informed decisions about drafting techniques related to competitive bicycling
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fluid dynamics – Ansys Fluent C4.8xlarge instance type 140M cell model F1 car CFD benchmark
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC in aerospace Boom leverages Rescale and AWS to enable supersonic travel • Simulated vortex lift with 200M cell models on 512+ cores • Increased simulation throughput: 100 jobs in parallel with 6x speedup per job → 600x speedup • Elastic HPC capacity and pay-as- you-go AWS clusters allow business agility & ability to scale “Rescale’s ScaleX cloud platform is a game-changer for engineering. It gives Boom computing resources comparable to building a large on- premise HPC center. Rescale lets us move fast with minimal capital spending and resources overhead.” Josh Krall CTO & Co-Founder
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Children’s Hospital of Philadelphia and Edico Genome Achieve Fastest-Ever Analysis of 1,000 Genomes Orlando, Fla., Oct 19, 2018 – The Children’s Hospital of Philadelphia (CHOP) and Edico Genome today set a new scientific world standard in rapidly processing whole human genomes into data files usable for researchers aiming to bring precision medicine into mainstream clinical practice. Utilizing Edico Genome’s DRAGENTM Genome Pipeline, deployed on 1,000 Amazon EC2 F1 instances on the Amazon Web Services (AWS) Cloud, 1,000 pediatric genomes were processed in two hours and 25 minutes. Genomics processing on FPGA
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EBS Volumes Enhanced Networking Hardware Quick Amazon Elastic Compute Cloud (Amazon EC2) review c5n.18xlarge Software NVMe ENA
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC software stack in Amazon EC2
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC software stack in Amazon EC2
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC network performance 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 EC2 MPI multi-stream bandwidth Series1 Series2 Series3 Series4 0 10 20 30 40 50 60 1 2 3 4 EC2 MPI Latency
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC software stack with EFA
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC software stack with EFA
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EBS Volumes Enhanced Networking Hardware Introducing EFA c5n.18xlarge Software NVMe ENAEFA
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 0 2000 4000 6000 8000 10000 12000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 EC2 MPI multi-stream bandwidth Series1 Series2 Series3 Series4 Series5 0 10 20 30 40 50 60 1 2 3 4 5 EC2 MPI Latency HPC network performance with EFA
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EFA getting started • Supported platforms • C5n.18xlarge, C5n.9xlarge, P3dn.24xlarge • EFA Kernel module • Upstream in progress • https://github.com/amzn/amzn-drivers • Libfabric Network Stack • AWS-custom version for first half 2019 • MPI Implementation or NCCL • Open MPI 3.1.3 or later or NCCL 2.3.8 or later • Intel MPI and MPICH in development See https://aws.amazon.com/hpc/ for more details
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EFA constraints • Subnet-local communication • Must have both an “allow all traffic within security group” ingress and egress rule • 1 EFA ENI per instance • EFA ENIs can only be added at instance launch or to stopped instance
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s build an EFA-enabled cluster % aws ec2 run-instances --count=4 --region us-east-1 --image-id ami-ABCD --instance-type c5n.18xlarge --placement GroupName=ABCD --network-interfaces DeleteOnTermination=true,DeviceIndex=0, SubnetId=subnet-ABCD,InterfaceType=efa --security-group-ids sg-ABCD
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What we just built… Availability Zone #1 Subnet ABCD
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s build an EFA-enabled cluster % aws ec2 run-instances --count=4 --region us-east-1 --image-id ami-ABCD --instance-type c5n.18xlarge --placement GroupName=ABCD --network-interfaces DeleteOnTermination=true,DeviceIndex=0, SubnetId=subnet-ABCD,InterfaceType=efa --security-group-ids sg-ABCD 🕰 % ssh ec2-user@ec2-1-2-3-4.compute-1.amazonaws.com % lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 00:06.0 Ethernet controller: Amazon.com, Inc. Device efa0
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. And verify drivers… % lsmod | grep efa efa 81920 0 ib_core 266240 2 efa,ib_uverbs % fi_info -p efa provider: efa fabric: EFA-fe80::883:afff:fed3:776c domain: efa_0-rdm version: 3.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::883:afff:fed3:776c domain: efa_0-dgrm version: 3.0 type: FI_EP_DGRAM protocol: FI_PROTO_EFA provider: efa;ofi_rxr fabric: EFA-fe80::883:afff:fed3:776c domain: efa_0-rdm version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXR
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. And verify MPI % ompi_info | grep 'mtl: ofi' MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v3.1.3) % mpirun -np 2 -hostfile ~/h ./ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do I use EFA with… AWS CloudFormation: Launch Templates Amazon EC2 Auto Scaling Groups: Launch Templates Spot/Spot Fleet: Launch Templates AWS Batch: Launch Templates Launch Templates: add InterfaceType : efa to the Network section
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Creating a launch template with EFA { "NetworkInterfaces": [{ "AssociatePublicIpAddress": false, "DeviceIndex": 0, "SubnetId": "subnet-ABCD", "InterfaceType" : "efa" }], "Placement " : { "GroupName": "ABCD" }, "ImageId": "ami-ABCD", "InstanceType": "c5n.18xlarge", "CpuOptions": { ”CoreCount": 36, "ThreadsPerCore": 1 } }
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How we used to write HPC applications
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Libfabric changes the picture
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Libfabric Components
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EFA and libfabric: endpoints Two native endpoint types: • RDM (Reliable DataGram) • DGRM (unreliable DataGRaM) One utility endpoint type: • RxR (RDM over RDM) EFA protocol custom to AWS % fi_info -p efa provider: efa fabric: EFA-fe80::883:afff:fed3:776c domain: efa_0-rdm version: 3.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::883:afff:fed3:776c domain: efa_0-dgrm version: 3.0 type: FI_EP_DGRAM protocol: FI_PROTO_EFA provider: efa;ofi_rxr fabric: EFA-fe80::883:afff:fed3:776c domain: efa_0-rdm version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXR
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EFA native modes • RDM • Reliable, unordered datagrams • ~8 KiB max message size • Send/receive interface, with no tag matching • Native multi-pathing; no “flow limit” • DGRAM • Unreliable, unordered datagrams • ~8 KiB max message size • Send/receive interface • Subject to same “flow limit” as TCP/IP and UDP/IP over ENA
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Utility endpoint type • RxR: build libfabrics interface over RDM • Completion ordered datagrams • tagged matching support (ie, MPI) • Max message size > system memory size • Large iovecs • RxR developed by AWS as part of EFA • Contributing back to Libfabric community shortly • Currently implemented to support MPI implementations • Future work includes supporting RMA and atomic transfer interfaces
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scalable Reliable Datagram (SRD) • New protocol designed for AWS’s unique datacenter network • Network aware multipath routing • Guaranteed delivery • Orders of magnitude lower tail latency • No ordering guarantees • Implemented as part of our 3rd generation Nitro chip • EFA exposes SRD as a reliable datagram interface
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SRD link failure handling 0 5000 10000 15000 20000 25000 0 1000 2000 3000 4000 5000 6000 7000 TCP Series1 0 5000 10000 15000 20000 25000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 SRD Series1
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. F.A.Q.s • When will it reach general availability? • First half 2019 • How do I sign up for the preview? • https://pages.awscloud.com/elastic-fabric-adapter-preview.html • What regions will EFA launch in? • Any region with C5n or P3dn support • What is your MPI latency? • Less than 15 𝜇s ½ RTT in placement group (osu_latency benchmark) • We will be constantly iterating behind the scenes to lower latency, including expanding .metal options
  • 41. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Brian Barrett bbarrett@amazon.com
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.