SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
STORAGE FOR HPC IN THE CLOUD
I s a i a h W e i n e r
S r . M g r . S o l u t i o n s A r c h i t e c t u r e
G P S T E C 3 2 4
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
I/O
Cost
TTR
National
Labs
Research,
Energy & UtilitiesGenomics
Analytics,
AI/ML
EDA M&E
Finance
HPC IS COMPLEX
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
GROWTH IN CLOUD
2015 2016 2017 2018 2019 2020
70% 65%
61% 59% 58% 55%
10%
15%
26% 26% 26% 28%
10% 12% 13% 15% 16% 17%
CLOUD MARKET FORECAST
On-Prem Public Cloud Private Cloud
Source: IDC Worldwide Quarterly Cloud IT Infrastructure Tracker
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
GROWTH IN STORAGE
0
10000
20000
30000
40000
50000
60000
70000
80000
2016 2017 2018 2019 2020
Exabytes
Enterprise HPC
Source: Gartner for Enterprise and IDC for High Performance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DIY – NFS
NFS
Server
Volume Volume
NFS
Server
Volume Volume
NFS
Server
Volume Volume
NFS
Clients
NFS
Clients
NFS
Clients
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AMAZON EFS ARCHITECTURE
Clients Clients Clients
Mount
Target
Single Namespace
Mount
Target
Mount
Target
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HPC CLUSTER NODE ANATOMY
Data
Metadata
Tiering
Backend
Routing
Monitoring
VIPs
Clustering
Storage
Access
Protocols
Frontend
Network
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenge: Local SSD performance
with centralized management. Local
SSDs required data to be copied
around, and multiple copies took up
space.
Solution delivers: Scalable,
sharable, simplified; one copy of
the data, on-par with local SSD for
performance. 0
50
100
150
200
250
Elapsed Time (Lower is Better)
Local SSD WekaIO NFSv4
SEMICONDUCTOR CUSTOMER
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenge: Small file workload with
some large files in the mix. Pre-
solution workaround: more jobs! All
the jobs!
Solution delivers: Scalable,
sharable, simplified; one copy of
the data, on-par with local SSD for
performance. 0
20
40
60
80
100
120
140
WekaIO On-Prem AFA
Elapsed Time (Lower is Better)
1 conversion 6 conversions
GENOMICS CUSTOMER
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cluster Sizing
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The hardware is reliable.
Trust the kernel, it is wise.
The MTBF is millions of hours.
Hardware is up until it dies.
…Is the hardware reliable, really?
DPDK + SR-IOV, SPDK, RoCE…
200K hours is more likely.
EC2 Spot could live for 15 minutes!
N O WT H E N
SOFTWARE ASSUMPTIONS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AVAILABILITY VS. DURABILITY
% Downtime Per Year Probability of Loss
99.999 5 minutes 15 seconds 1 in 100,000
99.9999 31 seconds 1 in 1,000,000
99.99999 3 seconds 1 in 10,000,000
99.999999999 1 in 100,000,000,000
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MTBF WITH SINGLE NODE
Failure every 22 years: 1/200,000
1 x MTBF
200K hours
8.5 hour
repair
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MTBF WITH CLUSTER
12 x MTBF
200K hours
2nd failure probability:
(11 x 8.5)/200,000
2nd failure frequency:
1.9 x 2134
3rd failure probability:
(10 x 5.4)/200,000
3rd failure frequency:
4,060 x 3,688
3rd failure every
14,977,000 years:
1 out of 3,688 2nd failures
2nd failure every
4,060 years: 1 out
of 2,134 repairs
Failure every 1.9
years:
12/200,000
5.4 hour
2nd repair
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FINDING METADATA LIMITS
srun -n $i -N 128 mdtest -i 5 -b 3 -z 3 -I 10 -w 1024 -y -d $PFS/testdir
$i = number of
compute processes
$PFS = HPC storage
mountpoint
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128 256 512 1024 2048
Creates/second
Number of client processes
File creates/process/second (32 nodes)
Lustre (single MDS) WekaIO v3.1
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FINDING DATA PATH LIMITS
srun -n $i -N 128 ior -a POSIX -o $PFS/iortest -z -w -F -b 1g -t 1m -i 8
$i = number of
compute processes
$PFS = HPC storage
mountpoint
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
1 2 4 8 16 32 64 128 256 512 1024 2048
Throughput(MB/sec)
Number of client processes
File-per-process throughput (32 nodes)
Lustre (single MDS) WekaIO v3.1
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FINDING DATA PATH LIMITS (DIRECT I/O)
srun -n $i -N 128 ior -a POSIX -o $PFS/iortest -z -w -F -b 1g -t 1m -i 8
vs.
srun -n $i -N 128 ior -a POSIX -o $PFS/iortest -z -w -F -B -b 1g -t 1m -i 8
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TECHNOLOGY SUMMARY
• Why RPO and RTO matter for HPC in the Cloud
• Lustre
• Supports DNE – Distributed Namespace
• Still no durability after all these years
• EBS performance limitations
• WekaIO
• Distributed Metadata
• Scalable data plane
• Durable, plus S3 persistence
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EXTERNAL RESOURCES
AWS Competency Program
• https://aws.amazon.com/partners/competencies
AWS Quick Start
• https://aws.amazon.com/quickstart
AWS Marketplace
• https://aws.amazon.com/marketplace
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

Weitere ähnliche Inhalte

Was ist angesagt?

GPSTEC322-GPS Creating Your Virtual Data Center VPC Fundamentals Connectivity...
GPSTEC322-GPS Creating Your Virtual Data Center VPC Fundamentals Connectivity...GPSTEC322-GPS Creating Your Virtual Data Center VPC Fundamentals Connectivity...
GPSTEC322-GPS Creating Your Virtual Data Center VPC Fundamentals Connectivity...Amazon Web Services
 
DEV326_DevOps Essentials An Introductory Workshop on CICD Practices
DEV326_DevOps Essentials An Introductory Workshop on CICD PracticesDEV326_DevOps Essentials An Introductory Workshop on CICD Practices
DEV326_DevOps Essentials An Introductory Workshop on CICD PracticesAmazon Web Services
 
GPSTEC305-Machine Learning in Capital Markets
GPSTEC305-Machine Learning in Capital MarketsGPSTEC305-Machine Learning in Capital Markets
GPSTEC305-Machine Learning in Capital MarketsAmazon Web Services
 
SRV312_Taking Serverless to the Edge
SRV312_Taking Serverless to the EdgeSRV312_Taking Serverless to the Edge
SRV312_Taking Serverless to the EdgeAmazon Web Services
 
ARC304_From One to Many Evolving VPC Design
ARC304_From One to Many Evolving VPC DesignARC304_From One to Many Evolving VPC Design
ARC304_From One to Many Evolving VPC DesignAmazon Web Services
 
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...Amazon Web Services
 
DEV337_Deploy a Data Lake with AWS CloudFormation
DEV337_Deploy a Data Lake with AWS CloudFormationDEV337_Deploy a Data Lake with AWS CloudFormation
DEV337_Deploy a Data Lake with AWS CloudFormationAmazon Web Services
 
CMP213_GPU(G3) Applications in Media and Entertainment Workloads
CMP213_GPU(G3) Applications in Media and Entertainment WorkloadsCMP213_GPU(G3) Applications in Media and Entertainment Workloads
CMP213_GPU(G3) Applications in Media and Entertainment WorkloadsAmazon Web Services
 
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfDEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfAmazon Web Services
 
CON208_Building Microservices on AWS
CON208_Building Microservices on AWSCON208_Building Microservices on AWS
CON208_Building Microservices on AWSAmazon Web Services
 
DEV203_Launch Applications the Amazon Way
DEV203_Launch Applications the Amazon WayDEV203_Launch Applications the Amazon Way
DEV203_Launch Applications the Amazon WayAmazon Web Services
 
ENT212-An Overview of Best Practices for Large-Scale Migrations
ENT212-An Overview of Best Practices for Large-Scale MigrationsENT212-An Overview of Best Practices for Large-Scale Migrations
ENT212-An Overview of Best Practices for Large-Scale MigrationsAmazon Web Services
 
GPSWKS404-GPS Game Changing C2S Services To Transform Your Customers Speed To...
GPSWKS404-GPS Game Changing C2S Services To Transform Your Customers Speed To...GPSWKS404-GPS Game Changing C2S Services To Transform Your Customers Speed To...
GPSWKS404-GPS Game Changing C2S Services To Transform Your Customers Speed To...Amazon Web Services
 
ARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersAmazon Web Services
 
CON317_Advanced container management at catsndogs.lol
CON317_Advanced container management at catsndogs.lolCON317_Advanced container management at catsndogs.lol
CON317_Advanced container management at catsndogs.lolAmazon Web Services
 
DEV206_Life of a Code Change to a Tier 1 Service
DEV206_Life of a Code Change to a Tier 1 ServiceDEV206_Life of a Code Change to a Tier 1 Service
DEV206_Life of a Code Change to a Tier 1 ServiceAmazon Web Services
 
DAT341_Working with Amazon ElastiCache for Redis
DAT341_Working with Amazon ElastiCache for RedisDAT341_Working with Amazon ElastiCache for Redis
DAT341_Working with Amazon ElastiCache for RedisAmazon Web Services
 
Build your case for the cloud and engage your business stakeholders
Build your case for the cloud and engage your business stakeholdersBuild your case for the cloud and engage your business stakeholders
Build your case for the cloud and engage your business stakeholdersAmazon Web Services
 

Was ist angesagt? (20)

GPSTEC322-GPS Creating Your Virtual Data Center VPC Fundamentals Connectivity...
GPSTEC322-GPS Creating Your Virtual Data Center VPC Fundamentals Connectivity...GPSTEC322-GPS Creating Your Virtual Data Center VPC Fundamentals Connectivity...
GPSTEC322-GPS Creating Your Virtual Data Center VPC Fundamentals Connectivity...
 
DEV326_DevOps Essentials An Introductory Workshop on CICD Practices
DEV326_DevOps Essentials An Introductory Workshop on CICD PracticesDEV326_DevOps Essentials An Introductory Workshop on CICD Practices
DEV326_DevOps Essentials An Introductory Workshop on CICD Practices
 
GPSTEC305-Machine Learning in Capital Markets
GPSTEC305-Machine Learning in Capital MarketsGPSTEC305-Machine Learning in Capital Markets
GPSTEC305-Machine Learning in Capital Markets
 
SRV312_Taking Serverless to the Edge
SRV312_Taking Serverless to the EdgeSRV312_Taking Serverless to the Edge
SRV312_Taking Serverless to the Edge
 
GPSTEC325-Enterprise Storage
GPSTEC325-Enterprise StorageGPSTEC325-Enterprise Storage
GPSTEC325-Enterprise Storage
 
ARC304_From One to Many Evolving VPC Design
ARC304_From One to Many Evolving VPC DesignARC304_From One to Many Evolving VPC Design
ARC304_From One to Many Evolving VPC Design
 
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
 
DEV337_Deploy a Data Lake with AWS CloudFormation
DEV337_Deploy a Data Lake with AWS CloudFormationDEV337_Deploy a Data Lake with AWS CloudFormation
DEV337_Deploy a Data Lake with AWS CloudFormation
 
CMP213_GPU(G3) Applications in Media and Entertainment Workloads
CMP213_GPU(G3) Applications in Media and Entertainment WorkloadsCMP213_GPU(G3) Applications in Media and Entertainment Workloads
CMP213_GPU(G3) Applications in Media and Entertainment Workloads
 
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfDEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
 
CON208_Building Microservices on AWS
CON208_Building Microservices on AWSCON208_Building Microservices on AWS
CON208_Building Microservices on AWS
 
DEV203_Launch Applications the Amazon Way
DEV203_Launch Applications the Amazon WayDEV203_Launch Applications the Amazon Way
DEV203_Launch Applications the Amazon Way
 
ENT212-An Overview of Best Practices for Large-Scale Migrations
ENT212-An Overview of Best Practices for Large-Scale MigrationsENT212-An Overview of Best Practices for Large-Scale Migrations
ENT212-An Overview of Best Practices for Large-Scale Migrations
 
GPSWKS404-GPS Game Changing C2S Services To Transform Your Customers Speed To...
GPSWKS404-GPS Game Changing C2S Services To Transform Your Customers Speed To...GPSWKS404-GPS Game Changing C2S Services To Transform Your Customers Speed To...
GPSWKS404-GPS Game Changing C2S Services To Transform Your Customers Speed To...
 
ARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million Users
 
SID402_An AWS Security Odyssey
SID402_An AWS Security OdysseySID402_An AWS Security Odyssey
SID402_An AWS Security Odyssey
 
CON317_Advanced container management at catsndogs.lol
CON317_Advanced container management at catsndogs.lolCON317_Advanced container management at catsndogs.lol
CON317_Advanced container management at catsndogs.lol
 
DEV206_Life of a Code Change to a Tier 1 Service
DEV206_Life of a Code Change to a Tier 1 ServiceDEV206_Life of a Code Change to a Tier 1 Service
DEV206_Life of a Code Change to a Tier 1 Service
 
DAT341_Working with Amazon ElastiCache for Redis
DAT341_Working with Amazon ElastiCache for RedisDAT341_Working with Amazon ElastiCache for Redis
DAT341_Working with Amazon ElastiCache for Redis
 
Build your case for the cloud and engage your business stakeholders
Build your case for the cloud and engage your business stakeholdersBuild your case for the cloud and engage your business stakeholders
Build your case for the cloud and engage your business stakeholders
 

Ähnlich wie GPSTEC324_STORAGE FOR HPC IN THE CLOUD

透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)Amazon Web Services
 
DynamoDB - What's new - DAT304 - re:Invent 2017
DynamoDB - What's new - DAT304 - re:Invent 2017DynamoDB - What's new - DAT304 - re:Invent 2017
DynamoDB - What's new - DAT304 - re:Invent 2017Amazon Web Services
 
State of the Union: Compute & DevOps
State of the Union: Compute & DevOpsState of the Union: Compute & DevOps
State of the Union: Compute & DevOpsAmazon Web Services
 
What's New for AWS Purpose Built, Non-relational Databases - DAT204 - re:Inve...
What's New for AWS Purpose Built, Non-relational Databases - DAT204 - re:Inve...What's New for AWS Purpose Built, Non-relational Databases - DAT204 - re:Inve...
What's New for AWS Purpose Built, Non-relational Databases - DAT204 - re:Inve...Amazon Web Services
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightAmazon Web Services
 
CMP207_High Performance Computing on AWS
CMP207_High Performance Computing on AWSCMP207_High Performance Computing on AWS
CMP207_High Performance Computing on AWSAmazon Web Services
 
MCL303-Deep Learning with Apache MXNet and Gluon
MCL303-Deep Learning with Apache MXNet and GluonMCL303-Deep Learning with Apache MXNet and Gluon
MCL303-Deep Learning with Apache MXNet and GluonAmazon Web Services
 
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...Amazon Web Services
 
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017Amazon Web Services
 
Born in the Cloud, Built like a Startup
Born in the Cloud, Built like a StartupBorn in the Cloud, Built like a Startup
Born in the Cloud, Built like a StartupAmazon Web Services
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...Amazon Web Services
 
Deep Learning for Industrial IoT - MCL316 - re:Invent 2017
Deep Learning for Industrial IoT - MCL316 - re:Invent 2017Deep Learning for Industrial IoT - MCL316 - re:Invent 2017
Deep Learning for Industrial IoT - MCL316 - re:Invent 2017Amazon Web Services
 
SageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningSageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningAmazon Web Services
 
Amazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San FranciscoAmazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San FranciscoAmazon Web Services
 
Cyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per Day
Cyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per DayCyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per Day
Cyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per DayAmazon Web Services
 
Working with Amazon SageMaker Algorithms for Faster Model Training
Working with Amazon SageMaker Algorithms for Faster Model TrainingWorking with Amazon SageMaker Algorithms for Faster Model Training
Working with Amazon SageMaker Algorithms for Faster Model TrainingAmazon Web Services
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftAmazon Web Services
 
Scale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSScale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSAmazon Web Services
 

Ähnlich wie GPSTEC324_STORAGE FOR HPC IN THE CLOUD (20)

透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
 
DynamoDB - What's new - DAT304 - re:Invent 2017
DynamoDB - What's new - DAT304 - re:Invent 2017DynamoDB - What's new - DAT304 - re:Invent 2017
DynamoDB - What's new - DAT304 - re:Invent 2017
 
State of the Union: Compute & DevOps
State of the Union: Compute & DevOpsState of the Union: Compute & DevOps
State of the Union: Compute & DevOps
 
What's New for AWS Purpose Built, Non-relational Databases - DAT204 - re:Inve...
What's New for AWS Purpose Built, Non-relational Databases - DAT204 - re:Inve...What's New for AWS Purpose Built, Non-relational Databases - DAT204 - re:Inve...
What's New for AWS Purpose Built, Non-relational Databases - DAT204 - re:Inve...
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
 
CMP207_High Performance Computing on AWS
CMP207_High Performance Computing on AWSCMP207_High Performance Computing on AWS
CMP207_High Performance Computing on AWS
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
MCL303-Deep Learning with Apache MXNet and Gluon
MCL303-Deep Learning with Apache MXNet and GluonMCL303-Deep Learning with Apache MXNet and Gluon
MCL303-Deep Learning with Apache MXNet and Gluon
 
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
 
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
 
Born in the Cloud, Built like a Startup
Born in the Cloud, Built like a StartupBorn in the Cloud, Built like a Startup
Born in the Cloud, Built like a Startup
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
Nonrelational Revolution
Nonrelational RevolutionNonrelational Revolution
Nonrelational Revolution
 
Deep Learning for Industrial IoT - MCL316 - re:Invent 2017
Deep Learning for Industrial IoT - MCL316 - re:Invent 2017Deep Learning for Industrial IoT - MCL316 - re:Invent 2017
Deep Learning for Industrial IoT - MCL316 - re:Invent 2017
 
SageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningSageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine Learning
 
Amazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San FranciscoAmazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San Francisco
 
Cyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per Day
Cyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per DayCyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per Day
Cyber Data Lake: How CIS Analyzes Billions of Network Traffic Records per Day
 
Working with Amazon SageMaker Algorithms for Faster Model Training
Working with Amazon SageMaker Algorithms for Faster Model TrainingWorking with Amazon SageMaker Algorithms for Faster Model Training
Working with Amazon SageMaker Algorithms for Faster Model Training
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
Scale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSScale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWS
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

GPSTEC324_STORAGE FOR HPC IN THE CLOUD

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT STORAGE FOR HPC IN THE CLOUD I s a i a h W e i n e r S r . M g r . S o l u t i o n s A r c h i t e c t u r e G P S T E C 3 2 4
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. I/O Cost TTR National Labs Research, Energy & UtilitiesGenomics Analytics, AI/ML EDA M&E Finance HPC IS COMPLEX
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. GROWTH IN CLOUD 2015 2016 2017 2018 2019 2020 70% 65% 61% 59% 58% 55% 10% 15% 26% 26% 26% 28% 10% 12% 13% 15% 16% 17% CLOUD MARKET FORECAST On-Prem Public Cloud Private Cloud Source: IDC Worldwide Quarterly Cloud IT Infrastructure Tracker
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. GROWTH IN STORAGE 0 10000 20000 30000 40000 50000 60000 70000 80000 2016 2017 2018 2019 2020 Exabytes Enterprise HPC Source: Gartner for Enterprise and IDC for High Performance
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DIY – NFS NFS Server Volume Volume NFS Server Volume Volume NFS Server Volume Volume NFS Clients NFS Clients NFS Clients
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AMAZON EFS ARCHITECTURE Clients Clients Clients Mount Target Single Namespace Mount Target Mount Target
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HPC CLUSTER NODE ANATOMY Data Metadata Tiering Backend Routing Monitoring VIPs Clustering Storage Access Protocols Frontend Network
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenge: Local SSD performance with centralized management. Local SSDs required data to be copied around, and multiple copies took up space. Solution delivers: Scalable, sharable, simplified; one copy of the data, on-par with local SSD for performance. 0 50 100 150 200 250 Elapsed Time (Lower is Better) Local SSD WekaIO NFSv4 SEMICONDUCTOR CUSTOMER
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenge: Small file workload with some large files in the mix. Pre- solution workaround: more jobs! All the jobs! Solution delivers: Scalable, sharable, simplified; one copy of the data, on-par with local SSD for performance. 0 20 40 60 80 100 120 140 WekaIO On-Prem AFA Elapsed Time (Lower is Better) 1 conversion 6 conversions GENOMICS CUSTOMER
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cluster Sizing
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The hardware is reliable. Trust the kernel, it is wise. The MTBF is millions of hours. Hardware is up until it dies. …Is the hardware reliable, really? DPDK + SR-IOV, SPDK, RoCE… 200K hours is more likely. EC2 Spot could live for 15 minutes! N O WT H E N SOFTWARE ASSUMPTIONS
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AVAILABILITY VS. DURABILITY % Downtime Per Year Probability of Loss 99.999 5 minutes 15 seconds 1 in 100,000 99.9999 31 seconds 1 in 1,000,000 99.99999 3 seconds 1 in 10,000,000 99.999999999 1 in 100,000,000,000
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. MTBF WITH SINGLE NODE Failure every 22 years: 1/200,000 1 x MTBF 200K hours 8.5 hour repair
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. MTBF WITH CLUSTER 12 x MTBF 200K hours 2nd failure probability: (11 x 8.5)/200,000 2nd failure frequency: 1.9 x 2134 3rd failure probability: (10 x 5.4)/200,000 3rd failure frequency: 4,060 x 3,688 3rd failure every 14,977,000 years: 1 out of 3,688 2nd failures 2nd failure every 4,060 years: 1 out of 2,134 repairs Failure every 1.9 years: 12/200,000 5.4 hour 2nd repair
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FINDING METADATA LIMITS srun -n $i -N 128 mdtest -i 5 -b 3 -z 3 -I 10 -w 1024 -y -d $PFS/testdir $i = number of compute processes $PFS = HPC storage mountpoint 0 5000 10000 15000 20000 25000 1 2 4 8 16 32 64 128 256 512 1024 2048 Creates/second Number of client processes File creates/process/second (32 nodes) Lustre (single MDS) WekaIO v3.1
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FINDING DATA PATH LIMITS srun -n $i -N 128 ior -a POSIX -o $PFS/iortest -z -w -F -b 1g -t 1m -i 8 $i = number of compute processes $PFS = HPC storage mountpoint 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 1 2 4 8 16 32 64 128 256 512 1024 2048 Throughput(MB/sec) Number of client processes File-per-process throughput (32 nodes) Lustre (single MDS) WekaIO v3.1
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FINDING DATA PATH LIMITS (DIRECT I/O) srun -n $i -N 128 ior -a POSIX -o $PFS/iortest -z -w -F -b 1g -t 1m -i 8 vs. srun -n $i -N 128 ior -a POSIX -o $PFS/iortest -z -w -F -B -b 1g -t 1m -i 8
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. TECHNOLOGY SUMMARY • Why RPO and RTO matter for HPC in the Cloud • Lustre • Supports DNE – Distributed Namespace • Still no durability after all these years • EBS performance limitations • WekaIO • Distributed Metadata • Scalable data plane • Durable, plus S3 persistence
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EXTERNAL RESOURCES AWS Competency Program • https://aws.amazon.com/partners/competencies AWS Quick Start • https://aws.amazon.com/quickstart AWS Marketplace • https://aws.amazon.com/marketplace
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!