The Cray Shasta Architecture - Designed for the Exascale Era

THE CRAY SHASTA ARCHITECTURE:
DESIGNED FOR THE EXASCALE ERA
Steve Scott
SVP, Senior Fellow, and CTO for HPC & AI
March 3, 2020

All three worldwide announced Exascale systems are based on Cray Shasta
THE EXASCALE ERA IS UPON US
2COPYRIGHT 2020 HPE

It’s not just a new machine,
IT’S A NEW ERA
COPYRIGHT 2020 HPE 3

MAJOR TRENDS MOTIVATING THE SHASTA ARCHITECTURE
Hot and Heterogeneous
Simulation &
Modeling
Big Data &
Analytics
Artificial
Intelligence
Data
Intensive
Computing
4COPYRIGHT 2020 HPE

Platform for the Exascale Era
HPE-CRAY SHASTA
HPC AnalyticsAI
Dynamic, Cloud-like Environment for
Hybrid Workflows
Wide Diversity of Processors
Flexible, Efficient, & Extensible
Hardware Infrastructure
High-Performance, Tiered, Integrated Storage
Slingshot HPC Ethernet Interconnect
Cloud IoTData Management
5COPYRIGHT 2020 HPE

HPE ACQUISITION OF CRAY
• Closed September 25, 2019
• Organizations fully integrated within one month (we are one team)
• Product roadmaps reconciled and integrated before SC’19 in November 2019
• Fully merged January 1, 2020 (Cray subsidiary dissolved; brand continues)
6COPYRIGHT 2020 HPE

CRN Oct 16, 2019 – Best of Breed Conference 2019
https://www.crn.com/slide-shows/data-center/antonio-neri-outposts-is-
aws-bid-to-lock-data-in-public-cloud/1
“The reason we bought Cray is they
have the foundational technology in
the connect fabric and the software
stack to manage these data-
intensive workloads.
That ultimately manifests itself in
some sort of HPC cluster and in the
future an Exascale supercomputer.
You should expect us to take those
technologies which are designed for
scale, speed and latency into the
commercial space.”
CRAY TECHNOLOGIES WITHIN HPE
7COPYRIGHT 2020 HPE

SHASTA FLEXIBLE COMPUTE INFRASTRUCTURE
8
“Olympus”
Dense, scale-optimized Cabinet
• Direct warm water cooling
• Supports high-powered processors with high density
• Flexible, high-density interconnect
• Air cooled with liquid cooling options
• Wide range of available compute and storage
“Apollo”
Standard 19” Rack
Same Interconnect - Same Software Environment
COPYRIGHT 2020 HPE

Architected for maximum performance, density, efficiency, and scale
SHASTA OLYMPUS INFRASTRUCTURE
• Up to 64 compute blades, and 512
GPUs + 128 CPUs per cabinet
• Flexible bladed architecture supports
multiple generations of CPUs, GPUs,
and interconnect
• 100% direct liquid cooling enables
300KW capability per cabinet (later
up to 400KW)
• Up to 64 Slingshot switches per
cabinet
• Scales from one to 100’s of cabinets

2 nodes per
blade
To Slingshot
DDR
DDR
DDR
DDR
To Slingshot
AMD EPYC
(NERSC)
Nvidia GPU
(NERSC)
Intel Xe
(ANL)
AMD GPU
(ORNL)
2 nodes per
blade
To Slingshot
DDR DDR 4 nodes per
blade
10COPYRIGHT 2020 HPE

SLINGSHOT OVERVIEW
Slingshot is Cray’s 8th generation
scalable interconnect
Earlier, Cray pioneered:
• Adaptive routing
• High-radix switch design
• Dragonfly topology
64 ports x 200
Gbps
Over 250K endpoints
with a diameter of
just three hops
Ethernet
Compliant
Easy connectivity to
datacenters and
third-party storage;
“HPC inside”
World class
Adaptive Routing
and QoS
High utilization at
scale; flawless
support for hybrid
workloads
Efficient
Congestion
Control
Performance isolation
between workloads
Low, Uniform
Latency
Focus on tail latency,
because real apps
synchronize

SLINGSHOT PACKAGING
PCIe NIC TOR Switch
Standard
Packaging
Rack mounted
Apollo & 3rd-party
Network card
Custom
Packaging
Dense
Liquid cooled
NIC Mezzanine
Cabling

SLINGSHOT IS RUNNING AT SCALE AND ACHIEVING HIGH EFFICIENCY
0
5
10
15
20
25
30
Bandwidth(GB/sec)
Global Link Load – All-to-All Communication
Global Link Number
“Shandy” in-house system
• 8 groups
• 1024 nodes
• Dual CX5 injection per node
• 25 TB/s aggregate injection BW
• 50% global bandwidth taper
• 12.5 TB/s aggregate global BW

SLINGSHOT CONGESTION MANAGEMENT
• Hardware automatically tracks all outstanding packets
• Knows what is flowing between every pair of endpoints
• Quickly identifies and controls causes of congestion
• Pushes back on sources… just enough
• Frees up buffer space for everyone else
• Other traffic not affected and can pass stalled traffic
• Avoids HOL blocking across entire fabric
• Fundamentally different than traditional ECN-based congestion control
• Fast and stable across wide variety of traffic patterns
• Suitable for dynamic HPC traffic
• Performance isolation between apps on same QoS class
• Applications much less vulnerable to other traffic on the network
• Predictable runtimes
• Lower mean and tail latency – a big benefit in apps with global synchronization
CONGESTION
MANAGEMENT

CONGESTION MANAGEMENT PROVIDES PERFORMANCE ISOLATION
0
50
100
150
200
250
2
58
115
171
227
283
340
396
452
508
565
621
677
733
790
846
902
958
1015
1071
1127
1183
1240
1296
1352
1408
1465
1521
1577
1633
1690
1743
1800
1856
1912
1968
2025
2081
2137
2193
(Gb/s)
Simulation time (uSec)
All to All
Global Sync
Many to one
0
50
100
150
200
250
(Gb/s)
Avg egress BW / endpoint
Many to one
All to All
Global Sync
2 ms
0
50
100
150
200
250
2
58
115
171
227
283
340
396
452
508
565
621
677
733
790
846
902
958
1015
1071
1127
1183
1240
1296
1352
1408
1465
1521
1577
1633
1690
1743
1800
1856
1912
1968
2025
2081
2137
2193
(Gb/s)
Simulation time (uSec)
All to All
Global Sync
Many to one
0
50
100
150
200
250
(Gb/s)
All to All
Many to one
Global Sync
Job Interference in
today’s networks
Congesting (green)
traffic hurts well-
behaved (blue) traffic,
and really hurts latency-
sensitive, synchronized
(red) traffic.
100% peak
With Slingshot
Advanced
Congestion
Management

(Global Performance and Congestion Network Tests)
NEW BENCHMARK: GPCNET
• Developed in collaboration with NERSC and ANL
• Publicly available at https://github.com/netbench/GPCNET
• Goals:
• Proxy real-world communication patterns
• Measure network performance under load
• Look at both mean and tail latency
• Look at interference between workloads
(How well does network perform congestion management?)
• Highly configurable to explore workloads of interest
• Benchmark outputs a rich set of metrics, including absolute and relative performance

CONGESTION IMPACT IN REAL SYSTEMS
CI = Latencycongested / Latencybaseline
0.1
1
10
100
1000
10000
Crystal Theta Edison Osprey Sierra Summit Malbec
696 Nodes 4096 Nodes 5575 Nodes 128 Nodes 4200 Nodes 4500 Nodes 485 Nodes
20 PPPort 16 PPPort 24 PPPort 20 PPPort 20 PPPort 21 PPPort 20 PPPort
Aries Aries Aries EDR EDR EDR SS10
CongestionImpact
Random Ring Latency Congestion Impact by System
Average 99% Tail
Aries EDR IB Slingshot
Crystal Theta Edison Osprey Sierra Summit Malbec
696 4,096 5,575 128 4,200 4,500 485
100% 50% 50% 100% 50% 100% 50%
System size (nodes):
Global network BW:
• Impact worsens with scale
and taper
• Infiniband does somewhat
better than Aries
• Slingshot does really well
20 16 24 20 20 21 20Processes per Network Port:
Line of
No Congestion
Impact

SHASTA PULLS STORAGE ONTO SLINGSHOT NETWORK
Tiered Flash and HDD Servers
Traditional
Model
OSS
(HDD)
OSS
(HDD)
High Speed
Network
Storage Area
Network
LNET
LNET
Compute
Node
High Speed
Network
Shasta
ClusterStor E1000
OSS & MDS
(SSD)
~80 GB/s
OSS
(HDD)
~30 GB/s
Compute
Node
Compute
Node
Compute
Node
Benefits:
• Lower cost
• Lower complexity
• Lower latency
• Improved small I/O performance

CLUSTERSTOR E1000 FLEXIBILITY
Extreme Perf (Flash) Hybrid Flexibility HDD Performance HDD Capacity
SSD Performance (read/write) 80 / 60 GB/s 80 / 60 GB/s
SSD Usable Capacity (3.2 TB) 55.3 TB 55.3 TB
HDD Performance 15 GB/s 30 GB/s 30 GB/s
HDD Usable Capacity (14TB) 1.07 PB 2.14 PB 4.27 PB
Network ports 6 x 200 Gbps 4 x 200 Gbps 2 x 200 Gbps 2 x 200 Gbps
Height Rack Units 2 6 10 18
Compared 2 x L300N (10RU) 15 times faster 15 times (flash), 0.7 (HDD) 50% faster 50% faster
Up to 120 GB/sec
Up to 10 PB usable capacity
Up to 1,600 GB/sec (read)
Up to 4.2 PB usable capacity
Per Rack

SHASTA: A MORE OPEN, CUSTOMIZABLE STACK
Cray XC Stack:
Sleek…
Scalable…
Monolithic…
Shasta HW (storage, compute, networks)
Hardware support services
Infrastructure support services
Platform support services
Consumer
Cray Shasta System Management Stack:
Open, documented, RESTful APIs
Ability to substitute different components
Buildable source

SHASTA SOFTWARE PLATFORM ARCHITECTURE
Cray Linux Environment
Linux + HPC Extensions
HPC Batch Job Mgmt. + Orchestration (Kubernetes)
Network and I/O Abstractions
Parallel Performance Libraries
Developer
Environment
Runtimes
Administrator System Services Developer Services
Cray Programming Environment
Shasta
Management
Services
Shasta
Monitoring
Framework
Linux Environment
Linux
Orchestration (Kubernetes)
Network and I/O Abstractions
Urika Manager
Cray Urika AI/Analytics Suite
Expanding the power of supercomputing with the
flexibility of cloud and full datacenter interoperability
Analytics Microservices
Analytics Libraries & Frameworks
Parallel Performance Libraries
Containerized Services Containerized ServicesContainerized Services
OpenAPIs
OpenAPIs

OPTIONS TO MEET A FULL RANGE OF AS-A-SERVICE NEEDS
Consumption-based
Cloud Architecture
Managed Service, Off-Premises
Public Cloud Ecosystem
GreenLake
Flexible Capacity
HPC Platform as a
Service
Managed HPC as a
Service
HPC as a Service in the
Public Cloud
Strategic partner to manage end-to-end, Hybrid HPC and
AI portfolio across deployment and consumption models
• HPE GreenLake Flexible
Capacity
• Ready for BlueData, RedHat
OCP and Singularity
• HPCM (and APIs), VMware
and Cray’s software
environment
• As a service offerings
(Advania, ScaleMatrix,
Markley)
• Data center offerings
(Equinix, CyrusOne)
• SI partners (Accenture,
DXC)
• ClusterStor in Azure
• Cray in Azure for
Manufacturing
• Cray in Azure for EDA

WE ARE ENTERING THE EXASCALE ERA
• HPC, Enterprise, and hyperscale are converging
• Data-centric, hybrid workflows: AI + analytics + HPC
• Growing complexity, but tremendous opportunity to extract value
• Need new infrastructure for these new workloads
• Shasta provides the infrastructure for the Exascale Era
• Chosen for all three announced exascale systems
• Flexibility (processors, storage, network, software)
• Extensibility (hardware and software)
• Scalability (Up to exascale, down to a single 19” rack)
• Standards-based (interoperable and open)
• Cloud-like software stack for dynamic, heterogeneous workloads

THANK YOU

The Cray Shasta Architecture - Designed for the Exascale Era

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von inside-BigData.com

Mehr von inside-BigData.com (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Cray Shasta Architecture - Designed for the Exascale Era