In this deck from the Rice Oil & Gas Conference, Steve Scott from HPE presents: The Cray Shasta Architecture - Designed for the Exascale Era.
"With the announcement of multiple exascale systems, we’re now entering the Exascale Era, marked by several important trends. CMOS is nearing the end of its roadmap, leading to hotter and more diverse processors as architects chase performance through specialization. Organizations are dealing with ever larger volumes of data, stressing storage systems and interconnects, and are increasingly augmenting their simulation and modeling with analytics and AI to gain insight from this data. And users and administrators are demanding flexible, cloud-like software environments that let them flexibly manage their systems, and develop and run code anywhere. While these issues are most acute in extreme scale HPC systems, they are becoming increasingly relevant across the broader enterprise. This talk provides an overview of the Cray Shasta system architecture, which was motivated by these trends, and designed for this new heterogeneous, data-driven world."
Watch the video: https://wp.me/p3RLHQ-lDt
Learn more: https://www.cray.com/products/computing/shasta
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Cray Shasta Architecture - Designed for the Exascale Era
1. THE CRAY SHASTA ARCHITECTURE:
DESIGNED FOR THE EXASCALE ERA
Steve Scott
SVP, Senior Fellow, and CTO for HPC & AI
March 3, 2020
2. All three worldwide announced Exascale systems are based on Cray Shasta
THE EXASCALE ERA IS UPON US
2COPYRIGHT 2020 HPE
3. It’s not just a new machine,
IT’S A NEW ERA
COPYRIGHT 2020 HPE 3
4. MAJOR TRENDS MOTIVATING THE SHASTA ARCHITECTURE
Hot and Heterogeneous
Simulation &
Modeling
Big Data &
Analytics
Artificial
Intelligence
Data
Intensive
Computing
4COPYRIGHT 2020 HPE
5. Platform for the Exascale Era
HPE-CRAY SHASTA
HPC AnalyticsAI
Dynamic, Cloud-like Environment for
Hybrid Workflows
Wide Diversity of Processors
Flexible, Efficient, & Extensible
Hardware Infrastructure
High-Performance, Tiered, Integrated Storage
Slingshot HPC Ethernet Interconnect
Cloud IoTData Management
5COPYRIGHT 2020 HPE
6. HPE ACQUISITION OF CRAY
• Closed September 25, 2019
• Organizations fully integrated within one month (we are one team)
• Product roadmaps reconciled and integrated before SC’19 in November 2019
• Fully merged January 1, 2020 (Cray subsidiary dissolved; brand continues)
6COPYRIGHT 2020 HPE
7. CRN Oct 16, 2019 – Best of Breed Conference 2019
https://www.crn.com/slide-shows/data-center/antonio-neri-outposts-is-
aws-bid-to-lock-data-in-public-cloud/1
“The reason we bought Cray is they
have the foundational technology in
the connect fabric and the software
stack to manage these data-
intensive workloads.
That ultimately manifests itself in
some sort of HPC cluster and in the
future an Exascale supercomputer.
You should expect us to take those
technologies which are designed for
scale, speed and latency into the
commercial space.”
CRAY TECHNOLOGIES WITHIN HPE
7COPYRIGHT 2020 HPE
8. SHASTA FLEXIBLE COMPUTE INFRASTRUCTURE
8
“Olympus”
Dense, scale-optimized Cabinet
• Direct warm water cooling
• Supports high-powered processors with high density
• Flexible, high-density interconnect
• Air cooled with liquid cooling options
• Wide range of available compute and storage
“Apollo”
Standard 19” Rack
Same Interconnect - Same Software Environment
COPYRIGHT 2020 HPE
9. Architected for maximum performance, density, efficiency, and scale
SHASTA OLYMPUS INFRASTRUCTURE
COPYRIGHT 2020 HPE 9
• Up to 64 compute blades, and 512
GPUs + 128 CPUs per cabinet
• Flexible bladed architecture supports
multiple generations of CPUs, GPUs,
and interconnect
• 100% direct liquid cooling enables
300KW capability per cabinet (later
up to 400KW)
• Up to 64 Slingshot switches per
cabinet
• Scales from one to 100’s of cabinets
10. 2 nodes per
blade
To Slingshot
DDR
DDR
DDR
DDR
To Slingshot
AMD EPYC
(NERSC)
Nvidia GPU
(NERSC)
Intel Xe
(ANL)
AMD GPU
(ORNL)
2 nodes per
blade
To Slingshot
DDR DDR 4 nodes per
blade
10COPYRIGHT 2020 HPE
11. SLINGSHOT OVERVIEW
Slingshot is Cray’s 8th generation
scalable interconnect
Earlier, Cray pioneered:
• Adaptive routing
• High-radix switch design
• Dragonfly topology
64 ports x 200
Gbps
Over 250K endpoints
with a diameter of
just three hops
Ethernet
Compliant
Easy connectivity to
datacenters and
third-party storage;
“HPC inside”
World class
Adaptive Routing
and QoS
High utilization at
scale; flawless
support for hybrid
workloads
Efficient
Congestion
Control
Performance isolation
between workloads
Low, Uniform
Latency
Focus on tail latency,
because real apps
synchronize
11COPYRIGHT 2020 HPE
12. SLINGSHOT PACKAGING
PCIe NIC TOR Switch
Standard
Packaging
Rack mounted
Apollo & 3rd-party
Network card
Custom
Packaging
Dense
Liquid cooled
NIC Mezzanine
Cabling
12COPYRIGHT 2020 HPE
13. SLINGSHOT IS RUNNING AT SCALE AND ACHIEVING HIGH EFFICIENCY
0
5
10
15
20
25
30
Bandwidth(GB/sec)
Global Link Load – All-to-All Communication
Global Link Number
“Shandy” in-house system
• 8 groups
• 1024 nodes
• Dual CX5 injection per node
• 25 TB/s aggregate injection BW
• 50% global bandwidth taper
• 12.5 TB/s aggregate global BW
13COPYRIGHT 2020 HPE
14. SLINGSHOT CONGESTION MANAGEMENT
• Hardware automatically tracks all outstanding packets
• Knows what is flowing between every pair of endpoints
• Quickly identifies and controls causes of congestion
• Pushes back on sources… just enough
• Frees up buffer space for everyone else
• Other traffic not affected and can pass stalled traffic
• Avoids HOL blocking across entire fabric
• Fundamentally different than traditional ECN-based congestion control
• Fast and stable across wide variety of traffic patterns
• Suitable for dynamic HPC traffic
• Performance isolation between apps on same QoS class
• Applications much less vulnerable to other traffic on the network
• Predictable runtimes
• Lower mean and tail latency – a big benefit in apps with global synchronization
CONGESTION
MANAGEMENT
14COPYRIGHT 2020 HPE
15. CONGESTION MANAGEMENT PROVIDES PERFORMANCE ISOLATION
0
50
100
150
200
250
2
58
115
171
227
283
340
396
452
508
565
621
677
733
790
846
902
958
1015
1071
1127
1183
1240
1296
1352
1408
1465
1521
1577
1633
1690
1743
1800
1856
1912
1968
2025
2081
2137
2193
(Gb/s)
Simulation time (uSec)
All to All
Global Sync
Many to one
0
50
100
150
200
250
(Gb/s)
Avg egress BW / endpoint
Many to one
All to All
Global Sync
2 ms
0
50
100
150
200
250
2
58
115
171
227
283
340
396
452
508
565
621
677
733
790
846
902
958
1015
1071
1127
1183
1240
1296
1352
1408
1465
1521
1577
1633
1690
1743
1800
1856
1912
1968
2025
2081
2137
2193
(Gb/s)
Simulation time (uSec)
All to All
Global Sync
Many to one
0
50
100
150
200
250
(Gb/s)
All to All
Many to one
Global Sync
Job Interference in
today’s networks
Congesting (green)
traffic hurts well-
behaved (blue) traffic,
and really hurts latency-
sensitive, synchronized
(red) traffic.
100% peak
With Slingshot
Advanced
Congestion
Management
15COPYRIGHT 2020 HPE
16. (Global Performance and Congestion Network Tests)
NEW BENCHMARK: GPCNET
• Developed in collaboration with NERSC and ANL
• Publicly available at https://github.com/netbench/GPCNET
• Goals:
• Proxy real-world communication patterns
• Measure network performance under load
• Look at both mean and tail latency
• Look at interference between workloads
(How well does network perform congestion management?)
• Highly configurable to explore workloads of interest
• Benchmark outputs a rich set of metrics, including absolute and relative performance
COPYRIGHT 2020 HPE 16
17. CONGESTION IMPACT IN REAL SYSTEMS
CI = Latencycongested / Latencybaseline
0.1
1
10
100
1000
10000
Crystal Theta Edison Osprey Sierra Summit Malbec
696 Nodes 4096 Nodes 5575 Nodes 128 Nodes 4200 Nodes 4500 Nodes 485 Nodes
20 PPPort 16 PPPort 24 PPPort 20 PPPort 20 PPPort 21 PPPort 20 PPPort
Aries Aries Aries EDR EDR EDR SS10
CongestionImpact
Random Ring Latency Congestion Impact by System
Average 99% Tail
Aries EDR IB Slingshot
Crystal Theta Edison Osprey Sierra Summit Malbec
696 4,096 5,575 128 4,200 4,500 485
100% 50% 50% 100% 50% 100% 50%
System size (nodes):
Global network BW:
• Impact worsens with scale
and taper
• Infiniband does somewhat
better than Aries
• Slingshot does really well
20 16 24 20 20 21 20Processes per Network Port:
Line of
No Congestion
Impact
17COPYRIGHT 2020 HPE
18. SHASTA PULLS STORAGE ONTO SLINGSHOT NETWORK
Tiered Flash and HDD Servers
Traditional
Model
OSS
(HDD)
OSS
(HDD)
High Speed
Network
Storage Area
Network
LNET
LNET
Compute
Node
High Speed
Network
Shasta
ClusterStor E1000
OSS & MDS
(SSD)
~80 GB/s
OSS
(HDD)
~30 GB/s
Compute
Node
Compute
Node
Compute
Node
Benefits:
• Lower cost
• Lower complexity
• Lower latency
• Improved small I/O performance
18COPYRIGHT 2020 HPE
19. CLUSTERSTOR E1000 FLEXIBILITY
Extreme Perf (Flash) Hybrid Flexibility HDD Performance HDD Capacity
SSD Performance (read/write) 80 / 60 GB/s 80 / 60 GB/s
SSD Usable Capacity (3.2 TB) 55.3 TB 55.3 TB
HDD Performance 15 GB/s 30 GB/s 30 GB/s
HDD Usable Capacity (14TB) 1.07 PB 2.14 PB 4.27 PB
Network ports 6 x 200 Gbps 4 x 200 Gbps 2 x 200 Gbps 2 x 200 Gbps
Height Rack Units 2 6 10 18
Compared 2 x L300N (10RU) 15 times faster 15 times (flash), 0.7 (HDD) 50% faster 50% faster
Up to 120 GB/sec
Up to 10 PB usable capacity
Up to 1,600 GB/sec (read)
Up to 4.2 PB usable capacity
Per Rack
19COPYRIGHT 2020 HPE
20. SHASTA: A MORE OPEN, CUSTOMIZABLE STACK
Cray XC Stack:
Sleek…
Scalable…
Monolithic…
Shasta HW (storage, compute, networks)
Hardware support services
Infrastructure support services
Platform support services
Consumer
Cray Shasta System Management Stack:
Open, documented, RESTful APIs
Ability to substitute different components
Buildable source
20COPYRIGHT 2020 HPE
21. SHASTA SOFTWARE PLATFORM ARCHITECTURE
Cray Linux Environment
Linux + HPC Extensions
HPC Batch Job Mgmt. + Orchestration (Kubernetes)
Network and I/O Abstractions
Parallel Performance Libraries
Developer
Environment
Runtimes
Administrator System Services Developer Services
Cray Programming Environment
Shasta
Management
Services
Shasta
Monitoring
Framework
Linux Environment
Linux
Orchestration (Kubernetes)
Network and I/O Abstractions
Urika Manager
Cray Urika AI/Analytics Suite
Expanding the power of supercomputing with the
flexibility of cloud and full datacenter interoperability
Analytics Microservices
Analytics Libraries & Frameworks
Parallel Performance Libraries
Containerized Services Containerized ServicesContainerized Services
OpenAPIs
OpenAPIs
21COPYRIGHT 2020 HPE
22. OPTIONS TO MEET A FULL RANGE OF AS-A-SERVICE NEEDS
Consumption-based
Cloud Architecture
Managed Service, Off-Premises
Public Cloud Ecosystem
GreenLake
Flexible Capacity
HPC Platform as a
Service
Managed HPC as a
Service
HPC as a Service in the
Public Cloud
Strategic partner to manage end-to-end, Hybrid HPC and
AI portfolio across deployment and consumption models
• HPE GreenLake Flexible
Capacity
• Ready for BlueData, RedHat
OCP and Singularity
• HPCM (and APIs), VMware
and Cray’s software
environment
• As a service offerings
(Advania, ScaleMatrix,
Markley)
• Data center offerings
(Equinix, CyrusOne)
• SI partners (Accenture,
DXC)
• ClusterStor in Azure
• Cray in Azure for
Manufacturing
• Cray in Azure for EDA
22COPYRIGHT 2020 HPE
23. WE ARE ENTERING THE EXASCALE ERA
• HPC, Enterprise, and hyperscale are converging
• Data-centric, hybrid workflows: AI + analytics + HPC
• Growing complexity, but tremendous opportunity to extract value
• Need new infrastructure for these new workloads
• Shasta provides the infrastructure for the Exascale Era
• Chosen for all three announced exascale systems
• Flexibility (processors, storage, network, software)
• Extensibility (hardware and software)
• Scalability (Up to exascale, down to a single 19” rack)
• Standards-based (interoperable and open)
• Cloud-like software stack for dynamic, heterogeneous workloads
23COPYRIGHT 2020 HPE