SlideShare a Scribd company logo
1 of 31
THE WORLD’S FIRST PETASCALE ARM SUPERCOMPUTER
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a
wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National
Nuclear Security Administration under contract DE-AC04-94AL85000.
Presented by Michael Aguilar
SAND2019-1378 C
203
Sandia
National
Laboratories
United States
Astra - Apollo
70, Cavium
ThunderX2
CN9975-2000
28C 2GHz,
4xEDR
Infiniband
HPE
125,328 1,529.0 2,005.2
36 203
Astra - Apollo
70, Cavium
ThunderX2
CN9975-2000
28C 2GHz,
4xEDR
Infiniband , HPE
Sandia National
Laboratories
United States
125,328 1,529.0 66.94
TOP500 Lists
Astra – the First Petscale Arm based Supercomputer
• 2,592 HPE Apollo 70 compute nodes
• Cavium Thunder-X2 Arm SoC, 28 core, 2.0 GHz
• 5,184 CPUs, 145,152 cores, 2.3 PFLOPs system peak
• 128GB DDR Memory per node (8 memory channels per socket)
• Aggregate capacity: 332 TB, Aggregate Bandwidth: 885 TB/s
• Mellanox IB EDR, ConnectX-5
• HPE Apollo 4520 All–flash storage, Lustre parallel file-system
• Capacity: 403 TB (usable)
• Bandwidth 244 GB/s
Astra Architecture
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
8 GB DDR4-2666 DR
x8x8
Cavium Thunder-X2
ARM v8.1
28 cores @ 2.0 GHz
Cavium Thunder-X2
ARM v8.1
28 cores @ 2.0 GHz
Mellanox ConnectX-5 OCP Network Interface
HPE Apollo 70
Cavium TX2 Node
1 EDR link, 100 Gbps
APM/HPE
Xgene-1
Cavium/Penguin
ThunderX1
= 2018
= Retired
= 2015
= 2017
TODAY
Cavium
HPE/Comanche
ThunderX2
Future ASC
Platforms
Sept 2011
Cavium
HPE/Apollo
ThunderX2
FY18
Vanguard-Phase 1
Petascale Arm
Platform
Cavium
ThunderX1
32 nodes
Pre-GA Cavium
ThunderX2
47 nodes
Applied Micro
X-Gene-1
47 nodes
2014
Sandia’s NNSA/ASC ARM Platform Evolution
7
• Deployed OpenHPC on Mayer Arm-based testbed at Sandia
• 47 compute nodes, dual-socket Cavium ThunderX2 28-core @ 2.0 GHZ
• Identified several gaps:
• Focused on providing latest version of a given package
• E.g., OpenHPC 1.3.5 moved to gcc7.3.0
SNL validating with gcc7.2.0, need to use that version
• Difficult to install multiple versions of a given package
• Lacks architecture-optimized builds
• HPL is 4.8x faster when compiled with an OpenBLAS targeting CaviumTX2 vs. OpenHPC
OpenBLAS
• Doesn’t support static linking (build recipes actively remove static libraries)
• Many users like to build static binaries, ship single binary to classified
• Hard to rebuild packages due to reliance on Open Build Service
Engaging with OpenHPC community to address gaps
ATSE: OpenHPC Evaluation
Vanguard Program
A proving ground for next-generation HPC technologies in support of the
NNSA mission
Vanguard Program: Advanced Technology Prototype Systems
• Prove viability of advanced technologies for NNSA integrated codes, at scale
• Expand the HPC-ecosystem by developing emerging yet-to-be proven
technologies
• Is technology viable for future ATS/CTS platforms supporting ASC mission?
• Increase technology AND integrator choices
• Buy down risk and increase technology and vendor choices for future NNSA
production platforms
• Ability to accept higher risk allows for more/faster technology advancement
• Lowers/eliminates mission risk and significantly reduces investment
• Jointly address hardware and software technologies
• First Prototype platform targeting Arm Architecture
Success achieved through Tri-Lab involvement and collaboration
Test Beds Vanguard ATS/CTS Platforms
Greater Stability, Larger Scale
Higher Risk, Greater Architectural Choices
Test Beds
• Small testbeds
(~10-100 nodes)
• Breadth of
architectures Key
• Brave users
Vanguard
• Larger-scale experimental
systems
• Focused efforts to mature
new technologies
• Broader user-base
• Not Production
• Tri-lab resource but not for
ATCC runs
ATS/CTS Platforms
• Leadership-class systems
(Petascale, Exascale, ...)
• Advanced technologies,
sometimes first-of-kind
• Broad user-base
• PRODUCTION USE
Where Vanguard Fits
11
• Advanced Tri-lab Software Environment
• Accelerate maturity of ARM ecosystem for ASC computing
• Prove viability for NNSA integrated codes running at scale
• Harden compilers, math libraries, tools, communication libraries
• Heavily templated C++, Fortran 2003/2008, Gigabyte+ binaries, long compiles
• Optimize performance, verify expected results
• Build integrated software stack
• Programming env (compilers, math libs, tools, MPI, OMP, SHMEM, I/O, ...)
• Low-level OS (HPC-optimized Linux, network stack, filesystems, containers/VMs, ...)
• Job scheduling and management (WLM, scalable app launcher, user tools, ...)
• System management (OS image management, boot, system monitoring, ...)
• Leverage prototype aspect of system for scalable system software R&D
Improve 0 to 60 time... Vanguard-Astra arrival to useful work done
Vanguard Program: Tri-Lab Software Effort (ATSE)
Construction Completed < 1 Year
Delivery and Integration Overlapped ConstructioCelebrity Groundbreaking
Artist Rendering
13
• Support NNSA mission applications, test against realistic input problems
• Be able to operate disconnected from the public internet
• Operate stand-alone or integrated with existing vendor infrastructure
• Support multiple processor architectures (arm64, x86_64, ppc64le, ...)
• Support multiple system architectures (Linux clusters, Cray, IBM, ...)
• Support multiple compiler toolchains (gcc, arm, intel, xlc, ...)
• Support multiple versions+configs of software packages with user selection
• Support both static and dynamic linking of applications
• Build static libraries with –fPIC
• Provide architecture-optimized packages
• Distribute container and virtual machine images of each ATSE release
• Source included, easy to rebuild and replace any non-proprietary package
• Provide open version of ATSE that is distributed publicly
ATSE: Goals
TOSS
RHEL
EPEL
Vendor
Software
ATSE
Packager
OpenHPC
Open Build
Server
ATSE
Packages
Vendor
Software
Koji Build
Server
ATSE Diagram (from SNL Feb 12 TOSS meeting)
Vanguard Hardware
Base OS
Layer
Closed Source Integrator ProvidedLimited Distribution ATSE Activity
Vendor OS TOSS
Open OS
e.g. OpenSUSE
Cluster Middleware
e.g. Lustre, SLURM
ATSE Programming Environment “Product” for Vanguard
Platform-optimized builds, common-look-and-feel across platforms
Virtual MachinesATSE Packaging
User-facing
Programming Env
Native InstallsContainers
NNSA/ASC Application Portfolio
Open Source
ATSE Activity
Closed Source
Integrator Provided
Limited Distribution
Open Source
Key:
ATSE: Pulling Components from Many Sources
Vanguard Hardware
Base OS
Layer
Vendor OS TOSS
Open OS
e.g. OpenSUSE
Cluster Middleware
e.g. Lustre, SLURM
ATSE Programming Environment “Product” for Vanguard
Platform-optimized builds, common-look-and-feel across platforms
ATSE Packaging
User-facing
Programming Env
NNSA/ASC Application Portfolio
Virtual
Machines
Containers
Native
Installs
Open Source Limited Distribution Closed Source Integrator Provided ATSE Activity
ATSE: Integration with Multiple Base Operating Systems
16
Open Leadership Software
Stack (OLSS)
• HPE:
• HPE MPI (+ XPMEM)
• HPE Cluster Manager
• Arm:
• Arm HPC Compilers
• Arm Math Libraries
• Allinea Tools
• Mellanox-OFED & HPC-X
• RedHat 7.x for aarch64
ATSE: Collaboration with HPE OLSS Effort
• Setup local Open Build Service (OBS) build farm at Sandia
• Built set of software packages needed for Astra milestone 1
• When OpenHPC recipe was available, we tried to use it, modifying as necessary
• Otherwise, we built a new build recipe in same style as OpenHPC
• Installed on Mayer testbed at Sandia, now using as the default user environment
• Tested with STREAM, HPL, HPCG, ASC mini-apps
• Compiler toolchain support
• GNU compilers, 7.2.0
• ARM HPC compilers
ATSE Modules Interface, Mirrors OpenHPC
ATSE: Deployed Beta Stack
Astra-Network Communications Become More Important Due to
the use of Multi-Core Sockets.
Mellanox InfiniBand NIC RDMA & SHArP
Cavium
Thunder
X2 B1
28 Cores
Cavium
Thunder
X2 B1
28 Cores
M
E
M
O
R
Y
M
E
M
O
R
Y
M
E
M
O
R
Y
M
E
M
O
R
Y
Astra-RDMA Network Improvements
• New NIC interfaces increase memory bandwidth and reduce memory
consumption
• Hardware Collectives
• Socket Direct allows cores from each socket to reach the RDMA fabric
• UCX API
Socket Direct
Socket Direct feature enables a single NIC to be
shared by multiple host processor sockets
• Share a single physical link to reduce cabling
complexity and costs
• NIC arbitrates between host processors to
ensure a fair level of service
• Required some complex O/S patches early on
in test systems
Arm64 + EDR
providing
> 12 GB/s
between nodes
> 75M
messages/sec
OSU MPI Multi-
Network Bandwidth
SHArP Hardware Collectives
• Reduce the amount of buffer space required for unexpected
messages.
• Perform work, when possible, in the NIC
• Increase computational performance by off-loading work from the
CPU scheduler
• Perform operations that are blocking computation in the NIC to
help speed up the work
Astra Hardware Collective Performance
UCX API
• Astra is providing Sandia with experience using a large-scale
implementation of UCX
• ATSE OpenMPI is compiled to use the UCT UCX API
• UCP abstacts multiple devices and transport layers
• UCP provides message fragmentation and non-blocking
operations.
• UCP can use multiple connections for transport.
Extreme Efficiency:
 Total 1.2 MW in the 36 compute racks are cooled by
only 12 fan coils
 These coils are cooled without compressors year round.
No evaporative water at all almost 6000 hours a year
 99% of the compute racks heat never leaves the
cabinet, yet the system doesn’t require the internal
plumbing of liquid disconnects and cold plates running
across all CPUs and DIMMs
Efficient tower and HEX can take hottest 36 hours of the year of 18.5C wetbulb to make 20C water to the fan coils
18.5C WB
20.0C
20.0C
1.5C approach
Sandia Thermosyphon Cooler Hybrid System for Water Savings
0.0C approach
26.0C
air
wall peak nominal (linpack) idle racks wall peak nominal (linpack) idle
Node racks 39888 35993 33805 6761 36 1436.0 1295.8 1217.0 243.4
MCS300 10500 7400 7400 170 12 126.0 88.8 88.8 2.0
Network 12624 10023 9021 9021 3 37.9 30.1 27.1 27.1
Storage 11520 10000 10000 1000 2 23.0 20.0 20.0 2.0
utility 8640 5625 4500 450 1 8.6 5.6 4.5 0.5
1631.5 1440.3 1357.3 274.9
Projected powerof the system by component
perconstituent rack type (W) total (kW)
Astra Advanced Power and Cooling
Early Results from Astra
CFD Models Hydrodynamics Molecular DynamicsMonte Carlo
1.60X 1.45X 1.30X 1.42X
Linear Solvers
1.87X
Baseline: Trinity ASC Platform (Current Production), dual-socket Haswell
0.3217
0.4537
0.5138
1.122
0.9457
1.294
1.372
1.409
1.494
1.529 1.565
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1022
10/23
1018
10/23
1884
10/24
1876
10/24
1998
10/26
1998
10/30
2000
10/30
2000
10/31
2238
11/1
2238
11/1
2238
11/2
PetaFlops
Number of Nodes
HPL Performance (in PetaFlops)
Initial Large Scale Testing and Benchmarks (HPL)
8.4
5.133
9.681
2.976
61.461 61.688
46.973
52.047
46.343
66.942
47.259
34.997
67.83
0
10
20
30
40
50
60
70
80
256
10/25
483
10/25
483
10/25
436
10/25
1975
10/27
1975
10/27
1975
10/27
1975
10/27
2000
10/28
1975
10/30
1998
10/30
2000
10/30
2233
10/31
PerformanceTF
Number of Nodes
HPCG Performance (TeraFlops)
Initial Large Scale Testing and Benchmarks (HPCG)
28
R&D Areas
• Workflows leveraging containers and virtual machines
• Support for machine learning frameworks
• ARMv8.1 includes new virtualization extensions, SR-IOV
• Evaluating parallel filesystems + I/O systems @ scale
• GlusterFS, Ceph, BeeGFS, Sandia Data Warehouse, …
• Resilience studies over Astra lifetime
• Improved MPI thread support, matching acceleration
• OS optimizations for HPC @ scale
• Exploring spectrum from stock distro Linux kernel to HPC-tuned Linux
kernels to non-Linux lightweight kernels and multi-kernels
• Arm-specific optimizations
ATSE
stack
ATSE: R&D Efforts
• Continue adding and optimizing packages and build test framework
• Package container and VM images
• Lab-internal version, hosted on Sandia Gitlab Docker registry
• Externally distributable versions, stripping out proprietary components
• Submit ”trial-run” patches filling gaps to OpenHPC community
• Explore Spack build and packaging
• Continue collaboration with HPE OLSS team, first external ATSE customer
• LANL: CharlieCloud/Bee support, LLVM compiler work, application readiness
• LLNL: TOSS, “Spack Stacks” ATSE build, application readiness
ATSE: Next Steps
Questions?
Exceptional Service in the National Interest
SAND2019-1378 C

More Related Content

More from inside-BigData.com

Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computinginside-BigData.com
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)inside-BigData.com
 

More from inside-BigData.com (20)

Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
 
FPGAs and Machine Learning
FPGAs and Machine LearningFPGAs and Machine Learning
FPGAs and Machine Learning
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

ASTRA - A Large Scale ARM64 HPC Deployment

  • 1. THE WORLD’S FIRST PETASCALE ARM SUPERCOMPUTER Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Presented by Michael Aguilar SAND2019-1378 C
  • 2.
  • 3. 203 Sandia National Laboratories United States Astra - Apollo 70, Cavium ThunderX2 CN9975-2000 28C 2GHz, 4xEDR Infiniband HPE 125,328 1,529.0 2,005.2 36 203 Astra - Apollo 70, Cavium ThunderX2 CN9975-2000 28C 2GHz, 4xEDR Infiniband , HPE Sandia National Laboratories United States 125,328 1,529.0 66.94 TOP500 Lists
  • 4. Astra – the First Petscale Arm based Supercomputer
  • 5. • 2,592 HPE Apollo 70 compute nodes • Cavium Thunder-X2 Arm SoC, 28 core, 2.0 GHz • 5,184 CPUs, 145,152 cores, 2.3 PFLOPs system peak • 128GB DDR Memory per node (8 memory channels per socket) • Aggregate capacity: 332 TB, Aggregate Bandwidth: 885 TB/s • Mellanox IB EDR, ConnectX-5 • HPE Apollo 4520 All–flash storage, Lustre parallel file-system • Capacity: 403 TB (usable) • Bandwidth 244 GB/s Astra Architecture 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR 8 GB DDR4-2666 DR x8x8 Cavium Thunder-X2 ARM v8.1 28 cores @ 2.0 GHz Cavium Thunder-X2 ARM v8.1 28 cores @ 2.0 GHz Mellanox ConnectX-5 OCP Network Interface HPE Apollo 70 Cavium TX2 Node 1 EDR link, 100 Gbps
  • 6. APM/HPE Xgene-1 Cavium/Penguin ThunderX1 = 2018 = Retired = 2015 = 2017 TODAY Cavium HPE/Comanche ThunderX2 Future ASC Platforms Sept 2011 Cavium HPE/Apollo ThunderX2 FY18 Vanguard-Phase 1 Petascale Arm Platform Cavium ThunderX1 32 nodes Pre-GA Cavium ThunderX2 47 nodes Applied Micro X-Gene-1 47 nodes 2014 Sandia’s NNSA/ASC ARM Platform Evolution
  • 7. 7 • Deployed OpenHPC on Mayer Arm-based testbed at Sandia • 47 compute nodes, dual-socket Cavium ThunderX2 28-core @ 2.0 GHZ • Identified several gaps: • Focused on providing latest version of a given package • E.g., OpenHPC 1.3.5 moved to gcc7.3.0 SNL validating with gcc7.2.0, need to use that version • Difficult to install multiple versions of a given package • Lacks architecture-optimized builds • HPL is 4.8x faster when compiled with an OpenBLAS targeting CaviumTX2 vs. OpenHPC OpenBLAS • Doesn’t support static linking (build recipes actively remove static libraries) • Many users like to build static binaries, ship single binary to classified • Hard to rebuild packages due to reliance on Open Build Service Engaging with OpenHPC community to address gaps ATSE: OpenHPC Evaluation
  • 8. Vanguard Program A proving ground for next-generation HPC technologies in support of the NNSA mission
  • 9. Vanguard Program: Advanced Technology Prototype Systems • Prove viability of advanced technologies for NNSA integrated codes, at scale • Expand the HPC-ecosystem by developing emerging yet-to-be proven technologies • Is technology viable for future ATS/CTS platforms supporting ASC mission? • Increase technology AND integrator choices • Buy down risk and increase technology and vendor choices for future NNSA production platforms • Ability to accept higher risk allows for more/faster technology advancement • Lowers/eliminates mission risk and significantly reduces investment • Jointly address hardware and software technologies • First Prototype platform targeting Arm Architecture Success achieved through Tri-Lab involvement and collaboration
  • 10. Test Beds Vanguard ATS/CTS Platforms Greater Stability, Larger Scale Higher Risk, Greater Architectural Choices Test Beds • Small testbeds (~10-100 nodes) • Breadth of architectures Key • Brave users Vanguard • Larger-scale experimental systems • Focused efforts to mature new technologies • Broader user-base • Not Production • Tri-lab resource but not for ATCC runs ATS/CTS Platforms • Leadership-class systems (Petascale, Exascale, ...) • Advanced technologies, sometimes first-of-kind • Broad user-base • PRODUCTION USE Where Vanguard Fits
  • 11. 11 • Advanced Tri-lab Software Environment • Accelerate maturity of ARM ecosystem for ASC computing • Prove viability for NNSA integrated codes running at scale • Harden compilers, math libraries, tools, communication libraries • Heavily templated C++, Fortran 2003/2008, Gigabyte+ binaries, long compiles • Optimize performance, verify expected results • Build integrated software stack • Programming env (compilers, math libs, tools, MPI, OMP, SHMEM, I/O, ...) • Low-level OS (HPC-optimized Linux, network stack, filesystems, containers/VMs, ...) • Job scheduling and management (WLM, scalable app launcher, user tools, ...) • System management (OS image management, boot, system monitoring, ...) • Leverage prototype aspect of system for scalable system software R&D Improve 0 to 60 time... Vanguard-Astra arrival to useful work done Vanguard Program: Tri-Lab Software Effort (ATSE)
  • 12. Construction Completed < 1 Year Delivery and Integration Overlapped ConstructioCelebrity Groundbreaking Artist Rendering
  • 13. 13 • Support NNSA mission applications, test against realistic input problems • Be able to operate disconnected from the public internet • Operate stand-alone or integrated with existing vendor infrastructure • Support multiple processor architectures (arm64, x86_64, ppc64le, ...) • Support multiple system architectures (Linux clusters, Cray, IBM, ...) • Support multiple compiler toolchains (gcc, arm, intel, xlc, ...) • Support multiple versions+configs of software packages with user selection • Support both static and dynamic linking of applications • Build static libraries with –fPIC • Provide architecture-optimized packages • Distribute container and virtual machine images of each ATSE release • Source included, easy to rebuild and replace any non-proprietary package • Provide open version of ATSE that is distributed publicly ATSE: Goals
  • 14. TOSS RHEL EPEL Vendor Software ATSE Packager OpenHPC Open Build Server ATSE Packages Vendor Software Koji Build Server ATSE Diagram (from SNL Feb 12 TOSS meeting) Vanguard Hardware Base OS Layer Closed Source Integrator ProvidedLimited Distribution ATSE Activity Vendor OS TOSS Open OS e.g. OpenSUSE Cluster Middleware e.g. Lustre, SLURM ATSE Programming Environment “Product” for Vanguard Platform-optimized builds, common-look-and-feel across platforms Virtual MachinesATSE Packaging User-facing Programming Env Native InstallsContainers NNSA/ASC Application Portfolio Open Source ATSE Activity Closed Source Integrator Provided Limited Distribution Open Source Key: ATSE: Pulling Components from Many Sources
  • 15. Vanguard Hardware Base OS Layer Vendor OS TOSS Open OS e.g. OpenSUSE Cluster Middleware e.g. Lustre, SLURM ATSE Programming Environment “Product” for Vanguard Platform-optimized builds, common-look-and-feel across platforms ATSE Packaging User-facing Programming Env NNSA/ASC Application Portfolio Virtual Machines Containers Native Installs Open Source Limited Distribution Closed Source Integrator Provided ATSE Activity ATSE: Integration with Multiple Base Operating Systems
  • 16. 16 Open Leadership Software Stack (OLSS) • HPE: • HPE MPI (+ XPMEM) • HPE Cluster Manager • Arm: • Arm HPC Compilers • Arm Math Libraries • Allinea Tools • Mellanox-OFED & HPC-X • RedHat 7.x for aarch64 ATSE: Collaboration with HPE OLSS Effort
  • 17. • Setup local Open Build Service (OBS) build farm at Sandia • Built set of software packages needed for Astra milestone 1 • When OpenHPC recipe was available, we tried to use it, modifying as necessary • Otherwise, we built a new build recipe in same style as OpenHPC • Installed on Mayer testbed at Sandia, now using as the default user environment • Tested with STREAM, HPL, HPCG, ASC mini-apps • Compiler toolchain support • GNU compilers, 7.2.0 • ARM HPC compilers ATSE Modules Interface, Mirrors OpenHPC ATSE: Deployed Beta Stack
  • 18. Astra-Network Communications Become More Important Due to the use of Multi-Core Sockets. Mellanox InfiniBand NIC RDMA & SHArP Cavium Thunder X2 B1 28 Cores Cavium Thunder X2 B1 28 Cores M E M O R Y M E M O R Y M E M O R Y M E M O R Y
  • 19. Astra-RDMA Network Improvements • New NIC interfaces increase memory bandwidth and reduce memory consumption • Hardware Collectives • Socket Direct allows cores from each socket to reach the RDMA fabric • UCX API
  • 20. Socket Direct Socket Direct feature enables a single NIC to be shared by multiple host processor sockets • Share a single physical link to reduce cabling complexity and costs • NIC arbitrates between host processors to ensure a fair level of service • Required some complex O/S patches early on in test systems Arm64 + EDR providing > 12 GB/s between nodes > 75M messages/sec OSU MPI Multi- Network Bandwidth
  • 21. SHArP Hardware Collectives • Reduce the amount of buffer space required for unexpected messages. • Perform work, when possible, in the NIC • Increase computational performance by off-loading work from the CPU scheduler • Perform operations that are blocking computation in the NIC to help speed up the work
  • 23. UCX API • Astra is providing Sandia with experience using a large-scale implementation of UCX • ATSE OpenMPI is compiled to use the UCT UCX API • UCP abstacts multiple devices and transport layers • UCP provides message fragmentation and non-blocking operations. • UCP can use multiple connections for transport.
  • 24. Extreme Efficiency:  Total 1.2 MW in the 36 compute racks are cooled by only 12 fan coils  These coils are cooled without compressors year round. No evaporative water at all almost 6000 hours a year  99% of the compute racks heat never leaves the cabinet, yet the system doesn’t require the internal plumbing of liquid disconnects and cold plates running across all CPUs and DIMMs Efficient tower and HEX can take hottest 36 hours of the year of 18.5C wetbulb to make 20C water to the fan coils 18.5C WB 20.0C 20.0C 1.5C approach Sandia Thermosyphon Cooler Hybrid System for Water Savings 0.0C approach 26.0C air wall peak nominal (linpack) idle racks wall peak nominal (linpack) idle Node racks 39888 35993 33805 6761 36 1436.0 1295.8 1217.0 243.4 MCS300 10500 7400 7400 170 12 126.0 88.8 88.8 2.0 Network 12624 10023 9021 9021 3 37.9 30.1 27.1 27.1 Storage 11520 10000 10000 1000 2 23.0 20.0 20.0 2.0 utility 8640 5625 4500 450 1 8.6 5.6 4.5 0.5 1631.5 1440.3 1357.3 274.9 Projected powerof the system by component perconstituent rack type (W) total (kW) Astra Advanced Power and Cooling
  • 25. Early Results from Astra CFD Models Hydrodynamics Molecular DynamicsMonte Carlo 1.60X 1.45X 1.30X 1.42X Linear Solvers 1.87X Baseline: Trinity ASC Platform (Current Production), dual-socket Haswell
  • 28. 28 R&D Areas • Workflows leveraging containers and virtual machines • Support for machine learning frameworks • ARMv8.1 includes new virtualization extensions, SR-IOV • Evaluating parallel filesystems + I/O systems @ scale • GlusterFS, Ceph, BeeGFS, Sandia Data Warehouse, … • Resilience studies over Astra lifetime • Improved MPI thread support, matching acceleration • OS optimizations for HPC @ scale • Exploring spectrum from stock distro Linux kernel to HPC-tuned Linux kernels to non-Linux lightweight kernels and multi-kernels • Arm-specific optimizations ATSE stack ATSE: R&D Efforts
  • 29. • Continue adding and optimizing packages and build test framework • Package container and VM images • Lab-internal version, hosted on Sandia Gitlab Docker registry • Externally distributable versions, stripping out proprietary components • Submit ”trial-run” patches filling gaps to OpenHPC community • Explore Spack build and packaging • Continue collaboration with HPE OLSS team, first external ATSE customer • LANL: CharlieCloud/Bee support, LLVM compiler work, application readiness • LLNL: TOSS, “Spack Stacks” ATSE build, application readiness ATSE: Next Steps
  • 31. Exceptional Service in the National Interest SAND2019-1378 C

Editor's Notes

  1. Multi-physics production codes Optimize performance, sometimes can be for silly simple things like memcpy. Takes awhile to find all of these. Taken for granted on x86 optimized assembly routines avail. Glibc has a lot of examples, generic but slow code that is optimized for specific archs like x86.
  2. Zero-Copy operations are a 2 step operation