Modular by Design: Supermicro’s New Standards-Based Universal GPU Server

Supermicro Universal GPU
12/2/2021 Better Faster Greener™ © 2021 Supermicro
1
Josh Grossman, Principal Product Manager
Waiming Mok, Director, Product Office
Steve Rudinsky, Director, Server Solutions

2
Supermicro Universal GPU
Open, Modular, Standards Based
Better Faster Greener™ © 2021 Supermicro
2
Universal GPU
Server
UBBs
OAMs
Storage
Options
Networking
Processors
PCIe
GPUs
Multiple GPU and
CPU Vendors
Supermicro Confidential

3
Benefits of Universal GPU Server
• Supports a Variety of Technologies
o CPU MB Support
• AMD H12 EPYC 7002/3
• Intel X13 Sapphire Rapids series
o GPU Support
• AMD MI-200 OAM with GPU to GPU Infinity Fabric
• NVIDIA Redstone with GPU to GPU NVLink
• Intel Future GPU
• Traditional PCIe Form Factor GPU
• Modular Design for Flexibility
• Improved Thermal Capability
o Support up to 500W/700W GPU, 280W AMD CPU and 350W/400W Intel CPU
• Future Proof Architecture
UBB/OAM
Redstone
PCIe
Supermicro Confidential

4
Specifications
CPU – Dual Socket
Dual AMD EPYC 7003 CPUs (Socket SP3)
up to 280W, 128 Cores/256 Threads
Memory – 32 DIMM Slots
32 DIMM, 8TB Reg. ECC DDR4 up to
3200MHz
Drives – 10 2.5” Drive-bay
Up to 10x HS NVMe U.2 connect to PCIe
Switch or 10x HS 2.5” SATA/SAS
1x M.2 NVMe/SATA onboard (max length of
110mm)
Expansion – 10 PCIe Slots
8x PCIe 4.0 x16 LP (via PLX switch for IB
EDR)
2x PCIe 4.0 x16 LP from CPUs
I/O ports
1x VGA, 1x COM Header, 2x USB 3.0, and
1x Dedicated IPMI, 2x 10 GBE LAN
Power Supply
4x 3000W (2+2) Titanium Level efficiency
power supplies
4U AMD EPYC 7003 Dual CPUs and Four GPUs
Universal GPU System with AMD CPU & MI-200 GPU: AS-4124GQ-TNMI
Subject to change without notice
Key Features
Universal GPU Server Standards Based Design
Modular by Design for Flexibility/Future Proofed
Improved Thermal Capability
Key Applications
Perfect Platform for HPC applications
Data Center Infrastructure
System Rear View
System Front View
Supermicro Confidential/Internal Only

5
Universal Design and AMD Instinct MI 250 OAM
Supermicro Confidential/Internal Only
• Significant HPC performance increase
over competition
• Also good for AI/ML workloads
• 128GB HBM2e ECC Memory per OAM
• GPU to GPU xGMI Infinity Fabric 2.5TB/s

6
Specifications
CPU – Dual Socket
Dual Sapphire Rapids CPU (up to
350W/400W TDP)
Memory – 32 DIMM Slots
32 DIMM, 8TB Reg. ECC DDR5 up to
4800MHz
Drives – 10 2.5” Drive-bay
Up to 10x HS NVMe U.2 connect to PCIe
Switch or 10x HS 2.5” SATA/SAS
1x M.2 NVMe/SATA onboard (max length of
110mm)
Expansion – 10 PCIe Slots
8x PCIe 5.0 x16 LP (via PLX switch for IB
EDR)
2x PCIe 5.0 x16 LP from CPUs
I/O ports
1x VGA, 1x COM Header, 2x USB 3.0, and
1x Dedicated IPMI, 2x 10 GBE LAN
Power Supply
4x 3000W (2+2) Titanium Level efficiency
power supplies
4U Intel Dual Sapphire Rapids CPUs and four GPUs
Universal GPU System with Intel X13 CPU & NVIDIA HGX A100 4-GPU
Subject to change without notice
Key Features
Universal GPU Server Standards Based Design
Modular by Design for Flexibility/Future Proofed
Improved Thermal Capability (500W/700W GPU)
Key Applications
Perfect Platform for HPC applications
Data Center Infrastructure
System Rear View
System Front View

GPU Solution Topics
7
•Trends & Applications
•Software & Solutions for AMD GPUs
•Software & Solutions for NVIDIA GPUs

Key Drivers for Growth for Computing & GPUs
8
• AI
• HPC
• Enterprise Adoption
• Metaverse

AI: Trends in Deep Neural Networks
9
2012 2021
?
Transformer / Attention
on multiple GPUs,
networked
More data,
More compute,
More powerful AI!
CNN on GPU
RNN on GPU
RNN / CNN on GPU
Different Deep Neural Networks
Image processing
Text understanding
Speech / audio
Domain specific techniques

AI Demand for More and More GPUs
Exponential Growth in Parameters in Transformer / Attention Models
10
OpenAI GPT-2, 1.5B
Google T5, 11B
OpenAI GPT-3, 175B
Google Swtich Transformer, 1.6T
ResNet-50, 26M
Inception v4, 43M
SENet, 146M
BERT large, 330M
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Millions
of
Parameters
–
Log
Scale
Feb 2021
1 Trillion
1 Billion
Feb 2019

Increasing Demand for High Performance Computing (HPC)
11
• Exascale
• Next Top500: HPCC 2022 or SC 2022
• Hybrid Computing: GPU and CPU
• AI Adoption
• Physics Simulation
• Healthcare / Biology
e.g. AlphaFold 2 analysis of protein structures
– Nov 2020, 90% accuracy

Enterprise Adoption of AI
AI Applies Across Industries
12
Corporation Functions
Industry Specific
Language
Models
Automatic
Speech
Recognition
Image
Recognition
Object
Detection
Recommenda
tion
Engine
Anomaly
Detection
Energy ● ● ● ●
Finance ● ● ● ● ●
Government ● ● ● ● ● ●
Healthcare ● ● ●
Manufacturing ● ● ● ●
Media ● ● ● ● ●
Retail ● ● ● ● ● ●
Services ● ● ● ● ● ●
Technology ● ● ● ● ● ●
Telecom ● ● ● ● ● ●

Enterprise Adoption of AI
13
Recommender Systems
Process Automation
Cyber Security
Fraud Detection
Talent Acquisition
Recommendation Systems
Market Data Analytics
Digital Prototyping
Process Automation
Predictive Maintenance
Defect Detection
Supply Chain
Language
Models

Metaverse: Digitized Reality and Novel Creations
14

AMD Tools & Solutions for AI/ML and HPC
15
RTM
Reverse Time Migration
Datacenter Tools: Profilers & Debuggers, Comm & Math Libraries, Compiler
Code Reuse: ONNX Run-time, existing deep learning, HPC code
Cross Platform: Open source, supports AMD CPUs, CPU, non-AMD GPUs
3RD GEN AMD INFINITY
ARCHITECTURE
FIRST MULTI-CHIP GPU
• Highest performance
• Bigger GPU memory
• Higher Flops (FP64, FP32, FP16)

AMD RocM 5.0 for AI and HPC
16
ROCm 5.0
• HPC Optimization
• Optimization for BERT and new models
• Scale-out libraries
• Exascale-ready system tools
• Top 25 HPC apps optimized

AMD ROCm Support for ONNX AI Models
17
Multiple Frameworks
Pre-Trained Models
https://cloudblogs.microsoft.com/opensource/2021/07/13/onnx-runtime-release-1-8-1-previews-support-for-accelerated-training-on-amd-gpus-with-the-amd-rocm-open-software-platform/

AMD ROCm Support for Multi-GPU Scaling
18
Multiple GPUs with Multiple Nodes
• PCIe Gen 4 (200Gbit/s)
• soon PCIe Gen 5 (400Gbit/s)
Multiple GPUs in Single Node
RCCL – ROCm Communication Collectives Library
Distributed Learning: all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, all-to-all
High bandwidth: PCIe, xGMI, InfiniBand Verbs or TCP/IP sockets
Applications: Single- or multi-process MPI
Ring and tree algorithms: Optimized for throughput and latency

AMD ROCm Resources
19
https://www.amd.com/en/graphics/instinct-server-accelerators

NVIDIA Certified
One-Stop Shop for Supermicro NVIDIA Certified Systems
NVIDIA Certified
OS / NGC Software Installation Service
Operating System yearly subscriptions
Software
Hardware
Supermicro Hardware Support Service
Service
NVIDIA NGC Support Service (per GPU per year)
Kubernetes Setup, Rack Setup, …
+
+
Downloadable NGC Containers, Frameworks & AI Models
CUDA-X
CUDA Nvidia Driver
NGC Software
(future)

Collections
Containers
Helm Charts
Models
NVIDIA Solution: NGC Catalog…

NVIDIA Solution: GPU Aggregation Stack
Within System
Across Systems
MPP
MPP
Programming
Horovod
NCCL
GPUDirect
RDMA
GPUDirect
Peer to Peer
with PCIe
NVLINK
GPUDirect
Storage
MPI

NVIDIA Solution: Parallel NVMe Fabrics for AI Training
AI Training
• Fast storage using NVMe Fabrics
• WEKA.io
• Excelero
• BeeGFS, Lustre, Hadoop HDFS, …
• Other storage partnerships (Pure, HDS, …)
• NVIDIA GPUDirect Storage and NVMe

AI/ML applications
Systems
Supermicro AI/ML Solution with Kubernetes
Enable, Automate, & Scale AI/ML
• Accelerate AI/ML Roll-out
• Automation of AI/ML pipeline
• Automation of bring-up and operation of cluster
• Proven reference architecture
• Reduce Total Cost of Ownership
• Cost optimized Supermicro modular systems, storage, network
• Simplified integration of AI/ML into IT workflow
• Respond to Business Needs
• Easy scaling of AI/ML use for multiple users
• Quick scaling of AI/ML workflow
• Routing of IT data into AI/ML pipeline
• One Stop Shop
• Unified architecture with hardware and software
• World-wide enterprise level support services
• Strong relationship with NVIDIA
• Broad set of software partners
Worker Nodes
Data Switches
MGMT
Switches
Automation &
Scaling Nodes
Reliable
Storage
Master Nodes
Reference Architecture
System & Software Stack

Supermicro Solutions Addressing Growth Drivers
25
Growth Drivers
• AI
• HPC
• Enterprise Adoption
• Metaverse
Supermicro GPU Solutions
Supermicro System
AMD GPUs
Software Eco-System
NVIDIA NGC
Supermicro System
NVIDIA GPUs
Software Eco-System

Roadmap and Q&A
26
• Sample Availability
• Future Releases
• Contact Supermicro
• Questions

DISCLAIMER
Super Micro Computer, Inc. may make changes to specifications and product descriptions at any time, without notice. The
information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors. Any performance tests and ratings are measured using systems that reflect the approximate
performance of Super Micro Computer, Inc. products as measured by those tests. Any differences in software or hardware
configuration may affect actual performance, and Super Micro Computer, Inc. does not control the design or implementation of
third party benchmarks or websites referenced in this document. The information contained herein is subject to change and may
be rendered inaccurate for many reasons, including but not limited to any changes in product and/or roadmap, component and
hardware revision changes, new model and/or product releases, software changes, firmware changes, or the like. Super Micro
Computer, Inc. assumes no obligation to update or otherwise correct or revise this information.
SUPER MICRO COMPUTER, INC. MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE
CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT
MAY APPEAR IN THIS INFORMATION.
SUPER MICRO COMPUTER, INC. SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL SUPER MICRO COMPUTER, INC. BE LIABLE TO ANY
PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF
ANY INFORMATION CONTAINED HEREIN, EVEN IF SUPER MICRO COMPUTER, Inc. IS EXPRESSLY ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2020 Super Micro Computer, Inc. All rights reserved.
27

Modular by Design: Supermicro’s New Standards-Based Universal GPU Server

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Modular by Design: Supermicro’s New Standards-Based Universal GPU Server

Ähnlich wie Modular by Design: Supermicro’s New Standards-Based Universal GPU Server (20)

Mehr von Rebekah Rodriguez

Mehr von Rebekah Rodriguez (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Modular by Design: Supermicro’s New Standards-Based Universal GPU Server

Hinweis der Redaktion