In this webinar, members of the Server Solution Team as well as a member of Supermicro’s Product Office will discuss Supermicro’s Universal GPU Server, the server’s modular, standards-based design, the important role of OCP Accelerator Module (OAM) form factor, and Universal Baseboard (UBB) in the system, as well as touching on AMD's next generation HPC accelerator. In addition, we will get some insights into trends in the HPC and AI/Machine Learning space, including the different software platforms and best practices that are driving innovation in our industry and daily lives. In particular: • Tools to enable use of the high performance hardware for HPC and Deep Learning applications • Tools to enable use of multiple GPUs, including RDMA, to solve highly demanding HPC and deep learning models, such as BERT • Running applications in containers with AMD’s next generation GPU system
3. 3
Benefits of Universal GPU Server
• Supports a Variety of Technologies
o CPU MB Support
• AMD H12 EPYC 7002/3
• Intel X13 Sapphire Rapids series
o GPU Support
• AMD MI-200 OAM with GPU to GPU Infinity Fabric
• NVIDIA Redstone with GPU to GPU NVLink
• Intel Future GPU
• Traditional PCIe Form Factor GPU
• Modular Design for Flexibility
• Improved Thermal Capability
o Support up to 500W/700W GPU, 280W AMD CPU and 350W/400W Intel CPU
• Future Proof Architecture
UBB/OAM
Redstone
PCIe
Supermicro Confidential
5. 5
Universal Design and AMD Instinct MI 250 OAM
Supermicro Confidential/Internal Only
• Significant HPC performance increase
over competition
• Also good for AI/ML workloads
• 128GB HBM2e ECC Memory per OAM
• GPU to GPU xGMI Infinity Fabric 2.5TB/s
20. NVIDIA Certified
One-Stop Shop for Supermicro NVIDIA Certified Systems
NVIDIA Certified
OS / NGC Software Installation Service
Operating System yearly subscriptions
Software
Hardware
Supermicro Hardware Support Service
Service
NVIDIA NGC Support Service (per GPU per year)
Kubernetes Setup, Rack Setup, …
+
+
Downloadable NGC Containers, Frameworks & AI Models
CUDA-X
CUDA Nvidia Driver
NGC Software
(future)
22. NVIDIA Solution: GPU Aggregation Stack
Within System
Across Systems
MPP
MPP
Programming
Horovod
NCCL
GPUDirect
RDMA
GPUDirect
Peer to Peer
with PCIe
NVLINK
GPUDirect
Storage
MPI
23. NVIDIA Solution: Parallel NVMe Fabrics for AI Training
AI Training
• Fast storage using NVMe Fabrics
• WEKA.io
• Excelero
• BeeGFS, Lustre, Hadoop HDFS, …
• Other storage partnerships (Pure, HDS, …)
• NVIDIA GPUDirect Storage and NVMe
24. AI/ML applications
Systems
Supermicro AI/ML Solution with Kubernetes
Enable, Automate, & Scale AI/ML
• Accelerate AI/ML Roll-out
• Automation of AI/ML pipeline
• Automation of bring-up and operation of cluster
• Proven reference architecture
• Reduce Total Cost of Ownership
• Cost optimized Supermicro modular systems, storage, network
• Simplified integration of AI/ML into IT workflow
• Respond to Business Needs
• Easy scaling of AI/ML use for multiple users
• Quick scaling of AI/ML workflow
• Routing of IT data into AI/ML pipeline
• One Stop Shop
• Unified architecture with hardware and software
• World-wide enterprise level support services
• Strong relationship with NVIDIA
• Broad set of software partners
Worker Nodes
Data Switches
MGMT
Switches
Automation &
Scaling Nodes
Reliable
Storage
Master Nodes
Reference Architecture
System & Software Stack
1969 Cyrus Levinthal noted that it would take longer than the age of the known universe to enumerate all possible configurations of a typical protein by brute force calculation – Levinthal estimated 10^300 possible conformations for a typical protein. Yet in nature, proteins fold spontaneously, some within milliseconds – a dichotomy sometimes referred to as Levinthal’s paradox.
Simulating quantum level, molecular, human size, earth size atmosphere, galaxy/universe size
Predicting protein structure is a solved problem
Drug discovery
Protein design
Understand diseases
Done with computers and GPUs, actual chemical and biological experiments can be refined significantly to high probability candidates
Software column and hardware column may need to be added. Waiming to work on it first.