SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Best practices for optimizing Red Hat
platforms for large scale datacenter
deployments on DGX systems
Charlie Boyle, NVIDIA
Andre Beausoleil and Jeremy Eder, Red Hat
NVIDIA GTC, Washington, DC, October, 2018
Agenda
● Relationship Overview
● Announcements / What’s New
● Tuned profile for DGX
● NGC Container Support overview
● RHEL, OpenShift, DGX-1 Integration Details
2
Summary of Announcements!
NVIDIA DGX-1 is now CERTIFIED on
Red Hat Enterprise Linux 7
Summary of Announcements!
Support for using DGX nodes as
workers in OpenShift 3.10 or later
NVIDIA DGX-1 is now CERTIFIED on
Red Hat Enterprise Linux 7
Summary of Announcements!
NGC containers can run on Red Hat
Enterprise Linux and OpenShift
NVIDIA DGX-1 is now CERTIFIED on
Red Hat Enterprise Linux 7
Support for using DGX nodes as
workers in OpenShift 3.10 or later
Summary of Announcements!
Expanded Engineering Relationship
NVIDIA DGX-1 is now CERTIFIED on
Red Hat Enterprise Linux 7
Support for using DGX nodes as
workers in OpenShift 3.10 or later
NGC containers can run on Red Hat
Enterprise Linux and OpenShift
Red Hat/NVIDIA Partnership Timeline
Open Source Project Collaboration
Key Red Hat Maintainer: Ben Skeggs
Qualified with new NVIDIA architectures
Part of complete OSS toolchain for HMM
NOUVEAU DRIVER
Key Red Hat developer: Jerome Glisse
Memory management between device & CPU
Key developer simplification, not just NVIDIA
HETEROGENEOUS MEMORY MGMT.
Key Red Hat Maintainer: Jakub Jelinek
OpenMP common library
GPU AWARE GCC (LIBGOMP)
Multiple vGPUs for compute and graphic
workloads
NVIDIA VGPU & RHV
9
Joint Testing of Critical CVEs
Installing Red Hat Enterprise Linux 7
●
●
●
Tuned
Tuning profile delivery mechanism
Red Hat ships tuned profiles that
improve performance for many
workloads...hopefully yours!
Okay, but why do I care ???
Children
Parents
Tuned: Your Custom Profiles
latency-performancethroughput-performance
network-latencynetwork-throughput
virtual-host
virtual-guest
balanced
desktop
Your Database ProfileYour Web Profile Your Middleware Profile
Children/Grandchildren
Tuned: Profile Inheritance (throughput)
throughput-performance
dgx-performance
governor=performance
energy_perf_bias=performance
min_perf_pct=100
readahead=4096
kernel.sched_min_granularity_ns = 10000000
kernel.sched_wakeup_granularity_ns = 15000000
vm.dirty_background_ratio = 10
vm.swappiness=10
[bootloader]
cmdline = ast.modeset=0
rd.driver.blacklist=nouveau nouveau.modeset=0
transparent_hugepage=madvise console=tty0
console=ttyS1,115200n8
intremap=no_x2apic_optout
Red Hat OpenShift Container Platform
OPENSHIFT IS GAINING MOMENTUM
OPENSHIFT CUSTOMER GROWTH IS ACCELERATING
COMPREHENSIVECLOUD PARTNERSCUSTOMERSCODE
Strong partnerships
with cloud providers,
ISVs, CCSPs.
Extensive container
catalog of certified
partner images.
Comprehensive portfolio of
container products and
services, including developer
tools, security, application
services, storage, and
management.
Red Hat is the leading
Kubernetes developer and
contributor with Google.
We make container
development easy, reliable,
and more secure.
Most reference customers
running in production.
Years of experience
running OpenShift Online
and OpenShift Dedicated
services.
Why OpenShift is the Best Choice
One Platform to...
OpenShift is the single platform
to run any application:
● Old or new
● Monolithic/Microservice
17
FSI
What does an OpenShift (OCP) Cluster look like?
c
What does an OpenShift (OCP) Cluster look like?
c
DGX-1 server
with Red Hat Enterprise Linux and
OpenShift Container platform (OCP)
● Resource Management Working Group
○ Features Delivered
■ Device Plugins (GPU/Bypass/FPGA)
■ CPU Manager (exclusive cores)
■ Huge Pages Support
○ Extensive Roadmap
● Intel, IBM, Google, NVIDIA, Red Hat, many more...
Upstream First: Kubernetes Working Groups
● Network Plumbing Working Group
○ Formalized Dec 2017
● Goal is to implement an out of tree, pseudo-standard collection of
CRDs for multiple networks, owned by sig-network, *out of tree*
● Separate control- and data-plane, Overlapping IPs, Fast Data-plane
● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least.
Upstream First: Kubernetes Working Groups
Control Plane
Compute and GPU Nodes
Infrastructure
master
and etcd
master
and etcd
master
and etcd
registry
and
router
registry
and
router
LB
registry
and
router
OpenShift Cluster Topology
DGX-1 DGX-1
DGX-1 DGX-1
● How to enable software to take advantage of “special”
hardware
● Create Node Pools
○ Mark them as “special”
○ Taints/Tolerations
○ Priority/Preemption
○ ExtendedResourceTole
ration
Compute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
● How to enable software to take advantage of “special”
hardware
● Tune/Configure the OS
○ Tuned Profiles
○ CPU Isolation
○ sysctlsCompute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
● How to enable software to take advantage of “special”
hardware
● Optimize your workload
○ Dedicate CPU cores
○ Consume hugepages
Compute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
● How to enable software to take advantage of “special”
hardware
● Enable the Hardware
○ Install drivers
○ Deploy Device Plugin
Compute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
● How to enable software to take advantage of “special”
hardware
● Consume the Device
○ KubeFlow Template
deployment
Compute and GPU Nodes
DGX-1 DGX-1
DGX-1 DGX-1
OpenShift Cluster Topology
Soft or Hard Shared Cluster Partitioning?
Priority and Preemption
● Create PriorityClasses based on business
goals
● Annotate pod specs with priorityClassName
● If all GPUs are used
○ A high prio pod is queued
○ A low prio pod is running
○ Kube will preempt low prio pod
■ And schedule high prio pod
● Ensures optimal density
Taints and Toleration
● Taints are “node labels with policies”
○ You can taint a node like
○ nvidia.com/gpu=value:NoSchedule
● Then a pod will have to “tolerate” the
nvidia.com/gpu taint, otherwise it won’t run
on that node.
● This allows you to create “node pools”
● Could lead to under-utilized resources
● Might make sense for security or business
rules
OpenShift + NVIDIA Device Plugin on DGX
Red Hat Enterprise Linux
30
OpenShift Container Platform
Linux Container Runtime nvidia-container-runtime-hook
NVIDIA Driver
libnvidia-container
NGC-gpu-pod-1
nvidia-device-plugin
NGC-gpu-pod-2 NGC-gpu-pod-3
OpenShift + NVIDIA Device Plugin on DGX
Volta GPU Kubelet
Device Plugin
(daemonset)
Kube Scheduler
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Benchmark (pod)
resources:
limits:
nvidia.com/gpu: 8
oc create
31
Benchmark (pod)
resources:
limits:
nvidia.com/gpu: 8
OpenShift + NVIDIA Device Plugin on DGX
Volta GPU Kubelet
Device Plugin
(daemonset)
Kube Scheduler
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Volta GPU
Benchmark (pod)
resources:
limits:
nvidia.com/gpu: 8
oc create
32
Demo
Link
1. Login to openshift web console and land at Service Catalog
2. Verify NVIDIA device-plugin daemonset is running in kube-system namespace
3. Show how you can get a console in any running container
4. Change to nvidia namespace, and filter catalog to only show NGC templates
5. Start a TensorRT Inference Server that uses 4 of the 8 GPUs in the DGX
6. Show logs of tensorRT pod, that it is consuming 4 GPUs and that the model server is ready (curl
output)
7. Go back to service catalog and again filter by NGC images
8. Start NGC caffe framework pod, and configure it to use the remaining 4 GPUs
9. Show logs of caffe pod, show nvidia-smi, and show that this pod can access the inference server via
curl
●
●
○
●
●
NVIDIA Driver Packaging
Red Hat/NVIDIA Expanded Collaboration
● Driver Packaging
● Expanded DGX Testing
● Monitoring
● Heterogeneous Clusters
○ Resource API
● Topology Awareness
● Resource Quota API
References
● radanalytics templates for ML-workflow on OpenShift
● How to use GPUs with DevicePlugin in OpenShift 3.10
● Machine-Learning OpenShift Commons
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHat

Weitere ähnliche Inhalte

Was ist angesagt?

Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Devananda Van Der Veen
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
Etsuji Nakai
 

Was ist angesagt? (20)

Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
 
LCA13: Xen on ARM
LCA13: Xen on ARMLCA13: Xen on ARM
LCA13: Xen on ARM
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
OpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image LifecycleOpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image Lifecycle
 
XPDS13: Xen in OSS based In–Vehicle Infotainment Systems - Artem Mygaiev, Glo...
XPDS13: Xen in OSS based In–Vehicle Infotainment Systems - Artem Mygaiev, Glo...XPDS13: Xen in OSS based In–Vehicle Infotainment Systems - Artem Mygaiev, Glo...
XPDS13: Xen in OSS based In–Vehicle Infotainment Systems - Artem Mygaiev, Glo...
 
Ixgbe internals
Ixgbe internalsIxgbe internals
Ixgbe internals
 
DevConf 2014 Kernel Networking Walkthrough
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking Walkthrough
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear Containers
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
 
[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020
 
Replacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with CiliumReplacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with Cilium
 
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
 
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime Ripard
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime RipardKernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime Ripard
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime Ripard
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 
Kubernetes 基盤における非機能試験の deepdive(Kubernetes Novice Tokyo #17 発表資料)
Kubernetes 基盤における非機能試験の deepdive(Kubernetes Novice Tokyo #17 発表資料)Kubernetes 基盤における非機能試験の deepdive(Kubernetes Novice Tokyo #17 発表資料)
Kubernetes 基盤における非機能試験の deepdive(Kubernetes Novice Tokyo #17 発表資料)
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDays
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDaysThe world of Containers with Podman, Buildah, Skopeo by Seema - CCDays
The world of Containers with Podman, Buildah, Skopeo by Seema - CCDays
 
IPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishIPMI is dead, Long live Redfish
IPMI is dead, Long live Redfish
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
 
Kali Linux Installation - VMware
Kali Linux Installation - VMwareKali Linux Installation - VMware
Kali Linux Installation - VMware
 

Ähnlich wie Best practices for optimizing Red Hat platforms for large scale datacenter deployments on DGX systems

GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
NVIDIA Taiwan
 

Ähnlich wie Best practices for optimizing Red Hat platforms for large scale datacenter deployments on DGX systems (20)

NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
 
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStackGPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
 
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPU
 
Using-NVIDIA-GPU-Cloud-Containers-on-the-Nimbix-Cloud-NVIDIA.pdf
Using-NVIDIA-GPU-Cloud-Containers-on-the-Nimbix-Cloud-NVIDIA.pdfUsing-NVIDIA-GPU-Cloud-Containers-on-the-Nimbix-Cloud-NVIDIA.pdf
Using-NVIDIA-GPU-Cloud-Containers-on-the-Nimbix-Cloud-NVIDIA.pdf
 
Tensorflow in Docker
Tensorflow in DockerTensorflow in Docker
Tensorflow in Docker
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
DockerCon EU '17 - Dockerizing Aurea
DockerCon EU '17 - Dockerizing AureaDockerCon EU '17 - Dockerizing Aurea
DockerCon EU '17 - Dockerizing Aurea
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
 
Lessons learned with kubernetes in production at PlayPass
Lessons learned with kubernetes in productionat PlayPassLessons learned with kubernetes in productionat PlayPass
Lessons learned with kubernetes in production at PlayPass
 
Delivering Container-based Apps to IoT Edge devices
Delivering Container-based Apps to IoT Edge devicesDelivering Container-based Apps to IoT Edge devices
Delivering Container-based Apps to IoT Edge devices
 
The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2
 
NVIDIA GTC 2018: Enabling GPU-as-a-Service Providers with Red Hat OpenShift
NVIDIA GTC 2018:  Enabling GPU-as-a-Service Providers with Red Hat OpenShiftNVIDIA GTC 2018:  Enabling GPU-as-a-Service Providers with Red Hat OpenShift
NVIDIA GTC 2018: Enabling GPU-as-a-Service Providers with Red Hat OpenShift
 
GPU Acceleration for Containers on Intel Processor Graphics
GPU Acceleration for Containers on Intel Processor GraphicsGPU Acceleration for Containers on Intel Processor Graphics
GPU Acceleration for Containers on Intel Processor Graphics
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
HDX 3D
HDX 3DHDX 3D
HDX 3D
 
2011-11-03 Intelligence Community Cloud Users Group
2011-11-03 Intelligence Community Cloud Users Group2011-11-03 Intelligence Community Cloud Users Group
2011-11-03 Intelligence Community Cloud Users Group
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 

Mehr von Jeremy Eder

Mehr von Jeremy Eder (9)

Red Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShiftRed Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShift
 
NVIDIA GTC 2018: Spectre/Meltdown Impact on High Performance Workloads
NVIDIA GTC 2018:  Spectre/Meltdown Impact on High Performance WorkloadsNVIDIA GTC 2018:  Spectre/Meltdown Impact on High Performance Workloads
NVIDIA GTC 2018: Spectre/Meltdown Impact on High Performance Workloads
 
Triangle Kubernetes Meetup - Performance Sensitive Apps in OpenShift
Triangle Kubernetes Meetup - Performance Sensitive Apps in OpenShiftTriangle Kubernetes Meetup - Performance Sensitive Apps in OpenShift
Triangle Kubernetes Meetup - Performance Sensitive Apps in OpenShift
 
KubeCon 2017 - Kubernetes SIG Scheduling and Resource Management Working Grou...
KubeCon 2017 - Kubernetes SIG Scheduling and Resource Management Working Grou...KubeCon 2017 - Kubernetes SIG Scheduling and Resource Management Working Grou...
KubeCon 2017 - Kubernetes SIG Scheduling and Resource Management Working Grou...
 
OSCON 2017: To contain or not to contain
OSCON 2017:  To contain or not to containOSCON 2017:  To contain or not to contain
OSCON 2017: To contain or not to contain
 
Red Hat Summit 2017: Wicked Fast PaaS: Performance Tuning of OpenShift and D...
Red Hat Summit 2017:  Wicked Fast PaaS: Performance Tuning of OpenShift and D...Red Hat Summit 2017:  Wicked Fast PaaS: Performance Tuning of OpenShift and D...
Red Hat Summit 2017: Wicked Fast PaaS: Performance Tuning of OpenShift and D...
 
DevConf 2017 - Realistic Container Platform Simulations
DevConf 2017 - Realistic Container Platform SimulationsDevConf 2017 - Realistic Container Platform Simulations
DevConf 2017 - Realistic Container Platform Simulations
 
KubeCon NA, Seattle, 2016: Performance and Scalability Tuning Kubernetes for...
KubeCon NA, Seattle, 2016:  Performance and Scalability Tuning Kubernetes for...KubeCon NA, Seattle, 2016:  Performance and Scalability Tuning Kubernetes for...
KubeCon NA, Seattle, 2016: Performance and Scalability Tuning Kubernetes for...
 
LinuxCon NA 2016: When Containers and Virtualization Do - and Don’t - Work T...
LinuxCon NA 2016:  When Containers and Virtualization Do - and Don’t - Work T...LinuxCon NA 2016:  When Containers and Virtualization Do - and Don’t - Work T...
LinuxCon NA 2016: When Containers and Virtualization Do - and Don’t - Work T...
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Best practices for optimizing Red Hat platforms for large scale datacenter deployments on DGX systems

  • 1. Best practices for optimizing Red Hat platforms for large scale datacenter deployments on DGX systems Charlie Boyle, NVIDIA Andre Beausoleil and Jeremy Eder, Red Hat NVIDIA GTC, Washington, DC, October, 2018
  • 2. Agenda ● Relationship Overview ● Announcements / What’s New ● Tuned profile for DGX ● NGC Container Support overview ● RHEL, OpenShift, DGX-1 Integration Details 2
  • 3. Summary of Announcements! NVIDIA DGX-1 is now CERTIFIED on Red Hat Enterprise Linux 7
  • 4. Summary of Announcements! Support for using DGX nodes as workers in OpenShift 3.10 or later NVIDIA DGX-1 is now CERTIFIED on Red Hat Enterprise Linux 7
  • 5. Summary of Announcements! NGC containers can run on Red Hat Enterprise Linux and OpenShift NVIDIA DGX-1 is now CERTIFIED on Red Hat Enterprise Linux 7 Support for using DGX nodes as workers in OpenShift 3.10 or later
  • 6. Summary of Announcements! Expanded Engineering Relationship NVIDIA DGX-1 is now CERTIFIED on Red Hat Enterprise Linux 7 Support for using DGX nodes as workers in OpenShift 3.10 or later NGC containers can run on Red Hat Enterprise Linux and OpenShift
  • 8. Open Source Project Collaboration Key Red Hat Maintainer: Ben Skeggs Qualified with new NVIDIA architectures Part of complete OSS toolchain for HMM NOUVEAU DRIVER Key Red Hat developer: Jerome Glisse Memory management between device & CPU Key developer simplification, not just NVIDIA HETEROGENEOUS MEMORY MGMT. Key Red Hat Maintainer: Jakub Jelinek OpenMP common library GPU AWARE GCC (LIBGOMP) Multiple vGPUs for compute and graphic workloads NVIDIA VGPU & RHV
  • 9. 9 Joint Testing of Critical CVEs
  • 10. Installing Red Hat Enterprise Linux 7 ● ● ●
  • 11. Tuned Tuning profile delivery mechanism Red Hat ships tuned profiles that improve performance for many workloads...hopefully yours! Okay, but why do I care ???
  • 12. Children Parents Tuned: Your Custom Profiles latency-performancethroughput-performance network-latencynetwork-throughput virtual-host virtual-guest balanced desktop Your Database ProfileYour Web Profile Your Middleware Profile Children/Grandchildren
  • 13. Tuned: Profile Inheritance (throughput) throughput-performance dgx-performance governor=performance energy_perf_bias=performance min_perf_pct=100 readahead=4096 kernel.sched_min_granularity_ns = 10000000 kernel.sched_wakeup_granularity_ns = 15000000 vm.dirty_background_ratio = 10 vm.swappiness=10 [bootloader] cmdline = ast.modeset=0 rd.driver.blacklist=nouveau nouveau.modeset=0 transparent_hugepage=madvise console=tty0 console=ttyS1,115200n8 intremap=no_x2apic_optout
  • 14. Red Hat OpenShift Container Platform
  • 15. OPENSHIFT IS GAINING MOMENTUM OPENSHIFT CUSTOMER GROWTH IS ACCELERATING
  • 16. COMPREHENSIVECLOUD PARTNERSCUSTOMERSCODE Strong partnerships with cloud providers, ISVs, CCSPs. Extensive container catalog of certified partner images. Comprehensive portfolio of container products and services, including developer tools, security, application services, storage, and management. Red Hat is the leading Kubernetes developer and contributor with Google. We make container development easy, reliable, and more secure. Most reference customers running in production. Years of experience running OpenShift Online and OpenShift Dedicated services. Why OpenShift is the Best Choice
  • 17. One Platform to... OpenShift is the single platform to run any application: ● Old or new ● Monolithic/Microservice 17 FSI
  • 18. What does an OpenShift (OCP) Cluster look like? c
  • 19. What does an OpenShift (OCP) Cluster look like? c DGX-1 server with Red Hat Enterprise Linux and OpenShift Container platform (OCP)
  • 20. ● Resource Management Working Group ○ Features Delivered ■ Device Plugins (GPU/Bypass/FPGA) ■ CPU Manager (exclusive cores) ■ Huge Pages Support ○ Extensive Roadmap ● Intel, IBM, Google, NVIDIA, Red Hat, many more... Upstream First: Kubernetes Working Groups
  • 21. ● Network Plumbing Working Group ○ Formalized Dec 2017 ● Goal is to implement an out of tree, pseudo-standard collection of CRDs for multiple networks, owned by sig-network, *out of tree* ● Separate control- and data-plane, Overlapping IPs, Fast Data-plane ● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least. Upstream First: Kubernetes Working Groups
  • 22.
  • 23. Control Plane Compute and GPU Nodes Infrastructure master and etcd master and etcd master and etcd registry and router registry and router LB registry and router OpenShift Cluster Topology DGX-1 DGX-1 DGX-1 DGX-1
  • 24. ● How to enable software to take advantage of “special” hardware ● Create Node Pools ○ Mark them as “special” ○ Taints/Tolerations ○ Priority/Preemption ○ ExtendedResourceTole ration Compute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 25. ● How to enable software to take advantage of “special” hardware ● Tune/Configure the OS ○ Tuned Profiles ○ CPU Isolation ○ sysctlsCompute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 26. ● How to enable software to take advantage of “special” hardware ● Optimize your workload ○ Dedicate CPU cores ○ Consume hugepages Compute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 27. ● How to enable software to take advantage of “special” hardware ● Enable the Hardware ○ Install drivers ○ Deploy Device Plugin Compute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 28. ● How to enable software to take advantage of “special” hardware ● Consume the Device ○ KubeFlow Template deployment Compute and GPU Nodes DGX-1 DGX-1 DGX-1 DGX-1 OpenShift Cluster Topology
  • 29. Soft or Hard Shared Cluster Partitioning? Priority and Preemption ● Create PriorityClasses based on business goals ● Annotate pod specs with priorityClassName ● If all GPUs are used ○ A high prio pod is queued ○ A low prio pod is running ○ Kube will preempt low prio pod ■ And schedule high prio pod ● Ensures optimal density Taints and Toleration ● Taints are “node labels with policies” ○ You can taint a node like ○ nvidia.com/gpu=value:NoSchedule ● Then a pod will have to “tolerate” the nvidia.com/gpu taint, otherwise it won’t run on that node. ● This allows you to create “node pools” ● Could lead to under-utilized resources ● Might make sense for security or business rules
  • 30. OpenShift + NVIDIA Device Plugin on DGX Red Hat Enterprise Linux 30 OpenShift Container Platform Linux Container Runtime nvidia-container-runtime-hook NVIDIA Driver libnvidia-container NGC-gpu-pod-1 nvidia-device-plugin NGC-gpu-pod-2 NGC-gpu-pod-3
  • 31. OpenShift + NVIDIA Device Plugin on DGX Volta GPU Kubelet Device Plugin (daemonset) Kube Scheduler Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Benchmark (pod) resources: limits: nvidia.com/gpu: 8 oc create 31
  • 32. Benchmark (pod) resources: limits: nvidia.com/gpu: 8 OpenShift + NVIDIA Device Plugin on DGX Volta GPU Kubelet Device Plugin (daemonset) Kube Scheduler Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Volta GPU Benchmark (pod) resources: limits: nvidia.com/gpu: 8 oc create 32
  • 33. Demo Link 1. Login to openshift web console and land at Service Catalog 2. Verify NVIDIA device-plugin daemonset is running in kube-system namespace 3. Show how you can get a console in any running container 4. Change to nvidia namespace, and filter catalog to only show NGC templates 5. Start a TensorRT Inference Server that uses 4 of the 8 GPUs in the DGX 6. Show logs of tensorRT pod, that it is consuming 4 GPUs and that the model server is ready (curl output) 7. Go back to service catalog and again filter by NGC images 8. Start NGC caffe framework pod, and configure it to use the remaining 4 GPUs 9. Show logs of caffe pod, show nvidia-smi, and show that this pod can access the inference server via curl
  • 35. Red Hat/NVIDIA Expanded Collaboration ● Driver Packaging ● Expanded DGX Testing ● Monitoring ● Heterogeneous Clusters ○ Resource API ● Topology Awareness ● Resource Quota API
  • 36. References ● radanalytics templates for ML-workflow on OpenShift ● How to use GPUs with DevicePlugin in OpenShift 3.10 ● Machine-Learning OpenShift Commons