SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
bit.ly/kubemaster1
1
GPU Enablement for Data
Science on OpenShift
Pete MacKinnon
Red Hat AI Center of Excellence
@pdmackinnon
● pmackinn@redhat.com
● Principal Engineer in the Red Hat AI Center of Excellence
● Kubeflow committer since project formation
● Open Data Hub and NVIDIA GPU Operator contributor
● KubeCon, TensorFlow World, GTC, ODSC, OpenShift
Commons, and SCaLE 17x presenter
● Technical Editor for upcoming Kubeflow publication
● Co-author of “Linux Unleashed”
● Thirty years of distributed computing consulting and
engineering experience
• Data science: data and models
• AI/ML lifecycle: training to inference
• Scalars, vectors, and tensors
• CPU and GPU
• Notebooks and frameworks
• The OpenShift GPU operator “family”
• The components of GPU enablement
• Installation and demo
Agenda
Data
Models
The AI/ML lifecycle
Inference/Serving
Training
Data collection
Feature
extraction
Labeling
Monitoring
Logging
Analysis
Transformation
Validation
Splitting
Model validation
Hyperparameter tuning
Algorithm selection or
development
Model Data and Model
in Production
Data
Scalars, vectors, and tensors
Scalar - a real number having magnitude that measures
something: volume, density, speed, energy, mass, time, etc.
Vector - a one-dimensional array of scalars: force, velocity,
momentum, etc.
Tensor - a higher-order algebraic object that could be a scalar, a
vector, a multidimensional array, a multilinear map, etc.
Modern CPU have advanced instruction sets for vector algebra
but modern GPU are built specifically to perform complex
tensor operations with a high degree of parallelism
Scalars, vectors, and tensors
How many matrix multiplications can be done in one clock cycle?
Image: https://iq.opengenus.org/
10¹ 10⁴ 10⁵
So, in one clock cycle...
CPU (scalar)
CPU/GPU
(vector)
GPU (tensor)
Or, DL with real world data...
Object
(scalar)
Movement
(vector)
Classification, velocity,
bearing, and much more
(tensor)
CPU and GPU
NVIDIA Ampere A100
• 6912 FP32 CUDA Cores
• 432 Gen3 Tensor Cores
but
• FP32 -> 19.5 TFLOPS
AMD EPYC 7702 (Rome)
• 64 CPU Cores
• 128 Threads
• 2.0GHz Base Clock
• FP32 -> 1-2 TFLOPS
A GPU notebook
Profit
380x speedup over CPU in basic CNN smoke test
(Intel Xeon E5-2686 vs. NVIDIA V100-SXM2-16Gi)
Special Resource Operator
(SRO)
● Community operator
● Reference
implementation for other
specialized hardware
○ NIC, FPGA
● Provided the code basis
for the NVIDIA GPU
Operator
● Deployed from
OperatorHub
GPU operators
NVIDIA GPU Operator
● Certified and supported on
OpenShift by NVIDIA and Red Hat
● Can be deployed from embedded
OperatorHub or with Helm
Both operators require node feature
discovery (NFD)
NVIDIA also provides the GPU feature
operator for enhanced labeling
Operator components
• Container-runtime-toolkit: The NVIDIA GPU Operator
supports docker and cri-o container runtimes. This daemonset
ensures the correct runtime setup for the GPU hook.
• Driver: A container deployed as a daemonset that holds all
userspace and kernelspace software to make the GPU device
work.
• Device plugin: A daemonset that monitors the health and
availability of the GPU on the node. Vital for pod scheduling.
• DCGM: Data Center GPU Monitoring - a node exporter that
captures GPU metrics for use by Prometheus.
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
Installation
Demo
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Operatorhub.io and your Kubernetes cluster | DevNation Tech Talk
Operatorhub.io and your Kubernetes cluster | DevNation Tech TalkOperatorhub.io and your Kubernetes cluster | DevNation Tech Talk
Operatorhub.io and your Kubernetes cluster | DevNation Tech Talk
 
16. Cncf meetup-docker
16. Cncf meetup-docker16. Cncf meetup-docker
16. Cncf meetup-docker
 
OpenShift Application Development | DO288 | Red Hat OpenShift
OpenShift Application Development | DO288 | Red Hat OpenShiftOpenShift Application Development | DO288 | Red Hat OpenShift
OpenShift Application Development | DO288 | Red Hat OpenShift
 
Journey of Kubernetes Scaling
Journey of Kubernetes ScalingJourney of Kubernetes Scaling
Journey of Kubernetes Scaling
 
What you have to know about Certified Kubernetes Administrator (CKA)
What you have to know about Certified Kubernetes Administrator (CKA)What you have to know about Certified Kubernetes Administrator (CKA)
What you have to know about Certified Kubernetes Administrator (CKA)
 
Kubernetes Logging
Kubernetes LoggingKubernetes Logging
Kubernetes Logging
 
Kubernetes - A Rising Hero
Kubernetes - A Rising HeroKubernetes - A Rising Hero
Kubernetes - A Rising Hero
 
Containerd + buildkit breakout
Containerd + buildkit breakoutContainerd + buildkit breakout
Containerd + buildkit breakout
 
Cicd pixelfederation
Cicd pixelfederationCicd pixelfederation
Cicd pixelfederation
 
Managing kubernetes deployment with operators
Managing kubernetes deployment with operatorsManaging kubernetes deployment with operators
Managing kubernetes deployment with operators
 
GlueCon kubernetes & container engine
GlueCon kubernetes & container engineGlueCon kubernetes & container engine
GlueCon kubernetes & container engine
 
Introduction to Kubernetes and GKE
Introduction to Kubernetes and GKEIntroduction to Kubernetes and GKE
Introduction to Kubernetes and GKE
 
Multi-cloud Kubernetes BCDR with Velero
Multi-cloud Kubernetes BCDR with VeleroMulti-cloud Kubernetes BCDR with Velero
Multi-cloud Kubernetes BCDR with Velero
 
Docker ee an architecture and operations overview
Docker ee an architecture and operations overviewDocker ee an architecture and operations overview
Docker ee an architecture and operations overview
 
Extended and embedding: containerd update & project use cases
Extended and embedding: containerd update & project use casesExtended and embedding: containerd update & project use cases
Extended and embedding: containerd update & project use cases
 
Implementing an Automated Staging Environment
Implementing an Automated Staging EnvironmentImplementing an Automated Staging Environment
Implementing an Automated Staging Environment
 
Five Lessons Learned from Large-scale Implementation of Kubernetes in the Ent...
Five Lessons Learned from Large-scale Implementation of Kubernetes in the Ent...Five Lessons Learned from Large-scale Implementation of Kubernetes in the Ent...
Five Lessons Learned from Large-scale Implementation of Kubernetes in the Ent...
 
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Kubernetes configuration and security policies with KubeLinter | DevNation Te...Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
 
How Docker EE is Finnish Railway’s Ticket to App Modernization
How Docker EE is Finnish Railway’s Ticket to App ModernizationHow Docker EE is Finnish Railway’s Ticket to App Modernization
How Docker EE is Finnish Railway’s Ticket to App Modernization
 
FOSDEM 2019: A containerd Project Update
FOSDEM 2019: A containerd Project UpdateFOSDEM 2019: A containerd Project Update
FOSDEM 2019: A containerd Project Update
 

Ähnlich wie GPU enablement for data science on OpenShift | DevNation Tech Talk

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 

Ähnlich wie GPU enablement for data science on OpenShift | DevNation Tech Talk (20)

Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
Nvidia at SEMICon, Munich
Nvidia at SEMICon, MunichNvidia at SEMICon, Munich
Nvidia at SEMICon, Munich
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
 
"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation ...
"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation ..."The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation ...
"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation ...
 
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 

Mehr von Red Hat Developers

Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Red Hat Developers
 

Mehr von Red Hat Developers (20)

DevNation Tech Talk: Getting GitOps
DevNation Tech Talk: Getting GitOpsDevNation Tech Talk: Getting GitOps
DevNation Tech Talk: Getting GitOps
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
 
GitHub Makeover | DevNation Tech Talk
GitHub Makeover | DevNation Tech TalkGitHub Makeover | DevNation Tech Talk
GitHub Makeover | DevNation Tech Talk
 
Quinoa: A modern Quarkus UI with no hassles | DevNation tech Talk
Quinoa: A modern Quarkus UI with no hassles | DevNation tech TalkQuinoa: A modern Quarkus UI with no hassles | DevNation tech Talk
Quinoa: A modern Quarkus UI with no hassles | DevNation tech Talk
 
Extra micrometer practices with Quarkus | DevNation Tech Talk
Extra micrometer practices with Quarkus | DevNation Tech TalkExtra micrometer practices with Quarkus | DevNation Tech Talk
Extra micrometer practices with Quarkus | DevNation Tech Talk
 
Event-driven autoscaling through KEDA and Knative Integration | DevNation Tec...
Event-driven autoscaling through KEDA and Knative Integration | DevNation Tec...Event-driven autoscaling through KEDA and Knative Integration | DevNation Tec...
Event-driven autoscaling through KEDA and Knative Integration | DevNation Tec...
 
Integrating Loom in Quarkus | DevNation Tech Talk
Integrating Loom in Quarkus | DevNation Tech TalkIntegrating Loom in Quarkus | DevNation Tech Talk
Integrating Loom in Quarkus | DevNation Tech Talk
 
Quarkus Renarde 🦊♥: an old-school Web framework with today's touch | DevNatio...
Quarkus Renarde 🦊♥: an old-school Web framework with today's touch | DevNatio...Quarkus Renarde 🦊♥: an old-school Web framework with today's touch | DevNatio...
Quarkus Renarde 🦊♥: an old-school Web framework with today's touch | DevNatio...
 
Containers without docker | DevNation Tech Talk
Containers without docker | DevNation Tech TalkContainers without docker | DevNation Tech Talk
Containers without docker | DevNation Tech Talk
 
Distributed deployment of microservices across multiple OpenShift clusters | ...
Distributed deployment of microservices across multiple OpenShift clusters | ...Distributed deployment of microservices across multiple OpenShift clusters | ...
Distributed deployment of microservices across multiple OpenShift clusters | ...
 
DevNation Workshop: Object detection with Red Hat OpenShift Data Science [Mar...
DevNation Workshop: Object detection with Red Hat OpenShift Data Science [Mar...DevNation Workshop: Object detection with Red Hat OpenShift Data Science [Mar...
DevNation Workshop: Object detection with Red Hat OpenShift Data Science [Mar...
 
Dear security, compliance, and auditing: We’re sorry. Love, DevOps | DevNatio...
Dear security, compliance, and auditing: We’re sorry. Love, DevOps | DevNatio...Dear security, compliance, and auditing: We’re sorry. Love, DevOps | DevNatio...
Dear security, compliance, and auditing: We’re sorry. Love, DevOps | DevNatio...
 
11 CLI tools every developer should know | DevNation Tech Talk
11 CLI tools every developer should know | DevNation Tech Talk11 CLI tools every developer should know | DevNation Tech Talk
11 CLI tools every developer should know | DevNation Tech Talk
 
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech TalkA Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
 
GitHub Actions and OpenShift: ​​Supercharging your software development loops...
GitHub Actions and OpenShift: ​​Supercharging your software development loops...GitHub Actions and OpenShift: ​​Supercharging your software development loops...
GitHub Actions and OpenShift: ​​Supercharging your software development loops...
 
To the moon and beyond with Java 17 APIs! | DevNation Tech Talk
To the moon and beyond with Java 17 APIs! | DevNation Tech TalkTo the moon and beyond with Java 17 APIs! | DevNation Tech Talk
To the moon and beyond with Java 17 APIs! | DevNation Tech Talk
 
Profile your Java apps in production on Red Hat OpenShift with Cryostat | Dev...
Profile your Java apps in production on Red Hat OpenShift with Cryostat | Dev...Profile your Java apps in production on Red Hat OpenShift with Cryostat | Dev...
Profile your Java apps in production on Red Hat OpenShift with Cryostat | Dev...
 
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
 
Level-up your gaming telemetry using Kafka Streams | DevNation Tech Talk
Level-up your gaming telemetry using Kafka Streams | DevNation Tech TalkLevel-up your gaming telemetry using Kafka Streams | DevNation Tech Talk
Level-up your gaming telemetry using Kafka Streams | DevNation Tech Talk
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

GPU enablement for data science on OpenShift | DevNation Tech Talk

  • 1. bit.ly/kubemaster1 1 GPU Enablement for Data Science on OpenShift Pete MacKinnon Red Hat AI Center of Excellence
  • 2. @pdmackinnon ● pmackinn@redhat.com ● Principal Engineer in the Red Hat AI Center of Excellence ● Kubeflow committer since project formation ● Open Data Hub and NVIDIA GPU Operator contributor ● KubeCon, TensorFlow World, GTC, ODSC, OpenShift Commons, and SCaLE 17x presenter ● Technical Editor for upcoming Kubeflow publication ● Co-author of “Linux Unleashed” ● Thirty years of distributed computing consulting and engineering experience
  • 3. • Data science: data and models • AI/ML lifecycle: training to inference • Scalars, vectors, and tensors • CPU and GPU • Notebooks and frameworks • The OpenShift GPU operator “family” • The components of GPU enablement • Installation and demo Agenda
  • 6. The AI/ML lifecycle Inference/Serving Training Data collection Feature extraction Labeling Monitoring Logging Analysis Transformation Validation Splitting Model validation Hyperparameter tuning Algorithm selection or development Model Data and Model in Production Data
  • 7. Scalars, vectors, and tensors Scalar - a real number having magnitude that measures something: volume, density, speed, energy, mass, time, etc. Vector - a one-dimensional array of scalars: force, velocity, momentum, etc. Tensor - a higher-order algebraic object that could be a scalar, a vector, a multidimensional array, a multilinear map, etc. Modern CPU have advanced instruction sets for vector algebra but modern GPU are built specifically to perform complex tensor operations with a high degree of parallelism
  • 8. Scalars, vectors, and tensors How many matrix multiplications can be done in one clock cycle? Image: https://iq.opengenus.org/ 10¹ 10⁴ 10⁵
  • 9. So, in one clock cycle... CPU (scalar) CPU/GPU (vector) GPU (tensor)
  • 10. Or, DL with real world data... Object (scalar) Movement (vector) Classification, velocity, bearing, and much more (tensor)
  • 11. CPU and GPU NVIDIA Ampere A100 • 6912 FP32 CUDA Cores • 432 Gen3 Tensor Cores but • FP32 -> 19.5 TFLOPS AMD EPYC 7702 (Rome) • 64 CPU Cores • 128 Threads • 2.0GHz Base Clock • FP32 -> 1-2 TFLOPS
  • 13. Profit 380x speedup over CPU in basic CNN smoke test (Intel Xeon E5-2686 vs. NVIDIA V100-SXM2-16Gi)
  • 14. Special Resource Operator (SRO) ● Community operator ● Reference implementation for other specialized hardware ○ NIC, FPGA ● Provided the code basis for the NVIDIA GPU Operator ● Deployed from OperatorHub GPU operators NVIDIA GPU Operator ● Certified and supported on OpenShift by NVIDIA and Red Hat ● Can be deployed from embedded OperatorHub or with Helm Both operators require node feature discovery (NFD) NVIDIA also provides the GPU feature operator for enhanced labeling
  • 15. Operator components • Container-runtime-toolkit: The NVIDIA GPU Operator supports docker and cri-o container runtimes. This daemonset ensures the correct runtime setup for the GPU hook. • Driver: A container deployed as a daemonset that holds all userspace and kernelspace software to make the GPU device work. • Device plugin: A daemonset that monitors the health and availability of the GPU on the node. Vital for pod scheduling. • DCGM: Data Center GPU Monitoring - a node exporter that captures GPU metrics for use by Prometheus. nodeSelector: feature.node.kubernetes.io/pci-10de.present: "true"
  • 17. Demo