SlideShare ist ein Scribd-Unternehmen logo
1 von 31
1©2018 VMware, Inc.
Accelerating & Optimizing HPC/ML
on vSphere Leveraging NVIDIA
GPU
Mohan Potheri, VMware, Inc
Justin Murray, VMware, Inc
Agenda
2©2018 VMware, Inc.
New Demands on IT
VMware Goal and Approach
Why Virtualize AI & ML
Machine Learning Landscape
Maximizing GPU Utilization
Extending GPU Sharing to Containers
Summary
3©2018 VMware, Inc.
New Demands on IT Infrastructure
X86 SGXGPU NVM FPGAQAT IPU
Specialized Hardware
Security
Hybrid Cloud
Public Cloud
Global Infra and Edge
Growth of Apps
Business
Critical Apps
Desktop
Virtualization
Graphic
Intensive
Cloud-Native
Apps
Edge/IOTSaaSMobile Custom/OtherAnalytics/
AI/ML
PMEM
Our Goal and Approach
• Increase agility and decrease time to discovery for researchers, data scientists, and
engineers
• Provide IT with the ability to efficiently provision, allocate, manage and ensure
compliance of research compute infrastructure across an increasingly broad range of
technical and business requirements
• By leveraging VMware’s proven, enterprise-class virtualization and cloud technologies to
meet the performance requirements of research computing, HPC, and ML workloads, and
• Bringing novel capabilities to bear to enable new capabilities not available in traditional
HPC/ML environments
5©2018 VMware, Inc.
• Simple cluster expansion and
contraction
• Rapidly reproduce research
environments
• Higher resiliency and less
downtime with vMotion
• Fault-isolation (hardware and
software)
 Cluster resource-sharing
 Minimize setup and
configuration time with
centralized management
capabilities
 Simultaneously support mixed
software environments
 Industry-leading virtualization
platform that your IT already
knows
• Easy, secure data access and
sharing
• Security Isolation
• Multi-tenant data security
Why Virtualize HPC AI/ML Infrastructure
vSphere can help data scientists get to answers faster
Operational Flexibility Reduced Complexity Secure Sensitive Workloads
6©2018 VMware, Inc.
Dispelling the Misunderstanding about GPUs on vSphere
• Hypervisor is not an intermediary
when accessing the GPU
• GPU access is
• Directly via passthrough to VM
or
• NVIDIA Grid vGPU
• Near Zero performance impact
7©2018 VMware, Inc.
Machine
Learning
Deep
LearningBig Data
Edge
or
IoT
ON-PREM
OFF-PREM
trainingdata
inference
inference
Machine Learning Infrastructure Landscape
Data Analytics
Two Main Phases in ML
• Training / Model Building
• Often very large data sets
• Compute, storage, and network
intensive
• Server-class infrastructure
• Inference / Scoring
• Apply existing models to new data
• Used for prediction
• Edge or core infrastructure
V
D
I
8©2018 VMware, Inc.
Using GPUs with vSphere
9©2018 VMware, Inc.
VM Direct Path I/O for NVIDIA
GPU
10©2018 VMware, Inc.
A Virtualized GPU
PassThrough v Sphere 6.5/6.7
ESXi Host
GPU
VM VM
Linux
CUDA Library & Driver
TensorFlow
11©2018 VMware, Inc.
• Can provision VMs with one or more GPUs
• Easily reuse GPU infrastructure
• Same behavior as Public Cloud GPU instances
• Benefits:
• HW Isolation
• Workload Isolation
• VM Level Quality of Service
• Fast environment provisioning
• Near bare-metal performance
• Passthrough device certification for vSphere not required
• Server must be compatible with device as published by server OEM and
GPU vendor
• Server must be vSphere Certified
GPU Acceleration on vSphere with DirectPath I/O
VM
GPU
App
GPU
App
GPU
App
GPU
App
GPU
App
• Caveats:
• No vMotion
• No Suspend and Resume
• No DRS
• No vSphere HA
Learn more
12©2018 VMware, Inc.
VM DirectPath I/O – Multiple GPUs Attached to a Virtual Machine
13©2018 VMware, Inc.
vSphere GPU Sharing Mechanisms
14©2018 VMware, Inc.
Using GPUs with vSphere
15©2018 VMware, Inc.
• Share single GPU among multiple VMs
• Provision VMs with partial up to one full GPU
• GRID vGPU VM Suspend and Resume support
• Quickly repurpose GPU infrastructure
• VDI or Data Science by day
• Compute (ML) by Night
• Benefits:
• HW Isolation
• Workload Isolation
• VM Level Quality of Service
• GPU Quality of Service
• Fast environment provisioning
• Bare-metal comparable performance
VMware vSphere 6.7 and NVIDIA Quadro vDWS (GRID 7.0)
GPU
App
GPU
App
GPU
App
GPU
App
GPU
App
GPU
App
GPU
App
GPU
App
Learn more
16©2018 VMware, Inc.
NVIDIA Grid – Two Layers of Software/Drivers
17©2018 VMware, Inc.
NVIDIA Grid Configuration – Choosing the vGPU Profile
18©2018 VMware, Inc.
Using GPUs with vSphere
19©2018 VMware, Inc.
• Dynamic GPU attach anywhere
• Fractional GPUs for Efficiency
• Application Run Time Virtualization
• Standard based GPU
Bitfusion Enables Remote GPU Sharing
BF
Client VM
ESX Host
BF
Server
VM
ESX Host
GPU Passthrough
BF
Server
VM
ESX Host
GPU Passthrough
BF
Server
VM
ESX Host
GPU Passthrough
vSphere GPU Cluster
BF
Client VM
ESX Host
BF
Client VM
ESX Host
BF
Client VM
ESX Host
20©2018 VMware, Inc.
Maximize GPU Utilization
21©2018 VMware, Inc.
vSphere 6.7 GPU Virtual Machine Suspend and Resume
Source: Enhancing Operations for NVIDIA Grid
Video Demo:
https://youtu.be/PwVReRauY50
Blog Article:
https://blogs.vmware.com/vsphere/2018/07/vs
phere-6-7-suspend-and-resume-of-gpu-
attached-virtual-machines.html
22©2018 VMware, Inc.
Go beyond a traditional batch-
processing to viewing HPC resources as
an engine for returning results in real
time.
Enable HPC compute jobs to harvest
cycles from a VDI compute environment.
Outcome
Benefit
Deep Learning Virtualization Use Case: Cycle Harvesting
Challenge:
Data Scientists submit jobs in traditional batches, because of
compute availability
• Submit jobs one day
• Wait until the next day for the job results
What if…
The VDI environment has unused cycles. Could HPC jobs be run in
the environment when it is not needed to run VDI?
Will it blend?
23©2018 VMware, Inc.
Cycle Harvesting
VMware ESXi VMware ESXi VMware ESXi
100 100 100 100 100 100 1 1Share Value 100
8AMTime Noon 5PM 10PM
1
24©2018 VMware, Inc.
Cycle Harvesting Case Study
https://bit.ly/2MrBngH
25©2018 VMware, Inc.
Extending GPGPU Sharing to
Containers
Why Singularity Containers?
Docker is not designed for HPC architectures
Singularity is the best suited Container solution for HPC:
Singularity container is encapsulated in a single file making
it highly portable and secure.
Singularity is designed from the ground up for scientific
computing
Combining Virtual Machines & Containers for GPU sharing
• Sharing GPUs in a container is difficult as there is no resource management
• vSphere VM with NVIDIA Grid or Bitfusion can use whole or partial GPU
• Containers are a great packaging mechanism for applications
• By enclosing one container per virtual machine, we get the best of both worlds
• GPU resources can be shared with other containers
• Machine and Deep Learning applications & platforms can be packaged and distributed effectively as a
container
Logical Schematic of Infrastructure components
• One Singularity Container per VM
• Containers leverage partial or full GPUs allocated
to the virtual machine
• Container packaged with TensorFlow, tools, etc.
• Bitfusion provides GPU sharing
BF
Server
VM
ESX Host
GPU
Passthrough
BF
Server
VM
ESX Host
GPU
Passthrough
BF
Server
VM
ESX Host
GPU
Passthrough
vSphere GPU Cluster
Singularit
y
ContainerVirtual Machine
ESX Host
Singularit
y
Container
Virtual Machine
ESX Host
vSphere Generic Cluster
Images/sec Throughput comparison for 1 GPU
2.5-3X more throughput
with sharing
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Resnet50 Alexnet Inception3
Throughput comparison with and without GPU sharing
Total Throughput Baseline no sharing
ThroughputRatios
Runtime comparison for 1 GPU (with/without sharing)
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
200.00
Runtime (%) Average Run Time (Seconds)
Runtime comparison for 1 GPU with and without sharing
Unshared Shared
17%
Only 17% slower for nearly 3X Throughput
Summary
• Sharing is key to enable cloud like capabilities on premises
• vSphere is the best platform to leverage latest high performance hardware
• Virtualization supports device sharing and delivers near bare-metal performance
• HW Sharing through vSphere can increase utilization. (Cycle Harvesting)

Weitere ähnliche Inhalte

Was ist angesagt?

Presentation v cat 3.0 - architecture to implementation
Presentation   v cat 3.0 - architecture to implementationPresentation   v cat 3.0 - architecture to implementation
Presentation v cat 3.0 - architecture to implementation
solarisyourep
 
Flex Cloud - Conceptual Design - ver 0.2
Flex Cloud - Conceptual Design - ver 0.2Flex Cloud - Conceptual Design - ver 0.2
Flex Cloud - Conceptual Design - ver 0.2
David Pasek
 

Was ist angesagt? (20)

Making Virtualization Invisible with AHV and Acropolis App Mobility Fabric
Making Virtualization Invisible with AHV and Acropolis App Mobility FabricMaking Virtualization Invisible with AHV and Acropolis App Mobility Fabric
Making Virtualization Invisible with AHV and Acropolis App Mobility Fabric
 
Kube Your Enthusiasm - Paul Czarkowski
Kube Your Enthusiasm - Paul CzarkowskiKube Your Enthusiasm - Paul Czarkowski
Kube Your Enthusiasm - Paul Czarkowski
 
Presentation v cat 3.0 - architecture to implementation
Presentation   v cat 3.0 - architecture to implementationPresentation   v cat 3.0 - architecture to implementation
Presentation v cat 3.0 - architecture to implementation
 
Nutanix Puts the I in VDI
Nutanix Puts the I in VDINutanix Puts the I in VDI
Nutanix Puts the I in VDI
 
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
Hitting the Enterprise Sweet Spot—A Real-World View of PKS Deployment and Suc...
 
Citrix cloud case study kit 2014
Citrix cloud case study kit 2014Citrix cloud case study kit 2014
Citrix cloud case study kit 2014
 
Presentazione Tintri - Clouditalia @ VMUGIT UserCon 2015
Presentazione Tintri - Clouditalia @ VMUGIT UserCon 2015Presentazione Tintri - Clouditalia @ VMUGIT UserCon 2015
Presentazione Tintri - Clouditalia @ VMUGIT UserCon 2015
 
Hybrid cloud overview and VCF on VxRAIL
Hybrid cloud overview and VCF on VxRAILHybrid cloud overview and VCF on VxRAIL
Hybrid cloud overview and VCF on VxRAIL
 
Private IaaS Cloud Provider
Private IaaS Cloud ProviderPrivate IaaS Cloud Provider
Private IaaS Cloud Provider
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
VersaStack CVD with IBM flashsystem V9000!
VersaStack CVD with IBM flashsystem V9000!VersaStack CVD with IBM flashsystem V9000!
VersaStack CVD with IBM flashsystem V9000!
 
Presentazione SimpliVity @ VMUGIT UserCon 2015
Presentazione SimpliVity @ VMUGIT UserCon 2015Presentazione SimpliVity @ VMUGIT UserCon 2015
Presentazione SimpliVity @ VMUGIT UserCon 2015
 
Spectre/Meltdown security vulnerabilities FAQ
Spectre/Meltdown security vulnerabilities FAQSpectre/Meltdown security vulnerabilities FAQ
Spectre/Meltdown security vulnerabilities FAQ
 
Citrix SaaS and Citrix CloudPlatform Case Study
Citrix SaaS and Citrix CloudPlatform Case StudyCitrix SaaS and Citrix CloudPlatform Case Study
Citrix SaaS and Citrix CloudPlatform Case Study
 
Flex Cloud - Conceptual Design - ver 0.2
Flex Cloud - Conceptual Design - ver 0.2Flex Cloud - Conceptual Design - ver 0.2
Flex Cloud - Conceptual Design - ver 0.2
 
Primend praktiline pilveseminar 2014 - Simplivity Omnicube, esimene samm pilve
Primend praktiline pilveseminar 2014 - Simplivity Omnicube, esimene samm pilvePrimend praktiline pilveseminar 2014 - Simplivity Omnicube, esimene samm pilve
Primend praktiline pilveseminar 2014 - Simplivity Omnicube, esimene samm pilve
 
Enterprise Cloud Platform - Keynote
Enterprise Cloud Platform - KeynoteEnterprise Cloud Platform - Keynote
Enterprise Cloud Platform - Keynote
 
Nutanix basic
Nutanix basicNutanix basic
Nutanix basic
 
Lenovo Converged HX Series Nutanix Appliance
Lenovo Converged HX Series Nutanix ApplianceLenovo Converged HX Series Nutanix Appliance
Lenovo Converged HX Series Nutanix Appliance
 
運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發
 

Ähnlich wie Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDIA GPUs

“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
Edge AI and Vision Alliance
 
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in aziendaPivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
VMware Tanzu
 

Ähnlich wie Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDIA GPUs (20)

VMware Workspace ONE a synergie s Microsoftem
VMware Workspace ONE a synergie s MicrosoftemVMware Workspace ONE a synergie s Microsoftem
VMware Workspace ONE a synergie s Microsoftem
 
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
 
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in aziendaPivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
Pivotal Container Service : la nuova soluzione per gestire Kubernetes in azienda
 
Pivotal Platform - December Release A First Look
Pivotal Platform - December Release A First LookPivotal Platform - December Release A First Look
Pivotal Platform - December Release A First Look
 
Optimize Content Delivery with Multi-Access Edge Computing
Optimize Content Delivery with Multi-Access Edge ComputingOptimize Content Delivery with Multi-Access Edge Computing
Optimize Content Delivery with Multi-Access Edge Computing
 
ENT208 Transform your Business with VMware Cloud on AWS
ENT208 Transform your Business with VMware Cloud on AWSENT208 Transform your Business with VMware Cloud on AWS
ENT208 Transform your Business with VMware Cloud on AWS
 
VMware_Cloud_on_AWS_Whats_New_with_Aug_2018_Release_JW-Default.pptx
VMware_Cloud_on_AWS_Whats_New_with_Aug_2018_Release_JW-Default.pptxVMware_Cloud_on_AWS_Whats_New_with_Aug_2018_Release_JW-Default.pptx
VMware_Cloud_on_AWS_Whats_New_with_Aug_2018_Release_JW-Default.pptx
 
Azure Stack Overview (Dec/2018)
Azure Stack Overview (Dec/2018)Azure Stack Overview (Dec/2018)
Azure Stack Overview (Dec/2018)
 
Nvidia grid and vGPU
Nvidia grid and vGPUNvidia grid and vGPU
Nvidia grid and vGPU
 
Enhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo Workflows
 
VMworld 2015: Deliver High Performance Desktops with VMware Horizon and NVIDI...
VMworld 2015: Deliver High Performance Desktops with VMware Horizon and NVIDI...VMworld 2015: Deliver High Performance Desktops with VMware Horizon and NVIDI...
VMworld 2015: Deliver High Performance Desktops with VMware Horizon and NVIDI...
 
VMworld 2013: Virtualization Rookie or Pro: Why vSphere is Your Best Choice
VMworld 2013: Virtualization Rookie or Pro: Why vSphere is Your Best ChoiceVMworld 2013: Virtualization Rookie or Pro: Why vSphere is Your Best Choice
VMworld 2013: Virtualization Rookie or Pro: Why vSphere is Your Best Choice
 
Accelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to CloudAccelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to Cloud
 
Citrix Portfolio Updates
Citrix Portfolio UpdatesCitrix Portfolio Updates
Citrix Portfolio Updates
 
Designing your xen app 7.5 environment
Designing your xen app 7.5 environmentDesigning your xen app 7.5 environment
Designing your xen app 7.5 environment
 
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...
 
cloud virtualization technology
 cloud virtualization technology  cloud virtualization technology
cloud virtualization technology
 
Edge Zones In CloudStack
Edge Zones In CloudStackEdge Zones In CloudStack
Edge Zones In CloudStack
 
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworldCisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
 
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworldCisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
 

Mehr von inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDIA GPUs

  • 1. 1©2018 VMware, Inc. Accelerating & Optimizing HPC/ML on vSphere Leveraging NVIDIA GPU Mohan Potheri, VMware, Inc Justin Murray, VMware, Inc
  • 2. Agenda 2©2018 VMware, Inc. New Demands on IT VMware Goal and Approach Why Virtualize AI & ML Machine Learning Landscape Maximizing GPU Utilization Extending GPU Sharing to Containers Summary
  • 3. 3©2018 VMware, Inc. New Demands on IT Infrastructure X86 SGXGPU NVM FPGAQAT IPU Specialized Hardware Security Hybrid Cloud Public Cloud Global Infra and Edge Growth of Apps Business Critical Apps Desktop Virtualization Graphic Intensive Cloud-Native Apps Edge/IOTSaaSMobile Custom/OtherAnalytics/ AI/ML PMEM
  • 4. Our Goal and Approach • Increase agility and decrease time to discovery for researchers, data scientists, and engineers • Provide IT with the ability to efficiently provision, allocate, manage and ensure compliance of research compute infrastructure across an increasingly broad range of technical and business requirements • By leveraging VMware’s proven, enterprise-class virtualization and cloud technologies to meet the performance requirements of research computing, HPC, and ML workloads, and • Bringing novel capabilities to bear to enable new capabilities not available in traditional HPC/ML environments
  • 5. 5©2018 VMware, Inc. • Simple cluster expansion and contraction • Rapidly reproduce research environments • Higher resiliency and less downtime with vMotion • Fault-isolation (hardware and software)  Cluster resource-sharing  Minimize setup and configuration time with centralized management capabilities  Simultaneously support mixed software environments  Industry-leading virtualization platform that your IT already knows • Easy, secure data access and sharing • Security Isolation • Multi-tenant data security Why Virtualize HPC AI/ML Infrastructure vSphere can help data scientists get to answers faster Operational Flexibility Reduced Complexity Secure Sensitive Workloads
  • 6. 6©2018 VMware, Inc. Dispelling the Misunderstanding about GPUs on vSphere • Hypervisor is not an intermediary when accessing the GPU • GPU access is • Directly via passthrough to VM or • NVIDIA Grid vGPU • Near Zero performance impact
  • 7. 7©2018 VMware, Inc. Machine Learning Deep LearningBig Data Edge or IoT ON-PREM OFF-PREM trainingdata inference inference Machine Learning Infrastructure Landscape Data Analytics Two Main Phases in ML • Training / Model Building • Often very large data sets • Compute, storage, and network intensive • Server-class infrastructure • Inference / Scoring • Apply existing models to new data • Used for prediction • Edge or core infrastructure V D I
  • 8. 8©2018 VMware, Inc. Using GPUs with vSphere
  • 9. 9©2018 VMware, Inc. VM Direct Path I/O for NVIDIA GPU
  • 10. 10©2018 VMware, Inc. A Virtualized GPU PassThrough v Sphere 6.5/6.7 ESXi Host GPU VM VM Linux CUDA Library & Driver TensorFlow
  • 11. 11©2018 VMware, Inc. • Can provision VMs with one or more GPUs • Easily reuse GPU infrastructure • Same behavior as Public Cloud GPU instances • Benefits: • HW Isolation • Workload Isolation • VM Level Quality of Service • Fast environment provisioning • Near bare-metal performance • Passthrough device certification for vSphere not required • Server must be compatible with device as published by server OEM and GPU vendor • Server must be vSphere Certified GPU Acceleration on vSphere with DirectPath I/O VM GPU App GPU App GPU App GPU App GPU App • Caveats: • No vMotion • No Suspend and Resume • No DRS • No vSphere HA Learn more
  • 12. 12©2018 VMware, Inc. VM DirectPath I/O – Multiple GPUs Attached to a Virtual Machine
  • 13. 13©2018 VMware, Inc. vSphere GPU Sharing Mechanisms
  • 14. 14©2018 VMware, Inc. Using GPUs with vSphere
  • 15. 15©2018 VMware, Inc. • Share single GPU among multiple VMs • Provision VMs with partial up to one full GPU • GRID vGPU VM Suspend and Resume support • Quickly repurpose GPU infrastructure • VDI or Data Science by day • Compute (ML) by Night • Benefits: • HW Isolation • Workload Isolation • VM Level Quality of Service • GPU Quality of Service • Fast environment provisioning • Bare-metal comparable performance VMware vSphere 6.7 and NVIDIA Quadro vDWS (GRID 7.0) GPU App GPU App GPU App GPU App GPU App GPU App GPU App GPU App Learn more
  • 16. 16©2018 VMware, Inc. NVIDIA Grid – Two Layers of Software/Drivers
  • 17. 17©2018 VMware, Inc. NVIDIA Grid Configuration – Choosing the vGPU Profile
  • 18. 18©2018 VMware, Inc. Using GPUs with vSphere
  • 19. 19©2018 VMware, Inc. • Dynamic GPU attach anywhere • Fractional GPUs for Efficiency • Application Run Time Virtualization • Standard based GPU Bitfusion Enables Remote GPU Sharing BF Client VM ESX Host BF Server VM ESX Host GPU Passthrough BF Server VM ESX Host GPU Passthrough BF Server VM ESX Host GPU Passthrough vSphere GPU Cluster BF Client VM ESX Host BF Client VM ESX Host BF Client VM ESX Host
  • 21. 21©2018 VMware, Inc. vSphere 6.7 GPU Virtual Machine Suspend and Resume Source: Enhancing Operations for NVIDIA Grid Video Demo: https://youtu.be/PwVReRauY50 Blog Article: https://blogs.vmware.com/vsphere/2018/07/vs phere-6-7-suspend-and-resume-of-gpu- attached-virtual-machines.html
  • 22. 22©2018 VMware, Inc. Go beyond a traditional batch- processing to viewing HPC resources as an engine for returning results in real time. Enable HPC compute jobs to harvest cycles from a VDI compute environment. Outcome Benefit Deep Learning Virtualization Use Case: Cycle Harvesting Challenge: Data Scientists submit jobs in traditional batches, because of compute availability • Submit jobs one day • Wait until the next day for the job results What if… The VDI environment has unused cycles. Could HPC jobs be run in the environment when it is not needed to run VDI? Will it blend?
  • 23. 23©2018 VMware, Inc. Cycle Harvesting VMware ESXi VMware ESXi VMware ESXi 100 100 100 100 100 100 1 1Share Value 100 8AMTime Noon 5PM 10PM 1
  • 24. 24©2018 VMware, Inc. Cycle Harvesting Case Study https://bit.ly/2MrBngH
  • 25. 25©2018 VMware, Inc. Extending GPGPU Sharing to Containers
  • 26. Why Singularity Containers? Docker is not designed for HPC architectures Singularity is the best suited Container solution for HPC: Singularity container is encapsulated in a single file making it highly portable and secure. Singularity is designed from the ground up for scientific computing
  • 27. Combining Virtual Machines & Containers for GPU sharing • Sharing GPUs in a container is difficult as there is no resource management • vSphere VM with NVIDIA Grid or Bitfusion can use whole or partial GPU • Containers are a great packaging mechanism for applications • By enclosing one container per virtual machine, we get the best of both worlds • GPU resources can be shared with other containers • Machine and Deep Learning applications & platforms can be packaged and distributed effectively as a container
  • 28. Logical Schematic of Infrastructure components • One Singularity Container per VM • Containers leverage partial or full GPUs allocated to the virtual machine • Container packaged with TensorFlow, tools, etc. • Bitfusion provides GPU sharing BF Server VM ESX Host GPU Passthrough BF Server VM ESX Host GPU Passthrough BF Server VM ESX Host GPU Passthrough vSphere GPU Cluster Singularit y ContainerVirtual Machine ESX Host Singularit y Container Virtual Machine ESX Host vSphere Generic Cluster
  • 29. Images/sec Throughput comparison for 1 GPU 2.5-3X more throughput with sharing 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Resnet50 Alexnet Inception3 Throughput comparison with and without GPU sharing Total Throughput Baseline no sharing ThroughputRatios
  • 30. Runtime comparison for 1 GPU (with/without sharing) 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00 200.00 Runtime (%) Average Run Time (Seconds) Runtime comparison for 1 GPU with and without sharing Unshared Shared 17% Only 17% slower for nearly 3X Throughput
  • 31. Summary • Sharing is key to enable cloud like capabilities on premises • vSphere is the best platform to leverage latest high performance hardware • Virtualization supports device sharing and delivers near bare-metal performance • HW Sharing through vSphere can increase utilization. (Cycle Harvesting)