SlideShare a Scribd company logo
1 of 56
Download to read offline
Increasing cluster
performance by combining
rCUDA with Slurm
Federico Silla
Technical University of Valencia
Spain
HPC Advisory Council Switzerland Conference 2016 2/56
Outline
rCUDA … what’s that?
HPC Advisory Council Switzerland Conference 2016 3/56
Basics of CUDA
GPU
HPC Advisory Council Switzerland Conference 2016 4/56
rCUDA … remote CUDA
No GPU
HPC Advisory Council Switzerland Conference 2016 5/56
A software technology that enables a more
flexible use of GPUs in computing facilities
No GPU
rCUDA … remote CUDA
HPC Advisory Council Switzerland Conference 2016 6/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 7/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 8/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 9/56
Physical
configuration
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
PCI-e
CPU
MainMemory
Network
Interconnection Network
Logical connections
Logical
configuration
Cluster envision with rCUDA
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
Interconnection Network
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
 rCUDA allows a new vision of a GPU deployment, moving from
the usual cluster configuration:
to the following one:
HPC Advisory Council Switzerland Conference 2016 10/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 11/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 12/56
The main concern with rCUDA is the
reduced bandwidth to the remote GPU
Concern with rCUDA
No GPU
HPC Advisory Council Switzerland Conference 2016 13/56
Using InfiniBand networks
HPC Advisory Council Switzerland Conference 2016 14/56
H2D pageable D2H pageable
H2D pinned D2H pinned
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Initial transfers within rCUDA
HPC Advisory Council Switzerland Conference 2016 15/56
 CUDASW++
Bioinformatics software for Smith-Waterman protein database
searches
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCUDAOverhead(%)
ExecutionTime(s)
Performance depending on network
HPC Advisory Council Switzerland Conference 2016 16/56
H2D pageable D2H pageable
H2D pinned
Almost 100% of
available BW
D2H pinned
Almost 100% of
available BW
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Optimized transfers within rCUDA
HPC Advisory Council Switzerland Conference 2016 17/56
rCUDA optimizations on applications
• Several applications executed with
CUDA and rCUDA
• K20 GPU and FDR InfiniBand
• K40 GPU and EDR InfiniBand
Lower
is better
HPC Advisory Council Switzerland Conference 2016 18/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 19/56
Outline
rCUDA improves
cluster performance
HPC Advisory Council Switzerland Conference 2016 20/56
 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
Test bench for studying rCUDA+Slurm
node with
the Slurm
scheduler
node with
the Slurm
scheduler node with
the Slurm
scheduler
8+1 GPU nodes
16+1 GPU nodes
4+1 GPU nodes
HPC Advisory Council Switzerland Conference 2016 21/56
 Applications used for tests:
 GPU-Blast (21 seconds; 1 GPU; 1599 MB)
 LAMMPS (15 seconds; 4 GPUs; 876 MB)
 MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)
 GROMACS (2 nodes) (167 seconds)
 NAMD (4 nodes) (11 minutes)
 BarraCUDA (10 minutes; 1 GPU; 3319 MB)
 GPU-LIBSVM (5 minutes; 1GPU; 145 MB)
 MUMmerGPU (5 minutes; 1GPU; 2804 MB)
Non-GPU
Short execution time
Long execution time
Set 1
Set 2
 Three workloads:
 Set 1
 Set 2
 Set 1 + Set 2
Applications for studying rCUDA+Slurm
HPC Advisory Council Switzerland Conference 2016 22/56
Workloads for studying rCUDA+Slurm (I)
HPC Advisory Council Switzerland Conference 2016 23/56
Performance of rCUDA+Slurm (I)
HPC Advisory Council Switzerland Conference 2016 24/56
Workloads for studying rCUDA+Slurm (II)
HPC Advisory Council Switzerland Conference 2016 25/56
Performance of rCUDA+Slurm (II)
HPC Advisory Council Switzerland Conference 2016 26/56
Outline
Why does rCUDA improve
cluster performance?
HPC Advisory Council Switzerland Conference 2016 27/56
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Non-accelerated applications keep GPUs idle in the nodes
where they use all the cores
1st reason for improved performance
A CPU-only application spreading over
these nodes will make their GPUs
unavailable for accelerated applications
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
HPC Advisory Council Switzerland Conference 2016 28/56
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated application using just one
CPU core may avoid other jobs to be
dispatched to this node
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
2nd reason for improved performance (I)
HPC Advisory Council Switzerland Conference 2016 29/56
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated MPI application using
just one CPU core per node may
keep part of the cluster busy
2nd reason for improved performance (II)
HPC Advisory Council Switzerland Conference 2016 30/56
• Do applications completely squeeze the GPUs available in the cluster?
• When a GPU is assigned to an application, computational resources
inside the GPU may not be fully used
• Application presenting low level of parallelism
• CPU code being executed (GPU assigned ≠ GPU working)
• GPU-core stall due to lack of data
• etc …
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
3rd reason for improved performance
HPC Advisory Council Switzerland Conference 2016 31/56
GPU usage of GPU-Blast
GPU assigned
but not used
GPU assigned
but not used
HPC Advisory Council Switzerland Conference 2016 32/56
GPU usage of CUDA-MEME
GPU utilization is far away from maximum
HPC Advisory Council Switzerland Conference 2016 33/56
GPU usage of LAMMPS
GPU assigned
but not used
HPC Advisory Council Switzerland Conference 2016 34/56
GPU allocation vs GPU utilization
GPUs
assigned
but not
used
HPC Advisory Council Switzerland Conference 2016 35/56
Sharing a GPU among jobs: GPU-Blast
Two concurrent
instances of GPU-Blast
One
instance
required
about 51
seconds
HPC Advisory Council Switzerland Conference 2016 36/56
Two concurrent
instances of GPU-Blast
Sharing a GPU among jobs: GPU-Blast
First
instance
HPC Advisory Council Switzerland Conference 2016 37/56
Two concurrent
instances of GPU-Blast
Sharing a GPU among jobs: GPU-Blast
First
instance
Second
instance
HPC Advisory Council Switzerland Conference 2016 38/56
Sharing a GPU among jobs
• LAMMPS: 876 MB
• mCUDA-MEME: 151 MB
• BarraCUDA: 3319 MB
• MUMmerGPU: 2104 MB
• GPU-LIBSVM: 145 MB
K20 GPU
HPC Advisory Council Switzerland Conference 2016 39/56
Outline
Other reasons for
using rCUDA?
HPC Advisory Council Switzerland Conference 2016 40/56
Cheaper cluster upgrade
No GPU
• Let’s suppose that a cluster without GPUs needs to be
upgraded to use GPUs
• GPUs require large power supplies
• Are power supplies already installed in the
nodes large enough?
• GPUs require large amounts of space
• Does current form factor of the nodes allow
to install GPUs?
The answer to both
questions is usually “NO”
HPC Advisory Council Switzerland Conference 2016 41/56
Cheaper cluster upgrade
No GPU
GPU-enabled
Approach 1: augment the cluster with some CUDA GPU-
enabled nodes  only those GPU-enabled nodes can
execute accelerated applications
HPC Advisory Council Switzerland Conference 2016 42/56
Cheaper cluster upgrade
Approach 2: augment the cluster with some rCUDA
servers  all nodes can execute accelerated
applications
GPU-enabled
HPC Advisory Council Switzerland Conference 2016 43/56
 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
16 nodes without GPU + 1 node with 4 GPUs
Cheaper cluster upgrade
HPC Advisory Council Switzerland Conference 2016 44/56
More workloads for studying rCUDA+Slurm
HPC Advisory Council Switzerland Conference 2016 45/56
Performance
-68% -60%
-63% -56%
+131% +119%
HPC Advisory Council Switzerland Conference 2016 46/56
Outline
Additional reasons for
using rCUDA?
HPC Advisory Council Switzerland Conference 2016 47/56
#1: More GPUs for a single application
64
GPUs!
HPC Advisory Council Switzerland Conference 2016 48/56
 MonteCarlo Multi-GPU (from NVIDIA samples)
Lower
is better
Higher
is better
#1: More GPUs for a single application
FDR InfiniBand +
NVIDIA Tesla K20
HPC Advisory Council Switzerland Conference 2016 49/56
#2: Virtual machines can share GPUs
• The GPU is assigned by using PCI passthrough exclusively
to a single virtual machine
• Concurrent usage of the GPU is not possible
HPC Advisory Council Switzerland Conference 2016 50/56
High performance
network available
Low performance
network available
#2: Virtual machines can share GPUs
HPC Advisory Council Switzerland Conference 2016 51/56
 Box A has 4 GPUs but only one is busy
 Box B has 8 GPUs but only two are busy
1. Move jobs from Box B to Box A and
switch off Box B
2. Migration should be transparent to
applications (decided by the global
scheduler)
#3: GPU task migration
Box A
Box B
Migration is performed
at GPU granularity
HPC Advisory Council Switzerland Conference 2016 52/56
1
1
3
7
13
14
14
Job granularity instead of GPU granularity
#3: GPU task migration
HPC Advisory Council Switzerland Conference 2016 53/56
Outline
… in summary …
HPC Advisory Council Switzerland Conference 2016 54/56
• Cons:
1.Reduced bandwidth to remote GPU (really a concern??)
Pros and cons of rCUDA
• Pros:
1.Many GPUs for a single application
2.Concurrent GPU access to virtual machines
3.Increased cluster throughput
4.Similar performance with smaller investment
5.Easier (cheaper) cluster upgrade
6.Migration of GPU jobs
7.Reduced energy consumption
8.Increased GPU utilization
HPC Advisory Council Switzerland Conference 2016 55/56
Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_
More than 650 requests world wide
rCUDA is a development by Technical University of Valencia
HPC Advisory Council Switzerland Conference 2016 56/56
Thanks!
Questions?
rCUDA is a development by Technical University of Valencia

More Related Content

What's hot

Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418inside-BigData.com
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmapinside-BigData.com
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputinginside-BigData.com
 
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Open Source RAPIDS GPU Platform to Accelerate Predictive Data AnalyticsOpen Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analyticsinside-BigData.com
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterpriseinside-BigData.com
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
 
Sierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science UnleashedSierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science Unleashedinside-BigData.com
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processorinside-BigData.com
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA
 

What's hot (20)

Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Open Source RAPIDS GPU Platform to Accelerate Predictive Data AnalyticsOpen Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterprise
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Nvidia at SEMICon, Munich
Nvidia at SEMICon, MunichNvidia at SEMICon, Munich
Nvidia at SEMICon, Munich
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
Sierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science UnleashedSierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science Unleashed
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processor
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
 

Similar to Increasing Cluster Performance by Combining rCUDA with Slurm

PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxPowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxNeoKenj
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...Jim St. Leger
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computingAshwin Ashok
 
CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019NVIDIA
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodesinside-BigData.com
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC
 
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...Jürgen Ambrosi
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, TrustedJeremy Eder
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community6WIND
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...Yutaka Kawai
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
3  Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...3  Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...Jürgen Ambrosi
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkRed Hat Developers
 
Multi-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different HardwareMulti-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different Hardwareinside-BigData.com
 

Similar to Increasing Cluster Performance by Combining rCUDA with Slurm (20)

PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxPowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
 
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
3  Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...3  Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech Talk
 
Multi-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different HardwareMulti-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different Hardware
 
Ti DSP optimization on Jacinto
Ti DSP optimization on JacintoTi DSP optimization on Jacinto
Ti DSP optimization on Jacinto
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Increasing Cluster Performance by Combining rCUDA with Slurm

  • 1. Increasing cluster performance by combining rCUDA with Slurm Federico Silla Technical University of Valencia Spain
  • 2. HPC Advisory Council Switzerland Conference 2016 2/56 Outline rCUDA … what’s that?
  • 3. HPC Advisory Council Switzerland Conference 2016 3/56 Basics of CUDA GPU
  • 4. HPC Advisory Council Switzerland Conference 2016 4/56 rCUDA … remote CUDA No GPU
  • 5. HPC Advisory Council Switzerland Conference 2016 5/56 A software technology that enables a more flexible use of GPUs in computing facilities No GPU rCUDA … remote CUDA
  • 6. HPC Advisory Council Switzerland Conference 2016 6/56 Basics of rCUDA
  • 7. HPC Advisory Council Switzerland Conference 2016 7/56 Basics of rCUDA
  • 8. HPC Advisory Council Switzerland Conference 2016 8/56 Basics of rCUDA
  • 9. HPC Advisory Council Switzerland Conference 2016 9/56 Physical configuration CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e PCI-e CPU MainMemory Network Interconnection Network Logical connections Logical configuration Cluster envision with rCUDA GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network Interconnection Network PCI-e CPU GPU GPU mem MainMemory Network GPU GPU mem PCI-e CPU GPU GPU mem MainMemory Network GPU GPU mem CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e  rCUDA allows a new vision of a GPU deployment, moving from the usual cluster configuration: to the following one:
  • 10. HPC Advisory Council Switzerland Conference 2016 10/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 11. HPC Advisory Council Switzerland Conference 2016 11/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 12. HPC Advisory Council Switzerland Conference 2016 12/56 The main concern with rCUDA is the reduced bandwidth to the remote GPU Concern with rCUDA No GPU
  • 13. HPC Advisory Council Switzerland Conference 2016 13/56 Using InfiniBand networks
  • 14. HPC Advisory Council Switzerland Conference 2016 14/56 H2D pageable D2H pageable H2D pinned D2H pinned rCUDA EDR Orig rCUDA EDR Opt rCUDA FDR Orig rCUDA FDR Opt Initial transfers within rCUDA
  • 15. HPC Advisory Council Switzerland Conference 2016 15/56  CUDASW++ Bioinformatics software for Smith-Waterman protein database searches 144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478 0 5 10 15 20 25 30 35 40 45 0 2 4 6 8 10 12 14 16 18 FDR Overhead QDR Overhead GbE Overhead CUDA rCUDA FDR rCUDA QDR rCUDA GbE Sequence Length rCUDAOverhead(%) ExecutionTime(s) Performance depending on network
  • 16. HPC Advisory Council Switzerland Conference 2016 16/56 H2D pageable D2H pageable H2D pinned Almost 100% of available BW D2H pinned Almost 100% of available BW rCUDA EDR Orig rCUDA EDR Opt rCUDA FDR Orig rCUDA FDR Opt Optimized transfers within rCUDA
  • 17. HPC Advisory Council Switzerland Conference 2016 17/56 rCUDA optimizations on applications • Several applications executed with CUDA and rCUDA • K20 GPU and FDR InfiniBand • K40 GPU and EDR InfiniBand Lower is better
  • 18. HPC Advisory Council Switzerland Conference 2016 18/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 19. HPC Advisory Council Switzerland Conference 2016 19/56 Outline rCUDA improves cluster performance
  • 20. HPC Advisory Council Switzerland Conference 2016 20/56  Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU  FDR InfiniBand based cluster Test bench for studying rCUDA+Slurm node with the Slurm scheduler node with the Slurm scheduler node with the Slurm scheduler 8+1 GPU nodes 16+1 GPU nodes 4+1 GPU nodes
  • 21. HPC Advisory Council Switzerland Conference 2016 21/56  Applications used for tests:  GPU-Blast (21 seconds; 1 GPU; 1599 MB)  LAMMPS (15 seconds; 4 GPUs; 876 MB)  MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)  GROMACS (2 nodes) (167 seconds)  NAMD (4 nodes) (11 minutes)  BarraCUDA (10 minutes; 1 GPU; 3319 MB)  GPU-LIBSVM (5 minutes; 1GPU; 145 MB)  MUMmerGPU (5 minutes; 1GPU; 2804 MB) Non-GPU Short execution time Long execution time Set 1 Set 2  Three workloads:  Set 1  Set 2  Set 1 + Set 2 Applications for studying rCUDA+Slurm
  • 22. HPC Advisory Council Switzerland Conference 2016 22/56 Workloads for studying rCUDA+Slurm (I)
  • 23. HPC Advisory Council Switzerland Conference 2016 23/56 Performance of rCUDA+Slurm (I)
  • 24. HPC Advisory Council Switzerland Conference 2016 24/56 Workloads for studying rCUDA+Slurm (II)
  • 25. HPC Advisory Council Switzerland Conference 2016 25/56 Performance of rCUDA+Slurm (II)
  • 26. HPC Advisory Council Switzerland Conference 2016 26/56 Outline Why does rCUDA improve cluster performance?
  • 27. HPC Advisory Council Switzerland Conference 2016 27/56 Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Non-accelerated applications keep GPUs idle in the nodes where they use all the cores 1st reason for improved performance A CPU-only application spreading over these nodes will make their GPUs unavailable for accelerated applications Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes)
  • 28. HPC Advisory Council Switzerland Conference 2016 28/56 Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Accelerated applications keep CPUs idle in the nodes where they execute An accelerated application using just one CPU core may avoid other jobs to be dispatched to this node Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) 2nd reason for improved performance (I)
  • 29. HPC Advisory Council Switzerland Conference 2016 29/56 Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Accelerated applications keep CPUs idle in the nodes where they execute An accelerated MPI application using just one CPU core per node may keep part of the cluster busy 2nd reason for improved performance (II)
  • 30. HPC Advisory Council Switzerland Conference 2016 30/56 • Do applications completely squeeze the GPUs available in the cluster? • When a GPU is assigned to an application, computational resources inside the GPU may not be fully used • Application presenting low level of parallelism • CPU code being executed (GPU assigned ≠ GPU working) • GPU-core stall due to lack of data • etc … Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM 3rd reason for improved performance
  • 31. HPC Advisory Council Switzerland Conference 2016 31/56 GPU usage of GPU-Blast GPU assigned but not used GPU assigned but not used
  • 32. HPC Advisory Council Switzerland Conference 2016 32/56 GPU usage of CUDA-MEME GPU utilization is far away from maximum
  • 33. HPC Advisory Council Switzerland Conference 2016 33/56 GPU usage of LAMMPS GPU assigned but not used
  • 34. HPC Advisory Council Switzerland Conference 2016 34/56 GPU allocation vs GPU utilization GPUs assigned but not used
  • 35. HPC Advisory Council Switzerland Conference 2016 35/56 Sharing a GPU among jobs: GPU-Blast Two concurrent instances of GPU-Blast One instance required about 51 seconds
  • 36. HPC Advisory Council Switzerland Conference 2016 36/56 Two concurrent instances of GPU-Blast Sharing a GPU among jobs: GPU-Blast First instance
  • 37. HPC Advisory Council Switzerland Conference 2016 37/56 Two concurrent instances of GPU-Blast Sharing a GPU among jobs: GPU-Blast First instance Second instance
  • 38. HPC Advisory Council Switzerland Conference 2016 38/56 Sharing a GPU among jobs • LAMMPS: 876 MB • mCUDA-MEME: 151 MB • BarraCUDA: 3319 MB • MUMmerGPU: 2104 MB • GPU-LIBSVM: 145 MB K20 GPU
  • 39. HPC Advisory Council Switzerland Conference 2016 39/56 Outline Other reasons for using rCUDA?
  • 40. HPC Advisory Council Switzerland Conference 2016 40/56 Cheaper cluster upgrade No GPU • Let’s suppose that a cluster without GPUs needs to be upgraded to use GPUs • GPUs require large power supplies • Are power supplies already installed in the nodes large enough? • GPUs require large amounts of space • Does current form factor of the nodes allow to install GPUs? The answer to both questions is usually “NO”
  • 41. HPC Advisory Council Switzerland Conference 2016 41/56 Cheaper cluster upgrade No GPU GPU-enabled Approach 1: augment the cluster with some CUDA GPU- enabled nodes  only those GPU-enabled nodes can execute accelerated applications
  • 42. HPC Advisory Council Switzerland Conference 2016 42/56 Cheaper cluster upgrade Approach 2: augment the cluster with some rCUDA servers  all nodes can execute accelerated applications GPU-enabled
  • 43. HPC Advisory Council Switzerland Conference 2016 43/56  Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU  FDR InfiniBand based cluster 16 nodes without GPU + 1 node with 4 GPUs Cheaper cluster upgrade
  • 44. HPC Advisory Council Switzerland Conference 2016 44/56 More workloads for studying rCUDA+Slurm
  • 45. HPC Advisory Council Switzerland Conference 2016 45/56 Performance -68% -60% -63% -56% +131% +119%
  • 46. HPC Advisory Council Switzerland Conference 2016 46/56 Outline Additional reasons for using rCUDA?
  • 47. HPC Advisory Council Switzerland Conference 2016 47/56 #1: More GPUs for a single application 64 GPUs!
  • 48. HPC Advisory Council Switzerland Conference 2016 48/56  MonteCarlo Multi-GPU (from NVIDIA samples) Lower is better Higher is better #1: More GPUs for a single application FDR InfiniBand + NVIDIA Tesla K20
  • 49. HPC Advisory Council Switzerland Conference 2016 49/56 #2: Virtual machines can share GPUs • The GPU is assigned by using PCI passthrough exclusively to a single virtual machine • Concurrent usage of the GPU is not possible
  • 50. HPC Advisory Council Switzerland Conference 2016 50/56 High performance network available Low performance network available #2: Virtual machines can share GPUs
  • 51. HPC Advisory Council Switzerland Conference 2016 51/56  Box A has 4 GPUs but only one is busy  Box B has 8 GPUs but only two are busy 1. Move jobs from Box B to Box A and switch off Box B 2. Migration should be transparent to applications (decided by the global scheduler) #3: GPU task migration Box A Box B Migration is performed at GPU granularity
  • 52. HPC Advisory Council Switzerland Conference 2016 52/56 1 1 3 7 13 14 14 Job granularity instead of GPU granularity #3: GPU task migration
  • 53. HPC Advisory Council Switzerland Conference 2016 53/56 Outline … in summary …
  • 54. HPC Advisory Council Switzerland Conference 2016 54/56 • Cons: 1.Reduced bandwidth to remote GPU (really a concern??) Pros and cons of rCUDA • Pros: 1.Many GPUs for a single application 2.Concurrent GPU access to virtual machines 3.Increased cluster throughput 4.Similar performance with smaller investment 5.Easier (cheaper) cluster upgrade 6.Migration of GPU jobs 7.Reduced energy consumption 8.Increased GPU utilization
  • 55. HPC Advisory Council Switzerland Conference 2016 55/56 Get a free copy of rCUDA at http://www.rcuda.net @rcuda_ More than 650 requests world wide rCUDA is a development by Technical University of Valencia
  • 56. HPC Advisory Council Switzerland Conference 2016 56/56 Thanks! Questions? rCUDA is a development by Technical University of Valencia