SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
UNIFIED MEMORY ON
P9+V100
2
UNIFIED MEMORY FUNDAMENTALS
Single pointer
On-demand migration
GPU memory oversubscription
Concurrent CPU/GPU access
System-wide atomics
void *data;
data = malloc(N);
cpu_func1(data, N);
gpu_func2<<<...>>>(data, N);
cudaDeviceSynchronize();
cpu_func3(data, N);
free(data);
3
P9+V100 CLARIFICATIONS
cudaMallocManaged still works as before
(including performance hints)
cudaMalloc still not accessible from the
CPU
void *data;
cudaMallocManaged(&data, N);
cpu_func1(data, N);
gpu_func2<<<...>>>(data, N);
cudaDeviceSynchronize();
cpu_func3(data, N);
cudaFree(data);
4
PERFORMANCE HINTS
cudaMemPrefetchAsync(ptr, size, processor, stream)
Similar to cudaMemcpyAsync
cudaMemAdvise(ptr, size, advice, processor)
ReadMostly: duplicate pages, writes possible but expensive
PreferredLocation: “resist” migrations from the preferred location
AccessedBy: establish mappings to avoid migrations and access directly
5
USER HINTS
Read Mostly
char *data;
cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..SetReadMostly, myGpuId);
cudaMemPrefetchAsync(data, N, myGpuId, s);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
In this case prefetch creates a copy
instead of moving data
Both processors can read data
simultaneously without faults
Writes invalidate the copies on other
processors (expensive)
GPU: my_kernel
CPU: init_data
CPU: use_data
6
USER HINTS
Preferred Location: resisting migrations
char *data;
cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
Here the kernel will page fault
and generate direct mapping to
data on the CPU
The driver will “resist”
migrating data away from the
preferred location
GPU: my_kernel
CPU: init_data
CPU: use_data
7
USER HINTS
Preferred Location: page population on first-touch
char *data;
cudaMallocManaged(&data, N);
cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
Here the kernel will page fault,
populate pages on the CPU
and generate direct mapping to
data on the CPU
Pages are populated on the
preferred location if the
faulting processor can access it
GPU: my_kernel
CPU: use_data
8
USER HINTS
Accessed By
char *data;
cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
GPU will establish direct mapping of
data in CPU memory, no page faults
will be generated
Memory can move freely to other
processors and mapping will carry
over
GPU: my_kernel
CPU: init_data
CPU: use_data
9
P9+V100 NEW FEATURES
Available since CUDA 9.1:
NVLINK2: increased migration throughput
NVLINK2: enabling HW coherency (CPU access GPU memory)
Indirect peers: GPU access memory of remote GPUs on a different socket
Native atomics support for all accessible memory
Work in progress:
cudaMallocManaged may use access counters to guide migrations (opt-in)
ATS: GPU can access all system memory (malloc, stack, mmap files)
10
NVLINK2: INCREASED THROUGHPUT
Copies, prefetches and direct access can achieve roughly 80% of peak BW
On-demand migrations have lower BW due to page fault overheads
*Indirect peers have lower BW due to P9 interconnect path
GPUS PER CPU CPU-GPU GPU-GPU*
2xP8+4xP100 2 40GB/s 40GB/s
2xP9+6xV100 (1) 3 50GB/s 50GB/s
2xP9+4xV100 (2) 2 75GB/s 75GB/s
11
NVLINK2: COHERENCY
CPU can directly access and cache GPU memory; native CPU-GPU atomics
page2 maps to CPU physical memory, page3 maps to GPU physical memory
Works for cudaMallocManaged and malloc
page2
page3
page2
page3
GPU memory CPU memory
GPU CPU
local access
remote access
remote access
12
ACCESS COUNTERS
If memory has AccessedBy set, migration can be triggered by access counters (opt-in)
Opt-in and only for cudaMallocManaged at acceptance
page1
page2
page3
page1
page2
page3
GPU CPU
local access
GPU memory CPU memory
many accesses
few accesses
13
FAULTS VS ACCESS COUNTERS
When using faults all touched pages will be moved to the GPU (eager migration)
Faults: eager migration
page1
page2
page3
page1
page2
page3
GPU CPU
page migration
local access
GPU memory CPU memory
local access
14
FAULTS VS ACCESS COUNTERS
With access counters only hot pages will be moved to the GPU (delayed migration)
Access counters: delayed migration
page1
page2
page3
page1
page2
page3
GPU CPU
page migration
local access
GPU memory CPU memory
remote access
15
USER HINTS ON P9+V100
Preferred Location: CPU can directly access GPU memory
char *data;
cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..PreferredLocation, gpuId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
Here the kernel will page fault
and migrate data to the GPU
The driver will “resist”
migrating data away from the
preferred location
on non P9+V100 systems the driver will migrate back to the CPU
GPU: my_kernel
CPU: init_data
CPU: use_data
16
USER HINTS ON P9+V100
Accessed By: using access counters on P9+V100
char *data;
cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
GPU will establish direct mapping of
data in CPU memory, no page faults
will be generated
Access counters may eventually
trigger migration of frequently
accessed pages to the GPU
on non P9+V100 systems all pages will stay in CPU memory
GPU: my_kernel
CPU: init_data
CPU: use_data
17
cudaMallocManaged vs malloc
(on the system at acceptance)
18
FIRST TOUCH
ptr = cudaMallocManaged(size);
doStuffOnGpu<<<...>>>(ptr, size);
cudaMallocManaged: same behavior as x86
GPU page faults
Unified Memory driver allocates on GPU
GPU accesses GPU memory
Note: you may alter this behavior by using perf hints,
e.g. PreferredLocation=CPU will allocate on CPU on first touch
19
FIRST TOUCH
ptr = malloc(size);
doStuffOnGpu<<<...>>>(ptr, size);
malloc: always allocate on CPU
GPU uses ATS, faults
OS allocates on CPU (by default)
GPU uses ATS to access CPU memory
20
ON-DEMAND MIGRATION
ptr = cudaMallocManaged(size);
fillData(ptr, size);
doStuffOnGpu<<<...>>>(ptr, size); // ptr migrates to GPU
cudaDeviceSynchronize();
doStuffOnCpu(ptr, size); // ptr migrates to CPU
cudaMallocManaged: same behavior as x86
21
ON-DEMAND MIGRATION
ptr = malloc(size);
fillData(ptr, size);
doStuffOnGpu<<<...>>>(ptr, size); // GPU accesses CPU memory through ATS
cudaDeviceSynchronize();
doStuffOnCpu(ptr, size); // CPU accesses CPU memory
malloc: no automatic migrations*
No on-demand malloc data movement except by APIs
*In the future we’ll use access counters to migrate malloc memory (not at acceptance time)
22
USER-DIRECTED MIGRATION
ptr = malloc_support ? malloc(size) : cudaMallocManaged(size);
fillData(ptr, size);
cudaMemPrefetchAsync(ptr, size, dest_gpu_id); // moves data to GPU
doStuffOnGpu<<<...>>>(ptr, size); // GPU accesses GPU memory
cudaMemPrefetchAsync(ptr, size, cudaCpuDeviceId); // moves data to CPU
cudaDeviceSynchronize();
doStuffOnCpu(ptr, size); // CPU accesses CPU memory
Works for all allocators
23
USER-DIRECTED MIGRATION
cudaMallocManaged: no performance regressions, increased bandwidth (NVLINK2)
Performance considerations
GPU GPU
CPU
x3 NVLINK
24
USER-DIRECTED MIGRATION
malloc: for system acceptance migrations could be implemented through CPU (slow
path) but we're working on enabling the direct P2P transfer in the future (fast path)
Performance considerations
GPU GPU
CPU
x3 NVLINK
25
cudaMalloc cudaMallocManaged malloc
cudaMalloc No Yes No
cudaMallocManaged No Yes No
malloc No No No
EVICTION TABLE
Can [row] evict [column] from GPU to CPU?
Green: Working as intended Red: Want to change in future
26
USER HINTS: ATS
Advise hints have no impact on malloc allocations
When pages are populated they are automatically AccessedBy all GPUs in the system
PreferredLocation or ReadMostly are no-ops for malloc at acceptance
ReadMostly does not work by design (ATS uses single page table)
PreferredLocation will work in the future
27
GENERAL RECOMMENDATIONS
Take advantage of CPU access to GPU memory (including native atomics) so that
data that is occasionally read by the CPU does not need to move from GPU
cudaMallocManaged is the most performant way to use Unified Memory now
Use perf hints to obtain more predictable memory access patterns and data
movement behavior between CPU and GPU (only for cudaMallocManaged)
When using system malloc consider current performance implications (no on-demand
migrations, possibly low prefetch bandwidth)
Use pooled allocator to alleviate cost of memory allocation
28
LEARN MORE AT GTC 2018
S8430 - Everything You Need to Know About Unified Memory
Tuesday, Mar 27, 4:30 PM - 5:20 PM – Room 211A
Look out for a “Connect with the Experts” session on Unified Memory

Weitere ähnliche Inhalte

Was ist angesagt?

Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdfYasunori Goto
 
Chips alliance omni xtend overview
Chips alliance omni xtend overviewChips alliance omni xtend overview
Chips alliance omni xtend overviewRISC-V International
 
不揮発メモリ(NVDIMM)とLinuxの対応動向について(for comsys 2019 ver.)
不揮発メモリ(NVDIMM)とLinuxの対応動向について(for comsys 2019 ver.)不揮発メモリ(NVDIMM)とLinuxの対応動向について(for comsys 2019 ver.)
不揮発メモリ(NVDIMM)とLinuxの対応動向について(for comsys 2019 ver.)Yasunori Goto
 
What is a Kernel? : Introduction And Architecture
What is a Kernel? : Introduction And ArchitectureWhat is a Kernel? : Introduction And Architecture
What is a Kernel? : Introduction And Architecturepec2013
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)Linaro
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIAllan Cantle
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
Arm: Enabling CXL devices within the Data Center with Arm Solutions
Arm: Enabling CXL devices within the Data Center with Arm SolutionsArm: Enabling CXL devices within the Data Center with Arm Solutions
Arm: Enabling CXL devices within the Data Center with Arm SolutionsMemory Fabric Forum
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Broadcom PCIe & CXL Switches OCP Final.pptx
Broadcom PCIe & CXL Switches OCP Final.pptxBroadcom PCIe & CXL Switches OCP Final.pptx
Broadcom PCIe & CXL Switches OCP Final.pptxMemory Fabric Forum
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Distributed file system
Distributed file systemDistributed file system
Distributed file systemNaza hamed Jan
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsRebekah Rodriguez
 
Fosdem 18: Securing embedded Systems using Virtualization
Fosdem 18: Securing embedded Systems using VirtualizationFosdem 18: Securing embedded Systems using Virtualization
Fosdem 18: Securing embedded Systems using VirtualizationThe Linux Foundation
 

Was ist angesagt? (20)

Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
CXL Fabric Management Standards
CXL Fabric Management StandardsCXL Fabric Management Standards
CXL Fabric Management Standards
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdf
 
Chips alliance omni xtend overview
Chips alliance omni xtend overviewChips alliance omni xtend overview
Chips alliance omni xtend overview
 
不揮発メモリ(NVDIMM)とLinuxの対応動向について(for comsys 2019 ver.)
不揮発メモリ(NVDIMM)とLinuxの対応動向について(for comsys 2019 ver.)不揮発メモリ(NVDIMM)とLinuxの対応動向について(for comsys 2019 ver.)
不揮発メモリ(NVDIMM)とLinuxの対応動向について(for comsys 2019 ver.)
 
What is a Kernel? : Introduction And Architecture
What is a Kernel? : Introduction And ArchitectureWhat is a Kernel? : Introduction And Architecture
What is a Kernel? : Introduction And Architecture
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMI
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Arm: Enabling CXL devices within the Data Center with Arm Solutions
Arm: Enabling CXL devices within the Data Center with Arm SolutionsArm: Enabling CXL devices within the Data Center with Arm Solutions
Arm: Enabling CXL devices within the Data Center with Arm Solutions
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Qemu Pcie
Qemu PcieQemu Pcie
Qemu Pcie
 
Broadcom PCIe & CXL Switches OCP Final.pptx
Broadcom PCIe & CXL Switches OCP Final.pptxBroadcom PCIe & CXL Switches OCP Final.pptx
Broadcom PCIe & CXL Switches OCP Final.pptx
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
 
Fosdem 18: Securing embedded Systems using Virtualization
Fosdem 18: Securing embedded Systems using VirtualizationFosdem 18: Securing embedded Systems using Virtualization
Fosdem 18: Securing embedded Systems using Virtualization
 

Ähnlich wie Unified Memory on POWER9 + V100

Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Making GPU resets less painful on Linux
Making GPU resets less painful on LinuxMaking GPU resets less painful on Linux
Making GPU resets less painful on LinuxIgalia
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Having fun with GPU resets in Linux – XDC 2023
Having fun with GPU resets in Linux – XDC  2023Having fun with GPU resets in Linux – XDC  2023
Having fun with GPU resets in Linux – XDC 2023Igalia
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda MoayadMoayadhn
 
Qemu - Raspberry | while42 Singapore #2
Qemu - Raspberry | while42 Singapore #2Qemu - Raspberry | while42 Singapore #2
Qemu - Raspberry | while42 Singapore #2While42
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsMainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsDobrica Pavlinušić
 

Ähnlich wie Unified Memory on POWER9 + V100 (20)

Cuda intro
Cuda introCuda intro
Cuda intro
 
Gpu
GpuGpu
Gpu
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Making GPU resets less painful on Linux
Making GPU resets less painful on LinuxMaking GPU resets less painful on Linux
Making GPU resets less painful on Linux
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Having fun with GPU resets in Linux – XDC 2023
Having fun with GPU resets in Linux – XDC  2023Having fun with GPU resets in Linux – XDC  2023
Having fun with GPU resets in Linux – XDC 2023
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
Qemu - Raspberry | while42 Singapore #2
Qemu - Raspberry | while42 Singapore #2Qemu - Raspberry | while42 Singapore #2
Qemu - Raspberry | while42 Singapore #2
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsMainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
 
Cuda debugger
Cuda debuggerCuda debugger
Cuda debugger
 

Mehr von inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Unified Memory on POWER9 + V100

  • 2. 2 UNIFIED MEMORY FUNDAMENTALS Single pointer On-demand migration GPU memory oversubscription Concurrent CPU/GPU access System-wide atomics void *data; data = malloc(N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); free(data);
  • 3. 3 P9+V100 CLARIFICATIONS cudaMallocManaged still works as before (including performance hints) cudaMalloc still not accessible from the CPU void *data; cudaMallocManaged(&data, N); cpu_func1(data, N); gpu_func2<<<...>>>(data, N); cudaDeviceSynchronize(); cpu_func3(data, N); cudaFree(data);
  • 4. 4 PERFORMANCE HINTS cudaMemPrefetchAsync(ptr, size, processor, stream) Similar to cudaMemcpyAsync cudaMemAdvise(ptr, size, advice, processor) ReadMostly: duplicate pages, writes possible but expensive PreferredLocation: “resist” migrations from the preferred location AccessedBy: establish mappings to avoid migrations and access directly
  • 5. 5 USER HINTS Read Mostly char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetReadMostly, myGpuId); cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data); In this case prefetch creates a copy instead of moving data Both processors can read data simultaneously without faults Writes invalidate the copies on other processors (expensive) GPU: my_kernel CPU: init_data CPU: use_data
  • 6. 6 USER HINTS Preferred Location: resisting migrations char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data); Here the kernel will page fault and generate direct mapping to data on the CPU The driver will “resist” migrating data away from the preferred location GPU: my_kernel CPU: init_data CPU: use_data
  • 7. 7 USER HINTS Preferred Location: page population on first-touch char *data; cudaMallocManaged(&data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data); Here the kernel will page fault, populate pages on the CPU and generate direct mapping to data on the CPU Pages are populated on the preferred location if the faulting processor can access it GPU: my_kernel CPU: use_data
  • 8. 8 USER HINTS Accessed By char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data); GPU will establish direct mapping of data in CPU memory, no page faults will be generated Memory can move freely to other processors and mapping will carry over GPU: my_kernel CPU: init_data CPU: use_data
  • 9. 9 P9+V100 NEW FEATURES Available since CUDA 9.1: NVLINK2: increased migration throughput NVLINK2: enabling HW coherency (CPU access GPU memory) Indirect peers: GPU access memory of remote GPUs on a different socket Native atomics support for all accessible memory Work in progress: cudaMallocManaged may use access counters to guide migrations (opt-in) ATS: GPU can access all system memory (malloc, stack, mmap files)
  • 10. 10 NVLINK2: INCREASED THROUGHPUT Copies, prefetches and direct access can achieve roughly 80% of peak BW On-demand migrations have lower BW due to page fault overheads *Indirect peers have lower BW due to P9 interconnect path GPUS PER CPU CPU-GPU GPU-GPU* 2xP8+4xP100 2 40GB/s 40GB/s 2xP9+6xV100 (1) 3 50GB/s 50GB/s 2xP9+4xV100 (2) 2 75GB/s 75GB/s
  • 11. 11 NVLINK2: COHERENCY CPU can directly access and cache GPU memory; native CPU-GPU atomics page2 maps to CPU physical memory, page3 maps to GPU physical memory Works for cudaMallocManaged and malloc page2 page3 page2 page3 GPU memory CPU memory GPU CPU local access remote access remote access
  • 12. 12 ACCESS COUNTERS If memory has AccessedBy set, migration can be triggered by access counters (opt-in) Opt-in and only for cudaMallocManaged at acceptance page1 page2 page3 page1 page2 page3 GPU CPU local access GPU memory CPU memory many accesses few accesses
  • 13. 13 FAULTS VS ACCESS COUNTERS When using faults all touched pages will be moved to the GPU (eager migration) Faults: eager migration page1 page2 page3 page1 page2 page3 GPU CPU page migration local access GPU memory CPU memory local access
  • 14. 14 FAULTS VS ACCESS COUNTERS With access counters only hot pages will be moved to the GPU (delayed migration) Access counters: delayed migration page1 page2 page3 page1 page2 page3 GPU CPU page migration local access GPU memory CPU memory remote access
  • 15. 15 USER HINTS ON P9+V100 Preferred Location: CPU can directly access GPU memory char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, gpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data); Here the kernel will page fault and migrate data to the GPU The driver will “resist” migrating data away from the preferred location on non P9+V100 systems the driver will migrate back to the CPU GPU: my_kernel CPU: init_data CPU: use_data
  • 16. 16 USER HINTS ON P9+V100 Accessed By: using access counters on P9+V100 char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data); GPU will establish direct mapping of data in CPU memory, no page faults will be generated Access counters may eventually trigger migration of frequently accessed pages to the GPU on non P9+V100 systems all pages will stay in CPU memory GPU: my_kernel CPU: init_data CPU: use_data
  • 17. 17 cudaMallocManaged vs malloc (on the system at acceptance)
  • 18. 18 FIRST TOUCH ptr = cudaMallocManaged(size); doStuffOnGpu<<<...>>>(ptr, size); cudaMallocManaged: same behavior as x86 GPU page faults Unified Memory driver allocates on GPU GPU accesses GPU memory Note: you may alter this behavior by using perf hints, e.g. PreferredLocation=CPU will allocate on CPU on first touch
  • 19. 19 FIRST TOUCH ptr = malloc(size); doStuffOnGpu<<<...>>>(ptr, size); malloc: always allocate on CPU GPU uses ATS, faults OS allocates on CPU (by default) GPU uses ATS to access CPU memory
  • 20. 20 ON-DEMAND MIGRATION ptr = cudaMallocManaged(size); fillData(ptr, size); doStuffOnGpu<<<...>>>(ptr, size); // ptr migrates to GPU cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); // ptr migrates to CPU cudaMallocManaged: same behavior as x86
  • 21. 21 ON-DEMAND MIGRATION ptr = malloc(size); fillData(ptr, size); doStuffOnGpu<<<...>>>(ptr, size); // GPU accesses CPU memory through ATS cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); // CPU accesses CPU memory malloc: no automatic migrations* No on-demand malloc data movement except by APIs *In the future we’ll use access counters to migrate malloc memory (not at acceptance time)
  • 22. 22 USER-DIRECTED MIGRATION ptr = malloc_support ? malloc(size) : cudaMallocManaged(size); fillData(ptr, size); cudaMemPrefetchAsync(ptr, size, dest_gpu_id); // moves data to GPU doStuffOnGpu<<<...>>>(ptr, size); // GPU accesses GPU memory cudaMemPrefetchAsync(ptr, size, cudaCpuDeviceId); // moves data to CPU cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); // CPU accesses CPU memory Works for all allocators
  • 23. 23 USER-DIRECTED MIGRATION cudaMallocManaged: no performance regressions, increased bandwidth (NVLINK2) Performance considerations GPU GPU CPU x3 NVLINK
  • 24. 24 USER-DIRECTED MIGRATION malloc: for system acceptance migrations could be implemented through CPU (slow path) but we're working on enabling the direct P2P transfer in the future (fast path) Performance considerations GPU GPU CPU x3 NVLINK
  • 25. 25 cudaMalloc cudaMallocManaged malloc cudaMalloc No Yes No cudaMallocManaged No Yes No malloc No No No EVICTION TABLE Can [row] evict [column] from GPU to CPU? Green: Working as intended Red: Want to change in future
  • 26. 26 USER HINTS: ATS Advise hints have no impact on malloc allocations When pages are populated they are automatically AccessedBy all GPUs in the system PreferredLocation or ReadMostly are no-ops for malloc at acceptance ReadMostly does not work by design (ATS uses single page table) PreferredLocation will work in the future
  • 27. 27 GENERAL RECOMMENDATIONS Take advantage of CPU access to GPU memory (including native atomics) so that data that is occasionally read by the CPU does not need to move from GPU cudaMallocManaged is the most performant way to use Unified Memory now Use perf hints to obtain more predictable memory access patterns and data movement behavior between CPU and GPU (only for cudaMallocManaged) When using system malloc consider current performance implications (no on-demand migrations, possibly low prefetch bandwidth) Use pooled allocator to alleviate cost of memory allocation
  • 28. 28 LEARN MORE AT GTC 2018 S8430 - Everything You Need to Know About Unified Memory Tuesday, Mar 27, 4:30 PM - 5:20 PM – Room 211A Look out for a “Connect with the Experts” session on Unified Memory