SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Jack Han | Solutions Architect | jahan@nvidia.com
PROFILING
DEEP LEARNING NETWORK
2
AGENDA
Profiling Deep Learning Network
Introduction to Nsight Systems
PyTorch NVTX Annotation
Examples and Benefits
Mixed Precision with BERT
TensorFlow NVTX Plugins
Plugin Usage Example
3
HOW TO SPEED-UP NETWORK
Use the latest GPU and more GPU?
Wait NVIDIA to release the new GPU or newer library?
Optimize Neural Network Computing or Something
Need to understand your network operation
We are talking about this
4
PROFILING NEURAL NETWORK
Profiling with cProfiler + Snakeviz
python -m cProfile -o 100_percent_gpu_utilization.prof train.py
snakeviz 100_percent_gpu_utilization.prof
https://sagivtech.com/2017/09/19/optimizing-pytorch-training-code/
Useful?
5
TENSORFLOW PROFILER
Show timeline and can trace network
Still difficult to understand
https://github.com/tensorflow/profiler-ui
Useful?
6
NVIDIA PROFILING TOOLS
CUPTI (CUDA Profiling Tools Interface)
Nsight Systems
Nsight Compute
Nsight Visual Studio Edition
Nsight Eclipse Edition
NVIDIA Profiler (nvprof)
NVIDIA Visual Profiler (nvvp)
7
NSIGHT PRODUCT FAMILY
Standalone Performance Tools
Nsight Systems system-wide application algorithm tuning
Nsight Compute Debug/optimize specific CUDA kernel
Nsight Graphics Debug/optimize specific graphics
IDE plugins
Nsight Eclipse Edicion/Visual Studio editor, debugger, some perf analysis
8
NSIGHT SYSTEMS
Profile System-wide application
Multi-process tree, GPU workload trace, etc
Investigate your workload across multiple CPUs and GPUs
CPU algorithms, utilization, and thread states
GPU streams kernels, memory transfers, etc
NVTX, CUDA & Library API, etc
Ready for Big Data
docker, user privilege (linux), cli, etc
Overview
9
Thread/core
migration
Thread state
Processes and
threads
CUDA and OpenGL
API trace
cuDNN and
cuBLAS trase
Kernel and
memory transfer
activities
Multi-GPU
10
NVTX Tracining
11
TRANSITIONING TO PROFILE A KERNEL
Dive into kernel analysis
12
NVIDIA NSIGHT COMPUTE
Interactive CUDA API debugging and kernel profiling
Fast Data Collection
Graphical and multiple kernel comparison reports
Improved Workflow and Fully Customizable
(Baselining, Programmable UI/Rules)
Command Line, Standalone, IDE Integration
Platform Support
OS: Linux(x86,ARM), Windows, OSX (host only)
GPUs: Pascal, Volta, Turing
Next Generation Kernel Profiler
Kernel Profile Comparisons with Baseline
Metric Data
Source Correlation
13
PROFILING GPU APPLICATION
How to measure
Focusing GPU Computing
Low GPU
Utilization
Low SM
Efficiency
Low Achieved
Occupancy
Memory
Bottleneck
Instructions
Bottleneck
GPU Profiling
CPU/GPU
Tracing
Application
Tracing
• Too few threads
• Register limit
• Large shared
memory
…
• Cache misses
• Bandwidth limit
• Access pattern
…
• Arithmetic
• Control flow
…
NVIDIA (Visual) Profiler / Nsight Compute
NVIDIA Supports them with cuDNN, cuBLAS, and so on
14
GPU Profiling
CPU/GPU
Tracing
Application
Tracing
PROFILING GPU APPLICATION
How to measure
Focusing System Operation
Low GPU
Utilization
Low SM
Efficiency
Low Achieved
Occupancy
Memory
Bottleneck
Instructions
Bottleneck
CPU-Only
Activities
Memcopy
Latency
Kernel Launch
Latency
Job Startup /
Checkpoints
CPU
Computation
I/O
Nsight System / Application Tracing
15
NSIGHT SYSTEMS PROFILE
Profile with CLI
$ nsys profile –t cuda,osrt,nvtx,cudnn,cublas 
-y 60 
-d 20 
–o baseline 
-f true 
–w true 
python main.py
cuda – GPU kernel
osrt – OS runtime
nvtx – NVIDIA Tools Extension
cudnn – CUDA Deep NN library
cublas– CUDA BLAS library
https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.3.6-x86/06-cli-profiling.htm
ßAPIs to be traced
ßDelayed profile (sec)
ßProfiling duration (sec)
ßOutput filename
ßOverwrite when it’s true
ßDisplay
ßExecution command
16
DISABLING ASYNCHRONOUS OPERATION
Elimination of CPU interop effects
export CUDA_LAUNCH_BLOCKING=1
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution
17
PROFILING
PYTORCH MODEL
18
BASELINE PROFILE
MNIST Training: 89 sec, <5% utilization
CPU waits on a semaphore and starves the GPU!
GPU Starvation GPU Starvation
19
NVTX ANNOTATIONS
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
20
NVTX ANNOTATIONS
import torch.cuda.nvtx as nvtx
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
nvtx.range_push("Batch " + str(batch_idx))
nvtx.range_push("Copy to device")
data, target = data.to(device), target.to(device)
nvtx.range_pop()
nvtx.range_push("Forward pass")
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
nvtx.range_pop()
nvtx.range_pop()
https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
21
BASELINE PROFILE (WITH NVTX)
GPU is idle during data loading
Data is loaded using a single thread. This starves the GPU!
fwd/bwd fwd/bwd
5.1ms
22
OPTIMIZE SOURCE CODE
Data loader was configured to use 1 worker thread
Let’s switch to using 8 worker threads:
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
kwargs = {'num_workers’: 8, 'pin_memory': True} if use_cuda else {}
23
AFTER OPTIMIZATION
Time for data loading reduced for each bath
60us Reduced from 5.1ms to 60us for batch 50
24
REAL EXAMPLE
https://news.developer.nvidia.com/nvidia-achieves-4x-speedup-on-bert-neural-network/
How?
25
INITIAL PROFILE
No NVTX
Difficult to understand à no useful
26
loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
...
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used and handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
Foward
Backward
NVTX TAGGING
27
nvtx.range_push("Batch " + str(step))
nvtx.range_push("Forward pass")
loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
nvtx.range_pop()
...
nvtx.range_push("Backward pass")
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used and handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
nvtx.range_pop()
nvtx.range_pop()
Foward
Backward
NVTX TAGGING
28
BERT FP32 BENCHMARK
HuggingFace’s pretrained BERT
Getting API List
~460 ms
~60% GEMM
4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
29
BERT FP16 BENCHMARK
HuggingFace’s pretrained BERT
Tensor Cores
APIs
~222 ms
2.1x Speed up
4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
30
132 ms
8.5 ms
APEX OPTIMIZER OPTIMIZATION
APEX/csrc/adam_cuda_kernel()
Low Utilization
Higher Utilization
FP32
FP16
31
PROFILING GPU APPLICATION
You can investigate the performance bottleneck in each layer
NVIDIA provides such inefficient layers / graphs optimization support
For example,
Like this, adam_cuda_kernel in APEX
cudnnMultiHeadAttnForward() in cuDNN
BERT speed-up
Reminding
cudnnMultiHeadAttnForward
32
PROFILING
TENSORFLOW GRAPHS
33
NVTX WITH TENSORFLOW
The TF_DISABLE_NVTX_RANGES variable enables
and disables NVTX ranges in TensorFlow
NVTX collect is enabled by default
Available in NGC Container
Having NVTX plugin for TF
Import nvtx plugin
Annotation in python code
Get data from through Profiler
34
NVTX PLUGINS FOR DEEP LEARNING
Allows users to add their own NVIDIA Tools Extension (NVTX) events and time ranges to a
TensorFlow graph
Ranges are added by wrapping regions of the computation graph with start and end
operations
Profiling TensorFlow for the graphs
NVTX Context
35
SETUP
Install NVTX plugins
pip install nvtx-plugins-tf
No need to install this plugin since NGC 19.08
36
HOW TO USE
Adding Markers
Function Detector
import nvtx.plugins.tf as nvtx_tf
x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1-2', 
domain_name='Forward’, grad_domain_name='Gradient’, enabled=ENABLE_NVTX, trainable=True)
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1')
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’)
x = nvtx_tf.ops.end(x, nvtx_context)
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_3')
import nvtx.plugins.tf as nvtx_tf
@nvtx_tf.ops.trace(message='Dense Block', domain_name='Forward',
grad_domain_name='Gradient', enabled=ENABLE_NVTX, trainable=True)
def dense_block(x):
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1')
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’)
return x
37
WORKING WITH SESSION/TF.ESTIMATOR
from nvtx.plugins.tf.estimator import NVTXHook
nvtx_callback = NVTXHook(skip_n_steps=1, name='Train’)
training_hooks=[]
training_hooks.append(nvtx_callback)
with tf.train.MonitoredSession(hooks=[nvtx_callback]) as sess:
tf.estimator.Estimator(hooks=training_hooks, ...)
38
PROFIING MPI PROCESS
To profile everything, putting the data into one file
To profile everything, putting the data into each node into a separated file
nsys [nsys options] mpirun [mpi options]
mpirun [mpi options] nsys profile [nsys options]
39
COMMAND EXAMPLES
1 GPU profiling
4 GPU profiling
NGC TensorFlow ResNet50-1.5
nsys profile -y 40 -d 20 -f true -w true -o tf_amp_4gpu 
-t osrt,nvtx,cuda,cublas,cudnn 
mpiexec --allow-run-as-root --bind-to socket -np 4 
python ./main.py --mode=training_benchmark --warmup_steps 200 
--num_iter 500 --iter_unit batch --results_dir=results --batch_size 64
nsys profile -y 40 -d 20 -f true -w true -o tf_amp 
-t osrt,nvtx,cuda,cublas,cudnn 
python ./main.py --mode=training_benchmark --warmup_steps 200 
--num_iter 500 --iter_unit batch --results_dir=results --batch_size 64
40
TF WITH NVTX
41
AMP ENABLED
42
MULTI-GPU
Profiling deep learning network using NVIDIA nsight systems

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたMITSUNARI Shigeo
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)Hiroki Nakahara
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP
 
高位合成でDeep learning
高位合成でDeep learning高位合成でDeep learning
高位合成でDeep learningMori Labo.
 
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読Deep Learning JP
 
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some PreliminaryDeep Learning JP
 
x86とコンテキストスイッチ
x86とコンテキストスイッチx86とコンテキストスイッチ
x86とコンテキストスイッチMasami Ichikawa
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門NVIDIA Japan
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例Fixstars Corporation
 
FPGAを用いたEdge AIの現状
FPGAを用いたEdge AIの現状FPGAを用いたEdge AIの現状
FPGAを用いたEdge AIの現状Yukitaka Takemura
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラNVIDIA Japan
 
ディープラーニングの2値化(Binarized Neural Network)
ディープラーニングの2値化(Binarized Neural Network)ディープラーニングの2値化(Binarized Neural Network)
ディープラーニングの2値化(Binarized Neural Network)Hideo Terada
 
SpectreとMeltdown:最近のCPUの深い話
SpectreとMeltdown:最近のCPUの深い話SpectreとMeltdown:最近のCPUの深い話
SpectreとMeltdown:最近のCPUの深い話LINE Corporation
 
Singularityで分散深層学習
Singularityで分散深層学習Singularityで分散深層学習
Singularityで分散深層学習Hitoshi Sato
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFoholiab
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選Yusuke Uchida
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKMarian Marinov
 

Was ist angesagt? (20)

CuPy解説
CuPy解説CuPy解説
CuPy解説
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみた
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
高位合成でDeep learning
高位合成でDeep learning高位合成でDeep learning
高位合成でDeep learning
 
Apache spark 2.3 and beyond
Apache spark 2.3 and beyondApache spark 2.3 and beyond
Apache spark 2.3 and beyond
 
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読
 
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
 
x86とコンテキストスイッチ
x86とコンテキストスイッチx86とコンテキストスイッチ
x86とコンテキストスイッチ
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例
 
FPGAを用いたEdge AIの現状
FPGAを用いたEdge AIの現状FPGAを用いたEdge AIの現状
FPGAを用いたEdge AIの現状
 
第31回「今アツい、分散ストレージを語ろう」(2013/11/28 on しすなま!)
第31回「今アツい、分散ストレージを語ろう」(2013/11/28 on しすなま!)第31回「今アツい、分散ストレージを語ろう」(2013/11/28 on しすなま!)
第31回「今アツい、分散ストレージを語ろう」(2013/11/28 on しすなま!)
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
 
ディープラーニングの2値化(Binarized Neural Network)
ディープラーニングの2値化(Binarized Neural Network)ディープラーニングの2値化(Binarized Neural Network)
ディープラーニングの2値化(Binarized Neural Network)
 
SpectreとMeltdown:最近のCPUの深い話
SpectreとMeltdown:最近のCPUの深い話SpectreとMeltdown:最近のCPUの深い話
SpectreとMeltdown:最近のCPUの深い話
 
Singularityで分散深層学習
Singularityで分散深層学習Singularityで分散深層学習
Singularityで分散深層学習
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDK
 

Ähnlich wie Profiling deep learning network using NVIDIA nsight systems

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介NVIDIA Japan
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Universitat Politècnica de Catalunya
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsAltoros
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...Chris Fregly
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Training course lect1
Training course lect1Training course lect1
Training course lect1Noor Dhiya
 

Ähnlich wie Profiling deep learning network using NVIDIA nsight systems (20)

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Nvidia at SEMICon, Munich
Nvidia at SEMICon, MunichNvidia at SEMICon, Munich
Nvidia at SEMICon, Munich
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Training course lect1
Training course lect1Training course lect1
Training course lect1
 

Kürzlich hochgeladen

Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 

Kürzlich hochgeladen (20)

Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 

Profiling deep learning network using NVIDIA nsight systems

  • 1. Jack Han | Solutions Architect | jahan@nvidia.com PROFILING DEEP LEARNING NETWORK
  • 2. 2 AGENDA Profiling Deep Learning Network Introduction to Nsight Systems PyTorch NVTX Annotation Examples and Benefits Mixed Precision with BERT TensorFlow NVTX Plugins Plugin Usage Example
  • 3. 3 HOW TO SPEED-UP NETWORK Use the latest GPU and more GPU? Wait NVIDIA to release the new GPU or newer library? Optimize Neural Network Computing or Something Need to understand your network operation We are talking about this
  • 4. 4 PROFILING NEURAL NETWORK Profiling with cProfiler + Snakeviz python -m cProfile -o 100_percent_gpu_utilization.prof train.py snakeviz 100_percent_gpu_utilization.prof https://sagivtech.com/2017/09/19/optimizing-pytorch-training-code/ Useful?
  • 5. 5 TENSORFLOW PROFILER Show timeline and can trace network Still difficult to understand https://github.com/tensorflow/profiler-ui Useful?
  • 6. 6 NVIDIA PROFILING TOOLS CUPTI (CUDA Profiling Tools Interface) Nsight Systems Nsight Compute Nsight Visual Studio Edition Nsight Eclipse Edition NVIDIA Profiler (nvprof) NVIDIA Visual Profiler (nvvp)
  • 7. 7 NSIGHT PRODUCT FAMILY Standalone Performance Tools Nsight Systems system-wide application algorithm tuning Nsight Compute Debug/optimize specific CUDA kernel Nsight Graphics Debug/optimize specific graphics IDE plugins Nsight Eclipse Edicion/Visual Studio editor, debugger, some perf analysis
  • 8. 8 NSIGHT SYSTEMS Profile System-wide application Multi-process tree, GPU workload trace, etc Investigate your workload across multiple CPUs and GPUs CPU algorithms, utilization, and thread states GPU streams kernels, memory transfers, etc NVTX, CUDA & Library API, etc Ready for Big Data docker, user privilege (linux), cli, etc Overview
  • 9. 9 Thread/core migration Thread state Processes and threads CUDA and OpenGL API trace cuDNN and cuBLAS trase Kernel and memory transfer activities Multi-GPU
  • 11. 11 TRANSITIONING TO PROFILE A KERNEL Dive into kernel analysis
  • 12. 12 NVIDIA NSIGHT COMPUTE Interactive CUDA API debugging and kernel profiling Fast Data Collection Graphical and multiple kernel comparison reports Improved Workflow and Fully Customizable (Baselining, Programmable UI/Rules) Command Line, Standalone, IDE Integration Platform Support OS: Linux(x86,ARM), Windows, OSX (host only) GPUs: Pascal, Volta, Turing Next Generation Kernel Profiler Kernel Profile Comparisons with Baseline Metric Data Source Correlation
  • 13. 13 PROFILING GPU APPLICATION How to measure Focusing GPU Computing Low GPU Utilization Low SM Efficiency Low Achieved Occupancy Memory Bottleneck Instructions Bottleneck GPU Profiling CPU/GPU Tracing Application Tracing • Too few threads • Register limit • Large shared memory … • Cache misses • Bandwidth limit • Access pattern … • Arithmetic • Control flow … NVIDIA (Visual) Profiler / Nsight Compute NVIDIA Supports them with cuDNN, cuBLAS, and so on
  • 14. 14 GPU Profiling CPU/GPU Tracing Application Tracing PROFILING GPU APPLICATION How to measure Focusing System Operation Low GPU Utilization Low SM Efficiency Low Achieved Occupancy Memory Bottleneck Instructions Bottleneck CPU-Only Activities Memcopy Latency Kernel Launch Latency Job Startup / Checkpoints CPU Computation I/O Nsight System / Application Tracing
  • 15. 15 NSIGHT SYSTEMS PROFILE Profile with CLI $ nsys profile –t cuda,osrt,nvtx,cudnn,cublas -y 60 -d 20 –o baseline -f true –w true python main.py cuda – GPU kernel osrt – OS runtime nvtx – NVIDIA Tools Extension cudnn – CUDA Deep NN library cublas– CUDA BLAS library https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.3.6-x86/06-cli-profiling.htm ßAPIs to be traced ßDelayed profile (sec) ßProfiling duration (sec) ßOutput filename ßOverwrite when it’s true ßDisplay ßExecution command
  • 16. 16 DISABLING ASYNCHRONOUS OPERATION Elimination of CPU interop effects export CUDA_LAUNCH_BLOCKING=1 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution
  • 18. 18 BASELINE PROFILE MNIST Training: 89 sec, <5% utilization CPU waits on a semaphore and starves the GPU! GPU Starvation GPU Starvation
  • 19. 19 NVTX ANNOTATIONS def train(args, model, device, train_loader, optimizer, epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
  • 20. 20 NVTX ANNOTATIONS import torch.cuda.nvtx as nvtx def train(args, model, device, train_loader, optimizer, epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): nvtx.range_push("Batch " + str(batch_idx)) nvtx.range_push("Copy to device") data, target = data.to(device), target.to(device) nvtx.range_pop() nvtx.range_push("Forward pass") optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) nvtx.range_pop() nvtx.range_pop() https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
  • 21. 21 BASELINE PROFILE (WITH NVTX) GPU is idle during data loading Data is loaded using a single thread. This starves the GPU! fwd/bwd fwd/bwd 5.1ms
  • 22. 22 OPTIMIZE SOURCE CODE Data loader was configured to use 1 worker thread Let’s switch to using 8 worker threads: kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {} kwargs = {'num_workers’: 8, 'pin_memory': True} if use_cuda else {}
  • 23. 23 AFTER OPTIMIZATION Time for data loading reduced for each bath 60us Reduced from 5.1ms to 60us for batch 50
  • 25. 25 INITIAL PROFILE No NVTX Difficult to understand à no useful
  • 26. 26 loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions) ... if args.fp16: optimizer.backward(loss) else: loss.backward() if (step + 1) % args.gradient_accumulation_steps == 0: if args.fp16: # modify learning rate with special warm up BERT uses # if args.fp16 is False, BertAdam is used and handles this automatically lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion) for param_group in optimizer.param_groups: param_group['lr'] = lr_this_step optimizer.step() optimizer.zero_grad() global_step += 1 Foward Backward NVTX TAGGING
  • 27. 27 nvtx.range_push("Batch " + str(step)) nvtx.range_push("Forward pass") loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions) nvtx.range_pop() ... nvtx.range_push("Backward pass") if args.fp16: optimizer.backward(loss) else: loss.backward() if (step + 1) % args.gradient_accumulation_steps == 0: if args.fp16: # modify learning rate with special warm up BERT uses # if args.fp16 is False, BertAdam is used and handles this automatically lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion) for param_group in optimizer.param_groups: param_group['lr'] = lr_this_step optimizer.step() optimizer.zero_grad() global_step += 1 nvtx.range_pop() nvtx.range_pop() Foward Backward NVTX TAGGING
  • 28. 28 BERT FP32 BENCHMARK HuggingFace’s pretrained BERT Getting API List ~460 ms ~60% GEMM 4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
  • 29. 29 BERT FP16 BENCHMARK HuggingFace’s pretrained BERT Tensor Cores APIs ~222 ms 2.1x Speed up 4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
  • 30. 30 132 ms 8.5 ms APEX OPTIMIZER OPTIMIZATION APEX/csrc/adam_cuda_kernel() Low Utilization Higher Utilization FP32 FP16
  • 31. 31 PROFILING GPU APPLICATION You can investigate the performance bottleneck in each layer NVIDIA provides such inefficient layers / graphs optimization support For example, Like this, adam_cuda_kernel in APEX cudnnMultiHeadAttnForward() in cuDNN BERT speed-up Reminding cudnnMultiHeadAttnForward
  • 33. 33 NVTX WITH TENSORFLOW The TF_DISABLE_NVTX_RANGES variable enables and disables NVTX ranges in TensorFlow NVTX collect is enabled by default Available in NGC Container Having NVTX plugin for TF Import nvtx plugin Annotation in python code Get data from through Profiler
  • 34. 34 NVTX PLUGINS FOR DEEP LEARNING Allows users to add their own NVIDIA Tools Extension (NVTX) events and time ranges to a TensorFlow graph Ranges are added by wrapping regions of the computation graph with start and end operations Profiling TensorFlow for the graphs NVTX Context
  • 35. 35 SETUP Install NVTX plugins pip install nvtx-plugins-tf No need to install this plugin since NGC 19.08
  • 36. 36 HOW TO USE Adding Markers Function Detector import nvtx.plugins.tf as nvtx_tf x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1-2', domain_name='Forward’, grad_domain_name='Gradient’, enabled=ENABLE_NVTX, trainable=True) x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1') x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’) x = nvtx_tf.ops.end(x, nvtx_context) x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_3') import nvtx.plugins.tf as nvtx_tf @nvtx_tf.ops.trace(message='Dense Block', domain_name='Forward', grad_domain_name='Gradient', enabled=ENABLE_NVTX, trainable=True) def dense_block(x): x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1') x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’) return x
  • 37. 37 WORKING WITH SESSION/TF.ESTIMATOR from nvtx.plugins.tf.estimator import NVTXHook nvtx_callback = NVTXHook(skip_n_steps=1, name='Train’) training_hooks=[] training_hooks.append(nvtx_callback) with tf.train.MonitoredSession(hooks=[nvtx_callback]) as sess: tf.estimator.Estimator(hooks=training_hooks, ...)
  • 38. 38 PROFIING MPI PROCESS To profile everything, putting the data into one file To profile everything, putting the data into each node into a separated file nsys [nsys options] mpirun [mpi options] mpirun [mpi options] nsys profile [nsys options]
  • 39. 39 COMMAND EXAMPLES 1 GPU profiling 4 GPU profiling NGC TensorFlow ResNet50-1.5 nsys profile -y 40 -d 20 -f true -w true -o tf_amp_4gpu -t osrt,nvtx,cuda,cublas,cudnn mpiexec --allow-run-as-root --bind-to socket -np 4 python ./main.py --mode=training_benchmark --warmup_steps 200 --num_iter 500 --iter_unit batch --results_dir=results --batch_size 64 nsys profile -y 40 -d 20 -f true -w true -o tf_amp -t osrt,nvtx,cuda,cublas,cudnn python ./main.py --mode=training_benchmark --warmup_steps 200 --num_iter 500 --iter_unit batch --results_dir=results --batch_size 64