SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
P R O F I L I N G P Y T O R C H F O R
E F F I C I E N C Y & S U S T A I N A B I L I T Y
N O V 1 7 , 2 0 2 1
G E E T A C H A U H A N
P Y T O R C H P A R T N E R E N G I N E E R I N G
M E T A A I
PyTorch Profile r Talk
A G E N D A
0 1
G P U P E R F O R M A N C E T U N I N G
0 2
P Y T O R C H P R O F I L E R
0 3
T I M E L I N E T R A C I N G
0 4
O P T I M I Z A T I O N E X A M P L E S
0 5
F U R T U R E : S U S T A I N A B L E A I
GPU Performance Tuning
PyTorch Profile r Talk
Optimized for single thread performance
- Majority of chip area is control logic & caches
Complex and deep out-of-order pipelines
- Extract instruction level parallelism
The brain
- Job is to keep the accelerator busy
CPU GPU
Optimized for throughput of data-parallel problems
- Majority of chip area is functional units
Simple, relatively slow in-order pipelines
- Achieves much higher total throughput
Accelerator attached via PCIe
- Order of magnitude faster but off to the side
A DIFFERENT MENTAL MODEL REQUIRED
G P U P E R F O R M A N C E T U N I N G
PyTorch Profile r Talk
Composed of Streaming
Multiprocessors (SMs)
Volta V100: 80x SMs
Ampere A100: 108 SMs
DGX A100 with 8 GPUs:
864 SMs vs 128 CPU cores
NVIDIA Volta V100 GPU
G P U P E R F O R M A N C E T U N I N G
PyTorch Profile r Talk
G P U P E R F O R M A N C E T U N I N G
64x FP32 units
64x INT, 32x FP64, 32x LD/ST
8x Tensor Cores
5120 (6920 ON A100)
FP32 EXECUTION UNITS
PER GPU
Streaming Multiprocessor
PyTorch Profile r Talk
• Excessive CPU/GPU interactions – e.g. for loop launching GPU operations
- Dominated by launch overheads
• Short GPU kernel durations – e.g. small inputs
- Need enough data to feed 10s of thousands of threads
• CPU overheads and I/O bottlenecks are starving the GPU
- Small operations on the CPU can quickly become dominant
• Framework inefficiencies
- E.g. unnecessary copies and hidden CPU-side overheads
VISIBILITY IS KEY
G P U P E R F O R M A N C E T U N I N G
Common Pitfalls
PyTorch Profiler
W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y
CONTRIBUTED BY MICROSOFT &
FACEBOOK
• PyTorch and GPU level information
• Automatic bottleneck detection
• Actionable performance
recommendations
• Data Scientist friendly lifecycle and tools
• TensorBoard Plugin - chrome traces
visualization
• OSS Kineto library - built on CUPTI
• Easy-to-use python API
• VS Code integration
libkineto
PyTorch Profiler
libCUPTI
PyTorch Process
aten operators
Python
C++
CUDA
TensorBoard
Python Events
GPU 1 GPU 2 GPU n
…
NVIDIA Driver
OS
Profiler
Plugin
CUDA Activities
CPU operators
Queue GPU ops
Traces
CPU operators
Traces
T H E P Y T O R C H P R O F I L E R
https://pytorch.org/tutorials/recipes/recipes/profiler.html
import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(record_shapes=True) as prof:
with profiler.record_function("model_inference"):
model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
T H E P Y T O R C H P R O F I L E R
Profiling API : Base Usage
T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
• When to trigger
• How many steps to profile
• Which activities to profile
• Results callable handler
• Extra metadata, eg shapes, stacks, memory
• Output options eg Chrome tracing , TensorBoard
T H E P Y T O R C H P R O F I L E R
Advanced
PyTorch Profile r Talk
T H E P Y T O R C H P R O F I L E R
PyTorch Profile r Talk
D I S T R I B U T E D T R A I N I N G V I E W
PyTorch Profile r Talk
V S C O D E D A T A W R A N G L E R
Timeline Tracing
PyTorch Profile r Talk
T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S
PyTorch Profile r Talk
T I M E L I N E T R A C I N G
Chrome Trace Viewer: CPU and GPU timelines
PyTorch Profile r Talk
• Can leave in permanently, no perf overhead
T I M E L I N E T R A C I N G
PyTorch Profile r Talk
T I M E L I N E T R A C I N G
See how CPU and GPU ops are connected
PyTorch Profile r Talk
Nvidia-smi shows
86% utilization
But.. only a
fraction of SMs are
actually used by
these kernels!
T I M E L I N E T R A C I N G
Inspect stats for individual activities
PyTorch Profile r Talk
Looks much better
after increasing input
sizes
T I M E L I N E T R A C I N G
Inspect stats for individual activities
Trace Analysis
E x a m p l e s f r o m M e t a w o r k l o a d s
#thanks to
Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan
for examples
PyTorch Profile r Talk
Issue:
1. Large periods of GPU inactivity
2. Trace does not show why
Solution:
1. Use record_function to reveal
bottlenecks on CPU
2. Parallelize CPU operations
3. Overlap CPU and GPU operations
temp = ""
num_substr = len(emb[k])
with record_function("## join_string {} ##".format(num_substr)):
temp = ",".join(str(x) for x in emb[k]) # string concatenation
with record_function("## append_record_in_else ##"):
records.append(f"{input_df.id[i + k]}t{temp}n") # list append
T R A C E A N A L Y S I S
Anti-pattern: Long GPU idle time
PyTorch Profile r Talk
A F T E R
def on_step(self, task) -> None:
...
with torch.no_grad():
torch._foreach_mul_(
self.ema_model_state_list, self.decay)
torch._foreach_add_(
self.ema_model_state_list,
self.param_list,
alpha=(1 - self.decay))
First issue:
• Exponential moving avg hook function has a
loop – CPU bottleneck
• Can rewrite using torch._foreach ops – loop
now on GPU
EMA HOOK 100X FASTER
ITERATION TIME: 860MS -> 770MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def on_step(self, task) -> None:
...
with torch.no_grad():
it = model_state_iterator(task.base_model)
# iterate on every name & param
for name, param in it:
s = self.state.ema_model_state
s[name] = self.decay * s[name] +
(1 – self.decay) *
param.to(device= self.device)
PyTorch Profile r Talk
Second issue:
• Optimizer step uses a naïve implementation
of RMSProp
• PyTorch provides an optimized multi-tensor
version – using torch._foreach
• Switch to optimized version!
OPTIMIZER 12X FASTER
ITERATION TIME: 770MS -> 600MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def prepare(self, param_groups):
self.optimizer = RMSpropTFV2Optimizer(
param_groups,
…
A F T E R
import torch.optim._multi_tensor as optim_mt
def prepare(self, param_groups):
self.optimizer = optim_mt.RMSprop(
param_groups,
…
PyTorch Profile r Talk
Third issue:
• Forward & backward pass dominated
by SyncBatchNorm
• 84x SyncBatchNorm in fwd pass
• 3x ncclAllGather per SyncBatchNorm
• Another 2x ncclAllReduce per
SyncBatchNorm in bwd pass
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU interactions
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
PyTorch Profile r Talk
From 2.4 req/s to 1,400+ req/s
CPU inference
torch.set_num_threads(1)
Intel IPEX
Quantization
GPU inference on 1 T4 GPU
model.half()
DistilBERT
Increase batch size
Do not overpad
Faster Transformer
T R A C E A N A L Y S I S
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
BERT PERFORMANCE OPTIMIZATION CASE STUDY
• From 2.4 req/s to 1,400+ req/s
• CPU inference
• torch.set_num_threads(1)
• Intel IPEX
• Quantization
• GPU inference on 1 T4 GPU
• model.half()
• DistilBERT
• Increase batch size
• Do not overpad
• Faster Transformer
Throughput P99
BERT
unoptimized
bs=1
70.67 seq/s 20.44ms
BERT
model.half()
bs=8
359 seq/s 23.58ms
DistilBERT
model.half()
bs=16
689 seq/s 22.8ms
BERT Faster
Transformer
885 seq/s 19.83ms
DistilBERT no
padding
model.half()
bs=32
1423 seq/s 19.7ms
FUTURE
S u s t a i n a b l e A I
PyTorch Profile r Talk
A I M O D E L G R O W T H
PyTorch Profile r Talk
M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N
• Platform level caching – 6.7x
improvements
• GPU Acceleration – unlocks 10.1x
energy efficiency
• Algorithmic Optimizations – 10x
improvements
O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M
1. Data Utilization Efficiency:
Data Scaling & Sampling, Data perishability
2. Experimentation and Training Efficiency:
NAS, HPO, Multi-Objective Optimizations,
Resource Efficient Architectures
3. Efficient Environment Scalable Infrastructure:
Carbon efficient scheduling, On-device Learning, …
4. Develop easy to adopt Telemetry:
Measure and publish,
Carbon impact statement & model cards
S U S T A I N B I L I T Y M I N D S E T
https://arxiv.org/pdf/2111.00364.pdf
Source: https://docs.cohere.ai/environmental-impact
PyTorch Profile r Talk
• What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/
• Introducing PyTorch Profiler:
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/
• Profiler: https://pytorch.org/docs/stable/profiler.html
• Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html
• VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/
• PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS:
https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731
• Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance-
batch-size-with-pytorch-profiler/
• Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples
• PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch-
lightning/blob/master/pl_examples/basic_examples/profiler_example.py
• Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf
• Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact
R E F E R E N C E S
PyTorch Profile r Talk
Questions?
Contact:
Email: gchauhan@fb.com
Linkedin: https://www.linkedin.com/in/geetachauhan/
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~Hideki Tsunashima
 
できる!並列・並行プログラミング
できる!並列・並行プログラミングできる!並列・並行プログラミング
できる!並列・並行プログラミングPreferred Networks
 
1070: CUDA プログラミング入門
1070: CUDA プログラミング入門1070: CUDA プログラミング入門
1070: CUDA プログラミング入門NVIDIA Japan
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門Hideo Terada
 
ENEOSにおける低炭素技術への挑戦~汎用原子レベルシミュレータMatlantis™の共同開発者とユーザーの視点から~
ENEOSにおける低炭素技術への挑戦~汎用原子レベルシミュレータMatlantis™の共同開発者とユーザーの視点から~ENEOSにおける低炭素技術への挑戦~汎用原子レベルシミュレータMatlantis™の共同開発者とユーザーの視点から~
ENEOSにおける低炭素技術への挑戦~汎用原子レベルシミュレータMatlantis™の共同開発者とユーザーの視点から~Matlantis
 
Pythonによる黒魔術入門
Pythonによる黒魔術入門Pythonによる黒魔術入門
Pythonによる黒魔術入門大樹 小倉
 
Object Detection & Instance Segmentationの論文紹介 | OHS勉強会#3
Object Detection & Instance Segmentationの論文紹介 | OHS勉強会#3Object Detection & Instance Segmentationの論文紹介 | OHS勉強会#3
Object Detection & Instance Segmentationの論文紹介 | OHS勉強会#3Toshinori Hanya
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門NVIDIA Japan
 
いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門Fixstars Corporation
 
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language ModelsDeep Learning JP
 
最近のDQN
最近のDQN最近のDQN
最近のDQNmooopan
 
第1回 配信講義 計算科学技術特論A (2021)
第1回 配信講義 計算科学技術特論A (2021)第1回 配信講義 計算科学技術特論A (2021)
第1回 配信講義 計算科学技術特論A (2021)RCCSRENKEI
 
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介Preferred Networks
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP
 
Surveyから始まる研究者への道 - Stand on the shoulders of giants -
Surveyから始まる研究者への道 - Stand on the shoulders of giants -Surveyから始まる研究者への道 - Stand on the shoulders of giants -
Surveyから始まる研究者への道 - Stand on the shoulders of giants -諒介 荒木
 
TVMの次期グラフIR Relayの紹介
TVMの次期グラフIR Relayの紹介TVMの次期グラフIR Relayの紹介
TVMの次期グラフIR Relayの紹介Takeo Imai
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential EquationsDeep Learning JP
 
cvpaper.challenge 研究効率化 Tips
cvpaper.challenge 研究効率化 Tipscvpaper.challenge 研究効率化 Tips
cvpaper.challenge 研究効率化 Tipscvpaper. challenge
 
第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 

Was ist angesagt? (20)

分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~
 
できる!並列・並行プログラミング
できる!並列・並行プログラミングできる!並列・並行プログラミング
できる!並列・並行プログラミング
 
1070: CUDA プログラミング入門
1070: CUDA プログラミング入門1070: CUDA プログラミング入門
1070: CUDA プログラミング入門
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門
 
ENEOSにおける低炭素技術への挑戦~汎用原子レベルシミュレータMatlantis™の共同開発者とユーザーの視点から~
ENEOSにおける低炭素技術への挑戦~汎用原子レベルシミュレータMatlantis™の共同開発者とユーザーの視点から~ENEOSにおける低炭素技術への挑戦~汎用原子レベルシミュレータMatlantis™の共同開発者とユーザーの視点から~
ENEOSにおける低炭素技術への挑戦~汎用原子レベルシミュレータMatlantis™の共同開発者とユーザーの視点から~
 
Pythonによる黒魔術入門
Pythonによる黒魔術入門Pythonによる黒魔術入門
Pythonによる黒魔術入門
 
Object Detection & Instance Segmentationの論文紹介 | OHS勉強会#3
Object Detection & Instance Segmentationの論文紹介 | OHS勉強会#3Object Detection & Instance Segmentationの論文紹介 | OHS勉強会#3
Object Detection & Instance Segmentationの論文紹介 | OHS勉強会#3
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門
 
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
 
最近のDQN
最近のDQN最近のDQN
最近のDQN
 
第1回 配信講義 計算科学技術特論A (2021)
第1回 配信講義 計算科学技術特論A (2021)第1回 配信講義 計算科学技術特論A (2021)
第1回 配信講義 計算科学技術特論A (2021)
 
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
[GTCJ2018]CuPy -NumPy互換GPUライブラリによるPythonでの高速計算- PFN奥田遼介
 
TabNetの論文紹介
TabNetの論文紹介TabNetの論文紹介
TabNetの論文紹介
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
 
Surveyから始まる研究者への道 - Stand on the shoulders of giants -
Surveyから始まる研究者への道 - Stand on the shoulders of giants -Surveyから始まる研究者への道 - Stand on the shoulders of giants -
Surveyから始まる研究者への道 - Stand on the shoulders of giants -
 
TVMの次期グラフIR Relayの紹介
TVMの次期グラフIR Relayの紹介TVMの次期グラフIR Relayの紹介
TVMの次期グラフIR Relayの紹介
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
 
cvpaper.challenge 研究効率化 Tips
cvpaper.challenge 研究効率化 Tipscvpaper.challenge 研究効率化 Tips
cvpaper.challenge 研究効率化 Tips
 
第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)
 

Ähnlich wie Profiling PyTorch for Efficiency & Sustainability

Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java ProfilingJerry Yoakum
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science Domino Data Lab
 
GPU profiling for computer vision applications
GPU profiling for computer vision applicationsGPU profiling for computer vision applications
GPU profiling for computer vision applicationsMai Nishimura
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorchgeetachauhan
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityScyllaDB
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...NECST Lab @ Politecnico di Milano
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...inside-BigData.com
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2Stanley Ho
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 

Ähnlich wie Profiling PyTorch for Efficiency & Sustainability (20)

Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
GPU profiling for computer vision applications
GPU profiling for computer vision applicationsGPU profiling for computer vision applications
GPU profiling for computer vision applications
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 

Mehr von geetachauhan

Building AI with Security Privacy in Mind
Building AI with Security Privacy in MindBuilding AI with Security Privacy in Mind
Building AI with Security Privacy in Mindgeetachauhan
 
Building AI with Security and Privacy in mind
Building AI with Security and Privacy in mindBuilding AI with Security and Privacy in mind
Building AI with Security and Privacy in mindgeetachauhan
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchgeetachauhan
 
Future is private intel dev fest
Future is private   intel dev festFuture is private   intel dev fest
Future is private intel dev festgeetachauhan
 
Decentralized AI Draper
Decentralized AI   DraperDecentralized AI   Draper
Decentralized AI Drapergeetachauhan
 
Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain geetachauhan
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIgeetachauhan
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIgeetachauhan
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaginggeetachauhan
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTechgeetachauhan
 
NIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCSNIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCSgeetachauhan
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprisesgeetachauhan
 
Deep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute StickDeep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute Stickgeetachauhan
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Financegeetachauhan
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestgeetachauhan
 
Distributed deep learning optimizations
Distributed deep learning optimizationsDistributed deep learning optimizations
Distributed deep learning optimizationsgeetachauhan
 
Tensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challengeTensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challengegeetachauhan
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learninggeetachauhan
 
Transfer learning for IoT
Transfer learning for IoTTransfer learning for IoT
Transfer learning for IoTgeetachauhan
 
Tensorflow for IoT
Tensorflow for IoTTensorflow for IoT
Tensorflow for IoTgeetachauhan
 

Mehr von geetachauhan (20)

Building AI with Security Privacy in Mind
Building AI with Security Privacy in MindBuilding AI with Security Privacy in Mind
Building AI with Security Privacy in Mind
 
Building AI with Security and Privacy in mind
Building AI with Security and Privacy in mindBuilding AI with Security and Privacy in mind
Building AI with Security and Privacy in mind
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
 
Future is private intel dev fest
Future is private   intel dev festFuture is private   intel dev fest
Future is private intel dev fest
 
Decentralized AI Draper
Decentralized AI   DraperDecentralized AI   Draper
Decentralized AI Draper
 
Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AI
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AI
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
NIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCSNIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCS
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprises
 
Deep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute StickDeep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute Stick
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBest
 
Distributed deep learning optimizations
Distributed deep learning optimizationsDistributed deep learning optimizations
Distributed deep learning optimizations
 
Tensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challengeTensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challenge
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learning
 
Transfer learning for IoT
Transfer learning for IoTTransfer learning for IoT
Transfer learning for IoT
 
Tensorflow for IoT
Tensorflow for IoTTensorflow for IoT
Tensorflow for IoT
 

Kürzlich hochgeladen

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Profiling PyTorch for Efficiency & Sustainability

  • 1. P R O F I L I N G P Y T O R C H F O R E F F I C I E N C Y & S U S T A I N A B I L I T Y N O V 1 7 , 2 0 2 1 G E E T A C H A U H A N P Y T O R C H P A R T N E R E N G I N E E R I N G M E T A A I
  • 2. PyTorch Profile r Talk A G E N D A 0 1 G P U P E R F O R M A N C E T U N I N G 0 2 P Y T O R C H P R O F I L E R 0 3 T I M E L I N E T R A C I N G 0 4 O P T I M I Z A T I O N E X A M P L E S 0 5 F U R T U R E : S U S T A I N A B L E A I
  • 4. PyTorch Profile r Talk Optimized for single thread performance - Majority of chip area is control logic & caches Complex and deep out-of-order pipelines - Extract instruction level parallelism The brain - Job is to keep the accelerator busy CPU GPU Optimized for throughput of data-parallel problems - Majority of chip area is functional units Simple, relatively slow in-order pipelines - Achieves much higher total throughput Accelerator attached via PCIe - Order of magnitude faster but off to the side A DIFFERENT MENTAL MODEL REQUIRED G P U P E R F O R M A N C E T U N I N G
  • 5. PyTorch Profile r Talk Composed of Streaming Multiprocessors (SMs) Volta V100: 80x SMs Ampere A100: 108 SMs DGX A100 with 8 GPUs: 864 SMs vs 128 CPU cores NVIDIA Volta V100 GPU G P U P E R F O R M A N C E T U N I N G
  • 6. PyTorch Profile r Talk G P U P E R F O R M A N C E T U N I N G 64x FP32 units 64x INT, 32x FP64, 32x LD/ST 8x Tensor Cores 5120 (6920 ON A100) FP32 EXECUTION UNITS PER GPU Streaming Multiprocessor
  • 7. PyTorch Profile r Talk • Excessive CPU/GPU interactions – e.g. for loop launching GPU operations - Dominated by launch overheads • Short GPU kernel durations – e.g. small inputs - Need enough data to feed 10s of thousands of threads • CPU overheads and I/O bottlenecks are starving the GPU - Small operations on the CPU can quickly become dominant • Framework inefficiencies - E.g. unnecessary copies and hidden CPU-side overheads VISIBILITY IS KEY G P U P E R F O R M A N C E T U N I N G Common Pitfalls
  • 8. PyTorch Profiler W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y
  • 9. CONTRIBUTED BY MICROSOFT & FACEBOOK • PyTorch and GPU level information • Automatic bottleneck detection • Actionable performance recommendations • Data Scientist friendly lifecycle and tools • TensorBoard Plugin - chrome traces visualization • OSS Kineto library - built on CUPTI • Easy-to-use python API • VS Code integration libkineto PyTorch Profiler libCUPTI PyTorch Process aten operators Python C++ CUDA TensorBoard Python Events GPU 1 GPU 2 GPU n … NVIDIA Driver OS Profiler Plugin CUDA Activities CPU operators Queue GPU ops Traces CPU operators Traces T H E P Y T O R C H P R O F I L E R
  • 10. https://pytorch.org/tutorials/recipes/recipes/profiler.html import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile(record_shapes=True) as prof: with profiler.record_function("model_inference"): model(inputs) print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) T H E P Y T O R C H P R O F I L E R Profiling API : Base Usage
  • 11. T H E P Y T O R C H P R O F I L E R Profiling API : Tensorboard Plugin import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile( record_shapes=True, on_trace_ready=torch.profiler.tensorboard_ trace_handler(‘results’) ) as prof: model(inputs) print(prof.key_averages().table(sort_by= "cpu_time_total", row_limit=10))
  • 12. T H E P Y T O R C H P R O F I L E R Profiling API : Tensorboard Plugin import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile( record_shapes=True, on_trace_ready=torch.profiler.tensorboard_ trace_handler(‘results’) ) as prof: model(inputs) print(prof.key_averages().table(sort_by= "cpu_time_total", row_limit=10))
  • 13. • When to trigger • How many steps to profile • Which activities to profile • Results callable handler • Extra metadata, eg shapes, stacks, memory • Output options eg Chrome tracing , TensorBoard T H E P Y T O R C H P R O F I L E R Advanced
  • 14. PyTorch Profile r Talk T H E P Y T O R C H P R O F I L E R
  • 15. PyTorch Profile r Talk D I S T R I B U T E D T R A I N I N G V I E W
  • 16. PyTorch Profile r Talk V S C O D E D A T A W R A N G L E R
  • 18. PyTorch Profile r Talk T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S
  • 19. PyTorch Profile r Talk T I M E L I N E T R A C I N G Chrome Trace Viewer: CPU and GPU timelines
  • 20. PyTorch Profile r Talk • Can leave in permanently, no perf overhead T I M E L I N E T R A C I N G
  • 21. PyTorch Profile r Talk T I M E L I N E T R A C I N G See how CPU and GPU ops are connected
  • 22. PyTorch Profile r Talk Nvidia-smi shows 86% utilization But.. only a fraction of SMs are actually used by these kernels! T I M E L I N E T R A C I N G Inspect stats for individual activities
  • 23. PyTorch Profile r Talk Looks much better after increasing input sizes T I M E L I N E T R A C I N G Inspect stats for individual activities
  • 24. Trace Analysis E x a m p l e s f r o m M e t a w o r k l o a d s #thanks to Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan for examples
  • 25. PyTorch Profile r Talk Issue: 1. Large periods of GPU inactivity 2. Trace does not show why Solution: 1. Use record_function to reveal bottlenecks on CPU 2. Parallelize CPU operations 3. Overlap CPU and GPU operations temp = "" num_substr = len(emb[k]) with record_function("## join_string {} ##".format(num_substr)): temp = ",".join(str(x) for x in emb[k]) # string concatenation with record_function("## append_record_in_else ##"): records.append(f"{input_df.id[i + k]}t{temp}n") # list append T R A C E A N A L Y S I S Anti-pattern: Long GPU idle time
  • 26. PyTorch Profile r Talk A F T E R def on_step(self, task) -> None: ... with torch.no_grad(): torch._foreach_mul_( self.ema_model_state_list, self.decay) torch._foreach_add_( self.ema_model_state_list, self.param_list, alpha=(1 - self.decay)) First issue: • Exponential moving avg hook function has a loop – CPU bottleneck • Can rewrite using torch._foreach ops – loop now on GPU EMA HOOK 100X FASTER ITERATION TIME: 860MS -> 770MS T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions B E F O R E def on_step(self, task) -> None: ... with torch.no_grad(): it = model_state_iterator(task.base_model) # iterate on every name & param for name, param in it: s = self.state.ema_model_state s[name] = self.decay * s[name] + (1 – self.decay) * param.to(device= self.device)
  • 27. PyTorch Profile r Talk Second issue: • Optimizer step uses a naïve implementation of RMSProp • PyTorch provides an optimized multi-tensor version – using torch._foreach • Switch to optimized version! OPTIMIZER 12X FASTER ITERATION TIME: 770MS -> 600MS T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions B E F O R E def prepare(self, param_groups): self.optimizer = RMSpropTFV2Optimizer( param_groups, … A F T E R import torch.optim._multi_tensor as optim_mt def prepare(self, param_groups): self.optimizer = optim_mt.RMSprop( param_groups, …
  • 28. PyTorch Profile r Talk Third issue: • Forward & backward pass dominated by SyncBatchNorm • 84x SyncBatchNorm in fwd pass • 3x ncclAllGather per SyncBatchNorm • Another 2x ncclAllReduce per SyncBatchNorm in bwd pass T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions FORWARD PASS 1.5X FASTER BACKWARD PASS 1.3X FASTER ITERATION TIME: 600MS -> 450MS 2.2ms 1.7ms
  • 29. PyTorch Profile r Talk From 2.4 req/s to 1,400+ req/s CPU inference torch.set_num_threads(1) Intel IPEX Quantization GPU inference on 1 T4 GPU model.half() DistilBERT Increase batch size Do not overpad Faster Transformer T R A C E A N A L Y S I S FORWARD PASS 1.5X FASTER BACKWARD PASS 1.3X FASTER ITERATION TIME: 600MS -> 450MS 2.2ms 1.7ms BERT PERFORMANCE OPTIMIZATION CASE STUDY • From 2.4 req/s to 1,400+ req/s • CPU inference • torch.set_num_threads(1) • Intel IPEX • Quantization • GPU inference on 1 T4 GPU • model.half() • DistilBERT • Increase batch size • Do not overpad • Faster Transformer Throughput P99 BERT unoptimized bs=1 70.67 seq/s 20.44ms BERT model.half() bs=8 359 seq/s 23.58ms DistilBERT model.half() bs=16 689 seq/s 22.8ms BERT Faster Transformer 885 seq/s 19.83ms DistilBERT no padding model.half() bs=32 1423 seq/s 19.7ms
  • 30. FUTURE S u s t a i n a b l e A I
  • 31. PyTorch Profile r Talk A I M O D E L G R O W T H
  • 32. PyTorch Profile r Talk M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N
  • 33. • Platform level caching – 6.7x improvements • GPU Acceleration – unlocks 10.1x energy efficiency • Algorithmic Optimizations – 10x improvements O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M
  • 34. 1. Data Utilization Efficiency: Data Scaling & Sampling, Data perishability 2. Experimentation and Training Efficiency: NAS, HPO, Multi-Objective Optimizations, Resource Efficient Architectures 3. Efficient Environment Scalable Infrastructure: Carbon efficient scheduling, On-device Learning, … 4. Develop easy to adopt Telemetry: Measure and publish, Carbon impact statement & model cards S U S T A I N B I L I T Y M I N D S E T https://arxiv.org/pdf/2111.00364.pdf Source: https://docs.cohere.ai/environmental-impact
  • 35. PyTorch Profile r Talk • What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/ • Introducing PyTorch Profiler: https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/ • Profiler: https://pytorch.org/docs/stable/profiler.html • Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html • VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/ • PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS: https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731 • Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance- batch-size-with-pytorch-profiler/ • Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples • PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch- lightning/blob/master/pl_examples/basic_examples/profiler_example.py • Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf • Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact R E F E R E N C E S
  • 36. PyTorch Profile r Talk Questions? Contact: Email: gchauhan@fb.com Linkedin: https://www.linkedin.com/in/geetachauhan/