SlideShare ist ein Scribd-Unternehmen logo
1 von 29
TensorRT Survey
issue.hsu@gmail.com
2017
TensorRT
• NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that
delivers low latency, high-throughput inference for deep learning applications.
• NVIDIA released TensorRT last year with the goal of accelerating deep learning inference for
production deployment.
2
Deploying a model with TensorRT
3
UFF stands for Universal Framework
Format, which is TensorRT’s internal
format used to represent the network
graph before running optimizations
perform optimizations for
specified parameters such as
batch size, precision, and
workspace memory for the
target deployment GPU
The output of the
TensorRT optimization is
a runtime inference engine
that can be serialized to
disk.
load and deserialize a
saved plan file to
create a TensorRT
engine object
A plan file includes not
only the weights, but also
the schedule for the kernels
to execute the network.
TensorRT supported layers
• TensorRT supported layers
• Convolution
• LSTM and GRU
• Activation: ReLU, tanh, sigmoid
• Pooling: max and average
• Scaling
• Element wise operations
• LRN
• Fully-connected
• SoftMax
• Deconvolution
• TensorRT provides a Custom Layer API to enable you
to define your own custom layers that aren’t natively
supported
• These custom layers are defined using C++ to make it easy
to leverage highly optimized CUDA libraries like cuDNN
and cuBLAS
4
TensorRT Optimizations
• TensorRT Optimizations
• Layer and tensor fusion and elimination of unused layers
• FP16 and INT8 reduced precision calibration
• Target-specific autotuning
• Efficient memory reuse
• Multi-Stream Execution
• TensorRT performs these optimizations automatically under the hood for you.
• All you need to specify is the UFF inference graph to optimize, the inference batch size, the
amount of workspace GPU memory (used for CUDA kernel scratch space), and the target
inference precision, as the following code shows.
•
5
Optimization 1: Layer & Tensor Fusion
• TensorRT parses the network computational graph and looks for opportunities to
perform graph optimizations.
• These graph optimizations do not change the underlying computation in the
graph: instead, they look to restructure the graph to perform the operations
much faster and more efficiently.
• TensorRT can also eliminate the concatenation layers in “concat” by preallocating output
buffers and writing into them in a strided fashion.
6
Optimization 2: FP16 and INT8 Precision
Calibration
• Most deep learning frameworks train neural networks in full 32-bit precision (FP32).
• Once the model is fully trained, inference computations can use half precision FP16 or even INT8 tensor operations, since
gradient backpropagation is not required for inference.
• Using lower precision results in smaller model size, lower memory utilization and latency, and higher throughput.
• TensorRT can deploy models in FP32, FP16 and INT8
• To quantize full-precision information into INT8 while minimizing accuracy loss, TensorRT must perform a
process called calibration to determine how best to represent the weights and activations as 8-bit integers.
• The calibration step requires you to provide TensorRT with a representative sample of the input training data.
• No additional fine tuning or retraining of the model is necessary, and you don’t need to have access to the entire training
dataset.
• Calibration is a completely automated and parameter-free method for converting FP32 to INT8.
7
Optimization 3: Kernel Auto-tuning
• During the optimization phase TensorRT also chooses from hundreds
of specialized kernels, many of them hand-tuned and optimized for a
range of parameters and target platforms.
• As an example, there are several different algorithms to do convolutions.
• TensorRT will pick the implementation from a library of kernels that delivers
the best performance for the target GPU, input data size, filter size, tensor
layout, batch size and other parameters.
• This ensures that the deployed model is performance tuned for the
specific deployment platform as well as for the specific neural
network being deployed.
8
Optimization 4: Dynamic Tensor Memory
• TensorRT reduces memory footprint and improves memory reuse by
allocating memory for each tensor only for the duration of its usage,
avoiding memory allocation overhead for fast and efficient execution.
9
Optimization 5: Multi-Stream Execution
• Scales to multiple input streams, by processing them in parallel using
the same model and weights
10
TensorRT Run-Time Inference
• You’re now ready to deploy your application with TensorRT
• You’ve so far imported a trained TensorFlow model into TensorRT, and performed a number of
optimizations to generate a runtime engine.
• And you’ve serialized this engine to disk as an engine plan file.
• You performed all these steps offline, and only once prior to deployment.
• The next step is to load serialized models into your runtime environment and
perform inference on new data.
11
• TensorRT Lite API is a highly abstracted
interface that handles standard tasks like
creating the logger, deserializing the engine
from a plan file to create a runtime, and
allocating GPU memory for the engine.
• During inference, it also manages data
transfer to and from GPU automatically, so
you can just create an engine and start
processing data.
More about INT8
12
Quantization
• It’s always a tradeoff between range and precision of the INT8
representation.
• Minimize information loss, since FP32 → INT8 is just re-encoding information
13
How to optimize threshold selection?
• “Relative Entropy” of two encodings
• INT8 model encodes the same information as the original FP32 model.
• We want to minimize loss of information.
• Loss of information is measured by Kullback-Leibler divergence (AKA relative
entropy or information divergence).
• P, Q - two discrete probability distributions.
• KL_divergence(P,Q):= SUM(P[i] * log(P[i] / Q[i] ), i)
• Intuition: KL divergence measures the amount of information lost when
approximating a given encoding.
14
Solution: Calibration
• Calibration Dataset
• Representative.
• Diverse.
• Ideally a subset of validation dataset.
• 1000s of samples
• Calibration
• Run FP32 inference on Calibration Dataset.
• For each Layer:
• collect histograms of activations.
• generate many quantized distributions with different saturation thresholds.
• pick threshold which minimizes KL_divergence(ref_distr, quant_distr).
• Entire process takes a few minutes on a typical desktop workstation.
15
INT8 workflow in TensorRT
• You will need:
• Model trained in FP32.
• Calibration dataset.
• TensorRT will:
• Run inference in FP32 on calibration dataset.
• Collect required statistics.
• Run calibration algorithm → optimal scaling factors.
• Quantize FP32 weights → INT8.
• Generate “CalibrationTable” and INT8 execution engine.
16
Entropy Calibration - pseudocode
Input: FP32 histogram H with 2048 bins: bin[ 0 ], …, bin[ 2047 ]
For i in range( 128 , 2048 ):
P = [ bin[ 0 ] , ..., bin[ i-1 ] ] // reference_distribution
outliers_count = sum( bin[ i ] , bin[ i+1 ] , … , bin[ 2047 ] )
P[ i-1 ] += outliers_count
P /= sum(P) // normalize distribution P
Q = quantize [ bin[ 0 ], …, bin[ i-1 ] ] into 128 levels // candidate_distribution
expand Q to ‘ i ’ bins
Q /= sum(Q) // normalize distribution Q
divergence[ i ] = KL_divergence( P, Q)
End For
Find index ‘m’ for which divergence[ m ] is minimal
threshold = ( m + 0.5 ) * ( width of a bin )
17
Candidate distribution Q
• KL_divergence(P, Q) requires that len(P) == len(Q)
• Candidate distribution Q is generated after merging ‘ i ’ bins from bin[0] to bin[i-1] into 128 bins
• Afterwards Q has to be ‘expanded’ again into ‘i’ bins
• Here is a simple example:
reference distribution P consisting of 8 bins, we want to quantize into 2 bins:
P = [ 1, 0, 2, 3, 5, 3, 1, 7]
we merge into 2 bins (8 / 2 = 4 consecutive bins are merged into one bin)
[1 + 0 + 2 + 3 , 5 + 3 + 1 + 7] = [6, 16]
then proportionally expand back to 8 bins, we preserve empty bins from the original distribution P:
Q = [ 6/3, 0, 6/3, 6/3, 16/4, 16/4, 16/4, 16/4] = [ 2, 0, 2, 2, 4, 4, 4, 4]
now we should normalize both distributions, after that we can compute KL_divergence
P /= sum(P) Q /= sum(Q)
result = KL_divergence(P, Q)
18
INT8 conv kernel - pseudocode
// I8 input tensors: I8_input, I8_weights, I8 output tensors: I8_output
// F32 bias (original bias from the F32 model)
// F32 scaling factors: input_scale, output_scale, weights_scale[K]
I32_gemm_out = I8_input * I8_weights // Compute INT8 GEMM (DP4A)
F32_gemm_out = (float)I32_gemm_out // Cast I32 GEMM output to F32 float
// At this point we have F32_gemm_out which is scaled by ( input_scale * weights_scale[K] ),
// but to store the final result in int8 we need to have scale equal to "output_scale", so we have to rescale:
// (this multiplication is done in F32, *_gemm_out arrays are in NCHW format)
for i in 0, ... K-1:
rescaled_F32_gemm_out[ :, i, :, :] = F32_gemm_out[ :, i, :, :] * [ output_scale / (input_scale * weights_scale[ i ] ) ]
// Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale"
rescaled_F32_gemm_out _with_bias = rescaled_F32_gemm_out + output_scale * bias
// Perform ReLU (in F32)
F32_result = ReLU(rescaled_F32_gemm_out _with_bias)
// Convert to INT8 and save to global
I8_output = Saturate( Round_to_nearest_integer( F32_result ) )
19
Results - Accuracy
and Performance
• All optimizations enabled.
• ILSVRC2012 validation dataset, batch =
25 images.
• Accuracy was measured on 500 batches
which were not used for the calibration.
20
Open challenges /
improvements
• Unsigned int8 for activations after
ReLU.
• Fine tuning of saturation thresholds.
• A better solution in tensorflow with
asymmetric quantization for above two?
• RNNs → open research problem.
• Dynamic Compute Graph
• Expose API for accepting custom, user
provided scale factors.
21
Reference
• TensorRT 3: Faster TensorFlow Inference and Volta Support
• 8-bit Inference with TensorRT
• Using TensorRT to Optimize Caffe Models in Python
• How to Quantize Neural Networks with TensorFlow
22
Summary of NN Compiler
Provider Framework Graph opt. Backend opt. INT8 support Runtime
inference
Format Open
source
Target
Nvidia Caffe /
Tensorflow
TensorRT TensorRT TensorRT
Precision
Calibration
TensorRT
runtime engine
NCHW No GPU/NVDLA
Google Tensorflow TF lite (toco) NNAPI ??? Proper quantized
training is
necessary before
conversion
TF lite
interpreter
NHWC Yes CPU
Amazon MxNet NNVM TVM mxnet.ndarray.co
ntrib.quantize
TVM runtime Depends
on Target
Yes CPU/GPU/…
23
• Generally, [NHWC] is the default for most frameworks (like Tensorflow) and [NCHW] is the optimal format to use when training on NVIDIA GPUs using cuDNN
• TF lite quantized conversion expect the models to be annotated with "fake quantization" nodes that record the dynamic range of the tensors. which means that
proper quantized training is necessary before conversion
END
24
Backup
25
Model convertion
• https://github.com/ysh329/deep-learning-model-convertor
• https://github.com/Microsoft/MMdnn
• https://github.com/hahnyuan/nn_tools
26
Graph and Target Optimizations
• NNVM
• https://github.com/dmlc/nnvm/tree/master/src/pass
• https://github.com/dmlc/nnvm/tree/master/src/compiler
• TF lite
• https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/toco
• https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/optimize_for_in
ference.py
• https://www.tensorflow.org/performance/
• TensorRT
• Layer & Tensor Fusion
• FP16 and INT8 Precision Calibration
• Kernel Auto-tuning
• Dynamic Tensor Memory
• Multi-Stream Execution
• All are offline optimized
27
Quantization reference code
• Solver::Solve()
• net_->Forward(&loss);
• Net::ForwardFromTo()
• Solver::Test()
• test_net->Forward(&iter_loss);
• Net::ForwardFromTo()
Net::ForwardFromTo() {
StartQuantization(); // Add and set QuantizationParams in each layer
for (int i = start; i <= end; ++i) {
float layer_loss = layers_[i]->Forward(bottom_vecs_[i], top_vecs_[i]);
loss += layer_loss;
}
FinishQuantization(); // UpdateQuantizationRangeInLayers
return loss;
}
28
NV TensorRT container
• https://ngc.nvidia.com/registry/nvidia-tensorrt
• https://github.com/NVIDIA/nvidia-docker
• https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-
application-deployment-made-easy/
• https://devblogs.nvidia.com/parallelforall/tensorrt-container/
29

Weitere ähnliche Inhalte

Was ist angesagt?

続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2Preferred Networks
 
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演Preferred Networks
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!TransformerArithmer Inc.
 
アドテクに機械学習を組み込むための推論の高速化
アドテクに機械学習を組み込むための推論の高速化アドテクに機械学習を組み込むための推論の高速化
アドテクに機械学習を組み込むための推論の高速化MicroAd, Inc.(Engineer)
 
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII
 
レコメンドエンジン作成コンテストの勝ち方
レコメンドエンジン作成コンテストの勝ち方レコメンドエンジン作成コンテストの勝ち方
レコメンドエンジン作成コンテストの勝ち方Shun Nukui
 
【de:code 2020】 AI とデータ サイエンスを加速する NVIDIA の最新 GPU アーキテクチャ
【de:code 2020】 AI とデータ サイエンスを加速する NVIDIA の最新 GPU アーキテクチャ【de:code 2020】 AI とデータ サイエンスを加速する NVIDIA の最新 GPU アーキテクチャ
【de:code 2020】 AI とデータ サイエンスを加速する NVIDIA の最新 GPU アーキテクチャ日本マイクロソフト株式会社
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門Takuji Tahara
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
機械学習ゴリゴリ派のための数学とPython
機械学習ゴリゴリ派のための数学とPython機械学習ゴリゴリ派のための数学とPython
機械学習ゴリゴリ派のための数学とPythonKimikazu Kato
 
Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Hokkaido.cap #osc11do Wiresharkを使いこなそう!Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Hokkaido.cap #osc11do Wiresharkを使いこなそう!Panda Yamaki
 
20171128分散深層学習とChainerMNについて
20171128分散深層学習とChainerMNについて20171128分散深層学習とChainerMNについて
20171128分散深層学習とChainerMNについてPreferred Networks
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~Hideki Tsunashima
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
ARM CPUにおけるSIMDを用いた高速計算入門
ARM CPUにおけるSIMDを用いた高速計算入門ARM CPUにおけるSIMDを用いた高速計算入門
ARM CPUにおけるSIMDを用いた高速計算入門Fixstars Corporation
 
東大大学院 戦略ソフトウェア特論2021「ロボットで世界を計算可能にする」海野裕也
東大大学院 戦略ソフトウェア特論2021「ロボットで世界を計算可能にする」海野裕也東大大学院 戦略ソフトウェア特論2021「ロボットで世界を計算可能にする」海野裕也
東大大学院 戦略ソフトウェア特論2021「ロボットで世界を計算可能にする」海野裕也Preferred Networks
 

Was ist angesagt? (20)

続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!Transformer
 
アドテクに機械学習を組み込むための推論の高速化
アドテクに機械学習を組み込むための推論の高速化アドテクに機械学習を組み込むための推論の高速化
アドテクに機械学習を組み込むための推論の高速化
 
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用
 
レコメンドエンジン作成コンテストの勝ち方
レコメンドエンジン作成コンテストの勝ち方レコメンドエンジン作成コンテストの勝ち方
レコメンドエンジン作成コンテストの勝ち方
 
【de:code 2020】 AI とデータ サイエンスを加速する NVIDIA の最新 GPU アーキテクチャ
【de:code 2020】 AI とデータ サイエンスを加速する NVIDIA の最新 GPU アーキテクチャ【de:code 2020】 AI とデータ サイエンスを加速する NVIDIA の最新 GPU アーキテクチャ
【de:code 2020】 AI とデータ サイエンスを加速する NVIDIA の最新 GPU アーキテクチャ
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
機械学習ゴリゴリ派のための数学とPython
機械学習ゴリゴリ派のための数学とPython機械学習ゴリゴリ派のための数学とPython
機械学習ゴリゴリ派のための数学とPython
 
Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Hokkaido.cap #osc11do Wiresharkを使いこなそう!Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Hokkaido.cap #osc11do Wiresharkを使いこなそう!
 
20171128分散深層学習とChainerMNについて
20171128分散深層学習とChainerMNについて20171128分散深層学習とChainerMNについて
20171128分散深層学習とChainerMNについて
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
ARM CPUにおけるSIMDを用いた高速計算入門
ARM CPUにおけるSIMDを用いた高速計算入門ARM CPUにおけるSIMDを用いた高速計算入門
ARM CPUにおけるSIMDを用いた高速計算入門
 
東大大学院 戦略ソフトウェア特論2021「ロボットで世界を計算可能にする」海野裕也
東大大学院 戦略ソフトウェア特論2021「ロボットで世界を計算可能にする」海野裕也東大大学院 戦略ソフトウェア特論2021「ロボットで世界を計算可能にする」海野裕也
東大大学院 戦略ソフトウェア特論2021「ロボットで世界を計算可能にする」海野裕也
 

Ähnlich wie TensorRT survey

Inference accelerators
Inference acceleratorsInference accelerators
Inference acceleratorsDarshanG13
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organizationssuserdfc773
 
Machine learning Experiments report
Machine learning Experiments report Machine learning Experiments report
Machine learning Experiments report AlmkdadAli
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptxruvex
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SBrandon Liu
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learningKumar Abhinav
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdfFrangoCamila
 

Ähnlich wie TensorRT survey (20)

Inference accelerators
Inference acceleratorsInference accelerators
Inference accelerators
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
 
Machine learning Experiments report
Machine learning Experiments report Machine learning Experiments report
Machine learning Experiments report
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
aspice
aspiceaspice
aspice
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learning
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf
 

Mehr von Yi-Hsiu Hsu

Glow introduction
Glow introductionGlow introduction
Glow introductionYi-Hsiu Hsu
 
Yocto Project introduction
Yocto Project introductionYocto Project introduction
Yocto Project introductionYi-Hsiu Hsu
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about CYi-Hsiu Hsu
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consumeYi-Hsiu Hsu
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V IntroductionYi-Hsiu Hsu
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64Yi-Hsiu Hsu
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64Yi-Hsiu Hsu
 

Mehr von Yi-Hsiu Hsu (8)

Glow introduction
Glow introductionGlow introduction
Glow introduction
 
Yocto Project introduction
Yocto Project introductionYocto Project introduction
Yocto Project introduction
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about C
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consume
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Memory model
Memory modelMemory model
Memory model
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64
 

Kürzlich hochgeladen

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Kürzlich hochgeladen (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

TensorRT survey

  • 2. TensorRT • NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. • NVIDIA released TensorRT last year with the goal of accelerating deep learning inference for production deployment. 2
  • 3. Deploying a model with TensorRT 3 UFF stands for Universal Framework Format, which is TensorRT’s internal format used to represent the network graph before running optimizations perform optimizations for specified parameters such as batch size, precision, and workspace memory for the target deployment GPU The output of the TensorRT optimization is a runtime inference engine that can be serialized to disk. load and deserialize a saved plan file to create a TensorRT engine object A plan file includes not only the weights, but also the schedule for the kernels to execute the network.
  • 4. TensorRT supported layers • TensorRT supported layers • Convolution • LSTM and GRU • Activation: ReLU, tanh, sigmoid • Pooling: max and average • Scaling • Element wise operations • LRN • Fully-connected • SoftMax • Deconvolution • TensorRT provides a Custom Layer API to enable you to define your own custom layers that aren’t natively supported • These custom layers are defined using C++ to make it easy to leverage highly optimized CUDA libraries like cuDNN and cuBLAS 4
  • 5. TensorRT Optimizations • TensorRT Optimizations • Layer and tensor fusion and elimination of unused layers • FP16 and INT8 reduced precision calibration • Target-specific autotuning • Efficient memory reuse • Multi-Stream Execution • TensorRT performs these optimizations automatically under the hood for you. • All you need to specify is the UFF inference graph to optimize, the inference batch size, the amount of workspace GPU memory (used for CUDA kernel scratch space), and the target inference precision, as the following code shows. • 5
  • 6. Optimization 1: Layer & Tensor Fusion • TensorRT parses the network computational graph and looks for opportunities to perform graph optimizations. • These graph optimizations do not change the underlying computation in the graph: instead, they look to restructure the graph to perform the operations much faster and more efficiently. • TensorRT can also eliminate the concatenation layers in “concat” by preallocating output buffers and writing into them in a strided fashion. 6
  • 7. Optimization 2: FP16 and INT8 Precision Calibration • Most deep learning frameworks train neural networks in full 32-bit precision (FP32). • Once the model is fully trained, inference computations can use half precision FP16 or even INT8 tensor operations, since gradient backpropagation is not required for inference. • Using lower precision results in smaller model size, lower memory utilization and latency, and higher throughput. • TensorRT can deploy models in FP32, FP16 and INT8 • To quantize full-precision information into INT8 while minimizing accuracy loss, TensorRT must perform a process called calibration to determine how best to represent the weights and activations as 8-bit integers. • The calibration step requires you to provide TensorRT with a representative sample of the input training data. • No additional fine tuning or retraining of the model is necessary, and you don’t need to have access to the entire training dataset. • Calibration is a completely automated and parameter-free method for converting FP32 to INT8. 7
  • 8. Optimization 3: Kernel Auto-tuning • During the optimization phase TensorRT also chooses from hundreds of specialized kernels, many of them hand-tuned and optimized for a range of parameters and target platforms. • As an example, there are several different algorithms to do convolutions. • TensorRT will pick the implementation from a library of kernels that delivers the best performance for the target GPU, input data size, filter size, tensor layout, batch size and other parameters. • This ensures that the deployed model is performance tuned for the specific deployment platform as well as for the specific neural network being deployed. 8
  • 9. Optimization 4: Dynamic Tensor Memory • TensorRT reduces memory footprint and improves memory reuse by allocating memory for each tensor only for the duration of its usage, avoiding memory allocation overhead for fast and efficient execution. 9
  • 10. Optimization 5: Multi-Stream Execution • Scales to multiple input streams, by processing them in parallel using the same model and weights 10
  • 11. TensorRT Run-Time Inference • You’re now ready to deploy your application with TensorRT • You’ve so far imported a trained TensorFlow model into TensorRT, and performed a number of optimizations to generate a runtime engine. • And you’ve serialized this engine to disk as an engine plan file. • You performed all these steps offline, and only once prior to deployment. • The next step is to load serialized models into your runtime environment and perform inference on new data. 11 • TensorRT Lite API is a highly abstracted interface that handles standard tasks like creating the logger, deserializing the engine from a plan file to create a runtime, and allocating GPU memory for the engine. • During inference, it also manages data transfer to and from GPU automatically, so you can just create an engine and start processing data.
  • 13. Quantization • It’s always a tradeoff between range and precision of the INT8 representation. • Minimize information loss, since FP32 → INT8 is just re-encoding information 13
  • 14. How to optimize threshold selection? • “Relative Entropy” of two encodings • INT8 model encodes the same information as the original FP32 model. • We want to minimize loss of information. • Loss of information is measured by Kullback-Leibler divergence (AKA relative entropy or information divergence). • P, Q - two discrete probability distributions. • KL_divergence(P,Q):= SUM(P[i] * log(P[i] / Q[i] ), i) • Intuition: KL divergence measures the amount of information lost when approximating a given encoding. 14
  • 15. Solution: Calibration • Calibration Dataset • Representative. • Diverse. • Ideally a subset of validation dataset. • 1000s of samples • Calibration • Run FP32 inference on Calibration Dataset. • For each Layer: • collect histograms of activations. • generate many quantized distributions with different saturation thresholds. • pick threshold which minimizes KL_divergence(ref_distr, quant_distr). • Entire process takes a few minutes on a typical desktop workstation. 15
  • 16. INT8 workflow in TensorRT • You will need: • Model trained in FP32. • Calibration dataset. • TensorRT will: • Run inference in FP32 on calibration dataset. • Collect required statistics. • Run calibration algorithm → optimal scaling factors. • Quantize FP32 weights → INT8. • Generate “CalibrationTable” and INT8 execution engine. 16
  • 17. Entropy Calibration - pseudocode Input: FP32 histogram H with 2048 bins: bin[ 0 ], …, bin[ 2047 ] For i in range( 128 , 2048 ): P = [ bin[ 0 ] , ..., bin[ i-1 ] ] // reference_distribution outliers_count = sum( bin[ i ] , bin[ i+1 ] , … , bin[ 2047 ] ) P[ i-1 ] += outliers_count P /= sum(P) // normalize distribution P Q = quantize [ bin[ 0 ], …, bin[ i-1 ] ] into 128 levels // candidate_distribution expand Q to ‘ i ’ bins Q /= sum(Q) // normalize distribution Q divergence[ i ] = KL_divergence( P, Q) End For Find index ‘m’ for which divergence[ m ] is minimal threshold = ( m + 0.5 ) * ( width of a bin ) 17
  • 18. Candidate distribution Q • KL_divergence(P, Q) requires that len(P) == len(Q) • Candidate distribution Q is generated after merging ‘ i ’ bins from bin[0] to bin[i-1] into 128 bins • Afterwards Q has to be ‘expanded’ again into ‘i’ bins • Here is a simple example: reference distribution P consisting of 8 bins, we want to quantize into 2 bins: P = [ 1, 0, 2, 3, 5, 3, 1, 7] we merge into 2 bins (8 / 2 = 4 consecutive bins are merged into one bin) [1 + 0 + 2 + 3 , 5 + 3 + 1 + 7] = [6, 16] then proportionally expand back to 8 bins, we preserve empty bins from the original distribution P: Q = [ 6/3, 0, 6/3, 6/3, 16/4, 16/4, 16/4, 16/4] = [ 2, 0, 2, 2, 4, 4, 4, 4] now we should normalize both distributions, after that we can compute KL_divergence P /= sum(P) Q /= sum(Q) result = KL_divergence(P, Q) 18
  • 19. INT8 conv kernel - pseudocode // I8 input tensors: I8_input, I8_weights, I8 output tensors: I8_output // F32 bias (original bias from the F32 model) // F32 scaling factors: input_scale, output_scale, weights_scale[K] I32_gemm_out = I8_input * I8_weights // Compute INT8 GEMM (DP4A) F32_gemm_out = (float)I32_gemm_out // Cast I32 GEMM output to F32 float // At this point we have F32_gemm_out which is scaled by ( input_scale * weights_scale[K] ), // but to store the final result in int8 we need to have scale equal to "output_scale", so we have to rescale: // (this multiplication is done in F32, *_gemm_out arrays are in NCHW format) for i in 0, ... K-1: rescaled_F32_gemm_out[ :, i, :, :] = F32_gemm_out[ :, i, :, :] * [ output_scale / (input_scale * weights_scale[ i ] ) ] // Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale" rescaled_F32_gemm_out _with_bias = rescaled_F32_gemm_out + output_scale * bias // Perform ReLU (in F32) F32_result = ReLU(rescaled_F32_gemm_out _with_bias) // Convert to INT8 and save to global I8_output = Saturate( Round_to_nearest_integer( F32_result ) ) 19
  • 20. Results - Accuracy and Performance • All optimizations enabled. • ILSVRC2012 validation dataset, batch = 25 images. • Accuracy was measured on 500 batches which were not used for the calibration. 20
  • 21. Open challenges / improvements • Unsigned int8 for activations after ReLU. • Fine tuning of saturation thresholds. • A better solution in tensorflow with asymmetric quantization for above two? • RNNs → open research problem. • Dynamic Compute Graph • Expose API for accepting custom, user provided scale factors. 21
  • 22. Reference • TensorRT 3: Faster TensorFlow Inference and Volta Support • 8-bit Inference with TensorRT • Using TensorRT to Optimize Caffe Models in Python • How to Quantize Neural Networks with TensorFlow 22
  • 23. Summary of NN Compiler Provider Framework Graph opt. Backend opt. INT8 support Runtime inference Format Open source Target Nvidia Caffe / Tensorflow TensorRT TensorRT TensorRT Precision Calibration TensorRT runtime engine NCHW No GPU/NVDLA Google Tensorflow TF lite (toco) NNAPI ??? Proper quantized training is necessary before conversion TF lite interpreter NHWC Yes CPU Amazon MxNet NNVM TVM mxnet.ndarray.co ntrib.quantize TVM runtime Depends on Target Yes CPU/GPU/… 23 • Generally, [NHWC] is the default for most frameworks (like Tensorflow) and [NCHW] is the optimal format to use when training on NVIDIA GPUs using cuDNN • TF lite quantized conversion expect the models to be annotated with "fake quantization" nodes that record the dynamic range of the tensors. which means that proper quantized training is necessary before conversion
  • 26. Model convertion • https://github.com/ysh329/deep-learning-model-convertor • https://github.com/Microsoft/MMdnn • https://github.com/hahnyuan/nn_tools 26
  • 27. Graph and Target Optimizations • NNVM • https://github.com/dmlc/nnvm/tree/master/src/pass • https://github.com/dmlc/nnvm/tree/master/src/compiler • TF lite • https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/toco • https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/optimize_for_in ference.py • https://www.tensorflow.org/performance/ • TensorRT • Layer & Tensor Fusion • FP16 and INT8 Precision Calibration • Kernel Auto-tuning • Dynamic Tensor Memory • Multi-Stream Execution • All are offline optimized 27
  • 28. Quantization reference code • Solver::Solve() • net_->Forward(&loss); • Net::ForwardFromTo() • Solver::Test() • test_net->Forward(&iter_loss); • Net::ForwardFromTo() Net::ForwardFromTo() { StartQuantization(); // Add and set QuantizationParams in each layer for (int i = start; i <= end; ++i) { float layer_loss = layers_[i]->Forward(bottom_vecs_[i], top_vecs_[i]); loss += layer_loss; } FinishQuantization(); // UpdateQuantizationRangeInLayers return loss; } 28
  • 29. NV TensorRT container • https://ngc.nvidia.com/registry/nvidia-tensorrt • https://github.com/NVIDIA/nvidia-docker • https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server- application-deployment-made-easy/ • https://devblogs.nvidia.com/parallelforall/tensorrt-container/ 29

Hinweis der Redaktion

  1. python api interface, 一次只能給一個plan: trt.utils.write_engine_to_file("./data/mnist/new_mnist.engine", engine.serialize())  Build by ur own through utility help, It need API, By ur self 所以要runtime pick, 還需要每個plan都跑過一次 才知道哪個plan最好?  Yes picking is by naive way
  2. 有關於 no INT8 for winograd這部分, 是有數據證明int8做winograd賺不到嗎?  Not yet support for good INT8 version in all form of Winograd  Sometime due to GPU clocks and memory bandwidth
  3. 我先是预先跑500张图,每个节点处的feature map dump出来,逐个分析。 图片是随机挑选的。个人感觉最终Accuracy对门限不是太敏感,稍微有些波动也无大碍
  4. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc https://github.com/apache/incubator-mxnet/blob/master/src/operator/contrib/quantize-inl.h https://github.com/tidsp/caffe-jacinto/blob/caffe-0.16/src/caffe/net.cpp