Weitere ähnliche Inhalte
Ähnlich wie Xilinx Inference solution for DL using OpenPOWER systems (20)
Mehr von Ganesan Narayanasamy (20)
Kürzlich hochgeladen (20)
Xilinx Inference solution for DL using OpenPOWER systems
- 1. © Copyright 2019 Xilinx
Ashish Sirasao
Fellow, Accelerated Computing
ashish.sirasao@xilinx.com
Xilinx Inference Solution
for Deep Learning
- 2. © Copyright 2019 Xilinx
Deep Learning Models – A broad spectrum
• Feature Extraction
• Object Detection
• Image Segmentation
Convolutional Neural Network
• Sequence and Temporal Data
• Speech to Text
• Language Translation
Recurrent Neural Network
• Classification
• Universal Function Approximator
• Autoencoder
Multi-Layer Perceptron
Object Detection SegmentationClassification
“Dog”
Page 2
- 4. © Copyright 2019 Xilinx
Deep learning resurgence - Till 2015
LeNet-5: 1998 AlexNet: 2012
VGG-Net: 2014 ResNet: 2015GoogLeNet: 2014
>> 4
- 6. © Copyright 2019 Xilinx
Deep Learning on Xilinx Adaptable Devices
>> 6
• 2D Array of MACs
• Flexible on-chip memory access
• High Bandwidth, Multiple Access Ports
Data Parallel
• Near Memory Compute
• Programmable routing for data & filter reuse
Custom Memory
Hierarchy
• Flexible Data Types
• FP32/16, INT16/8/4/2, Binary/Ternary
• Sparsity friendly compute
Compression &
Sparsity
• Scalable device family for different applications
• Built in System functions – Networking, Video, ARM
Broad Device
Range
- 7. © Copyright 2019 Xilinx
ALVEO Data Center Workloads
>> 7
*GoogleNet v1
https://www.xilinx.com/products/boards-and-kits/alveo.html
- 8. © Copyright 2019 Xilinx
Variable Precision Compute Density – TOPs on VU9P
>> 8
Weight/Activation VU9P
MAX FLOAT (SP)/ FLOAT (SP) 2.18538
MAX FP16/FP16 5.64
MAX 8b/8b 17.48304
XFP8 (1,3,4) 23.60306
MAX 4b/4b - DSP 25.71035
XFP7 (1,3,3) 34.96608
MAX 4b/8b 41.72917
MAX 4b/4b - LUT 81.65384
MAX 2b/8b 92.10951
MAX T/8b 92.10951
MAX B/8b 92.10951
MAX B/4b 160.7017
MAX 2b/2b 314.7075
MAX B/B 686.6345
1 10 100 1000
MAX FLOAT (SP)/ FLOAT (SP)
MAX FP16/FP16
MAX 8b/8b
XFP8 (1,3,4)
MAX 4b/4b - DSP
XFP7 (1,3,3)
MAX 4b/8b
MAX 4b/4b - LUT
MAX 2b/8b
MAX T/8b
MAX B/8b
MAX B/4b
MAX 2b/2b
MAX B/B
MAX TOPs (Log Scale)
VU9P
MAX TOPs Estimates at 700 MHz FMAX
- 10. © Copyright 2019 Xilinx
Customized overlays with ISA architecture for optimized implementation
Easy plug and play with Software Stack
Overlay Architecture
Custom Processors Exploiting Xilinx FPGA Flexibility
MLP Engine
Scalable sparse and dense
implementation*
xDNN – CNN Engine for Large 16 nm
Xilinx Devices**
Deephi DPU – Flexible CNN Engine
with Embedded Focus
CHaiDNN – HLS based open source
offering***
Deephi ESE
LSTM Speech to Text
engine
Random Forest
Configurable RF
classification
*https://github.com/Xilinx/gemx
** https://github.com/Xilinx/ml-suite
*** https://github.com/Xilinx/CHaiDNN
- 11. © Copyright 2019 Xilinx
Inference Optimization Techniques
Hotchips 2018 Tutorial – Michaela Blott, Xilinx Inc
>> 11
- 12. © Copyright 2019 Xilinx
Model Pruning and Integer Arithmetic - Mainstream
• RNN Models – 5x to 20x
• CNN Models – 30% to 10x
Model Compression provides compute abd memory gains
• 8 bit solution loses no significant accuracy
• BNNs are improving rapidly
• Near consensus that inference can be very low precision
• Image / CNN: 2-bit (binary)
• Speech / RNN: 3-bit (ternary)
Increasing Accuracy of Reduced Precision CNNs & BNNs
- 13. © Copyright 2019 Xilinx
Whole Application Acceleration
Example - Smart City/Surveillance
Efficient AI Deployment Requires Full Application Optimization
- 14. © Copyright 2019 Xilinx
Xilinx AI Development Stack – Edge to Cloud
>> 14
Edge/Embedded Cloud/DC
Platforms Z7020 Board Z7020 SOM ZU2/3 SOM ZU2/3 Card
ZU9 Card ZCU102 ZCU104 Ultra96
Xilinx U200, U250
FPGA IP Deephi DPU xDNN v2 and v3
Deephi Runtime
Software Stack
xfDNN Runtime
Deephi Compiler xfDNN Compiler
Pruning and Quantization(Caffe and Tensorflow EA)
Models 20+ pruned / customized / basic models
Deephi LSTM
SDSoC SDAccel
- 15. © Copyright 2019 Xilinx
>> 15
Xilinx Tensor Processor: An Inference Engine,
Network Compiler + Runtime for Xilinx FPGAs
- 17. © Copyright 2019 Xilinx
xDNN Performance Comparison – Batch of 1
>> 17
https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf
- 18. © Copyright 2019 Xilinx
https://github.com/Xilinx/ml-suite
Server Platforms
Intel x86, AMD Epyc,
Power9, ARM
FaaS
AWS F1, Nimbix,
Ali Cloud, Huawei
Xilinx SDx Boards
ALVEO U200
ALVEO U250
ALVEO 280
>> 18
- 19. © Copyright 2019 Xilinx
Python interface to simplify xdnn usage
Blocking
Non-Blocking 8-FPGA
- 20. © Copyright 2019 XilinxAMD / XILINX CONFIDENTIAL
Demo successfully shown at Xilinx Developer Forum
Most photographed demo
GoogLeNet Performance
30,000 images/sec, Int8, Batch 1, XDNN v3
Final softmax and FC layers running on AMD CPU overlapped with FPGA
using optimized OpenBLAS
Single U250 performance with XDNN v3 for GoogLeNet
Massive Scaleout
EPYC BOXX + 8 ALVEO U250
0
5000
10000
15000
20000
25000
30000
35000
Single FPGA 8 FPGAs
GoogLeNet Performance
(img/sec) Int8, Batch 1
XDNN v2 XDNN v3
PL Kernel Peak TOP/s
(Int8)
Latency
(ms)
Images/sec
4 Kernels--Throughput 19.088 1.82 4127
4 kernels – Low
Latency
19.088 1.18 3389
- 21. © Copyright 2019 Xilinx
Ready to use Algorithms – Evaluate Baseline Models
Face
Object Detection,
Landmarks,
Recognition and Anti-
spoofing
People
Object Detection, Pose
estimation,
Re-identification
Video Analytics
Object Detection, Multi-
object tracking
Attribute – Person, Car,
Text – Plate number
Segmentation
Scene parsing,
lane detection
Medical Imaging
Cervical cancer classification,
guide-wire detection, cell
segmentation
Satellite Imaging
Object detection,
Accelerated pre and post
processing>> 21
- 22. © Copyright 2019 Xilinx
Model Compression – Enabling Next Level of Performance
Classification Networks
Baseline Pruning Result 1 Pruning Result 2
Top-5 Top-5 ΔTop5 ratio Top-5 ΔTop5 ratio
Resnet50 [7.7G] 91.65% 91.23% -0.42% 40% 90.79% -0.86% 32%
Inception_v2 [4.0G] 91.07% 90.37% -0.70% 60% 90.07% -1.00% 55%
SqueezeNet [778M] 83.19% 82.46% -0.73% 89% 81.57% -1.62% 75%
Detection Networks
Baseline
mAP
Pruning Result 1 Pruning Result 2
mAP ΔmAP ratio mAP ΔmAP ratio
DetectNet [17.5G] 44.46 45.7 +1.24 63% 45.12 +0.66 50%
SSD+VGG [ 117G] 61.5 62.0 +0.5 16% 60.4 -1.1 10%
[A] SSD+VGG [ 173G] 57.1 58.7 +1.6 40% 56.6 -0.5 12%
[B] Yolov2 [ 198G] 80.4 81.9 +1.5 28% 79.2 -1.2 7%
Segmentation Networks
Baseline Pruning Result 1 Pruning Result 2
mIoU mIoU ΔmIoU ratio mIoU ΔmIoU ratio
FPN [163G] 65.69% 65.21% -0.48% 80% 64.07% -1.62% 60%
- 23. © Copyright 2019 Xilinx
Xilinx VERSAL – Breakthrough AI Inference Performance
>> 23
https://www.xilinx.com/products/silicon-devices/acap/versal.html