3. DNN prediction on end devices
• Disadvantage
• Poor computing resources
4. How to implement on end device ?
• Reduce Parameter (include approximation)
• Binary Net
• Parameter Quantization
• Learning Sparse Matrix
• Software
• Fast Convolution Algorithm / Fusion
• Hardware
• FPGA / ASIC
5. In this presentation
• Software Approach on Raspberry Pi Series
• Use Raspberry Pi GPU efficiently
• There is no..
• Hardware Approach
• Approximation and Re-training
7. Raspberry Pi CPU/GPU Spec.
Pi 3 Pi Zero/W
CPU ARM Cortex-A53
Quad Core 1.2Ghz
ARM1176JZF-S
Single Core 1Ghz
GPU Broadcom VideoCore IV
400MHz
Broadcom VideoCore IV
250MHz
8. Single Precision flops (theoretical)
Pi 3 Pi Zero/W
CPU 38.4 Gflops
(1.1 Gflops/$)
1 Gflops
(0.1-0.2 Gflops/$)
GPU 38.4 Gflops
(1.1 Gflops/$)
24 Gflops
(2.4-4.8 Gflops/$)
39. Rejected Algorithms
• im2col
• TMU enable equivalent load
• Winograd
• increase data transfer > reduce operation
• Not enough register
• Direct (NCHW → NHWC, im2col equiv. TMU load)
• NHWC has bad data locality for next layer
41. Accepted Algorithms
• Direct (NCHW → NCHW)
• DMA store with Transpose(C,HW)
• for small image (HxW < 2048)
• Direct (NCHW → NHCW)
• DMA store with Transpose(C,W)
42. Specialize for Kernel Size
• 3x3 with stride = 1
• Best performance
• Pi3: 18.5 Gflops
• 48% of theoretical limit (96% of practical limit)
• 1x1
• 1xK
• Kx1
• KxK
43. Specialize for Output Shape
• DMA Transfer Block Size
• HxWxC = 2x16x16
• HxWxC = 4x8x16
• for small image
• HxWxC = 2x14x16+2x16x16
• OverLap-Add Method
• 3x3 only
2x16x16 = 4x8x16 = use 32 general purpose register for accumulation
Other 32 registers have convolution parameters.
44. Combination
• Using 22 specialized implementations
• NCHW→NCHW / NCHW→NHCW
• 2x14x16+
2x16x16
2x16x16 4x8x16
3x3 Use Use Use
1x1 - Use Use
1xK - Use Use
Kx1 - Use Use
KxK - Use Use