Speaker: Dr. Hai (Helen) Li is the Clare Boothe Luce Associate Professor of Electrical and Computer Engineering and Co-director of the Duke Center for Evolutionary Intelligence at Duke University
In this talk, I will introduce a few recent research spotlights by the Duke Center for Evolutionary Intelligence. The talk will start with the structured sparsity learning (SSL) method which attempts to learn a compact structure from a bigger DNN to reduce computation cost. It generates a regularized structure with high execution efficiency. Our experiments on CPU, GPU, and FPGA platforms show on average 3~5 times speedup of convolutional layer computation of AlexNet. Then, the implementation and acceleration of DNN applications on mobile computing systems will be introduced. MoDNN is a local distributed system which partitions DNN models onto several mobile devices to accelerate computations. ApesNet is an efficient pixel-wise segmentation network, which understands road scenes in real-time, and has achieved promising accuracy. Our prospects on the adoption of emerging technology will also be given at the end of this talk, offering the audiences an alternative thinking about the future evolution and revolution of modern computing systems.
Improving Hardware Efficiency for DNN Applications
1. Improving Hardware Efficiency for
DNN Applications
Hai (Helen) Li
Electrical and Computing Engineering Department
Duke Center for Evolutionary Intelligence
2. Technology Landscape
2
Desktop/Workstation
• Personal
computing
• Wired internet
Mobile Phone
• Sensing
• Display
• Wireless
communication
& internet
• Computing at
the edge
• Data integration
• Large-scale storage
• Large-scale
computing
Data Center / Cloud
• IoT
• Robotics
• Industrial internet
• Self-driving cars
• Smart Grid
• Secure autonomous
networks
• Real-time data to
decision
Intelligent SystemsProgram Learn
Static
Offline
Virtual world
Dynamic
Online
Real world
T. Hylton, “Perspectives on
Neuromorphic Computing,” 2016.
4. Data Explosion & Hardware Development
• Every minute, we send over 200
million emails, click almost 2
million likes on Facebook, send
almost 300K tweets and up-load
200K photos to Facebook as
well as 100 hours of video to
YouTube.
• “Data centers consumes up to 1.5%
of all the world’s electricity…”
• “Google’s data centers draw almost
260 MW of power, which is more
power than Salt Lake City uses…”
4
Google’s “Council Bluffs” data center
facilities in Iowa.
2X transistor count,
But only 40% faster,
50% more efficient…
L. Ceze, “Approximate Overview of Approximate Computing”.
J. Glanz, “Google Details, and Defends, Its Use of Electricity”
5. Hardware Acceleration for DNNs
• GPUs
– Fast, but high power
consumption (~200W)
– Training DNNs in back-end
GPU clusters
• FPGAs
– Massively parallel + low-power
(~25W) + reconfigurable
– Suitable for latency-sensitive
real-time inference job
• ASICs
– Fast + energy efficient
– Long development cycle
• Novel architectures and
emerging devices
5
Efficiency vs. Flexibility
6. Mismatch: Software vs. Hardware
Software Hardware
Model/Component scale Large Small/Moderate
Reconfigurability Easy Hard
Accuracy vs. Power Accuracy Tradeoff
Training implementation Easy Hard
Precision vs.
Limited programmability
Double (high)
precision
Low precision
(often a few bits)
Connectivity realization Easy Hard
6
Our work: Improve Efficiency for DNN Applications thorugh
Software/Hardware Co-Design Framework
7. Outline
Our work: Improve Efficiency for DNN Applications
Through Software/Hardware Co-Design
• Introduction
• Research Spotlights
– Structured Sparsity Regularization (NIPS’16)
– Local Distributed Mobile System for DNN
– ApesNet for Image Segmentation
• Conclusion
7
8. Related Work
• State-of-the-art methods to reduce the number of parameters
– Weight regularization (L1-norm)
– Connection pruning
8
Layer conv1 conv2 conv3 conv4 conv5
Sparsity 0.927 0.95 0.951 0.942 0.938
Theoretical speedup 2.61 7.14 16.12 12.42 10.77
AlexNet, B. Liu, et al., CVPR 2015
AlexNet, S. Han, et al., NIPS 2015
Remained
9. Theoretical Speedup ≠ Practical Speedup
• Forwarding speedups of
AlexNet on GPU platforms
and the sparsity.
• Baseline is GEMM of cuBLAS.
• The sparse matrixes are stored
in the format of compressed
sparse row (CSR) and
accelerated by cuSPARSE.
9
Speedup Sparsity
B. Liu, et al., “Hardcoding nonzero weights in source code,” CVPR’15
S. Han, et al., “Customizing an EIE chip accelerator for compressed DNN,” ISCA’17
Random
Sparsity
Irregular
Memory
Access
Poor
Data
Locality
No or
trivial
Speedup
Software or
Hardware
Customization
Structured
Sparsity
Regular
Memory
Access
Good
Data
Locality
Real
Speedup
Combining
SW and HW
Customization
10. Computation-efficient Structured Sparsity
Example 1: Removing 2D filters in convolution (2D-filter-wise sparsity)
Example 2: Removing rows/columns in GEMM (row/column-wise sparsity)
http://deeplearning.net/
3D filter =
stacked 2D filters
Lowering
Weight matrix
filter
feature map
GEMM
GEneral Matrix Matrix Multiplication
Non-structured sparsity
Structured sparsity
featurematrix
10
11. Structured Sparsity Regularization
• Group Lasso regularization in ML model
11
argmin
w
E w argmin
w
ED
w
s.t. Rg
w g
argmin
w
E w argmin
w
ED
w g
Rg
w
w0
w1
w2
ED
w Example:
Rg
w0
,w1
,w2 w0
2
w1
2
w2
2
g
group 1 group 2
M. Yuan, 2006
(w0,w1)=(0,0)
Many groups will be zeros
12. SSL: Structured Sparsity Learning
• Group Lasso regularization in DNNs
• The learned structured sparsity is determined by the way of splitting
groups
12
shortcut
depth-wisefilter-wise
channel-wise
…
shape-wise
W(l)
:,cl,:,:
W(l)
nl,:,:,:
W
(l)
:,cl,ml,kl
W(l)
Penalize unimportant
filters and channels Learn filter shapes Learn the depth of layers
13. SSL: Structured Sparsity Learning
• Group Lasso regularization in DNNs
13
shortcut
depth-wisefilter-wise
channel-wise …
shape-wise
W(l)
:,cl,:,:
W(l)
nl,:,:,:
W
(l)
:,cl,ml,kl
W(l)
filters and channels Learn filter shapes Learn the depth of layers
14. Penalizing Unimportant Filters and Channels
14
LeNet # Error Filter # Channel # FLOP Speedup
Baseline - 1 0.9% 20 – 50 1 – 20 100% – 100% 1.00× – 1.00×
2 0.8% 5 – 19 1 – 4 25% – 7.6% 1.64× – 5.23×
3 1.0% 3 - 12 1 – 3 15% – 3.6% 1.99× – 7.44×
LeNet on MNIST
The data is represented in the order of conv1 – conv2.
1
2
3
Conv1 filters (gray level 128 represents zero)
SSL obtains FEWER but more natural patterns
15. Learning Smaller Filter Shapes
LeNet on MNIST
LeNet # Error Filter # Channel # FLOP Speedup
Baseline - 1 0.9% 25 – 500 1 – 20 100% – 100% 1.00× – 1.00×
4 0.8% 21 – 41 1 – 2 8.4% – 8.2% 2.33× – 6.93×
5 1.0% 7 - 14 1 – 1 1.4% – 2.8% 5.19× – 10.82×
The size of filters after removing zero shape fibers, in the order of conv1 – conv2.
Learned shapes of
conv1 filters:
LeNet 1
5x5
LeNet 4
21
LeNet 5
7
Learned shape of conv2 filters @ LeNet 5 3D 20x5x5 filters is regularized to 2D filters!
Smaller weight matrix
16
SSL obtains SMALLER filters w/o accuracy loss
16. Learning Smaller Dense Weight Matrix
17
ConvNet # Error Row Sparsity Column Sparsity Speedup
Baseline 1 17.9% 12.5%–0%–0% 0%–0%–0% 1.00×–1.00×–1.00×
4 17.9% 50.0%–28.1%–1.6% 0%–59.3%–35.1% 1.43×–3.05×–1.57×
5 16.9% 31.3%–0%–1.6% 0%–42.8%–9.8% 1.25×–2.01×–1.18×
ConvNet on CIFAR-10
Row/column sparsity is represented in the order of conv1–conv2–conv3.
The learned conv1 filters
3
2
1
SSL can efficiently learn DNNs with smaller but dense
weight matrix which has good locality
17. Regularizing The Depth of DNNs
18
Baseline, K. He, CVPR’16
# layers Error # layers Error
ResNet 20 8.82% 32 7.51%
SSL-ResNet 14 8.54% 18 7.40%
ResNet on CIFAR-10
shortcut
depth-wise W(l)
Depth-wise SSL
18. Experiments – AlexNet@ImageNet
• SSL saves 30-40% (60-70%) FLOPs with 0 (<1.5%) accuracy loss by
structurally removing 2D filters.
• Deeper layers have higher sparsity.
19
0
20
40
60
80
100
0
20
40
60
80
100
41.5 42 42.5 43 43.5 44
conv1
conv2
conv3
conv4
conv5
FLOP
p y
% top-1 error
%2D-filtersparsity
%FLOPreduction
Baseline
http://deeplearning.net/
Stacked
2D filters
Learning 2D-filter-wise sparsity
3D convolution = sum of 2D convolutions:
20. Open Source
• Source code in Github, and trained model in model zoo
• https://github.com/wenwei202/caffe/tree/scnn
21
21. Collaborated Work with Intel – ICLR 2017
Jongsoo Park, et al, “Faster CNNs with Direct Sparse Convolutions and Guided
Pruning,” ICLR 2017.
Direct Sparse Convolution: an
efficient implementation of
convolution with sparse filters
Performance Model: A
predictor of the speedup vs.
sparsity level
Guided Sparsity Learning: A
performance-model-supervised
pruning process that avoids
pruning layers, which are
unbeneficial for speedup but
harmful for accuracy
27
22. Outline
Our work: Improve Efficiency for DNN Applications
Through Software/Hardware Co-Design
• Introduction
• Research Spotlights
– Structured Sparsity Regularization
– Local Distributed Mobile System for DNN (DATE’17)
– ApesNet for Image Segmentation
• Conclusion
28
23. Neural Network on Mobile Platforms
Advantages: Data Accessibility
Mobility
Navigation
AR/VR ApplicationsImage Analysis
Language Processing
Object Recognition Motion Prediction
Malware Detection
Wearable Devices
(1) A brain-inspired
chip from IBM
(2) Google glass
(3) Nixie
(4) Android wear
smart watch
Applications:
29
24. Challenges and Preliminary Analysis
30
Layer Analysis of DNNs on Smartphones:
Computing-intensive Memory-intensive
Security Battery Capacity Hardware Performance
CPU MemoryGPU
Challenges:
25. Original DNN Model
Local Distributed Mobile System for DNN
31
System Overview of MoDNN:
BODP
MSCC&
FGCP
Model Processor
Middleware
Computing Cluster
Group Owner
Worker Node Worker Node
27. Optimization of Fully-connected Layers
33
Schematic diagram of the partition scheme for fully-connected layers
Sparsity PartitionCluster
0 4 8 12 16
Array
Linked List
ExecutionTime(ms)
Calculation (Millions)
0
350
50
100
150
200
250
300
GET
SPT
Performance characterization of Nexus 5
Speed-oriented decision: SPMV and GEMV
28. Optimization of Sparse Fully-connected Layers
• Apply spectral co-clustering to original sparse matrix
• Decide the execution method for each cluster
• Initialize the estimated time with the execution time of each
cluster and their corresponding outliers.
34
Modified Spectral Co-Clustering (MSCC)
Target: Higher Computing Efficiency & Lower Transmission Amount
Input Neurons
OutputNeurons
(a) Original (b) Clustered (MSCC) (c) Outliers (FGCP)
29. Optimization of Sparse Fully-connected Layers
35
• Choose the node of
highest time
• Find the sparsest line
• Offload it to GO
Fine-Grain Cross Partition (FGCP)
Target: Workload Balance & Lower Transmission Amount
Input Neurons
OutputNeurons
(a) Original (b) Clustered (MSCC) (c) Outliers (FGCP)
2 Workers 3 Workers 4 Workers
70% 100%Sparsity 70% 100%Sparsity 70% 100%Sparsity
20%
30%
40%
50%
60%
70%
TransmissionSizeReduction(%)
FL_6: 25088×4096 FL_7: 4096×4096 FL_8: 1000×4096
Transmission size decrease ratio of MoDNN to baseline.
30. Outline
Our work: Improve Efficiency for DNN Applications
Through Software/Hardware Co-Design
• Introduction
• Research Spotlights
– Structured Sparsity Regularization
– Local Distributed Mobile System for DNN
– ApesNet for Image Segmentation (ESWEEK’16)
• Conclusion
36
37. Outline
Our work: Improve Efficiency for DNN Applications
Through Software/Hardware Co-Design
• Introduction
• Research Spotlights
– Structured Sparsity Regularization
– Local Distributed Mobile System for DNN
– ApesNet for Image Segmentation
• Conclusion
43
38. Conclusion
• DNNs demonstrate great success and potentials in
various types of applications;
• Many software optimization techniques cannot obtain
the theoretical speedup in real implementations;
• The mismatch between the software requirement and
hardware capability is more severe as problem scale
increases;
• New approaches that coordinate the software and
hardware co-design and optimization are necessary.
• New devices and novel architecture could play an
important role too.
44
39. Thanks to Our Sponsors, Collaborators, & Students
We welcome industrial collaborators.
45