3. NetAdapt: Platform Aware Neural Network Adaptation
for Mobile Applications (Google)
• 일반적인 최적화 MACs / FLOPs 등을 줄이는 데에 집중함
• 실제로 Latency, Energy consumption 등과 같은 direct metrics도 최적화 되는가? (그렇지 않을
수도 있다) 이 부분을 고려해서 최적화 하겠다!
• Empirical measurements
• Contribution
• Automatically and progressively simply a pre-trained network until the resource budget is
met while maximizing the accuracy
• Achieves better accuracy versus latency trade-offs on mobile CPU & GPU, compared with
the state-of-the-art automated network simplification algorithms
• Method
• 한 번에 주어진 constraints를 맞추려 하는 것이 아니라, iterative하게 조건을 점점 더 tight하게
만들어 가면서 정확도 최적화를 진행
• 1 step당 constraint를 만족시키면서 가장 acc drop이 낮은 layer의 필터 수를 조정하는 방식
• 느리다
6. Algorithm Details
• Empirical Measurements
• Layer 별로 look-up table 생성해 둬서 시간을 최대한 절약한다.
• Choose which Filter
• L2-norm magnitude 작은 순서대로 제거한다.
• Joint influence 계산해서 지우는 방법도 있을 것*
• Fine-tuning
• Short-term fine-tuning으로 대충 성능 비교 후 최종 결과에 대해서만 Long-term 으로 진행
• Short-term training: about 40k iteration, w/ ImageNet training set – 10,000 holdout set
*Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne: Designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). (2017)
8. ADC: Automated Deep Compression and Acceleration
with Reinforcement Learning (Song Han)
• NetAdapt’s competitor
• LPIRC: Google Achieve better accuracy
than ADC & practical
• Efficient DL workshop: Song Han
NetAdapt is slow!
• Reinforcement Learning based agent
• Efficient design space exploration
• Accuracy & compression rate
• Sample the design space greatly improve
the model compression quality
• Even better than human expertise!
9. ADC Agent
• w/ continuous compression ratio control (DDPG*)
• Receive a reward with approximated model
performance without fine-tuning
• Accuracy & overall compression rate
• Further scenario: FLOPs-constrained compression &
accuracy-guaranteed compression
• Process a network in a layer by layer manner
• Input: Layer embedding state 𝑠𝑡 =
• Outputs a fine grained sparsity ratio for each layer
* N. Johnson, S. Kotz, and N. Balakrishnan. Continuous univariate probability distributions,(vol. 1), 1994.
10. Algorithm
• Specified Compression algorithm (reducing channels to c’): n x c x k x k ?
• Spatial decomposition[1]: n x c’ x k x 1, c’ x c x 1 x k - Data independent reconstruction
• Channel decomposition[2]: n x c’ x k x k, c’ x c x 1 x 1
• Channel pruning[3]: n x c’ x k x k - L2-norm(magnitude) based pruning
• Agent
• Each transition in an episode is 𝑠𝑡, 𝑎 𝑡, 𝑅, 𝑠𝑡+1
• Action Error[4]에 비례한 Reward를 통해 Agent 학습
• FLOPs-Constrained Compression
• R = -Error
• 일단 1차로 네트워크 압축 후, 휴리스틱을 통해 점차 주어진 budget 아래로 압축되도록 만든다.
• Accuracy-Guaranteed Compression
• Observe that accuracy error is inversely-proportional to log(FLOPs)
• R = - Error * log(FLOPs)
[1] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014
[2] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and
machine intelligence, 38(10):1943–1955, 2016.
[3] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1389–1397, 2017
[4]B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning
12. Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference (Google, CVPR 2018)
• How to train Quantized Neural Networks?
• 이전까지의 Quantization approach:
• 너무 쉬운 문제들에 대해서만 접근하는 경향이 있다 (Alexnet, ResNet, VGG)
• All over-parameterized
• Compression에 대해서만 생각하고 Computational efficiency는 고려하지 않았다.
• Look-up table 방식: Poorly perform on common devices
• Shift / XOR 등 bitwise operation 사용하는 애들은 Existing hardware에서 딱히 이득이 없다.
• Fully XOR Net 같은 경우는 performance degradation 문제가 있다
• Quantization scheme
• Weights / Activations: 8-bit integers
• bias vectors: 32-bit integers
• Quantized inference / training Framework
• Adopted in TFLite (Inference)
• Inference: Integer-only arithmetic / training: floating-point arithmetic
13. Quantized Inference
• affine mapping
• 𝑟 = 𝑆(𝑞 − 𝑍)
• Integers q to real numbers r S, Z are quantization parameters
• Uses a single set of quantization parameters for all values within
each activations array and within each weights array
• Computation of Matrix multiplication
• 𝑟3 = 𝑟1 𝑟2일 때 (𝑟𝛼: N x N matrix)
• 𝑟3
(𝑖,𝑘)
= 𝑗=1
𝑁
𝑟1
(𝑖,𝑗)
𝑟2
(𝑗,𝑘)
14. Quantized Inference
• Bias quantization
• Bias quantization error act as an overall bias
• 32-bit representation
• 𝑍 𝑏𝑖𝑎𝑠 = 0, 𝑆 𝑏𝑖𝑎𝑠 = 𝑆1 ∗ 𝑆2
• Things left to do
• Scale down to the final scale (8-bit output activations)
• Cast down to uint8
• apply the activation function to yield the final 8-bit output activation
15. Training with simulated quantization
• All weights & biases are stored in floating point
• Weights are quantized before they are convolved with the input
• Activations are quantized at points where they would be during inference
• Tuning quantization parameters
• Weight: min value ~ max value linearization
• Activation: Exponential moving averages
18. SBNet: Sparse Blocks Network for Fast
Inference (Uber)
• Low-cost computation mask reduce computation in the
high-resolution main network
• Tiling-based sparse convolution algorithm
• Implements tiling-based GPU kernel
• LiDAR 3D object detection tasks
19. Sparse Blocks Network
• How to handle sparse input?
• Mask to indices
• Extract a list of activate
location indices
• Sparse gather/scatter
• Extract data from the sparse
inputs
• Signal processing
• Overlap-save algorithm
• Repeating Gathering /
Scattering while processing
21. Shift: A Zero FLOP, Zero Parameter Alternative to
Spatial Convolutions (UC Berkeley, Kurt Keutzer)
• Shift-based module
• Use Shift operation to mix spatial
information across channels
• Let’s use simple shift operation
instead of depth-wise convolution!
• Series of memory operations that
adjusts channels of the input tensor in
certain directions
• Assign different shift kernels per
each channel
• 𝑘2
different shift kernels
• Each group of 𝑀/𝑘2
channels adopts
one shift
• Results
• It looks not that efficient
• But it can be adapted to MIDAP easily
22. Shift based modules
• (Shift-)Conv-Shift-Conv module
• 𝑆𝐶2 module / CSC module
• Shift Kernel
• Size 𝐷 𝑘: 𝐷 𝑘
2
possible shift matrices
• Dilation rate: similar to dilated convolution
• Expansion rate 𝜀: expand the channel size via 1x1
convolution kernel to gather sufficient information
with shift operation
• Only 1x1 convolutions
• Target
• Mobile / IOT applications
• Memory footprint reduction
24. Squeeze-and-Excitation Networks
(Momenta & Oxford)
• 1st place winner of ILSVRC 2017 classification
• Suggests SE block
• Feature recalibration
• Squeeze: Global average pooling (H x W 1 x 1)
• Excitation: Adaptive Recalibration (capture channel-wise dependencies)
25. Squeeze & Excitation
• Excitation
• Gating mechanism with two fully connected
layers
• Acts similarly as an attention module
• Results
26. ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices (Megvii Inc.)
• Simple idea
• State-of-the-art architectures
• 1x1 conv + DWconv + 1x1 conv
• Intuitive shuffling
• 1x1 group conv + shuffle +
DWconv + 1x1 group conv
• g x n outputs (g: # of groups)
(g,n) transpose (n, g)
flattening g x n
• Good results
27. CondenseNet: An Efficient DenseNet using
Learned Group Convolutions (Cornell Univ.)
• Observation
• 1x1 group convolution usually leads to drastic
reductions in accuracy
• Learned group convolution
• Removing superfluous computation in
DenseNet architecture via group convolution
• Automatic input feature groupings during
training
28. CondenseNet Training
• Split the filters into G groups of equal size before training
• Random grouping for further condensation
• Condensation Criterion
• Averaged absolute value of weights between them across all outputs within the group
• Group Lasso
• Group-level sparsity
• Condensation procedure
• Condensation factor C
• C – 1 condensing stages
• Pruning 1/C of the filter weights at the end of each stage
• Re-index the layer
29. Stochastic Downsampling for Cost-Adjustable Inference and
Improved Regularization in Convolutional Networks
(Nanyang Technological University & Adobe & Nvidia)
• Training the network w/ stochastic downsampling
31. Efficient video object segmentation via
Network Modulation (Snap)
• Semi-supervised video segmentation
• A human can easily segment an object in the whole
video without knowing its semantic meaning
• Typical scenario
• Given: First frame of a video along with an annotated object
mask
• Task: to accurately locate the object in all following frames
• Modulator + segmentation network
• 기존: FCN pre-training + fine-tuning the network for
specific video sequence
• Fine-tuning 과정 비효율적
• Proposed: Segmentation network 는 1번만 트레이닝하고,
주어진 태스크에 맞는 modulator 트레이닝하자
• One-shot fine-tuning (One-shot learning == meta-learning 응용)
• Visual modulator(Attention), Spatial modulator
33. Mobile Video Object Detection with Temporally-Aware
Feature Maps (Georgia Tech, Google)
• Video object detection
• Imagenet VID 2015 dataset
• Single image object detector + LSTM
• LSTM layers to create an interweaved recurrent-
convolutional architecture
• Bottleneck-LSTM to reduce computational cost
• 15 FPS in Mobile CPU
• Smaller and faster than DFF(Deep Feature Flow)
• This work does not use optical flow estimation
34. Approach
• SSD + Convolutional LSTMs
• Mobilenet-SSD, Removing the final layer
• Inject convolutional LSTM layers directly into the single-
frame detector
• Allow the network to encode both spatial and temporal
information
• Feature refinement with LSTMs
• Place a single LSTM after the Conv13 layer
• Stack multiple LSTMs after the Conv13 layer
• Place one LSTM after each feature map
36. Towards High Performance Video Object
Detection (USTC, Microsoft Research)
• Recent works
• Motion estimation module is built into the network architecture
• Sparse feature propagation
• Expensive feature network on sparse key frames
• Motion field
• Dense feature aggregation
• Utilize every frame to enhance accuracy
• This paper suggests unified approach
• Sparsely recursive feature aggregation
• Spatially-adaptive partial feature updating
• To recompute features on non-key frames
• wherever propagated features have bad quality
• Temporally-adaptive key frame scheduling
• Dynamic key frame scheduling
38. Low-shot Learning with Imprinted Weights (UCLA)
• How to recognize novel visual categories?
• Given base classes w/ abundant samples for training
• Exposed to previously unseen novel classes with a limited amount of training data
for each category
• Directly set weights for a new category based on an appropriately
scaled copy of the embedding layer activations for that training
example
• Human’s ability to accept the new visual categories learner grows its capability
as it encounters more categories and training samples
• A single imprinted weight vector is learned for each novel category
39. Metric Learning
• Proxy-based Embedding Training
• Previous works: Neighborhood components
analysis – learns a distance metric
• Comparison with all other classes
• Proxy-based training
• Comparison with other negative-correlated proxies
• Trainable proxies
• I cannot understand this concept exactly
• Imprinting
• Remembering the semantic embedding of low-
shot examples as the templates for new classes
41. Memory Matching Networks for One-Shot Image
Recognition (USTC, Microsoft)
• Writes the features of a set of labelled images into memory
• Reads from memory when performing inference
• A Contextual Learner employs the memory slots in a sequential manner to predict the parameters of
CNNs for unlabeled images
• MM-Net could output one unified model irrespective of the number of shots and
categories
42. One-Shot Image recognition
• Given an unlabeled image 𝑥, predict its class 𝑦
• 𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑦𝑛 𝑥, 𝑆), 𝑤ℎ𝑒𝑟𝑒 𝑃 𝑦𝑛 𝑥, 𝑠 = 𝑓 𝑥 𝑆 T
∙ 𝑔 𝑥 𝑛
𝑆
• Different embedding function for unlabeled image and support image
• 𝑥 𝑛: 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑠𝑎𝑚𝑝𝑙𝑒 𝑜𝑓 𝑙𝑎𝑏𝑒𝑙 𝑦 𝑛
• Design a memory module to encode the contextual information within
support set into the memory via write controller
• Memory: consist of M key-value pairs
• Key: 𝐷 𝑚-dimensional memory representation
• Value: class label
• Write controller
• Encode the sequence of N support images into M memory slots
• Aiming to distill the intrinsic characteristics of classes
• Contextual Embedding
• For support set / Unlabeled image
• bi-LSTM-based approach
43. Feature Generating Networks for Zero-Shot Learning
(Saarland Informatics Campus)
• How to cope with unseen classes? (Zero-shot learning task)
• Use GAN to synthesize features of unseen classes
• Use class-level semantic information
45. Dual Skipping Networks (Fudan Univ, Tencent AI)
• Inspired by neuroscience studies
• Coarse-to-fine object categorization
• Mimicking the behavior of human brain
• LH(Fine grain) & RH(Coarse grain)
• Propose a layer-skipping mechanism
• Learns a gating network to predict which layers to
skip
• E
46. Model
• Network has left-right subnets by referring to
LH and RH
• At first, both branches have roughly the same
initialized layers and structures
• Skip-Dense Block
• Dense Layer – Residual or DenseNet based block
• Gating network
• Path selection
• Whether or not skipping the convolutional layer from the
training data
• Threshold function of Gating network
• Performs as a binary classifier
• Training: act as a scale value
• Testing: discrete binary value (0: skip)
• Guide
• Faster coarse subnet can guide the slower fine/local
subnet
48. Deep Mutual Learning
(Dalian University of Technology, China)
• Model distillation
• A powerful large network teaches a small network
• Deep Mutual learning
• An ensemble of students learn collaboratively & teach each other
• Collaborative learning
• Dual learning[1]: two cross-lingual translation models teach each other
• Cooperative Learning[2]: Recognizing the same set of object categories but with
different inputs (ex: RGB + depth)
• This work: different models, but the same input and task
• No priori powerful teacher network is necessary!
[1] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. Dual learning for machine translation. In NIPS, pages 820– 828, 2016.
[2] T. Batra and D. Parikh. Cooperative learning with visual attributes. arXiv: 1705.05512, 2017.
49. Deep Mutual Learning
• Use KL Divergence to provide training experience to each other network
• 𝐷 𝐾𝐿(𝑝2| 𝑝1 = 𝑖=1
𝑁
𝑚=1
𝑀
𝑝2
𝑚
𝑥𝑖 𝑙𝑜𝑔
𝑝2
𝑚 𝑥 𝑖
𝑝1
𝑚(𝑥 𝑖)
(𝑁: # 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠, 𝑀: # 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠, 𝑝 𝑛: 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑒𝑟 𝑜𝑓 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝜃 𝑛)
• Loss function: 𝐿 𝜃 𝑘
= 𝐿 𝐶 𝑘
+
1
𝐾−1 𝑙=1,𝑙≠𝑘
𝐾
𝐷 𝐾𝐿(𝑝𝑙||𝑝 𝑘) and vice versa (𝐿 𝑐 𝑘
: Classification Loss)
• It can be extended to semi-supervised tasks
• (Label information is not required for posterior computation)
52. Interpret Neural Networks by Identifying Critical Data
Routing Paths (Tsinghua Univ.)
• Interpretable machine learning
algorithm
• Explain or to present in
understandable terms to a human
• Distillation Guided Routing Method
• Discover the critical nodes on the
data routing paths for individual
input samples
• Scalar control gate
• Decide whether each layer’s output
channel is critical for the decision
53. Methodology
• Pretrained model + Channel-wise Control gates
• Control gates are learned to find the optimal routing decision in the network
• Scale value for each channel
• Distillation Guided Routing
• Perform SGD on the same input for T = 30 iterations
• Most scalar values of the gates should be close to zeros
• Output of the new network should be similar to the original network
• argmin
Λ
𝐿 𝑓𝜃 𝑥 , 𝑓𝜃 𝑥; Λ + +𝛾 𝑘 𝜆 𝑘
• Gradients for control gates:
𝜕𝐿𝑜𝑠𝑠
𝜕Λ
=
𝜕𝐿
𝜕Λ
+ 𝛾 ∗ 𝑠𝑖𝑔𝑛 Λ
• CDRPs representation
• 𝑣 𝑓𝑜𝑟 𝑖𝑚𝑎𝑔𝑒 𝑥 = 𝐶𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒(𝑎𝑙𝑙 Λ)
• Adversarial Samples Detection
• CDRPs comparison
54. Deep Photo Enhancer: Unpaired Learning for Image Enhancement
from Photographs with GANs (National Taiwan Univ.)
• Problem
• Given a set of photographs w/
desired characteristics
• Transforms an input image
into an enhanced image with
those characteristics
• MIT-Adobe 5K dataset
• 5K images – original images &
several versions of retouched
images
• Competitive samples : retouched
images from photographer C
55. Network
• Define an enhancement by a set of examples Y
• Input X U-net based generator Output (vs Y) Discriminator
• Add Attention-based feature in the U-net
• To capture global features (such as the sky)
• Can use 2-way GAN for consistency checking
56. A2-RL: Aesthetics Aware Reinforcement
Learning for Image Cropping
• Cropping the image to improve
aesthetic quality
• AVA dataset*
• Traditional approach: sliding
window method
• Time consuming, fixed aspect ratio
• Weakly supervised Aesthetics
Aware Reinforcement Learning
• Train the agent using the actor-
critic architecture
• Sequential decision making
* N. Murray, L. Marchesotti, and F. Perronnin. Ava: A large- scale database for aesthetic visual analysis. In CVPR, 2012.
57. RL Agent
• 14 pre-defined action
• Reward function: aesthetic score
• Output of the pretrained view finding network (asthetic ranker) – Trained with same dataset
58. Distort-and-Recover: Color Enhancement using
Deep Reinforcement Learning (Lunit)
• Distort original image & use original image as a ground truth for
recovering
• Adobe-5K Training set, but only utilizes retouched images
• Training a reinforcement learning agent for color enhancement
• Compare the features & take an action
• Reduce the gap between two images
59. Neural Style Transfer via Meta Networks
(Peking Univ., National University of Singapore)
• Generate the specified network for
specific style
• through one feed-forward in the meta
networks for neural style transfer
• Don’t need enormous training iterations
to adopt a new style
• Small size neural style transfer
network is generated
60. Embodied Question Answering
(Georgia Institute of Technology, Facebook AI)
• New AI Task
• 3D environment
• Question Navigate to
find the answer Answer
61. Excluded papers
• NestedNet: Learning Nested Sparse Structures in Deep Neural Networks (SNU)
• Real-Time Monocular Depth Estimation using Synthetic Data with Domain Adaptation via Image
Style Transfer (Durham Univ.)
• Low-Latency Video Semantic Segmentation (CAS)
• Guided Proofreading of Automatic Segmentations for Connectomics (Harvard)
• Generative Adversarial Learning Towards FastWeakly Supervised Detection (Ximan Univ, Microsoft)
• Logo Synthesis and Manipulation with Clustered Generative Adversarial Networks (ETH Zurich)
• Neural Baby Talk(Georgia Institute of Technology, Facebook AI)
• Self-Supervised Feature Learning by Learning to Spot Artifacts(University of Bern)
• CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise (Microsoft AI)
Hinweis der Redaktion
Empirical Measurements: Layer 별로 look-up table 생성해 둬서 시간을 최대한 절약한다.
Input image resolution의 경우엔 이 전체 과정에는 포함이 되지 않는 듯. (Resolution 각 Resolution 별로 이 과정 진행)
Which filter? for k from 1 to K (우측 그림에도 나와 있음)
Idea: Empirical experiments + Scheduling problem을 같이 써서 네트워크 조절 알고리즘을 짤 수도 있겠다.
처음에 압축 안하더라도, 뒤가 더 압축되는 결과가 나오므로 손해다.
Plain-20, VGG16 4x, Mobilenet
Residual / Concat 등 operation 지원하기 어렵다.
우리 아이디어와 비슷
Channel contribution: 해당 채널 값에 곱해지는 1x1 weight value를 기준으로 계산하였음.