This document introduces convolutional neural networks (CNNs). It discusses how CNNs extract features using filters and pooling to build up representations of images while reducing the number of parameters. The key operations of CNNs including convolution, nonlinear activation, pooling and fully connected layers are explained. Examples of CNN applications are provided. The evolution of CNNs is then reviewed, from LeNet and AlexNet to VGGNet, GoogleNet, and improvements like ReLU, dropout, and batch normalization that helped CNNs train better and go deeper.
2. 목차
Convolutional Neural Nets ?
Convolutional Neural Nets의 응용예
Convolutional Neural Nets의 동작원리
Convolutional Neural Nets의 진화과정
Brief intro : Invariance and Equivariance
Limitations of CNN
Group CovNet
Capsule Net
2
3. 3
x_image
(28x28)
Reshape
28x28 784x1 vector
.
.
.
10 digits
.
.
W, bx y=softmax(Wx+b)
Neural Nets
# of unknown parameters to estimate = # of weights + # of bias
= 784x10+10 = 7,850 !!!
• 일반적인 Neural Net의 경우, 입력 이미지의 pixel 정보로 부터 시작
• 고해상도 이미지를 고속으로 처리가 불가능
CONVOLUTIONAL NEURAL NETS ?
4. CONVOLUTIONAL NEURAL NETS ?
딥러닝 기반 시각인지를 위한 Networks
4
• CNN은 간단한 형상의 Patch(Filter or Kernel) 단위로 특징 추출
• 상위계층으로 진행될 수록 사물의 전체 형상을 구성
추정해야 할 parameter의 수가 줄어듬
5. 5
• Color images are three dimensional and so have a volume
• Time domain speech signals are 1-d while the frequency domain
representations (e.g. MFCC vectors) take a 2d form. They can also be
looked at as a time sequence.
• Medical images (such as CT/MR/etc) are multi-dimensional
• Videos have the additional temporal dimension compared to stationary
images
• Variable length sequences and time series data are again multi-dimensional
• Hence it makes sense to model them as tensors instead of vectors.
CONVOLUTIONAL NEURAL NETS ?
Types of inputs
6. 6
• Image retrieval from database
• Object Detection
• Self driving cars
• Semantic segmentation
• Face recognition (FB tagging)
• Pose estimation
• Detect diseases
• Speech Recognition
• Text processing
• Analysing satellite data
CONVOLUTIONAL NEURAL NETS의 응용 예
CNNs are everywhere
8. 8
물체 감지 및 인식
시각인지 기능을 이용하여 물체의 class와 BB 제시
CONVOLUTIONAL NEURAL NETS의 응용 예
다양한 Convolution layers
Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
10. 10
CLs FCLs A1
Action
Sequential Front view
End to end learning for Self-driving car
시각인지와 자동차의 행동을 학습하여 자율주행을 수
행
https://youtu.be/qhUvQiKec2U
CONVOLUTIONAL NEURAL NETS의 응용 예
11. 11
90x1
224x224 pixels
Smart picking robot based on Deep learning
시각인지와 강화학습을 통한 산업용 로봇 훈련
CONVOLUTIONAL NEURAL NETS의 응용 예
12. 12
Feature Extraction Layer Classification Layer
CNN 은 Feature Extraction과 Classification Layer로 구성
CONVOLUTIONAL NEURAL NETS의 구조
19. 2X2 MAX POOLING WITH STRIDE=1
3 0 1
0 0 2
0 2 3
1 0 1
0 0 0
3 1 0
3 2
2 3
1 1
3 1
max pooling
max pooling
CONVOLUTIONAL NEURAL NETS의 동작원리
20. 20
Dimension Reduction
Add Spatial(Translation & Rotation) Invariance to
Feature Maps
• Be able to recognize feature regardless of angle, direction
or skew.
• Does not care where feature is, as long as it maintains its
relative position to other features.
CONVOLUTIONAL NEURAL NETS의 동작원리
Why Pooling ?
Spatial Invariance
24. Flattening takes the pooled layer and flattens it in sequential order into
a single vector.
• Vector is used as the input to the Classifier
Flattening
CONVOLUTIONAL NEURAL NETS의 동작원리
26. 26
CONVOLUTIONAL NEURAL NETS의 진화
LeNet to ResNet: A Deep Journey
LeNet5 (1998): The origin of convolutional neural network
• Repeat of Convolution – Pooling – Non
Linearity
• Average pooling
• Sigmoid activation for the intermediate
layer
• tanh activation at F6
• 5x5 Convolutionfilter
• 7 layers and less than 1M parameters
• Use of convolution to extract
spatial features
• Subsample using spatial average
ofmaps
• Sparse connection matrix
between layers to avoid large
computationalcost
Characteristics Key Contributions
• Slow totrain
• Hard to train (Neuronsdies
quickly)
• Lack of data
The Gap
27. 27
CONVOLUTIONAL NEURAL NETS의 진화
• ImageNet is an image database organized according to the WordNet hierarchy
• is formally a project aimed at (manually) labeling and categorizing images
• ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
• Training Data: 1.2 Million Images, 1000+ categories
• Validation and Test Data: 150K Images, 50K Validation, Remaining Test
• Image Net Data: http://image-net.org/challenges/LSVRC/2010/browse-synsets
• Multiple Challenges; Object recognition, localization etc.
28. IMAGENET CLASSIFICATION RESULTS
<2012 Result>
• Krizhevsky et al. – 16.4% error(top-5)
• Next best (non-convnet) – 26.2%
error
<2013 Result>
• All rankers use deep
learning(Convnet)
Revolution of Depth!
AlexNet
CONVOLUTIONAL NEURAL NETS의 진화
29. 29
CONVOLUTIONAL NEURAL NETS의 진화
ALEXNET (2012)
• GPU and training in
parallel
• ReLu Activation
• Dropout regularization
• Image Augmentation
Characteristics Key Contributions
- 11x11, 5x5 and 3x3 Convolutions
- Max pooling
- 3 FC layers
- 60 Million parameters
30. 30
CONVOLUTIONAL NEURAL NETS의 진화
A 4 layer CNN with ReLUs is 6
times faster than equivalent
network with thanh in
reaching 25% error rate on
CIFR-10 dataset
RELU NON-LINEARITY – SIMPLER ACTIVATION
31. Ljubljana, June 2016
Deep learning - ReLU
How does sigmoid function affect learning?
• Enables easier computation of derivative but has negative effects:
– Neuron never reaches 1 or 0 saturating
– Gradient reduces the magnitude of error
• Leads to two problems:
• Slow learning when neurons saturated i.e. big z values
• Vanishing gradient problem (gradient always 25% of error from previous layer!!)
32. Ljubljana, June 2016
Deep learning - ReLU
• Alex Krizhevsky (2011) proposed Rectified Linear Unit instead of sigmoid function
• Main purpose of ReLu: reduces saturation and vanishing gradient issues
• Still not perfect:
– Stops learning at negative z values (can use piecewise linear - Parametric ReLu, He 2015 from
Microsoft)
– Bigger risk of saturating neurons to infinity
33. Ljubljana, June 2016
Deep learning - dropout
• Too many weights cause overfitting issues
• Weight decay (regularization) helps but is not perfect
– Also adds another hyper-parameter to setup manually
• Srivastava et al. (2014) proposed a kind of „bagging“ for deep nets (actually Alex
Krizhevsky already used it in AlexNet in 2011)
• Main point:
– Robustify network by disabling neurons
– Each neuron has a probability, usually of 0.4, of being disabled
– Remaining neurons must adept to work without them
• Applied only to fully connected layers
– Conv. layers less susceptible to overfitting
Srivastava et al., Dropout : A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
34. Ljubljana, June 2016
Deep learning – batch norm
• Input needs to be whitened i.e. normalized (LeCun 1998,
Efficient BackProp)
– Usually done on first layer input only
• The same reason for normalization of first layer exists for
other layers as well
• Ioffe and Szegedy, Batch Normalization, 2015
– Normalize input to each layer
– Reduce internal covariance shift
– Too slow to normalize all input data (>1M samples)
– Instead normalize within mini-batch only
– Learning: norm over mini-batch data
– Inference: norm over all trained input data
Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift, 2015
Better results while allowing to use higher learning rate, higher decay, no dropout, no LRN.
35. 35
VGG (2014) • Smaller size convolution 3x3 throughout the net
• Sequence of 3x3 convolution can emulate
larger receptive fields, e.g., 5x5 or 7x7
• Use of 1x1 convolution
• Decrease in spatial volume and increase in
depth of input
What's the advantage of using 3 layers of
3x3 instead of one layer of 7x7?
• 3 non-linear rectification layers
• Less number of parameters, 27C2 as opposed to
49C2
Key Points
• Depth is important
• Simplify the network to go deep
• 140M parameters
(mostly due to the FC layers)
CONVOLUTIONAL NEURAL NETS의 진화
36. 36
VGG(2ND PLACE IN 2014)
3x3 filter만 반복해서 사용
Why??
Convolution filter를 stack하면 더 큰
receptive field를 가질 수 있음
2개의 3x3 filter = 5x5 filter
3개의 3x3 filter = 7x7 filter
Parameter수는 큰 filter 사용하는 경우에
비하여 감소
regularization 효과
“Very Deep Convolutional Networks for Large-Scale Image Recognition”
37. 37
CONVOLUTIONAL NEURAL NETS의 진화
GOOGLENET OR INCEPTION (2014)
• 22 Layer CNN
• Heavy use of 1x1 ‘Network in Network’
• Use of average pooling before the classification
• Auxiliary classifiers connected to intermediate layers
• During training add the loss of the auxiliary classifiers
with a discount (0.3) weight
38. 38
GOOGLENET KEY IDEAS
• 3x3 or 5x5 중 어떤 것이 좋은가 ?
• 전부 다 사용해보자
연산량이 많아진다.
Naïve Version
Modified Idea
Way too many output!!! Use 1x1 for dimensionality reduction
Why 1x1 convolution?
• Introduced as “Network in Network” in 2014
• Is a way to increase Non-Linearity and spatially combine
features across feature maps
Only 4M parameters compared to
60M in AlexNet
39. 39
GOOGLENET KEY IDEAS
1x1 convolution을 사용하여 dimension reduction
Feature map의 개수를 절반으로 줄여 총 연산량은 비슷하게
40. 40
GOOGLENET KEY IDEAS
Input layer Kth feature map,
output layer
X11
Xij
y11,k
yij,kwk
wk
X11 : 1x256 vector, wk : 1x256 weight vector, Yij,k = f(Xij·wk), f() : Nonlinear ft’n
x y
w
1x1 Convolution의 dimension reduction 원리
Fully Connected NN을 이용한
Feature Dimension Reduction원리와 동일
41. 41
RESNET (RESIDUAL NEURAL NETWORK) (2015)
CONVOLUTIONAL NEURAL NETS의 진화
• Introduce shortcut connections (exists in prior literature in various forms)
• Key invention is to skip 2 layers. Skipping single layer didn’t give much
improvement for some reason
42. 42
RESNET
Layer수가 많을수록 항상 좋을까?
56개의 layer를 사용하는 경우가 20개의 layer를 사용하는 경우에 비
해 training error가 더 큰 결과가 나옴
43. 더 deep한 model은 training error가
더 낮아야 하지만
Deep한 model은 optimization이 쉽지 않다
는 것을 발견(identity도 힘들다)
원인 : vanishing/exploding gradient
학습시켜야 할 파라메터 수의 증가
A shallower model
(18 layers)
A deeper model
(34 layers)
“Deep Residual Learning for Image Recognition”
RESNET
44. RESNET의 KEY IDEA
Identity는 그대로 상위 layer로 전달하고, 나머지 부분만 학습
H(x)를 얻는 것이 목표가 아니라 F(x)=H(x)-x 를 목표로
F(x) ~0 이므로 수렴이 빠름
Identity shortcut을 통한 효과
- 깊은 망의 최적화도 가능
- 깊이에 비례해 정확도 개선
“Deep Residual Learning for Image Recognition”
45. BOTTLENECK : A PRACTICAL DESIGN
• # parameters
• 256 x 64 + 64 x 3 x 3x 64 + 64 x 25
6 = ~70K
• # parameters just using 3 x 3 x 256 x 2
56 conv layer = ~600K
1x1 conv를 이용하여 dimension reduction 3x3 conv
1x1 conv를 이용하여 dimension expansion
연산량을 줄이기 위함
RESNET의 KEY IDEA
46. Dilated convolutions
The goal of this layer is to increase the size of the receptive field
(input activations that are used to compute a given output)
without using downsampling (in order to preserve local information).
Increasing the size of the receptive field allows to use more context
(information spatially further away).
The idea is to spread the input images and fill the added pixels with
zeros, and then compute a convolution.
51. CovNet are translational Equivalent
This demonstrates LeNet-5's invariance to small rotations (+/-40 degrees).
How about Rotation ?
Limitation of Conventional CovNet
52. 2D convolution is equivariant under translation, but not under rotation
Limitation of Conventional CovNet
53. Invariance
Φ
Image(X)
Feature(Z) Z1 = Z = Z2
𝑇𝑔
1
Mapping
ft’n(Φ(·))
Φ
Transformation
X1 X2
Z = Z1 = Φ(X1) = Z2 = Φ(X2) = Φ(𝑻 𝒈
𝟏
X1 )
: Mapping independent of transformation, 𝑇𝑔, for all 𝑇𝑔
X2 = 𝑇𝑔
1
X1
54. To make a Convolutional Neural Networks (CNN) transformation-
invariant, data augmentation with training samples is generally used
Invariance
55. Equivariance
Φ
Image(X)
Feature(Z) Z1 Z2
𝑇𝑔
2
𝑇𝑔
1
Φ
Transformation
X1 X2
Z2 = 𝑻 𝒈
𝟐
Z1 = 𝑻 𝒈
𝟐
Φ(X1) = Φ(𝑻 𝒈
𝟏
X1 )
: Invariance is special case of equivariance where 𝑇𝑔
2 is the identity.
X2 = 𝑇𝑔
1
X1
Z2 = 𝑇𝑔
2
Z1
: Mapping preserves algebraic structure of transformation
Z1 ≠ Z2 but keeps the relationship
Mapping
ft’n(Φ(·))
56. Equivariance : Group CovNet
To understand the rotation or proportion change of a given entity, a
group of filters(a combination of rotated and mirror reflected versions of
filter) is adopted.
For example, the group p4 which contains translations and rotations by
multiples of ninety degrees, or, which additionally contains mirror
reflections.
: Rotation
: Mirror reflections
57. A filter in a G-CNN detects co-occurrences of features that have the
preferred relative pose, and can match such a feature constellation in
every global pose through an operation called the G-convolution.
Equivariance : Group CovNet
Filter group 1
Filter group 2
Filter group N
58. Visualization of classic 2D convolution
Visualization of the G-Conv for the roto-translation group
G-Convolution
Equivariance : Group CovNet
60. Equivariance : Group CovNet
Latent representations learnt by a CNN and a G-CNN.
- The left part is the result of a typical CNN while the right one is that of a G-
CNN.
- In both parts, the outer cycles consist of the rotated images while the inner
cycles consist of the learnt representations.
- Features produced by a G-CNN is equivariant to rotation while that produced
by a typical CNN is not.
61. What we need : EQUIVARIANCE (not invariance)
“Equivariance makes a CNN understand the rotation or proportion change”
Equivariance : Capsule Net
62. “A capsule is a group of neurons whose activity vector represents
the instantiation parameters of a specific type of entity such as an
object or an object part.”
Equivariance : Capsule Net
63. Equivariance of Capsules
“A capsule is a group of neurons whose activity vector represents the
instantiation parameters of a specific type of entity such as an object or
an object part.”
Activity vector map Object
Equivariance : Capsule Net