Convolutional neural networks 이론과 응용

CONVOLUTIONAL NEURAL NETS 소개
2019. 3.
김 홍 배1

목차
 Convolutional Neural Nets ?
 Convolutional Neural Nets의 응용예
 Convolutional Neural Nets의 동작원리
 Convolutional Neural Nets의 진화과정
 Brief intro : Invariance and Equivariance
 Limitations of CNN
 Group CovNet
 Capsule Net
2

3
x_image
(28x28)
Reshape
28x28  784x1 vector
.
.
.
10 digits
.
.
W, bx y=softmax(Wx+b)
Neural Nets
# of unknown parameters to estimate = # of weights + # of bias
= 784x10+10 = 7,850 !!!
• 일반적인 Neural Net의 경우, 입력 이미지의 pixel 정보로 부터 시작
• 고해상도 이미지를 고속으로 처리가 불가능
CONVOLUTIONAL NEURAL NETS ?

 딥러닝 기반 시각인지를 위한 Networks
4
• CNN은 간단한 형상의 Patch(Filter or Kernel) 단위로 특징 추출
• 상위계층으로 진행될 수록 사물의 전체 형상을 구성
 추정해야 할 parameter의 수가 줄어듬

5
• Color images are three dimensional and so have a volume
• Time domain speech signals are 1-d while the frequency domain
representations (e.g. MFCC vectors) take a 2d form. They can also be
looked at as a time sequence.
• Medical images (such as CT/MR/etc) are multi-dimensional
• Videos have the additional temporal dimension compared to stationary
images
• Variable length sequences and time series data are again multi-dimensional
• Hence it makes sense to model them as tensors instead of vectors.
Types of inputs

6
• Image retrieval from database
• Object Detection
• Self driving cars
• Semantic segmentation
• Face recognition (FB tagging)
• Pose estimation
• Detect diseases
• Speech Recognition
• Text processing
• Analysing satellite data
CONVOLUTIONAL NEURAL NETS의 응용 예
CNNs are everywhere

7
 상황분석
시각인지 기능과 문장을 만들기 위한 RNN을 이용하
여
주어진 영상에 대한 설명을 수행

8
 물체 감지 및 인식
시각인지 기능을 이용하여 물체의 class와 BB 제시
다양한 Convolution layers
Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

9
 의미론적 분활(Semantic Segmentation)
시각인지 기능을 이용하여 영상의 픽셀단위로 라벨링 작업

10
CLs FCLs A1
Action
Sequential Front view
 End to end learning for Self-driving car
시각인지와 자동차의 행동을 학습하여 자율주행을 수
행
https://youtu.be/qhUvQiKec2U

11
90x1
224x224 pixels
 Smart picking robot based on Deep learning
시각인지와 강화학습을 통한 산업용 로봇 훈련

12
Feature Extraction Layer Classification Layer
CNN 은 Feature Extraction과 Classification Layer로 구성
CONVOLUTIONAL NEURAL NETS의 구조

13
CONVOLUTIONAL NEURAL NETS의 동작원리

14
x_image
(28x28)
convolution
(5x5,s=1)
h_conv1
(28x28x32)
32 features
h_pool1
(14x14x32)
32 channels
Max pooling
(2x2,s=2)
h_conv2
(14x14x64)
64 features
convolution
(5x5,s=1)
64 features
h_pool2
(7x7x64)
Max pooling
(2x2,s=2)
1st convolutional layer 2nd convolutional layer
Reshape
7 * 7 * 64 Tensor  3,136x1 vector
.
.
.
1,024 neurons 10 digits
Fully connected layer
Networks Architecture
A
A
Readout layer

CONVOLUTION
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
1 0 1
0 1 0
1 0 1
4 3 4
2 4 3
2 3 4
=
convolution
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
1 0 1
0 1 0
1 0 1
4
=convolution
filter feature map
Input or feature map
filter feature map
Input or feature map
 Convolution 연산 : 같은 위치에 있는 숫자끼리 곱한 후 모두 더함
 1x1 + 1x0 + 1x1 + 0x0 + 1x1 + 1x0 + 0x1 + 0x0 + 1x1 = 4
 Filter가 옆으로 이동 후 같은 연산 수행
 옆으로 모두 이동한 이후에는 아래로 이동 후 같은 연산 수행

16

RELU(WHY RELU ISTEAD OF
SIGMOID ?)
3 0 1
-2 0 2
0 2 3
1 -1 1
0 -1 -1
3 1 0
𝑓
3 0 1
0 0 2
0 2 3
𝑓
1 0 1
0 0 0
3 1 0
ReLU
ReL
U
Rectified Linear Unit
(ReLU)

POOLING LAYER
 Max pooling을 많이 사용함

2X2 MAX POOLING WITH STRIDE=1
3 0 1
0 0 2
0 2 3
1 0 1
0 0 0
3 1 0
3 2
2 3
1 1
3 1
max pooling
max pooling

20
 Dimension Reduction
 Add Spatial(Translation & Rotation) Invariance to
Feature Maps
• Be able to recognize feature regardless of angle, direction
or skew.
• Does not care where feature is, as long as it maintains its
relative position to other features.
Why Pooling ?
Spatial Invariance

Input Image
Convolution
(Learned)
Non-linearity
Spatial pooling
Feature maps
Input Feature Map
.
.
.
Key operations in a CNN
Source: R. Fergus, Y. LeCun Slide: Lazebnik

Input Image
Convolution
(Learned)
Non-linearity
Spatial pooling
Feature maps
Key operations
Source: R. Fergus, Y. LeCun
Rectified Linear Unit (ReLU)
Slide: Lazebnik

Input Image
Convolution
(Learned)
Non-linearity
Spatial pooling
Feature maps
Max
Key operations
Source: R. Fergus, Y. LeCun Slide: Lazebnik

Flattening takes the pooled layer and flattens it in sequential order into
a single vector.
• Vector is used as the input to the Classifier
Flattening

FULLY-CONNECTED LAYER
3 2
2 3
1 1
3 1
3
2
2
3
1
1
3
1
2
1
softmax
0.8
0.2
Cat
Dog

26
CONVOLUTIONAL NEURAL NETS의 진화
LeNet to ResNet: A Deep Journey
LeNet5 (1998): The origin of convolutional neural network
• Repeat of Convolution – Pooling – Non
Linearity
• Average pooling
• Sigmoid activation for the intermediate
layer
• tanh activation at F6
• 5x5 Convolutionfilter
• 7 layers and less than 1M parameters
• Use of convolution to extract
spatial features
• Subsample using spatial average
ofmaps
• Sparse connection matrix
between layers to avoid large
computationalcost
Characteristics Key Contributions
• Slow totrain
• Hard to train (Neuronsdies
quickly)
• Lack of data
The Gap

27
• ImageNet is an image database organized according to the WordNet hierarchy
• is formally a project aimed at (manually) labeling and categorizing images
• ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
• Training Data: 1.2 Million Images, 1000+ categories
• Validation and Test Data: 150K Images, 50K Validation, Remaining Test
• Image Net Data: http://image-net.org/challenges/LSVRC/2010/browse-synsets
• Multiple Challenges; Object recognition, localization etc.

IMAGENET CLASSIFICATION RESULTS
<2012 Result>
• Krizhevsky et al. – 16.4% error(top-5)
• Next best (non-convnet) – 26.2%
error
<2013 Result>
• All rankers use deep
learning(Convnet)
Revolution of Depth!
AlexNet

29
ALEXNET (2012)
• GPU and training in
parallel
• ReLu Activation
• Dropout regularization
• Image Augmentation
Characteristics Key Contributions
- 11x11, 5x5 and 3x3 Convolutions
- Max pooling
- 3 FC layers
- 60 Million parameters

30
A 4 layer CNN with ReLUs is 6
times faster than equivalent
network with thanh in
reaching 25% error rate on
CIFR-10 dataset
RELU NON-LINEARITY – SIMPLER ACTIVATION

Ljubljana, June 2016
Deep learning - ReLU
How does sigmoid function affect learning?
• Enables easier computation of derivative but has negative effects:
– Neuron never reaches 1 or 0  saturating
– Gradient reduces the magnitude of error
• Leads to two problems:
• Slow learning when neurons saturated i.e. big z values
• Vanishing gradient problem (gradient always 25% of error from previous layer!!)

Deep learning - ReLU
• Alex Krizhevsky (2011) proposed Rectified Linear Unit instead of sigmoid function
• Main purpose of ReLu: reduces saturation and vanishing gradient issues
• Still not perfect:
– Stops learning at negative z values (can use piecewise linear - Parametric ReLu, He 2015 from
Microsoft)
– Bigger risk of saturating neurons to infinity

Deep learning - dropout
• Too many weights cause overfitting issues
• Weight decay (regularization) helps but is not perfect
– Also adds another hyper-parameter to setup manually
• Srivastava et al. (2014) proposed a kind of „bagging“ for deep nets (actually Alex
Krizhevsky already used it in AlexNet in 2011)
• Main point:
– Robustify network by disabling neurons
– Each neuron has a probability, usually of 0.4, of being disabled
– Remaining neurons must adept to work without them
• Applied only to fully connected layers
– Conv. layers less susceptible to overfitting
Srivastava et al., Dropout : A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014

Deep learning – batch norm
• Input needs to be whitened i.e. normalized (LeCun 1998,
Efficient BackProp)
– Usually done on first layer input only
• The same reason for normalization of first layer exists for
other layers as well
• Ioffe and Szegedy, Batch Normalization, 2015
– Normalize input to each layer
– Reduce internal covariance shift
– Too slow to normalize all input data (>1M samples)
– Instead normalize within mini-batch only
– Learning: norm over mini-batch data
– Inference: norm over all trained input data
Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift, 2015
Better results while allowing to use higher learning rate, higher decay, no dropout, no LRN.

35
VGG (2014) • Smaller size convolution 3x3 throughout the net
• Sequence of 3x3 convolution can emulate
larger receptive fields, e.g., 5x5 or 7x7
• Use of 1x1 convolution
• Decrease in spatial volume and increase in
depth of input
What's the advantage of using 3 layers of
3x3 instead of one layer of 7x7?
• 3 non-linear rectification layers
• Less number of parameters, 27C2 as opposed to
49C2
Key Points
• Depth is important
• Simplify the network to go deep
• 140M parameters
(mostly due to the FC layers)

36
VGG(2ND PLACE IN 2014)
 3x3 filter만 반복해서 사용
 Why??
 Convolution filter를 stack하면 더 큰
receptive field를 가질 수 있음
 2개의 3x3 filter = 5x5 filter
 3개의 3x3 filter = 7x7 filter
 Parameter수는 큰 filter 사용하는 경우에
비하여 감소 
regularization 효과
“Very Deep Convolutional Networks for Large-Scale Image Recognition”

37
GOOGLENET OR INCEPTION (2014)
• 22 Layer CNN
• Heavy use of 1x1 ‘Network in Network’
• Use of average pooling before the classification
• Auxiliary classifiers connected to intermediate layers
• During training add the loss of the auxiliary classifiers
with a discount (0.3) weight

38
GOOGLENET KEY IDEAS
• 3x3 or 5x5 중 어떤 것이 좋은가 ?
• 전부 다 사용해보자
 연산량이 많아진다.
Naïve Version
Modified Idea
Way too many output!!! Use 1x1 for dimensionality reduction
Why 1x1 convolution?
• Introduced as “Network in Network” in 2014
• Is a way to increase Non-Linearity and spatially combine
features across feature maps
Only 4M parameters compared to
60M in AlexNet

39
GOOGLENET KEY IDEAS
 1x1 convolution을 사용하여 dimension reduction
 Feature map의 개수를 절반으로 줄여 총 연산량은 비슷하게

40
GOOGLENET KEY IDEAS
Input layer Kth feature map,
output layer
X11
Xij
y11,k
yij,kwk
wk
X11 : 1x256 vector, wk : 1x256 weight vector, Yij,k = f(Xij·wk), f() : Nonlinear ft’n
x y
w
1x1 Convolution의 dimension reduction 원리
Fully Connected NN을 이용한
Feature Dimension Reduction원리와 동일

41
RESNET (RESIDUAL NEURAL NETWORK) (2015)
• Introduce shortcut connections (exists in prior literature in various forms)
• Key invention is to skip 2 layers. Skipping single layer didn’t give much
improvement for some reason

42
RESNET
 Layer수가 많을수록 항상 좋을까?
 56개의 layer를 사용하는 경우가 20개의 layer를 사용하는 경우에 비
해 training error가 더 큰 결과가 나옴

 더 deep한 model은 training error가
더 낮아야 하지만
 Deep한 model은 optimization이 쉽지 않다
는 것을 발견(identity도 힘들다)
원인 : vanishing/exploding gradient
학습시켜야 할 파라메터 수의 증가
A shallower model
(18 layers)
A deeper model
(34 layers)
“Deep Residual Learning for Image Recognition”
RESNET

RESNET의 KEY IDEA
 Identity는 그대로 상위 layer로 전달하고, 나머지 부분만 학습
 H(x)를 얻는 것이 목표가 아니라 F(x)=H(x)-x 를 목표로
F(x) ~0 이므로 수렴이 빠름
 Identity shortcut을 통한 효과
- 깊은 망의 최적화도 가능
- 깊이에 비례해 정확도 개선
“Deep Residual Learning for Image Recognition”

 BOTTLENECK : A PRACTICAL DESIGN
• # parameters
• 256 x 64 + 64 x 3 x 3x 64 + 64 x 25
6 = ~70K
• # parameters just using 3 x 3 x 256 x 2
56 conv layer = ~600K
1x1 conv를 이용하여 dimension reduction  3x3 conv 
1x1 conv를 이용하여 dimension expansion
 연산량을 줄이기 위함
RESNET의 KEY IDEA

Dilated convolutions
The goal of this layer is to increase the size of the receptive field
(input activations that are used to compute a given output)
without using downsampling (in order to preserve local information).
Increasing the size of the receptive field allows to use more context
(information spatially further away).
The idea is to spread the input images and fill the added pixels with
zeros, and then compute a convolution.

DenseNet
2017 CVPR에서 Densely Connected Network라는 네트워크 구조에 획기적인 변화를
주는 연구 결과가 발표

Brief intro : Invariance and Equivariance

CovNet are translational Equivalent
This demonstrates LeNet-5's invariance to small rotations (+/-40 degrees).
How about Rotation ?
Limitation of Conventional CovNet

2D convolution is equivariant under translation, but not under rotation
Limitation of Conventional CovNet

Invariance
Φ
Image(X)
Feature(Z) Z1 = Z = Z2
𝑇𝑔
1
Mapping
ft’n(Φ(·))
Φ
Transformation
X1 X2
Z = Z1 = Φ(X1) = Z2 = Φ(X2) = Φ(𝑻 𝒈
𝟏
X1 )
: Mapping independent of transformation, 𝑇𝑔, for all 𝑇𝑔
X2 = 𝑇𝑔
1
X1

To make a Convolutional Neural Networks (CNN) transformation-
invariant, data augmentation with training samples is generally used
Invariance

Equivariance
Φ
Image(X)
Feature(Z) Z1 Z2
𝑇𝑔
2
𝑇𝑔
1
Φ
Transformation
X1 X2
Z2 = 𝑻 𝒈
𝟐
Z1 = 𝑻 𝒈
𝟐
Φ(X1) = Φ(𝑻 𝒈
𝟏
X1 )
: Invariance is special case of equivariance where 𝑇𝑔
2 is the identity.
X2 = 𝑇𝑔
1
X1
Z2 = 𝑇𝑔
2
Z1
: Mapping preserves algebraic structure of transformation
Z1 ≠ Z2 but keeps the relationship
Mapping
ft’n(Φ(·))

Equivariance : Group CovNet
To understand the rotation or proportion change of a given entity, a
group of filters(a combination of rotated and mirror reflected versions of
filter) is adopted.
For example, the group p4 which contains translations and rotations by
multiples of ninety degrees, or, which additionally contains mirror
reflections.
: Rotation
: Mirror reflections

A filter in a G-CNN detects co-occurrences of features that have the
preferred relative pose, and can match such a feature constellation in
every global pose through an operation called the G-convolution.
Filter group 1
Filter group 2
Filter group N

Visualization of classic 2D convolution
Visualization of the G-Conv for the roto-translation group
G-Convolution

G-convolution is equivariant under rotation
G-Convolution

Latent representations learnt by a CNN and a G-CNN.
- The left part is the result of a typical CNN while the right one is that of a G-
CNN.
- In both parts, the outer cycles consist of the rotated images while the inner
cycles consist of the learnt representations.
- Features produced by a G-CNN is equivariant to rotation while that produced
by a typical CNN is not.

What we need : EQUIVARIANCE (not invariance)
“Equivariance makes a CNN understand the rotation or proportion change”
Equivariance : Capsule Net

“A capsule is a group of neurons whose activity vector represents
the instantiation parameters of a specific type of entity such as an
object or an object part.”

Equivariance of Capsules
“A capsule is a group of neurons whose activity vector represents the
instantiation parameters of a specific type of entity such as an object or
an object part.”
Activity vector map Object

Convolutional neural networks 이론과 응용

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Convolutional neural networks 이론과 응용

Ähnlich wie Convolutional neural networks 이론과 응용 (20)

Mehr von 홍배 김

Mehr von 홍배 김 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Convolutional neural networks 이론과 응용