Recurrent Neural Net의 이론과 설명

Recurrent Neural Networks
2019. 3
김홍배
1

Outline
2
1. Sequence modeling
2. Feed-forward networks review
3. Vanilla RNN
4. Vanishing gradient
5. Gating methodology
6. Use cases

Sequence modeling
 Language Applications
• Language Modeling (probability)
• Machine Translation
• Speech Recognition
3

 Energy signal (Price)
4
Sequence modeling
Current time
External signal
(e.g. Weather, load, generation)

Feed-forward networks review
5

 Where is the Memory ?
If we have a sequence of samples...
predict sample x[t+1] knowing previous values {x[t], x[t-1], x[t-2], …, x[t-τ]}
6

7
Where is the Memory ?
Feed Forward approach:
• static window of size L
• slide the window time-step wise
x[t+1]
L

 Where is the Memory ?
8
Feed Forward approach:
• static window of size L
• slide the window time-step wise
x[t+1]
L

 Problems for the FNN + static window approach I
• If increasing L, fast growth of num. of parameters !
• Decisions are independent between time-steps!
 The network doesn’t care about what happened at
previous time-step, only present window matters →
doesn’t look good
• Can’t work with variable sequence lengths
9

Vanilla RNN
 Recurrent Neural Network (RNN) adding
the “temporal” evolution
10
Allow to build specific connections
capturing ”history”
x
h
y
𝒚𝒕 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙(𝑽𝒉𝒕)
W
U
V

 RNN: parameters
11
Vanilla RNN
x
h
yW
U
V

 RNN : unfolding
 BEWARE: We have extra depth now !
Every time-step is an extra level of depth
(as a deeper stack of layers in a feed-forward fashion !)
12
Vanilla RNN

 RNN : depth 1
Forward in space propagation
13
Vanilla RNN

14
 RNN : depth 2
Forward in time propagation
Vanilla RNN

15
Vanilla RNN
 Training a RNN : BPTT
 Backpropagation through time (BPTT):
The training algorithm for updating network weights to minimize
error including time
 Cross Entropy Loss

 Training a RNN : BPTT
16
Vanilla RNN
𝜕𝐸
𝜕𝑊
=
𝑡
𝜕𝐸𝑡
𝜕𝑊
NOTE: our goal is to calculate the gradients of the error
with respect to our parameters U, W and V and and then
learn good parameters using Stochastic Gradient Descent.
Just like we sum up the errors, we also sum up the
gradients at each time step for one training example:

Training a RNN : BPTT
17
Vanilla RNN
𝜕𝐸3
𝜕𝑊
=
𝜕𝐸3
𝜕 𝑦3
𝜕 𝑦3
𝜕ℎ3
𝜕ℎ3
𝜕𝑊
ℎ3 = 𝑓 𝑈𝑥𝑡 + 𝑊ℎ2
ℎ2 = 𝑓 𝑈𝑥𝑡 + 𝑊ℎ1
ℎ1 = 𝑓(𝑈𝑥𝑡 + 𝑊ℎ0)
𝜕𝐸3
𝜕𝑊
=
𝑘=0
3 𝜕𝐸3
𝜕 𝑦3
𝜕 𝑦3
𝜕ℎ3
𝜕ℎ3
𝜕ℎ 𝑘
𝜕ℎ 𝑘
𝜕𝑊
E3 computation for example

 Vanishing gradient
 During training gradients explode/vanish easily because of
depth-in-time → Exploding/Vanishing gradients !
18
Vanilla RNN
𝜕𝐸3
𝜕𝑊
=
𝑘=0
3 𝜕𝐸3
𝜕 𝑦3
𝜕 𝑦3
𝜕ℎ3
𝜕ℎ3
𝜕ℎ 𝑘
𝜕ℎ 𝑘
𝜕𝑊
𝜕ℎ3
𝜕ℎ1
=
𝜕ℎ3
𝜕ℎ2
𝜕ℎ2
𝜕ℎ1
𝜕𝐸3
𝜕𝑊
=
𝑘=0
3 𝜕𝐸3
𝜕 𝑦3
𝜕 𝑦3
𝜕ℎ3
𝑗=𝑘+1
3
𝜕ℎ𝑗
𝜕ℎ𝑗−1
𝜕ℎ 𝑘
𝜕𝑊

19
Vanilla RNN
tanh and derivative. Source: http://nn.readthedocs.org/en/rtd/transfer/
𝜕𝐸3
𝜕𝑊
=
𝑘=0
3 𝜕𝐸3
𝜕 𝑦3
𝜕 𝑦3
𝜕ℎ3
𝑗=𝑘+1
3
𝜕ℎ𝑗
𝜕ℎ𝑗−1
𝜕ℎ 𝑘
𝜕𝑊

 Standard Solutions
• Proper initialization of Weight Matrix
• Regularization of outputs or Dropout
• Use of ReLU Activations as it’s derivative is either 0 or 1
20
Vanilla RNN

Gating method
 Standard RNN
21

22
Long-Short Term Memory (LSTM)

1. Change the way in which past information is kept → create the
notion of cell state, a memory unit that keeps long-term
information in a safer way by protecting it from recursive
operations
2. Make every RNN unit able to decide whether the current time-
step information matters or not, to accept or discard (optimized
reading mechanism)
3. Make every RNN unit able to forget whatever may not be
useful anymore by clearing that info from the cell state (optimized
clearing mechanism)
4. Make every RNN unit able to output the decisions whenever it
is ready to do so (optimized output mechanism)
23

24
• Internal Memory (Cell State, or data) 사용
• 현시점 입력(입력과 이전 시점 출력)을 이용하여
- Internal Memory 정보의 부분 가감
- 현시점 입력의 Internal Memory 저장여부
- Internal Memory로 부터 출력값의 설정

depth
time
RNN
LSTM
tt-1
l
l-1
ℎ 𝑡−1
𝑙
ℎ 𝑡
𝑙−1
ℎ 𝑡
𝑙
ℎ 𝑡
𝑙
 RNN과 LSTM의 수식적 차이

f
x
i g
x
+
tanh
o
x
f
x
i g
x
+
tanh
o
x
@ time t
ht-1
xt xt+1
ht ht+1
ct-1
Cell state
ct ct+1
@ time t+1
 LSTM의 각각의 Cell은 다음과 같으며, 여러 개의 gate로 구성
입력 또는 하부층 출력
전시점(t-1)
cell 데이터
전시점(t-1)
출력
출력
Cell state
(Valuable information
Worth keeping long term)

 LSTM의 gate함수에 대한 이해
Sigmoide :
- Sigmoide 출력값은 0~1사이에 존재
- Cell state 값이나 입출력값의 상대적인 중요도를 설정
- “0”이면 필요 없으므로 삭제, “1”이면 중요하므로 유지
- Hyperbolic tanget 출력값은 -1~1사이에 존재
- Cell state, 입출력값을 Normalization 하기 위함.
- 따라서 LSTM을 쉽게 이해하기 위해서 무시해도 됨.

f
Forget Gate
 과거 계열 데이터의 사용/미사용을 제어
𝑓𝑡 = 𝜎(𝑊𝑓 𝑥 𝑥𝑡 + 𝑊 𝑓h ht-1
)
x
ct-1
ht-1
xt
- Sigmoid ft’n의 출력값은 0 ~ 1 사이에 존재
 ft가 “1”이면 이전 State 값을 유지
 ft가 “0”이면 이전 State 값을 삭제
Cell state
학습하는 변수
입력 또는 하부층 출력
전시점(t-1)
cell 데이터
전시점(t-1)
출력
∙ : Element-wise multi

Input Gate
 입력데이터의 사용/미사용을 제어
i g
x
f
gt= tanh(Wgx x𝑡 + 𝑊ghht-1)
xct-1
ht-1
xt - gt 는 Hyperbolic tangent ft’n의 출력값이므로 -1 ~ 1 사이에 존재
 입력데이터의 Normalization
- it는 Sigmoid ft’n의 출력값이므로 0~1 사이에 존재
Cell state
+
it= σ(Wix x𝑡 + 𝑊ihht-1)
yt = gt⨀it
y
학습하는 변수
ht-1
xt
ct
현시점(t)
cell 데이터

Output Gate
 출력데이터의 사용/미사용을 제어
x
f
ot = σ(Wox x𝑡 + 𝑊ohht-1)
xct-1
Cell state
+
ht = ot⨀tanh(ct)
y
학습하는 변수
tanh
o x
ht
ht-1
xt
ct
현시점(t)
출력

i
f
o
g
sigmoid
sigmoid
tanh
sigmoid
4n x 2n 4n 4*n
nx1
nx1
Wix Wih
Wfx Wfh
Wox Woh
Wgx Wgh
xt
ℎ 𝑡−1
𝑙
2n
 Matrix와 Vector 형태로 간략화 시키면
LSTM weight matrix
to be identified
하층 출력 또는
입력벡터 (x)
전시점(t-1)
출력벡터
ct-1
Cell state
x +
x
ct
x
tanh
ht

Design Patterns for RNN
RNN Sequences
Blog post by A. Karpathy. “The Unreasonable Effectiveness of Recurrent Neural Networks” (2015)
Task Input Output
Image classification fixed-sized image fixed-sized class
Image captioning image input sentence of words
Sentiment analysis sentence positive or negative sentiment
Machine translation sentence in English sentence in French
Video classification video sequence label each frame
Page 32

RNN Implementation using TensorFlow
How we design RNN model
for time series prediction?
 How manipulate our time
series data as input of RNN?
Page 33

LAB-5) Connect input and recurrent
layers
rnn_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
stacked_lstm = tf.nn.rnn_cell.MultiRNNCell([rnn_cell] * depth)
x_split = tf.split(batch_size, time_steps, x_data)
output, state = tf.nn.rnn(stacked_lstm, x_split)
𝑥𝑡−9 𝑥𝑡−8 𝑥𝑡−7 … 𝑥𝑡
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
…
𝑜𝑡−9 𝑜𝑡−8 𝑜𝑡−7 … 𝑜𝑡
Page 34

Long Short-Term Memory Network for
Remaining Useful Life Estimation
Deep LSTM model for RUL estimation
NASA C-MAPSS (Commercial Modular Aero-Propulsion
System Simulation) data set (Turbofan Engine Degradation
Simulation Data Set)

Deep LSTM model for RUL estimation

Electricity Price Forecasting (EPF)
Current timeEnergy signal (Price)
External signal
(e.g. Weather, load, generation)
Page 37

Experiment results
LSTM + DNN + LinearRegression
predicted
test
hour
price
(euro/MWh)
Page 38

Experiment results
Models Mean Absolute Error (euro/MWh)
LinearRegression 4.04
RidgeRegression 4.04
LassoRegression 3.73
ElasticNet 3.57
LeastAngleRegression 6.27
LSTM+DNN+LinearRegression 2.13
Page 39

Show and Tell :
A Neural Image Caption Generator
참고자료
1. “Show and Tell: A Neural Image Caption Generator”, O.Vinyals, A.Toshev,
S.Bengio, D.Erhan
2. CV勉強会@関東「CVPR2015読み会」発表資料, 皆川卓也
3. Lecture Note “Recurrent Neural Networks”, CS231n, Andrej Karpathy
2017.
김홍배
한국항공우주연구원

개요
 1장의 스틸사진으로 부터 설명문(Caption)을 생성
 자동번역등에 사용되는 Recurrent Neural Networks (RNN)에 Deep
Convolutional Neural Networks에서 생성한 이미지의 특징벡터를
입력
 Neural Image Caption (NIC)
 종래방법을 크게 상회하는 정확도

Neural Image Caption (NIC)
 사진(I)를 입력으로 주었을 때
 정답 “설명문“, S를 만들어 낼 가능성을 최대가 되도록
 학습데이터(I, S)를 이용하여
 넷의 변수(w)들을 찾아내는 과정
설명문
w∗ = argmax 𝐼,𝑆 log ‫(݌‬S|I;w)
w 사진, 변수
확률
손실함수
전체 학습데이터 셋에 대한 손실함수
손실함수를 최소화 시키는 변수, w*를 구하는 작업

 사진으로부터 설명문 생성
𝑝 𝑆 𝐼; 𝑤 =
𝑡=0
𝑁
𝑝 𝑆𝑡 𝐼, 𝑆0, 𝑆1,···, 𝑆𝑡−1; 𝑤
단어수
각 단어는 그전 단어열의 영향을 받는다.
𝑆 ={𝑆0, 𝑆1, ⋯}
단어, 따라서 설명문 S는 길이가 변하는 계열데이터

 사진으로부터 설명문 생성
𝑝 𝑆 𝐼; 𝑤 =
𝑡=0
𝑁
𝑝 𝑆𝑡 𝐼, 𝑆0, 𝑆1,···, 𝑆𝑡−1; 𝑤
학습 데이터 셋(I,S)로 부터 훈련을 통해 찾아내는 변수

ht-1
xt
단어 @ t
St
L
S
T
M
WeSt
입력 @ t
출력@t
Pt+1(St+1)=softmax(ℎ𝑡)
 LSTM based Sentence Generator의 기본 구조
ℎ𝑡
: 단어별 확률적 분포를 계산
ht
log ‫(݌‬St+1) : 손실함수 계산
: word embedding 과정
출력@t-1

46
 Word Embedding
일반적으로 “one hot“ vector형태로 단어를 나타내는데,
단어들로 구성된 Dictionary의 크기가 바뀌기 쉬움
이경우 LSTM의 모델링등에 어려움이 있음
이에 따라 가변의 “one hot“ vector형태를 고정된 길이의
Vector형태로 변형시키는 과정이 필요
dog
0010000000
cat
one hot vector
representation
0000001000
Word embedding vector
representation
dog
0.10.30.20.10.20.3
cat
we
0.20.10.20.20.10.1
xtSt

47
 손실함수
For 𝑦_𝑖 = 1 𝑐𝑎𝑠𝑒 J(w)=-log𝑦𝑖
𝑦𝑖
1
J(w)
As 𝑦𝑖 approaches to 1,
J(w) becomes 0
J(w)=-∑𝑦_𝑖•log𝑦𝑖
y : 분류기에서 추정한 확률값
y_ : 정답
Cross entropy로 정의함

사진의 특징벡터를
Deep CNN에서
가져움
LSTM으로최초의
입력이됨(𝒙−𝟏)

단어𝑺 𝟎을입력
다음 단어가
𝑺 𝟏일확률

h 𝟎, c𝟎

NIC의 학습과정
ImageNet+ Drop out
으로 Pretraining
랜던하게변수를초기화

NIC의 학습과정
 학습용 사진과 설명문 셋

학습데이터
NIC의 학습과정
예측확률
손실함수

학습데이터
오차
역전파
NIC의 학습과정
손실함수

NIC로 예측 (Sampling)
DeepCNN에서특징
벡터를 가져옴
사진이 주어짐

SpecialStart Word
가장 확률이 높은
단어 𝑺 𝟏을선택

선택된 단어
𝑺 𝟏을입력
end- of- sentence
token이 나타날때
까지 계속

Recurrent Neural Net의 이론과 설명

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Recurrent Neural Net의 이론과 설명

Ähnlich wie Recurrent Neural Net의 이론과 설명 (20)

Mehr von 홍배 김

Mehr von 홍배 김 (20)

Recurrent Neural Net의 이론과 설명