23. 지금까지 제가 구현한 논문은
• Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
• Deep Visual Analogy-Making Computer Vision
24. 지금까지 제가 구현한 논문은
• Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
• Deep Visual Analogy-Making
• End-To-End Memory Networks
• Neural Variational Inference for Text Processing
• Character-Aware Neural Language Models
• Teaching Machines to Read and Comprehend
Computer Vision
Natural Language Processing
25. 지금까지 제가 구현한 논문은
• Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
• Deep Visual Analogy-Making
• End-To-End Memory Networks
• Neural Variational Inference for Text Processing
• Character-Aware Neural Language Models
• Teaching Machines to Read and Comprehend
• Playing Atari with Deep Reinforcement Learning
• Human-level control through deep reinforcement learning
• Deep Reinforcement Learning with Double Q-learning
• Dueling Network Architectures for Deep Reinforcement Learning
• Language Understanding for Text-based Games Using Deep Reinforcement Learning
• Asynchronous Methods for Deep Reinforcement Learning
Computer Vision
Natural Language Processing
Reinforcement Learning
26. 지금까지 제가 구현한 논문은
• Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
• Deep Visual Analogy-Making
• End-To-End Memory Networks
• Neural Variational Inference for Text Processing
• Character-Aware Neural Language Models
• Teaching Machines to Read and Comprehend
• Playing Atari with Deep Reinforcement Learning
• Human-level control through deep reinforcement learning
• Deep Reinforcement Learning with Double Q-learning
• Dueling Network Architectures for Deep Reinforcement Learning
• Language Understanding for Text-based Games Using Deep Reinforcement Learning
• Asynchronous Methods for Deep Reinforcement Learning
• Neural Turing Machines
Computer Vision
Algorithm Learning
Natural Language Processing
Reinforcement Learning
27. 재밌었던 논문은
• Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
• Deep Visual Analogy-Making
• End-To-End Memory Networks # 간단하면서도 쉽고 결과도 재밌음
• Neural Variational Inference for Text Processing
• Character-Aware Neural Language Models
• Teaching Machines to Read and Comprehend
• Playing Atari with Deep Reinforcement Learning # RL 입문으로 괜찮
• Human-level control through deep reinforcement learning
• Deep Reinforcement Learning with Double Q-learning
• Dueling Network Architectures for Deep Reinforcement Learning
• Language Understanding for Text-based Games Using Deep Reinforcement Learning
• Asynchronous Methods for Deep Reinforcement Learning # 구현이 challenging함
• Neural Turing Machines # 모델이 흥미로움
# 모델이 흥미로움
36. Recurrent attention model
with external memory
필요한 사전 지식
Recurrent Neural Network
Soft and hard attention in Neural Network
Internal and external memory access in Neural Network
37. Input: Context 문장들과 질문
𝑥: = Mary journeyed to the den.
𝑥? = Mary went back to the kitchen.
𝑥A = John journeyed to the bedroom.
𝑥C = Mary discarded the milk.
Q: Where was the milk before the den?
𝑥: 𝑥? 𝑥A 𝑥C
39. 문장 하나: 단어의 조합
𝑥H = 𝑥H:, 𝑥H?, …, 𝑥HK
𝑥H = Mary journeyed to the den.
𝑥H: = mary
𝑥H? = journeyed
𝑥HA = to
𝑥HC = the
𝑥HL = den
40. 단어 하나: Bag-of-Words (BoW)로 표현
1
0
0
0
.
.
.
0
mary
𝑥H = Mary journeyed to the den.
Bag-of-Words (BoW)
𝑥H:
=
𝑥H: = mary
𝑥H = 𝑥H:, 𝑥H?, …, 𝑥HK
41. 문장 하나: BoW의 set
1
0
0
0
.
.
.
0
mary
0
1
0
0
.
.
.
0
journeyed
0
0
1
0
.
.
.
0
to
0
0
0
1
.
.
.
0
the
𝑥H = Mary journeyed to the den.
𝑥H = 𝑥H:, 𝑥H?, …, 𝑥HK
𝑥H: = mary
𝑥H? = journeyed
𝑥HA = to
𝑥HC = the
𝑥HL = den
42. Input: BoW로 표현된 Context 문장들과 질문
𝑥: = Mary journeyed to the den.
𝑥? = Mary went back to the kitchen.
𝑥A = John journeyed to the bedroom.
𝑥C = Mary discarded the milk.
𝑥: 𝑥A 𝑥T 𝑥? 𝑥C 𝑥U 𝑥V
Q: Where was the milk before the den?
1
0
0
0
.
.
.
0
0
1
0
0
.
.
.
0
0
0
1
0
.
.
.
0
0
0
0
1
.
.
.
0
43. Memory: 문장 하나로부터 만들어진다
𝑚H = X 𝑨𝑥HZ
Z
1
0
0
0
.
.
.
0
0
1
0
0
.
.
.
0
0
0
1
0
.
.
.
0
0
0
0
1
.
.
.
0
𝑚: = 𝑨 +𝑨 +𝑨 +𝑨
𝑥:: 𝑥:? 𝑥:A 𝑥:C
mary journeyed to the
𝑨: embedding matrix
45. Memory: 한 task에는 여러개가 사용된다
𝑚?, 𝑚C, 𝑚L
𝑚? 𝑚C
𝑚L
𝑥? = Mary journeyed to the den.
𝑥C = Mary went back to the kitchen.
𝑥L = John journeyed to the bedroom.
46. Memory: 필요한 것만 Input으로 사용
Input memory
실제로 메모리의 본체는 embedding matrix인 𝑨이며, 𝑨가 점차 학습된다
𝑚H = X 𝑨𝑥HZ
Z
68. # optimizer를 정의한다
opt = tf.GradientDescentOptimizer(learning_rate=0.1)
# 주어진 variable의 loss에 대한 gradients를 계산한다.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars은 tuple(gradient, variable)의 리스트다.
# 이제 gradient를 마음대로 바꾸고 optimizer를 적용(apply_gradient)하면 된다.
capped_grads_and_vars =
[(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
opt.apply_gradients(capped_grads_and_vars)
Tip: 에서
custom gradient를 정의하는 방법?
69. # optimizer를 정의한다
opt = tf.GradientDescentOptimizer(learning_rate=0.1)
# 주어진 variable의 loss에 대한 gradients를 계산한다.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars은 tuple(gradient, variable)의 리스트다.
# 이제 gradient를 마음대로 바꾸고 optimizer를 적용(apply_gradient)하면 된다.
capped_grads_and_vars =
[(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
opt.apply_gradients(capped_grads_and_vars)
Tip: 에서
custom gradient를 정의하는 방법?
70. # optimizer를 정의한다
opt = tf.GradientDescentOptimizer(learning_rate=0.1)
# 주어진 variable의 loss에 대한 gradients를 계산한다.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars은 tuple(gradient, variable)의 리스트다.
# 이제 gradient를 마음대로 바꾸고 optimizer를 적용(apply_gradient)하면 된다.
capped_grads_and_vars =
[(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
optim = opt.apply_gradients(capped_grads_and_vars)
Tip: 에서
custom gradient를 정의하는 방법?
71. params = [A, B, C, W]
grads_and_vars = opt.compute_gradients(loss, params)
clipped_grads_and_vars = [(tf.clip_by_norm(gv[0], 50), gv[1])
for gv in grads_and_vars]
optim = opt.apply_gradients(clipped_grads_and_vars)
Optimization:
𝑔𝑟𝑎𝑖𝑑𝑒𝑛𝑡 𝑐𝑙𝑖𝑝𝑝𝑖𝑛𝑔을 한다
72. Train: 학습 시키면 끝
for idx in xrange(N):
sess.run(optim, feed_dict={input: x,
target: target,
context: context})
73. Resources
• Paper “End-To-End Memory Networks”
• 저자의 코드 : https://github.com/facebook/MemNN
• Question Answering에 더 관심이 있다면
• Paper “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing”
• Paper “Teaching machines to read and comprehend”
Dynamic Memory Networks Attentive Impatient Reader
74. QA dataset: The bAbI tasks
> AI가 문장들에서 정보를 추론할 수 있을까?
https://research.facebook.com/research/babi
75. QA dataset: The Children’s Book Test
> AI가 “이상한 나라의 앨리스”를 이해할 수 있을까?
https://research.facebook.com/research/babi/#cbt
79. Asynchronous Methods for
Deep Reinforcement Learning (A3C)
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Kavukcuoglu, K. (ICML 2016)
https://github.com/carpedm20/deep-rl-tensorflow
89. 정의: State, Value
Demystifying Deep Reinforcement Learning
State: game의 상태 화면, 공의 위치, 상대방의 위치 …
Value: state s|에서 행동 a|의 가치 보상 (return)
a|를 하고 난 후의 점수 r|}:뿐만 아니라
s|, 𝑠g}:, 𝑠g}?, … 에서 받게되는 미래의 모든 점수, 즉 r|}: + r|}? + r|}A + ⋯의 기대값
90. 정의: State, Value
Demystifying Deep Reinforcement Learning
State: game의 상태 화면, 공의 위치, 상대방의 위치 …
Value: state s|에서 행동 a|의 가치 보상 (return)
Value가 중요한 이유는 Value를 통해 어떠한 행동을 할지 결정할 수 있기 때문
epsilon greedy policy
91. 정의: State, Value
Demystifying Deep Reinforcement Learning
State: game의 상태
가장 쉬운 방법은 표로 기록하고 값을 업데이트 하는것!
화면, 공의 위치, 상대방의 위치 …
State 𝑸(𝒔 𝒕,위) 𝑸(𝒔 𝒕,아래)
𝑠: 1.2 -1
𝑠? -1 0
𝑠A 2.3 1.2
𝑠C 0 0
. . .
. . .
. . .
𝑠‚ 0.2 -0.1
Value: state s|에서 행동 a|의 가치
Dynamic programming, Value iteration …
보상 (return)
92. 정의: State, Value
Demystifying Deep Reinforcement Learning
State: game의 상태
Value: state s|에서 행동 a|의 가치
하지만 state가 너무 많을 경우 각각을 하나하나 계산하는게 불가능
화면, 공의 위치, 상대방의 위치 …
보상 (return)
State 𝑸(𝒔 𝒕,위) 𝑸(𝒔 𝒕,아래)
𝑠: 1.2 -1
𝑠? -1 0
𝑠A 2.3 1.2
𝑠C 0 0
. . .
. . .
. . .
𝑠‚ 0.2 -0.1
93. Model: Deep Q (quality) -Network
Demystifying Deep Reinforcement Learning
딥러닝 모델로 한번 배워보자!
Q(s|)를 함수로 approximate 하는것 :
Value function approximation
𝑸(𝒔 𝒕, 위) 𝑸(𝒔 𝒕, 아래) 𝑸(𝒔 𝒕,…)
94. Input: 4개의 연속된 프레임
Human-level control through deep reinforcement learning
공이 위로 움직이는지 아래로 움직이는지
context 정보를 주기 위함
𝑠g
𝑸(𝒔 𝒕, 𝒂 𝟏 ) 𝑸(𝒔 𝒕, 𝒂 𝟐 ) 𝑸(𝒔 𝒕, 𝒂 𝟐 )
95. Output: 행동의 가치(q-value)
Human-level control through deep reinforcement learning
𝑸 𝒔𝒕 : action의 갯수를 크기로 하는 vector
𝑸(𝒔 𝒕, 𝒂 𝟏 ) 𝑸(𝒔 𝒕, 𝒂 𝟐 ) 𝑸(𝒔 𝒕, 𝒂 𝟐 )
96. Output: 행동의 가치(q-value)
Human-level control through deep reinforcement learning
𝑠g
각각 𝑄 𝑠g , 𝑎: ,𝑄 𝑠g, 𝑎: ,𝑄 𝑠g, 𝑎: 의 scalar 값
-0.9 0.8 -0.9
106. 즉, loss만큼 𝑄 𝑠g 를 업데이트 한다
𝑄(𝑠g)
𝑄(𝑠g, 𝑎:) 𝑄(𝑠g, 𝑎?) 𝑄(𝑠g, 𝑎A)
𝑠g
𝑠g}:
𝑎 𝑟g}:
예상하는 가치 : 𝑟g}: + 𝛾𝑟g}? + 𝛾?
𝑟g}A + ⋯ 𝑟g}? + 𝛾𝑟g}A + 𝛾?
𝑟g}C + ⋯
𝑙𝑜𝑠𝑠 = {𝑟g}: + 𝛾𝑄 𝑠g}: } − 𝑄 𝑠g
107. n-step Q-learning
𝑄(𝑠g)
𝑄(𝑠g, 𝑎:) 𝑄(𝑠g, 𝑎?) 𝑄(𝑠g, 𝑎A)
See David Silver’s course & Sutton’s book for more information
𝑟g}:
1-step
𝑠g 𝑠g}:
업데이트
108. n-step Q-learning
𝑄(𝑠g)
𝑄(𝑠g, 𝑎:) 𝑄(𝑠g, 𝑎?) 𝑄(𝑠g, 𝑎A)
See David Silver’s course & Sutton’s book for more information
𝑟g}:
1-step
𝑟g}:
2-step
𝑟g}?
𝑠g 𝑠g}:
𝑠g}?
업데이트
109. n-step Q-learning
𝑄(𝑠g)
𝑄(𝑠g, 𝑎:) 𝑄(𝑠g, 𝑎?) 𝑄(𝑠g, 𝑎A)
See David Silver’s course & Sutton’s book for more information
𝑟g}:
1-step
𝑟g}:
2-step
𝑟g}:
3-step
𝑟g}?
𝑟g}? 𝑟g}A
𝑠g 𝑠g}:
𝑠g}?
𝑠g}A
업데이트
110. n-step Q-learning
𝑄(𝑠g)
𝑄(𝑠g, 𝑎:) 𝑄(𝑠g, 𝑎?) 𝑄(𝑠g, 𝑎A)
See David Silver’s course & Sutton’s book for more information
1-step
𝑟g}:
2-step
𝑟g}:
3-step
𝑟g}?
𝑟g}? 𝑟g}A
𝑠g}?
𝑠g}A
{𝑟g}: + 𝛾𝑄 𝑠g}: } − 𝑄 𝑠g
업데이트
111. n-step Q-learning
𝑄(𝑠g)
𝑄(𝑠g, 𝑎:) 𝑄(𝑠g, 𝑎?) 𝑄(𝑠g, 𝑎A)
See David Silver’s course & Sutton’s book for more information
1-step
2-step
𝑟g}:
3-step
𝑟g}? 𝑟g}A
𝑠g}A
{𝑟g}: + 𝛾𝑄 𝑠g}: } − 𝑄 𝑠g
{𝑟g}: + 𝛾𝑟g}? + 𝛾?
𝑄 𝑠g}? } − 𝑄 𝑠g
업데이트
112. n-step Q-learning
𝑄(𝑠g)
𝑄(𝑠g, 𝑎:) 𝑄(𝑠g, 𝑎?) 𝑄(𝑠g, 𝑎A)
See David Silver’s course & Sutton’s book for more information
1-step
2-step
3-step
{𝑟g}: + 𝛾𝑄 𝑠g}: } − 𝑄 𝑠g
{𝑟g}: + 𝛾𝑟g}? + 𝛾?
𝑄 𝑠g}? } − 𝑄 𝑠g
{𝑟g}: + 𝛾𝑟g}? + 𝛾?
𝑟g}A + 𝛾A
𝑄 𝑠g}A } − 𝑄 𝑠g
Higher variance, lower bias
업데이트
129. 예제: 3-step actor-critic
𝑥g U 𝑥g ?
Context
CNN
𝑥g L 𝑥g :
Context
CNN
𝑥g C 𝑥g
Context
CNN
𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
share parameters
130. 𝑠g ?를 보고 𝜋를 계산해 𝑎를 구함
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝑥g C 𝑥g
Context
CNN
𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝑎
𝜋 : actor V : critic
행동의 확률 state의 가치
131. agent가 𝑎를 수행
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝑥g C 𝑥g
Context
CNN
𝑠g :𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝑎
𝑟g :
132. 𝑠g :를 보고 𝜋를 계산해 𝑎를 구함
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g
Context
CNN
𝑠g𝑠g :𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝑎𝑎
𝑟g𝑟g :
133. 𝑠g를 보고 𝜋를 계산해 𝑎를 구함
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g
Context
CNN
𝜋 𝑉
𝑠g}:𝑠g𝑠g :𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝜋 𝑉
𝑎𝑎𝑎
𝑟g}:𝑟g𝑟g :
134. gradient를 0으로 초기화하고
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g}:
Context
CNN
𝜋 𝑉
𝑎𝑎𝑎
𝑟g}:𝑟g𝑟g :
𝜕𝜃™ ← 0
𝑠g}:𝑠g𝑠g :𝑠g ?
accum_gradient
0
135. 𝑟g}:, 𝑉 𝑠g , 𝑉 𝑠g}: 를 사용해서
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g
Context
CNN
𝜋 𝑉
𝑠g}:𝑠g𝑠g :𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝜋 𝑉
𝑎𝑎𝑎
𝑟g}:𝑟g𝑟g :
𝜕𝜃™ ← 0
accum_gradient
0
136. TD의 gradient를 더하고
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g
Context
CNN
𝜋 𝑉
𝑠g}:𝑠g𝑠g :𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝜋 𝑉
𝑎𝑎𝑎
𝜕𝜃™ ← 𝜕𝜃™ + 𝜕 𝑟g}: + 𝛾𝑉 𝑠g}: − V s|
?
/𝜕𝜃`™
𝑟g}:𝑟g𝑟g :
Temporal Difference
글로벌 네트워크의 Variable 𝜃`™에 대한
assign_add()
accum_gradient
𝜕𝜃™
137. 또, TD의 gradient를 더하고
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g
Context
CNN
𝜋 𝑉
𝑠g}:𝑠g𝑠g :𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝜋 𝑉
𝑎𝑎𝑎
𝜕𝜃™ ← 𝜕𝜃™ + 𝜕 𝑟g + 𝛾𝑟g}: + 𝛾?
𝑉 𝑠g}: − V s| :
?
/𝜕𝜃`™
𝑟g}:𝑟g𝑟g :
글로벌 네트워크의 Variable 𝜃`™에 대한
assign_add()
accum_gradient
𝜕𝜃™ 𝜕𝜃™
138. 또, TD의 gradient를 더해서
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g
Context
CNN
𝜋 𝑉
𝑠g}:𝑠g𝑠g :𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝜋 𝑉
𝑎𝑎𝑎
𝜕𝜃™ ← 𝜕𝜃™ + 𝜕 𝑟g : + 𝛾𝑟g + 𝛾?
𝑟g}: + 𝛾A
𝑉 𝑠g}: − V s| ?
?
/𝜕𝜃`™
𝑟g}:𝑟g𝑟g :
글로벌 네트워크의 Variable 𝜃`™에 대한
assign_add()
accum_gradient
𝜕𝜃™𝜕𝜃™ 𝜕𝜃™
139. 글로벌 네트워크의 Variable를 업데이트
𝑥g U 𝑥g ?
Context
CNN
𝜋 𝑉
𝑥g L 𝑥g :
Context
CNN
𝜋 𝑉
𝑥g C 𝑥g
Context
CNN
𝜋 𝑉
𝑠g}:𝑠g𝑠g :𝑠g ?
𝑥g C 𝑥g}:
Context
CNN
𝜋 𝑉
𝑎𝑎𝑎
𝑟g}:𝑟g𝑟g :
Context
CNN
𝜋 𝑉
𝜃™
apply_gradient()
accum_gradient
𝜕𝜃™𝜕𝜃™ 𝜕𝜃™
169. from threading import Thread
global_network = make_network(name='A3C_global’)
Global network를 만들고
170. for worker_id in range(n_worker):
with tf.variable_scope('thread_%d' % worker_id) as scope:
for step in range(n_step):
network = Network(global_network, name='A3C_%d' % (worker_id))
networks.append(network)
tf.get_variable_scope().reuse_variables()
A3C_FFs[worker_id] = A3C_FF(worker_id, networks, global_network)
Local network을 만들고
…
171. for worker_id in range(n_worker):
A3C_FFs[worker_id].networks[0].copy_from_global()
# weight sharing하기 때문에 0번째 step에 해당하는 network만 하면 됨
Global network의 𝜃‘, 𝜃™를
Local network로 복사한다
share parameters
copy
…
172. class Network(object):
def __init__(self, sess):
with tf.variable_scope('copy_from_target'):
copy_ops = []
for name in self.w.keys(): # w: network의 variable를 가진 dict
copy_op = self.w[name].assign(global_network.w[name])
copy_ops.append(copy_op)
self.global_copy_op = tf.group(*copy_ops, name='global_copy_op')
𝜃™를 𝜃`™로 assign() 하는 op을 만들고
copy
…
175. for grad, var in tuple(grads_and_vars):
self.accum_grads[name] = tf.Variable(tf.zeros(shape), trainable=False)
new_grads_and_vars.append((tf.clip_by_norm(
self.accum_grads[name].ref(), self.max_grad_norm), global_v))
𝜕l𝑜𝑠𝑠/𝜕𝜃`™를 누적해서 저장할
𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒을 만들고, assign_add()를 추가
176. targets = [network.sampled_action for network in self.networks]
targets.extend([network.log_policy_of_sampled_action for network in self.networks])
targets.extend([network.value for network in self.networks])
targets.extend(self.add_accum_grads.values())
targets.append(self.fake_apply_gradient)
inputs = [network.s_t for network in self.networks]
inputs.extend([network.R for network in self.networks])
inputs.extend([network.true_log_policy for network in self.networks])
inputs.append(self.learning_rate_op)
self.partial_graph = self.sess.partial_run_setup(targets, inputs)
sess.partial_run_setup로
partial_graph를 정의하고
179. def worker_func(worker_id):
model = A3C_FFs[worker_id]
if worker_id == 0:
model.train_with_log(global_t,
saver, writer, checkpoint_dir, assign_global_t_op)
else:
model.train(global_t)
쓰레드로 실행된 함수를 정의하고
180. workers = []
for idx in range(config.n_worker):
workers.append(Thread(target=worker_func, args=(idx,)))
# Execute and wait for the end of the training
for worker in workers:
worker.start()
for worker in workers:
worker.join()
쓰레드 start() join() 하면 끝.
181. Resources
• Paper “Asynchronous Methods for Deep Reinforcement Learning”
• Chainer 구현: https://github.com/muupan/async-rl
• 읽기 쉬운 Deep Q-network 소개글
• http://sanghyukchun.github.io/90/
• Demystifying Deep Reinforcement Learning
• Reinforcement Learning를 진지하게 공부하고 싶다면
• David Silver’s course
• John Schulman’s CS294 course
• Sutton’s “Reinforcement Learning: An Introduction”
182. Resources
• 가 제안한 문제들 : Requests for Research
• Improved Q-learning with continuous actions
• Multitask RL with continuous actions.
• Parallel TRPO
• 주목할만한 keywords
• Continuous actions: [1], [2]
• Multiagents: [3]
• Hierarchical reinforcement learning: [4]
• Better exploration: [5]
183. 아쉽지만..
OpenAI’s blog post. Generative Models
Deep Convolutional Generative Adversarial Networks
https://github.com/carpedm20/DCGAN-tensorflow