SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
Deep Robotics
Intelligent Control and Systems Laboratory
J.hyeon Park
2018-05-15
SNU AI Study
What is deep robotics?
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
What is deep robotics?
• Think about computer vision
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
What is deep robotics?
• Think about computer vision
features
eg. HOG
traditional
computer
vision
mid-level
features
eg. DPM
classifier
eg. SVM
imge semantic
label
training trainingtraining
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
What is deep robotics?
• Think about computer vision
features
eg. HOG
traditional
computer
vision
mid-level
features
eg. DPM
classifier
eg. SVM
imge semantic
label
deep
learning
imge artificial neural network
semantic
label
training trainingtraining
end-to-end training
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
What is deep robotics?
• deep robotics analogus to computer vision
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
What is deep robotics?
• deep robotics analogus to computer vision
state
estimation
traditional
robotics
modeling
&
prediction
planning
observation
(eg, image)
controls
training
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
low-level
control
training training training
What is deep robotics?
• deep robotics analogus to computer vision
state
estimation
traditional
robotics
modeling
&
prediction
planning
observation
(eg, image)
controls
training
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
low-level
control
training training training
deep
robotics
observation artificial neural network controls
end-to-end training
What is deep robotics?
• Connectionist model in robotics
• Benefit?
- General-purpose algorithm
- Combine perception and control
- Acquire complex skills with general-purpose representations
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
Preliminary1
• MDP set-up : < 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑠𝑠𝑡𝑡+1, 𝑟𝑟𝑡𝑡 >
𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆, state
𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴, action
𝑃𝑃𝑠𝑠𝑠𝑠′ = 𝑃𝑃(𝑠𝑠𝑡𝑡+1|𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡) , transition probability
𝑟𝑟𝑡𝑡 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 , reward
• control objective: Maximize 𝑅𝑅 = Σ𝑟𝑟𝑡𝑡 (or Σ𝛾𝛾𝑡𝑡
𝑟𝑟𝑡𝑡)
𝑅𝑅 : accumulated reward
𝛾𝛾 : discount factor
Preliminary2
• Reinforcement learning set-up
𝜋𝜋(𝑠𝑠𝑡𝑡) or 𝜋𝜋 𝑎𝑎𝑡𝑡 𝑠𝑠𝑡𝑡 , policy
𝑉𝑉 𝜋𝜋
𝑠𝑠𝑡𝑡 , value function
𝑄𝑄𝜋𝜋(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡), Q-function
𝑉𝑉 𝜋𝜋
𝑠𝑠𝑡𝑡 ≡ 𝐸𝐸𝑠𝑠𝑖𝑖>𝑡𝑡+1~𝑃𝑃𝑠𝑠𝑠𝑠𝑠,𝑎𝑎𝑖𝑖~𝜋𝜋 𝑅𝑅𝑡𝑡
𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 ≡ 𝐸𝐸𝑠𝑠𝑖𝑖>𝑡𝑡+1~𝑃𝑃𝑠𝑠𝑠𝑠′,𝑎𝑎𝑖𝑖>𝑡𝑡~𝜋𝜋
𝑅𝑅𝑡𝑡
𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐸𝐸[𝑟𝑟 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝛾𝛾𝐸𝐸𝑎𝑎𝑡𝑡+1~𝜋𝜋[𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 ] (Bellmann’s equation)
Deep reinforcement learning (DQN)
1. Declare neural-network function approximators
𝑄𝑄 𝜃𝜃 𝑄𝑄
: 𝑆𝑆 × 𝐴𝐴 → 𝑅𝑅 (Q network)
2. Accumulate (𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑠𝑠𝑡𝑡+1, 𝑟𝑟𝑡𝑡) during roll-outs
3. Update prameters
𝐿𝐿 𝜃𝜃 𝑄𝑄
=
1
𝑁𝑁
∑ 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄
2
where 𝑦𝑦𝑡𝑡 = 𝑟𝑟𝑡𝑡 + 𝛾𝛾𝛾𝛾 𝑠𝑠𝑡𝑡+1, 𝜋𝜋 𝑠𝑠𝑡𝑡+1
4. Select policy
𝜋𝜋 𝑎𝑎𝑡𝑡|𝑠𝑠𝑡𝑡 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑥𝑥𝑎𝑎𝑡𝑡
𝑄𝑄(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
V.Minh et al(2016) “Human-level control through deep reinforcement learning”
Deep reinforcement learning (DQN)
source code : https://github.com/musyoku/deep-q-network
𝐿𝐿 𝜃𝜃 𝑄𝑄 =
1
𝑁𝑁
� 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄
2
pixel
action
Deep reinforcement learning (DQN)
source code : https://github.com/musyoku/deep-q-network
𝐿𝐿 𝜃𝜃 𝑄𝑄 =
1
𝑁𝑁
� 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄
2
pixel
action
“End-to-end”
Deep reinforcement learning (DQN)
source : https://github.com/musyoku/deep-q-network
Deep reinforcement learning (DQN)
source : https://github.com/musyoku/deep-q-network
Deep reinforcement learning (DQN)
source : https://github.com/musyoku/deep-q-network
Training 8000 episdoes
In real world...
8000 episodes
In real world...
8000 episodes
in real world?
In real world...
Several robots
Several acceleration algorithms
Several patience...
Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Update (S.Gu et al, 2016)
(no pixel input)
In real world...
Not enough!
Guided policy search
• Levine, Sergey, and Vladlen Koltun. "Guided policy search." Proceedings of the 30th
International Conference on Machine Learning (ICML-13). 2013.
• Sergey Levine, Pieter Abbeel. Learning Neural Network Policies with Guided Policy
Search under Unknown Dynamics. NIPS 2014.
• William Montgomery, Sergey Levine. Guided Policy Search as Approximate Mirror
Descent. NIPS 2016.
Guided policy search
𝑝𝑝𝑖𝑖 = {𝑠𝑠1, 𝑎𝑎1, … 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇}
training
min
𝜃𝜃,𝑝𝑝1,…,𝑝𝑝 𝑁𝑁
�
𝑖𝑖=1
𝑁𝑁
�
𝑡𝑡=1
𝑇𝑇
𝐸𝐸𝑝𝑝𝑖𝑖 s𝑡𝑡,𝑢𝑢𝑡𝑡
r s𝑡𝑡, a𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑝𝑝𝑖𝑖 a𝑡𝑡 s𝑡𝑡 = 𝜋𝜋𝜃𝜃 a𝑡𝑡, s𝑡𝑡 ∀s𝑡𝑡, a𝑡𝑡, 𝑡𝑡, 𝑖𝑖
global policy networktrajectory optimization
from optimal control
What is optimal control?
min
𝑎𝑎1,𝑎𝑎2,…𝑎𝑎𝑇𝑇
� 𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
under dynamics 𝑠𝑠𝑡𝑡+1 = 𝑓𝑓 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
optimal control example : iLQR
Assumption :
dynamics is linear 𝑠𝑠𝑡𝑡+1 = 𝐹𝐹𝑡𝑡
𝑠𝑠𝑡𝑡
𝑎𝑎𝑡𝑡
+ ft
cost is quadratic 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 =
1
2
𝑠𝑠𝑡𝑡
𝑎𝑎𝑡𝑡
𝑇𝑇
𝐶𝐶𝑡𝑡
𝑠𝑠𝑡𝑡
𝑎𝑎𝑡𝑡
+
𝑠𝑠𝑡𝑡
𝑎𝑎𝑡𝑡
𝑇𝑇
𝑐𝑐𝑡𝑡
optimal control example : iLQR
Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 +
1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
At time 𝑇𝑇
optimal control example : iLQR
Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 +
1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇
𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
𝑇𝑇
At time 𝑇𝑇
optimal control example : iLQR
Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 +
1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇
𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
𝑇𝑇 = 0
At time 𝑇𝑇
optimal control example : iLQR
Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 +
1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇
𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
𝑇𝑇 = 0
optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
−1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
At time 𝑇𝑇
optimal control example : iLQR
Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 +
1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇
𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
𝑇𝑇 = 0
optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
−1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
with dynamics : 𝑠𝑠𝑇𝑇 = 𝐹𝐹𝑇𝑇
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+ fT
At time 𝑇𝑇
optimal control example : iLQR
Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 +
1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇
𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
𝑇𝑇 = 0
optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
−1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
with dynamics : 𝑠𝑠𝑇𝑇 = 𝐹𝐹𝑇𝑇
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+ fT
𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
−1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝐹𝐹𝑇𝑇
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+ fT + 𝑐𝑐𝑎𝑎𝑇𝑇
At time 𝑇𝑇
optimal control example : iLQR
Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 +
1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇
𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
𝑇𝑇 = 0
optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
−1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
with dynamics : 𝑠𝑠𝑇𝑇 = 𝐹𝐹𝑇𝑇
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+ fT
𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
−1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝐹𝐹𝑇𝑇
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+ fT + 𝑐𝑐𝑎𝑎𝑇𝑇
At time 𝑇𝑇
optimal control example : iLQR
Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 +
1
2
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝐶𝐶𝑇𝑇−1
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝑐𝑐𝑇𝑇−1
At time 𝑇𝑇 − 1
optimal control example : iLQR
Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 +
1
2
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝐶𝐶𝑇𝑇−1
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝑐𝑐𝑇𝑇−1
At time 𝑇𝑇 − 1
quadratic function of 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1
optimal control example : iLQR
Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 +
1
2
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝐶𝐶𝑇𝑇−1
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝑐𝑐𝑇𝑇−1
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1
𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = …
At time 𝑇𝑇 − 1
optimal control example : iLQR
Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 +
1
2
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝐶𝐶𝑇𝑇−1
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝑐𝑐𝑇𝑇−1
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1
𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0
At time 𝑇𝑇 − 1
optimal control example : iLQR
Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 +
1
2
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝐶𝐶𝑇𝑇−1
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝑐𝑐𝑇𝑇−1
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1
𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0
optimal action : 𝑎𝑎𝑇𝑇−1 = ...
At time 𝑇𝑇 − 1
optimal control example : iLQR
Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 +
1
2
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝐶𝐶𝑇𝑇−1
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝑐𝑐𝑇𝑇−1
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1
𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0
optimal action : 𝑎𝑎𝑇𝑇−1 = ...
and so on... and so on...
from 𝑇𝑇 to 1
At time 𝑇𝑇 − 1
optimal control example : iLQR
Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 +
1
2
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝐶𝐶𝑇𝑇−1
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
𝑇𝑇
𝑐𝑐𝑇𝑇−1
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1
𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0
optimal action : 𝑎𝑎𝑇𝑇−1 = ...
of course, system is not linear, cost is not quadratic
iteratively linearlization & quadratization
At time 𝑇𝑇 − 1
Guided policy search
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
Guided policy search
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖
∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
Guided policy search
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖
∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
where 𝐿𝐿′ 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃
Guided policy search
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖
∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
where 𝐿𝐿′ 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃
This constraint is very important for convergence
(constraint for the optimal control
not to be far from policy)
Guided policy search
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖
∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
where 𝐿𝐿′ 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃
Solve 𝑝𝑝𝑖𝑖 using iLQR
Supervised learning
𝜋𝜋𝜃𝜃 ← 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝜃𝜃 � 𝐾𝐾𝐾𝐾(𝜋𝜋𝜃𝜃| 𝑝𝑝𝑖𝑖
Guided policy search
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
Guided policy search
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
Operate robot,
and collect 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑟𝑟𝑡𝑡, 𝑠𝑠𝑡𝑡+1
Guided policy search
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
Learn dynamics of P(st+1|st, at)
KL-divergence btw local and global policy
𝐿𝐿′
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃
End-to-end training of deep visuomotor policies (2015)
End-to-end training of deep visuomotor policies (2015)
System input : 𝑆𝑆, 𝑂𝑂, 𝐴𝐴
Training phase:
𝑝𝑝𝑖𝑖 ← 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑆𝑆, 𝐴𝐴)
𝜃𝜃 ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 (𝑂𝑂, 𝑝𝑝𝑖𝑖)
Test phase:
𝐴𝐴 ← 𝜋𝜋 𝜃𝜃(𝑂𝑂) 𝑆𝑆 : Robot configuration
𝑂𝑂 : Image
𝐴𝐴 : Action
End-to-end training of deep visuomotor policies (2015)
Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
Solve optimal control
(C-step)
Training end-to-end
network
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics
End-to-end training of deep visuomotor policies (2015)
Deep Spatial Autoencoder for Visumotor Learning (2016)
System input : 𝑂𝑂, 𝐴𝐴
Training phase:
𝑝𝑝𝑖𝑖 ← 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑂𝑂, 𝐴𝐴)
𝜃𝜃 ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 (𝑂𝑂, 𝑝𝑝𝑖𝑖)
Test phase:
𝐴𝐴 ← 𝜋𝜋 𝜃𝜃(𝑂𝑂)
using softmax position layer to encode image
𝑂𝑂 : Image
𝐴𝐴 : Action
Deep Spatial Autoencoder for Visumotor Learning (2016)
Thank you
ka2hyeon@gmail.com

Weitere ähnliche Inhalte

Was ist angesagt?

0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
Deep Learning in Robotics
Deep Learning in RoboticsDeep Learning in Robotics
Deep Learning in RoboticsSungjoon Choi
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門Ryo Iwaki
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用Ryo Iwaki
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks IISang Jun Lee
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesSangwoo Mo
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient TrainingHung Le
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 

Was ist angesagt? (20)

Deep Reasoning
Deep ReasoningDeep Reasoning
Deep Reasoning
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Deep Learning in Robotics
Deep Learning in RoboticsDeep Learning in Robotics
Deep Learning in Robotics
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
InfoGAIL
InfoGAIL InfoGAIL
InfoGAIL
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 

Ähnlich wie Deep robotics

Stochastic optimal control &amp; rl
Stochastic optimal control &amp; rlStochastic optimal control &amp; rl
Stochastic optimal control &amp; rlChoiJinwon3
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Adrian Aley
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningJY Chun
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxSeungeon Baek
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Intrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningIntrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningKai Zhang
 
Passivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulatorPassivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulatorHancheol Choi
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsJisang Yoon
 
Lecture Notes: EEEC4340318 Instrumentation and Control Systems - Fundamental...
Lecture Notes:  EEEC4340318 Instrumentation and Control Systems - Fundamental...Lecture Notes:  EEEC4340318 Instrumentation and Control Systems - Fundamental...
Lecture Notes: EEEC4340318 Instrumentation and Control Systems - Fundamental...AIMST University
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"YeChan(Paul) Kim
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptxZhiwuGuo1
 
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...SmartCat
 
PPT - Deep Hedging OF Derivatives Using Reinforcement Learning
PPT - Deep Hedging OF Derivatives Using Reinforcement LearningPPT - Deep Hedging OF Derivatives Using Reinforcement Learning
PPT - Deep Hedging OF Derivatives Using Reinforcement LearningJisang Yoon
 

Ähnlich wie Deep robotics (20)

Stochastic optimal control &amp; rl
Stochastic optimal control &amp; rlStochastic optimal control &amp; rl
Stochastic optimal control &amp; rl
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Intrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningIntrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement Learning
 
Continuous control
Continuous controlContinuous control
Continuous control
 
Passivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulatorPassivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulator
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Lecture Notes: EEEC4340318 Instrumentation and Control Systems - Fundamental...
Lecture Notes:  EEEC4340318 Instrumentation and Control Systems - Fundamental...Lecture Notes:  EEEC4340318 Instrumentation and Control Systems - Fundamental...
Lecture Notes: EEEC4340318 Instrumentation and Control Systems - Fundamental...
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
 
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
 
PPT - Deep Hedging OF Derivatives Using Reinforcement Learning
PPT - Deep Hedging OF Derivatives Using Reinforcement LearningPPT - Deep Hedging OF Derivatives Using Reinforcement Learning
PPT - Deep Hedging OF Derivatives Using Reinforcement Learning
 

Kürzlich hochgeladen

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 

Kürzlich hochgeladen (20)

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 

Deep robotics

  • 1. Deep Robotics Intelligent Control and Systems Laboratory J.hyeon Park 2018-05-15 SNU AI Study
  • 2. What is deep robotics? source : RI seminar : sergey Levine https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
  • 3. What is deep robotics? • Think about computer vision source : RI seminar : sergey Levine https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
  • 4. What is deep robotics? • Think about computer vision features eg. HOG traditional computer vision mid-level features eg. DPM classifier eg. SVM imge semantic label training trainingtraining source : RI seminar : sergey Levine https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
  • 5. What is deep robotics? • Think about computer vision features eg. HOG traditional computer vision mid-level features eg. DPM classifier eg. SVM imge semantic label deep learning imge artificial neural network semantic label training trainingtraining end-to-end training source : RI seminar : sergey Levine https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
  • 6. What is deep robotics? • deep robotics analogus to computer vision source : RI seminar : sergey Levine https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
  • 7. What is deep robotics? • deep robotics analogus to computer vision state estimation traditional robotics modeling & prediction planning observation (eg, image) controls training source : RI seminar : sergey Levine https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s low-level control training training training
  • 8. What is deep robotics? • deep robotics analogus to computer vision state estimation traditional robotics modeling & prediction planning observation (eg, image) controls training source : RI seminar : sergey Levine https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s low-level control training training training deep robotics observation artificial neural network controls end-to-end training
  • 9. What is deep robotics? • Connectionist model in robotics • Benefit? - General-purpose algorithm - Combine perception and control - Acquire complex skills with general-purpose representations source : RI seminar : sergey Levine https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s
  • 10. Preliminary1 • MDP set-up : < 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑠𝑠𝑡𝑡+1, 𝑟𝑟𝑡𝑡 > 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆, state 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴, action 𝑃𝑃𝑠𝑠𝑠𝑠′ = 𝑃𝑃(𝑠𝑠𝑡𝑡+1|𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡) , transition probability 𝑟𝑟𝑡𝑡 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 , reward • control objective: Maximize 𝑅𝑅 = Σ𝑟𝑟𝑡𝑡 (or Σ𝛾𝛾𝑡𝑡 𝑟𝑟𝑡𝑡) 𝑅𝑅 : accumulated reward 𝛾𝛾 : discount factor
  • 11. Preliminary2 • Reinforcement learning set-up 𝜋𝜋(𝑠𝑠𝑡𝑡) or 𝜋𝜋 𝑎𝑎𝑡𝑡 𝑠𝑠𝑡𝑡 , policy 𝑉𝑉 𝜋𝜋 𝑠𝑠𝑡𝑡 , value function 𝑄𝑄𝜋𝜋(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡), Q-function 𝑉𝑉 𝜋𝜋 𝑠𝑠𝑡𝑡 ≡ 𝐸𝐸𝑠𝑠𝑖𝑖>𝑡𝑡+1~𝑃𝑃𝑠𝑠𝑠𝑠𝑠,𝑎𝑎𝑖𝑖~𝜋𝜋 𝑅𝑅𝑡𝑡 𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 ≡ 𝐸𝐸𝑠𝑠𝑖𝑖>𝑡𝑡+1~𝑃𝑃𝑠𝑠𝑠𝑠′,𝑎𝑎𝑖𝑖>𝑡𝑡~𝜋𝜋 𝑅𝑅𝑡𝑡 𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐸𝐸[𝑟𝑟 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝛾𝛾𝐸𝐸𝑎𝑎𝑡𝑡+1~𝜋𝜋[𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 ] (Bellmann’s equation)
  • 12. Deep reinforcement learning (DQN) 1. Declare neural-network function approximators 𝑄𝑄 𝜃𝜃 𝑄𝑄 : 𝑆𝑆 × 𝐴𝐴 → 𝑅𝑅 (Q network) 2. Accumulate (𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑠𝑠𝑡𝑡+1, 𝑟𝑟𝑡𝑡) during roll-outs 3. Update prameters 𝐿𝐿 𝜃𝜃 𝑄𝑄 = 1 𝑁𝑁 ∑ 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄 2 where 𝑦𝑦𝑡𝑡 = 𝑟𝑟𝑡𝑡 + 𝛾𝛾𝛾𝛾 𝑠𝑠𝑡𝑡+1, 𝜋𝜋 𝑠𝑠𝑡𝑡+1 4. Select policy 𝜋𝜋 𝑎𝑎𝑡𝑡|𝑠𝑠𝑡𝑡 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑥𝑥𝑎𝑎𝑡𝑡 𝑄𝑄(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡) V.Minh et al(2016) “Human-level control through deep reinforcement learning”
  • 13. Deep reinforcement learning (DQN) source code : https://github.com/musyoku/deep-q-network 𝐿𝐿 𝜃𝜃 𝑄𝑄 = 1 𝑁𝑁 � 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄 2 pixel action
  • 14. Deep reinforcement learning (DQN) source code : https://github.com/musyoku/deep-q-network 𝐿𝐿 𝜃𝜃 𝑄𝑄 = 1 𝑁𝑁 � 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄 2 pixel action “End-to-end”
  • 15. Deep reinforcement learning (DQN) source : https://github.com/musyoku/deep-q-network
  • 16. Deep reinforcement learning (DQN) source : https://github.com/musyoku/deep-q-network
  • 17. Deep reinforcement learning (DQN) source : https://github.com/musyoku/deep-q-network Training 8000 episdoes
  • 19. In real world... 8000 episodes in real world?
  • 20. In real world... Several robots Several acceleration algorithms Several patience... Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Update (S.Gu et al, 2016) (no pixel input)
  • 22. Guided policy search • Levine, Sergey, and Vladlen Koltun. "Guided policy search." Proceedings of the 30th International Conference on Machine Learning (ICML-13). 2013. • Sergey Levine, Pieter Abbeel. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics. NIPS 2014. • William Montgomery, Sergey Levine. Guided Policy Search as Approximate Mirror Descent. NIPS 2016.
  • 23. Guided policy search 𝑝𝑝𝑖𝑖 = {𝑠𝑠1, 𝑎𝑎1, … 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇} training min 𝜃𝜃,𝑝𝑝1,…,𝑝𝑝 𝑁𝑁 � 𝑖𝑖=1 𝑁𝑁 � 𝑡𝑡=1 𝑇𝑇 𝐸𝐸𝑝𝑝𝑖𝑖 s𝑡𝑡,𝑢𝑢𝑡𝑡 r s𝑡𝑡, a𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑝𝑝𝑖𝑖 a𝑡𝑡 s𝑡𝑡 = 𝜋𝜋𝜃𝜃 a𝑡𝑡, s𝑡𝑡 ∀s𝑡𝑡, a𝑡𝑡, 𝑡𝑡, 𝑖𝑖 global policy networktrajectory optimization from optimal control
  • 24. What is optimal control? min 𝑎𝑎1,𝑎𝑎2,…𝑎𝑎𝑇𝑇 � 𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡) under dynamics 𝑠𝑠𝑡𝑡+1 = 𝑓𝑓 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
  • 25. optimal control example : iLQR Assumption : dynamics is linear 𝑠𝑠𝑡𝑡+1 = 𝐹𝐹𝑡𝑡 𝑠𝑠𝑡𝑡 𝑎𝑎𝑡𝑡 + ft cost is quadratic 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 1 2 𝑠𝑠𝑡𝑡 𝑎𝑎𝑡𝑡 𝑇𝑇 𝐶𝐶𝑡𝑡 𝑠𝑠𝑡𝑡 𝑎𝑎𝑡𝑡 + 𝑠𝑠𝑡𝑡 𝑎𝑎𝑡𝑡 𝑇𝑇 𝑐𝑐𝑡𝑡
  • 26. optimal control example : iLQR Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 + 1 2 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝐶𝐶𝑇𝑇 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝑐𝑐𝑇𝑇 At time 𝑇𝑇
  • 27. optimal control example : iLQR Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 + 1 2 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝐶𝐶𝑇𝑇 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝑐𝑐𝑇𝑇 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 𝑇𝑇 At time 𝑇𝑇
  • 28. optimal control example : iLQR Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 + 1 2 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝐶𝐶𝑇𝑇 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝑐𝑐𝑇𝑇 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 𝑇𝑇 = 0 At time 𝑇𝑇
  • 29. optimal control example : iLQR Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 + 1 2 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝐶𝐶𝑇𝑇 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝑐𝑐𝑇𝑇 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 𝑇𝑇 = 0 optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 −1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 At time 𝑇𝑇
  • 30. optimal control example : iLQR Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 + 1 2 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝐶𝐶𝑇𝑇 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝑐𝑐𝑇𝑇 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 𝑇𝑇 = 0 optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 −1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 with dynamics : 𝑠𝑠𝑇𝑇 = 𝐹𝐹𝑇𝑇 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + fT At time 𝑇𝑇
  • 31. optimal control example : iLQR Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 + 1 2 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝐶𝐶𝑇𝑇 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝑐𝑐𝑇𝑇 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 𝑇𝑇 = 0 optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 −1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 with dynamics : 𝑠𝑠𝑇𝑇 = 𝐹𝐹𝑇𝑇 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + fT 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 −1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝐹𝐹𝑇𝑇 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + fT + 𝑐𝑐𝑎𝑎𝑇𝑇 At time 𝑇𝑇
  • 32. optimal control example : iLQR Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 + 1 2 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝐶𝐶𝑇𝑇 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑠𝑠𝑇𝑇 𝑎𝑎𝑇𝑇 𝑇𝑇 𝑐𝑐𝑇𝑇 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 𝑇𝑇 = 0 optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 −1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇 with dynamics : 𝑠𝑠𝑇𝑇 = 𝐹𝐹𝑇𝑇 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + fT 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇 −1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇 𝐹𝐹𝑇𝑇 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + fT + 𝑐𝑐𝑎𝑎𝑇𝑇 At time 𝑇𝑇
  • 33. optimal control example : iLQR Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 + 1 2 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝐶𝐶𝑇𝑇−1 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝑐𝑐𝑇𝑇−1 At time 𝑇𝑇 − 1
  • 34. optimal control example : iLQR Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 + 1 2 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝐶𝐶𝑇𝑇−1 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝑐𝑐𝑇𝑇−1 At time 𝑇𝑇 − 1 quadratic function of 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1
  • 35. optimal control example : iLQR Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 + 1 2 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝐶𝐶𝑇𝑇−1 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝑐𝑐𝑇𝑇−1 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … At time 𝑇𝑇 − 1
  • 36. optimal control example : iLQR Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 + 1 2 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝐶𝐶𝑇𝑇−1 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝑐𝑐𝑇𝑇−1 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0 At time 𝑇𝑇 − 1
  • 37. optimal control example : iLQR Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 + 1 2 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝐶𝐶𝑇𝑇−1 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝑐𝑐𝑇𝑇−1 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0 optimal action : 𝑎𝑎𝑇𝑇−1 = ... At time 𝑇𝑇 − 1
  • 38. optimal control example : iLQR Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 + 1 2 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝐶𝐶𝑇𝑇−1 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝑐𝑐𝑇𝑇−1 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0 optimal action : 𝑎𝑎𝑇𝑇−1 = ... and so on... and so on... from 𝑇𝑇 to 1 At time 𝑇𝑇 − 1
  • 39. optimal control example : iLQR Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 + 1 2 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝐶𝐶𝑇𝑇−1 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 + 𝑠𝑠𝑇𝑇−1 𝑎𝑎𝑇𝑇−1 𝑇𝑇 𝑐𝑐𝑇𝑇−1 derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0 optimal action : 𝑎𝑎𝑇𝑇−1 = ... of course, system is not linear, cost is not quadratic iteratively linearlization & quadratization At time 𝑇𝑇 − 1
  • 40. Guided policy search Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics
  • 41. Guided policy search Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics 𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖 ∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
  • 42. Guided policy search Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics 𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖 ∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡) where 𝐿𝐿′ 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃
  • 43. Guided policy search Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics 𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖 ∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡) where 𝐿𝐿′ 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃 This constraint is very important for convergence (constraint for the optimal control not to be far from policy)
  • 44. Guided policy search Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics 𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖 ∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡) where 𝐿𝐿′ 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃 Solve 𝑝𝑝𝑖𝑖 using iLQR
  • 45. Supervised learning 𝜋𝜋𝜃𝜃 ← 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝜃𝜃 � 𝐾𝐾𝐾𝐾(𝜋𝜋𝜃𝜃| 𝑝𝑝𝑖𝑖 Guided policy search Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics
  • 46. Guided policy search Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics Operate robot, and collect 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑟𝑟𝑡𝑡, 𝑠𝑠𝑡𝑡+1
  • 47. Guided policy search Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics Learn dynamics of P(st+1|st, at)
  • 48. KL-divergence btw local and global policy 𝐿𝐿′ 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃
  • 49. End-to-end training of deep visuomotor policies (2015)
  • 50. End-to-end training of deep visuomotor policies (2015) System input : 𝑆𝑆, 𝑂𝑂, 𝐴𝐴 Training phase: 𝑝𝑝𝑖𝑖 ← 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑆𝑆, 𝐴𝐴) 𝜃𝜃 ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 (𝑂𝑂, 𝑝𝑝𝑖𝑖) Test phase: 𝐴𝐴 ← 𝜋𝜋 𝜃𝜃(𝑂𝑂) 𝑆𝑆 : Robot configuration 𝑂𝑂 : Image 𝐴𝐴 : Action
  • 51. End-to-end training of deep visuomotor policies (2015) Solve optimal control (C-step) Training policy network (S-step) 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics Solve optimal control (C-step) Training end-to-end network 𝑝𝑝𝑖𝑖 roll-out 𝜃𝜃 𝑝𝑝𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜 Learning dynamics
  • 52. End-to-end training of deep visuomotor policies (2015)
  • 53. Deep Spatial Autoencoder for Visumotor Learning (2016) System input : 𝑂𝑂, 𝐴𝐴 Training phase: 𝑝𝑝𝑖𝑖 ← 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑂𝑂, 𝐴𝐴) 𝜃𝜃 ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 (𝑂𝑂, 𝑝𝑝𝑖𝑖) Test phase: 𝐴𝐴 ← 𝜋𝜋 𝜃𝜃(𝑂𝑂) using softmax position layer to encode image 𝑂𝑂 : Image 𝐴𝐴 : Action
  • 54. Deep Spatial Autoencoder for Visumotor Learning (2016)