Deep robotics

Deep Robotics
Intelligent Control and Systems Laboratory
J.hyeon Park
2018-05-15
SNU AI Study

What is deep robotics?
source : RI seminar : sergey Levine
https://www.youtube.com/watch?v=eKaYnXQUb2g&t=346s

• Think about computer vision

features
eg. HOG
traditional
computer
vision
mid-level
features
eg. DPM
classifier
eg. SVM
imge semantic
label
training trainingtraining

features
eg. HOG
traditional
computer
vision
mid-level
features
eg. DPM
classifier
eg. SVM
imge semantic
label
deep
learning
imge artificial neural network
semantic
label
training trainingtraining
end-to-end training

• deep robotics analogus to computer vision

state
estimation
traditional
robotics
modeling
&
prediction
planning
observation
(eg, image)
controls
training
low-level
control
training training training

state
estimation
traditional
robotics
modeling
&
prediction
planning
observation
(eg, image)
controls
training
low-level
control
training training training
deep
robotics
observation artificial neural network controls
end-to-end training

• Connectionist model in robotics
• Benefit?
- General-purpose algorithm
- Combine perception and control
- Acquire complex skills with general-purpose representations

Preliminary1
• MDP set-up : < 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑠𝑠𝑡𝑡+1, 𝑟𝑟𝑡𝑡 >
𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆, state
𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴, action
𝑃𝑃𝑠𝑠𝑠𝑠′ = 𝑃𝑃(𝑠𝑠𝑡𝑡+1|𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡) , transition probability
𝑟𝑟𝑡𝑡 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 , reward
• control objective: Maximize 𝑅𝑅 = Σ𝑟𝑟𝑡𝑡 (or Σ𝛾𝛾𝑡𝑡
𝑟𝑟𝑡𝑡)
𝑅𝑅 : accumulated reward
𝛾𝛾 : discount factor

Preliminary2
• Reinforcement learning set-up
𝜋𝜋(𝑠𝑠𝑡𝑡) or 𝜋𝜋 𝑎𝑎𝑡𝑡 𝑠𝑠𝑡𝑡 , policy
𝑉𝑉 𝜋𝜋
𝑠𝑠𝑡𝑡 , value function
𝑄𝑄𝜋𝜋(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡), Q-function
𝑉𝑉 𝜋𝜋
𝑠𝑠𝑡𝑡 ≡ 𝐸𝐸𝑠𝑠𝑖𝑖>𝑡𝑡+1~𝑃𝑃𝑠𝑠𝑠𝑠𝑠,𝑎𝑎𝑖𝑖~𝜋𝜋 𝑅𝑅𝑡𝑡
𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 ≡ 𝐸𝐸𝑠𝑠𝑖𝑖>𝑡𝑡+1~𝑃𝑃𝑠𝑠𝑠𝑠′,𝑎𝑎𝑖𝑖>𝑡𝑡~𝜋𝜋
𝑅𝑅𝑡𝑡
𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐸𝐸[𝑟𝑟 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝛾𝛾𝐸𝐸𝑎𝑎𝑡𝑡+1~𝜋𝜋[𝑄𝑄𝜋𝜋 𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 ] (Bellmann’s equation)

Deep reinforcement learning (DQN)
1. Declare neural-network function approximators
𝑄𝑄 𝜃𝜃 𝑄𝑄
: 𝑆𝑆 × 𝐴𝐴 → 𝑅𝑅 (Q network)
2. Accumulate (𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑠𝑠𝑡𝑡+1, 𝑟𝑟𝑡𝑡) during roll-outs
3. Update prameters
𝐿𝐿 𝜃𝜃 𝑄𝑄
=
1
𝑁𝑁
∑ 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄
2
where 𝑦𝑦𝑡𝑡 = 𝑟𝑟𝑡𝑡 + 𝛾𝛾𝛾𝛾 𝑠𝑠𝑡𝑡+1, 𝜋𝜋 𝑠𝑠𝑡𝑡+1
4. Select policy
𝜋𝜋 𝑎𝑎𝑡𝑡|𝑠𝑠𝑡𝑡 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑥𝑥𝑎𝑎𝑡𝑡
𝑄𝑄(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
V.Minh et al(2016) “Human-level control through deep reinforcement learning”

source code : https://github.com/musyoku/deep-q-network
𝐿𝐿 𝜃𝜃 𝑄𝑄 =
1
𝑁𝑁
� 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄
2
pixel
action

source code : https://github.com/musyoku/deep-q-network
𝐿𝐿 𝜃𝜃 𝑄𝑄 =
1
𝑁𝑁
� 𝑦𝑦𝑡𝑡 − 𝑄𝑄 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜃𝜃 𝑄𝑄
2
pixel
action
“End-to-end”

source : https://github.com/musyoku/deep-q-network

source : https://github.com/musyoku/deep-q-network
Training 8000 episdoes

In real world...
8000 episodes

In real world...
8000 episodes
in real world?

In real world...
Several robots
Several acceleration algorithms
Several patience...
Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Update (S.Gu et al, 2016)
(no pixel input)

Guided policy search
• Levine, Sergey, and Vladlen Koltun. "Guided policy search." Proceedings of the 30th
International Conference on Machine Learning (ICML-13). 2013.
• Sergey Levine, Pieter Abbeel. Learning Neural Network Policies with Guided Policy
Search under Unknown Dynamics. NIPS 2014.
• William Montgomery, Sergey Levine. Guided Policy Search as Approximate Mirror
Descent. NIPS 2016.

𝑝𝑝𝑖𝑖 = {𝑠𝑠1, 𝑎𝑎1, … 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇}
training
min
𝜃𝜃,𝑝𝑝1,…,𝑝𝑝 𝑁𝑁
�
𝑖𝑖=1
𝑁𝑁
�
𝑡𝑡=1
𝑇𝑇
𝐸𝐸𝑝𝑝𝑖𝑖 s𝑡𝑡,𝑢𝑢𝑡𝑡
r s𝑡𝑡, a𝑡𝑡 𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑝𝑝𝑖𝑖 a𝑡𝑡 s𝑡𝑡 = 𝜋𝜋𝜃𝜃 a𝑡𝑡, s𝑡𝑡 ∀s𝑡𝑡, a𝑡𝑡, 𝑡𝑡, 𝑖𝑖
global policy networktrajectory optimization
from optimal control

What is optimal control?
min
𝑎𝑎1,𝑎𝑎2,…𝑎𝑎𝑇𝑇
� 𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)
under dynamics 𝑠𝑠𝑡𝑡+1 = 𝑓𝑓 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡

optimal control example : iLQR
Assumption :
dynamics is linear 𝑠𝑠𝑡𝑡+1 = 𝐹𝐹𝑡𝑡
𝑠𝑠𝑡𝑡
𝑎𝑎𝑡𝑡
+ ft
cost is quadratic 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 =
1
2
𝑠𝑠𝑡𝑡
𝑎𝑎𝑡𝑡
𝑇𝑇
𝐶𝐶𝑡𝑡
𝑠𝑠𝑡𝑡
𝑎𝑎𝑡𝑡
+
𝑠𝑠𝑡𝑡
𝑎𝑎𝑡𝑡
𝑇𝑇
𝑐𝑐𝑡𝑡

Q –function : 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 = co𝑠𝑠𝑠𝑠 +
1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
At time 𝑇𝑇

1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇
𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑢𝑢𝑇𝑇 = 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
𝑎𝑎𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
𝑇𝑇
At time 𝑇𝑇

1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
𝑇𝑇 = 0
At time 𝑇𝑇

1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
𝑇𝑇 = 0
optimal action : 𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
−1 𝐶𝐶𝑎𝑎𝑇𝑇,𝑠𝑠𝑇𝑇
𝑠𝑠𝑇𝑇 + 𝑐𝑐𝑎𝑎𝑇𝑇
At time 𝑇𝑇

1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
𝑇𝑇 = 0
with dynamics : 𝑠𝑠𝑇𝑇 = 𝐹𝐹𝑇𝑇
𝑠𝑠𝑇𝑇−1
𝑎𝑎𝑇𝑇−1
+ fT
At time 𝑇𝑇

1
2
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝐶𝐶𝑇𝑇
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
+
𝑠𝑠𝑇𝑇
𝑎𝑎𝑇𝑇
𝑇𝑇
𝑐𝑐𝑇𝑇
𝑇𝑇 = 0
with dynamics : 𝑠𝑠𝑇𝑇 = 𝐹𝐹𝑇𝑇
+ fT
𝑎𝑎𝑇𝑇 = −𝐶𝐶𝑎𝑎𝑇𝑇,𝑎𝑎𝑇𝑇
𝐹𝐹𝑇𝑇
+ fT + 𝑐𝑐𝑎𝑎𝑇𝑇
At time 𝑇𝑇

Q –function: 𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1 = 𝑄𝑄 𝑠𝑠𝑇𝑇, 𝑎𝑎𝑇𝑇 +
1
2
𝑇𝑇
𝐶𝐶𝑇𝑇−1
+
𝑇𝑇
𝑐𝑐𝑇𝑇−1
At time 𝑇𝑇 − 1

1
2
𝑇𝑇
+
𝑇𝑇
quadratic function of 𝑠𝑠𝑇𝑇−1, 𝑎𝑎𝑇𝑇−1

1
2
𝑇𝑇
+
𝑇𝑇
derivative of Q : 𝛻𝛻𝑎𝑎𝑇𝑇−1
𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = …

1
2
𝑇𝑇
+
𝑇𝑇
𝑄𝑄 𝑠𝑠𝑇𝑇−1, 𝑢𝑢𝑇𝑇−1 = … = 0

1
2
𝑇𝑇
+
𝑇𝑇
optimal action : 𝑎𝑎𝑇𝑇−1 = ...

1
2
𝑇𝑇
+
𝑇𝑇
and so on... and so on...
from 𝑇𝑇 to 1

1
2
𝑇𝑇
+
𝑇𝑇
of course, system is not linear, cost is not quadratic
iteratively linearlization & quadratization

Solve optimal control
(C-step)
Training policy network
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜𝑜𝑜𝑜𝑜
Learning dynamics

(C-step)
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics
𝑝𝑝𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝑝𝑝𝑖𝑖
∑ 𝐿𝐿𝐿(𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡)

(C-step)
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics
where 𝐿𝐿′ 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃

(C-step)
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics
This constraint is very important for convergence
(constraint for the optimal control
not to be far from policy)

(C-step)
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics
Solve 𝑝𝑝𝑖𝑖 using iLQR

Supervised learning
𝜋𝜋𝜃𝜃 ← 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑛𝑛𝜃𝜃 � 𝐾𝐾𝐾𝐾(𝜋𝜋𝜃𝜃| 𝑝𝑝𝑖𝑖
(C-step)
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics

(C-step)
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics
Operate robot,
and collect 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝑟𝑟𝑡𝑡, 𝑠𝑠𝑡𝑡+1

(C-step)
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics
Learn dynamics of P(st+1|st, at)

KL-divergence btw local and global policy
𝐿𝐿′
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐿𝐿 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 + 𝐾𝐾𝐾𝐾 𝑝𝑝𝑖𝑖 𝜋𝜋 𝜃𝜃

End-to-end training of deep visuomotor policies (2015)

System input : 𝑆𝑆, 𝑂𝑂, 𝐴𝐴
Training phase:
𝑝𝑝𝑖𝑖 ← 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑆𝑆, 𝐴𝐴)
𝜃𝜃 ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 (𝑂𝑂, 𝑝𝑝𝑖𝑖)
Test phase:
𝐴𝐴 ← 𝜋𝜋 𝜃𝜃(𝑂𝑂) 𝑆𝑆 : Robot configuration
𝑂𝑂 : Image
𝐴𝐴 : Action

(C-step)
(S-step)
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics
(C-step)
Training end-to-end
network
𝑝𝑝𝑖𝑖
roll-out
𝜃𝜃
𝑝𝑝𝑖𝑖
Learning dynamics

Deep Spatial Autoencoder for Visumotor Learning (2016)
System input : 𝑂𝑂, 𝐴𝐴
Training phase:
𝑝𝑝𝑖𝑖 ← 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑂𝑂, 𝐴𝐴)
𝜃𝜃 ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 (𝑂𝑂, 𝑝𝑝𝑖𝑖)
Test phase:
𝐴𝐴 ← 𝜋𝜋 𝜃𝜃(𝑂𝑂)
using softmax position layer to encode image
𝑂𝑂 : Image
𝐴𝐴 : Action

Deep Spatial Autoencoder for Visumotor Learning (2016)

Deep robotics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep robotics

Ähnlich wie Deep robotics (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep robotics