Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
0415_seminar_DeepDPG
0415_seminar_DeepDPG
Wird geladen in …3
×

Hier ansehen

1 von 18 Anzeige

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

Herunterladen, um offline zu lesen

The material that I've used to present the paper
"Continuous Deep Q-Learning with Model-based Acceleration", S.Gu, T.Lillicrap, I.Sutskever, S.Levine, 2016 ICML

The material that I've used to present the paper
"Continuous Deep Q-Learning with Model-based Acceleration", S.Gu, T.Lillicrap, I.Sutskever, S.Levine, 2016 ICML

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)

Anzeige

Ähnlich wie 1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration (20)

Aktuellste (20)

Anzeige

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

  1. 1. HAYA! Continuous Deep Q-Learning with Model-based Acceleration 2016 ICML S. Gu, T. Lillicrap, I. Sutskever, S. Levine. Presenter : Hyemin Ahn
  2. 2. HAYA! Introduction 2016-12-02 CPSLAB (EECS) 2  Another, and Another improved work of Deep - Reinforcement Learning  Tried incorporate the advantages of Model-free Reinforcement Learning && Model-based Reinforcement Learning
  3. 3. HAYA! Results : Preview 2016-12-02 CPSLAB (EECS) 3
  4. 4. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 4 Agent How can we formulize our behavior?
  5. 5. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 5 At each time 𝒕, the agent receives an observation 𝒙 𝒕 from environment 𝑬 Wow so scare such gun so many bullets nice suit btw
  6. 6. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 6 The agent takes an action 𝒖 𝒕 ∈ 𝒰, and receives a scalar reward 𝒓 𝒕. 𝒖 𝒕 𝒙 𝒕
  7. 7. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 7 The agent chooses an action according to its current policy 𝛑(𝐮𝐭|𝐱 𝐭), which maps states to probability distribution over actions. 𝒖 𝟏 𝒖 𝟐 𝛑(𝐮𝐭|𝒙 𝒕) 𝒖 𝟐𝒖 𝟏
  8. 8. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 8 𝒖 𝟏 𝒙 𝟏 𝛑 𝒖 𝟐 𝒙 𝟐 𝛑 𝒖 𝟑 𝒙 𝟑 𝛑 𝒑(𝒙 𝟐|𝒙 𝟏, 𝒖 𝟏) 𝒑(𝒙 𝟑|𝒙 𝟐, 𝒖 𝟐) 𝑹 𝒕 = 𝒊=𝒕 𝑻 𝜸(𝒊−𝒕) 𝒓(𝒙𝒊, 𝒖𝒊) : cumulative sum of rewards over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor) 𝑸 𝝅 𝒙 𝒕, 𝒖 𝒕 = 𝔼[𝑹 𝒕|𝒙 𝒕, 𝒖 𝒕] : state-action value function. Objective of RL : find 𝛑 maximizing 𝔼(𝑹 𝟏) ! 𝒓(𝒙 𝟏, 𝒖 𝟏) 𝒓(𝒙 𝟐, 𝒖 𝟐) 𝒓(𝒙 𝟑, 𝒖 𝟑)M D P
  9. 9. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 9 𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒙 𝒕, 𝒖 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒙 𝒕, 𝒖 𝒕<
  10. 10. HAYA! Reinforcement Learning : overview 2016-12-02 CPSLAB (EECS) 10 • From environment E, 𝒙 ∈ 𝒳 : state 𝒖 ∈ 𝒰 : action • π(𝒖 𝑡|𝒙 𝑡) : a policy defining agent’s behavior : maps states to probability distribution over the actions • With 𝒳, 𝒰, an initial state distribution p(𝒙1), the agent experiences a transition to a new state sampled from the dynamics distribution p 𝒙t+1 𝒙t, 𝒖t • Rt = i=t T γ(i−t) r(𝒙i, 𝒖i) : the sum of future reward with a discounting factor γ ∈ [0,1] • Objective of RL : learning a policy π maximizing,
  11. 11. HAYA! Reinforcement Learning : Model Free? 2016-12-02 CPSLAB (EECS) 11 • When the system dynamics p 𝒙t+1 𝒙t, 𝒖t are not known. • We define the Q-function 𝑄 𝜋 𝒙 𝑡, 𝒖 𝑡 , corresponding to a policy 𝜋 as the expected return from 𝒙 𝑡 after taking 𝒖 𝑡 and following 𝜋 thereafter. • Q-learning learns a greedy deterministic policy which corresponds to • The learning objective is to minimize the Bellman error,  𝛽 : arbitrary exploration policy, 𝜌 𝛽 : resulting state visitation frequency of the policy 𝛽,  𝜃 𝑄 : parameter of the Q-function,  Assume that there is a fixed target 𝑦𝑡, 𝑸(𝒙, 𝝁(𝒙))
  12. 12. HAYA! Continuous Q-Learning with Normalized Advantage Functions 2016-12-02 CPSLAB (EECS) 12 How Authors learned parameterized Q-function with Deep Learning, when the domain of state-action is continuous? Value function Advantage function of a given policy 𝝅 They suggest to use a neural network that separately outputs a value function term, and an advantage term.  State-dependent, positive-definite square matrix, parameterized by 𝑷 𝒙 𝜃 𝑃 = 𝑳 𝒙 𝜃 𝑃 𝑳 𝒙 𝜃 𝑃 𝑇 .  𝑳 𝒙 𝜃 𝑃 : Lower-triangular matrix whose entries come from a linear output layer of a neural network. The action that maximizes the Q-function is always given by 𝝁(𝒙|𝜃 𝜇 ).
  13. 13. HAYA! Continuous Q-Learning with Normalized Advantage Functions 2016-12-02 CPSLAB (EECS) 13 Trick : assume that we have a target network. 𝑄′(𝒙, 𝒖|𝜃 𝑄′) the SLOW-LEARNER 𝑄 𝒙, 𝒖 𝜃 𝑄 the EXPLORER 𝑹 EXPERIENCE CONTAINER
  14. 14. HAYA! Accelerating Learning with Imagination Rollouts 2016-12-02 CPSLAB (EECS) 14  The sample complexity of model-free algorithms tends to be high when using high-dimensional function approximators.  To reduce the sample complexity and accelerate the learning phase, how about using a good exploratory behavior from the trajectory optimization?
  15. 15. HAYA! Accelerating Learning with Imagination Rollouts 2016-12-02 CPSLAB (EECS) 15  how about using a good exploratory behavior from the trajectory optimization? 𝑄′(𝒙, 𝒖|𝜃 𝑄′ ) 𝑄 𝒙, 𝒖 𝜃 𝑄 𝑹 𝑩 𝒇𝑩 𝒐𝒍𝒅 𝜇 𝒙 𝜃 𝜇 𝜋 𝑡 𝑖𝐿𝑄𝐺 𝒖 𝑡 𝒙 𝑡 𝓜 𝑹 𝒇 𝒇 𝒇
  16. 16. HAYA! Experiment : Results 2016-12-02 CPSLAB (EECS) 16
  17. 17. HAYA! Experiment : Results 2016-12-02 CPSLAB (EECS) 17
  18. 18. HAYA! 2016-12-02 CPSLAB (EECS) 18

Hinweis der Redaktion

  • Of these, the least novel are the value/advantage decomposition of Q(s,a) and the use of locally-adapted linear-Gaussian dynamics.
  • But we don’t know the target…!
  • But we don’t know the target…!
  • But we don’t know the target…!

×