Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Reinforcement Learning 13. Policy Gradient Methods

166 Aufrufe

Veröffentlicht am

A summary of Chapter 6: Policy Gradient Methods of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html

Check my website for more slides of books and papers!

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Reinforcement Learning 13. Policy Gradient Methods

  1. 1. Chapter 13: Policy Gradient Methods Seungjae Ryan Lee
  2. 2. Preview: Policy Gradients ● Action-value Methods ○ Learn values of actions and select actions with estimated action values ○ Policy derived from action-value estimates ● Policy Gradient Methods ○ Learn parameterized policy that can select action without a value function ○ Can still use value function to learn the policy parameter
  3. 3. Policy Gradient Methods ● Define a performance measure J(θ) to maximize ● Learn policy parameter θ through approximate gradient ascent Stochastic estimate of J(θ)
  4. 4. Soft-max in Action Preferences ● Numerical preference h(s, a, θ) for each state-action pair ● Action selection through soft-max
  5. 5. Soft-max in Action Preferences: Advantages 1. Approximate policy can approach deterministic policy ○ No “limit” like ε-greedy methods ○ Using soft-max on action values cannot approach deterministic policy
  6. 6. Soft-max in Action Preferences: Advantages 2. Allow stochastic policy ○ Best approximate policy can be stochastic in problems with significant function approximation ● Consider small corridor with -1 reward on each step ○ States are indistinguishable ○ Action transition is reversed in the second state
  7. 7. Soft-max in Action Preferences: Advantages 2. Allow stochastic policy ○ ε-greedy methods (ε=0.1) cannot find optimal policy
  8. 8. Soft-max in Action Preferences: Advantages 3. Policy may be simpler to approximate ○ Differs among problems http://proceedings.mlr.press/v48/simsek16.pdf
  9. 9. Theoretical Advantage of Policy Gradient Methods ● Smooth transition of policy for parameter changes ● Allows for stronger convergence guarantees h = 1 h = 0.9 0.4750.525 h = 1 h = 1.1 0.475 0.525 Q = 1 Q = 0.9 0.050.95 Q = 1 Q = 1.1 0.5 0.95
  10. 10. Policy Gradient Theorem ● Define performance measure as value of the start state ● Want to compute w.r.t. policy parameter
  11. 11. Policy Gradient Theorem Episodic: Average episode length Continuing: 1 On-policy state distribution
  12. 12. Policy Gradient Theorem: Proof
  13. 13. Policy Gradient Theorem: Proof Product rule
  14. 14. Policy Gradient Theorem: Proof
  15. 15. Policy Gradient Theorem: Proof
  16. 16. Policy Gradient Theorem: Proof Unrolling
  17. 17. Policy Gradient Theorem: Proof
  18. 18. Policy Gradient Theorem: Proof
  19. 19. Policy Gradient Theorem: Proof
  20. 20. Policy Gradient Theorem: Proof
  21. 21. Policy Gradient Theorem: Proof
  22. 22. Stochastic Gradient Descent ● Need samples with expectation
  23. 23. Stochastic Gradient Descent is an on-policy state distribution of
  24. 24. Stochastic Gradient Descent
  25. 25. Stochastic Gradient Descent
  26. 26. Stochastic Gradient Descent: REINFORCE
  27. 27. REINFORCE (1992) ● Sample return like Monte Carlo ● Increment proportional to return ● Increment inverse proportional to action probability ○ Prevent frequent actions dominating due to frequent updates
  28. 28. REINFORCE: Pseudocode Eligibility vector
  29. 29. REINFORCE: Results
  30. 30. REINFORCE with Baseline ● REINFORCE ○ Good theoretical convergence ○ Bad convergence speed due to high variance Baseline
  31. 31. REINFORCE with Baseline: Pseudocode Learn state value with MC to use as baseline
  32. 32. REINFORCE with Baseline: Results
  33. 33. ● Learn approximations for both policy (Actor) and value function (Critic) ● Critic vs Baseline in REINFORCE ○ Critic is used for bootstrapping ○ Bootstrapping introduces bias and relies on state representation ○ Bootstrapping reduces variance and accelerates learning Actor-Critic Methods Actor Critic
  34. 34. One-step Actor Critic ● Replace return with one-step return ● Replace baseline with approximated value function (Critic) ○ Learned with semi-gradient TD(0) REINFORCE: One-step AC:
  35. 35. One-step Actor-Critic Update Critic (value function) parameters Update Actor (policy) parameters
  36. 36. Actor-Critic with Eligibility Traces
  37. 37. Average Reward for Continuing Problems ● Measure performance in terms of average reward
  38. 38. Actor-Critic for Continuing Problems
  39. 39. Policy Gradient Theorem Proof (Continuing Case) 1. Same procedure: 2. Rearrange equation:
  40. 40. Policy Gradient Theorem Proof (Continuing Case) 3. Sum over all states weighted by state-distribution a. Nothing changes since neither side depend on s and
  41. 41. Policy Gradient Theorem Proof (Continuing Case) 4. Use ergodicity:
  42. 42. ● Policy based methods can handle continuous action spaces ● Learn statistics of the probability distribution ○ ex) mean and variance of Gaussian ● Choose action from the learned distribution ○ ex) Gaussian distribution Policy Parameterization for Continuous Actions
  43. 43. Policy Parametrization to Gaussian Distribution ● Divide policy parameter vector into mean and variance: ● Approximate mean with linear function: ● Approximate variance with exponential of linear function: ○ Guaranteed positive ● All PG algorithms can be applied to the parameter vector
  44. 44. Summary ● Policy Gradient methods have many advantages over action-value methods ○ Represent stochastic policy and approach deterministic policies ○ Learn appropriate levels of exploration ○ Handle continuous action spaces ○ Compute effect of policy parameter on performance with Policy Gradient Theorem ● Actor-Critic estimates value function for bootstrapping ○ Introduces bias but is often desirable due to lower variance ○ Similar to preferring TD over MC “Policy Gradient methods provide a significantly different set of strengths and weaknesses than action-value methods.”
  45. 45. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai