Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Survey of Modern Reinforcement Learning

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 38 Anzeige

Survey of Modern Reinforcement Learning

Herunterladen, um offline zu lesen

A review of the basic ideas and concepts in reinforcement learning, including discussion of Q-Learning and Sarsa methods. Includes a survey of modern RL methods, including Dyna-Q, DQN, REINFORCE, and AC2, and how they relate.

A review of the basic ideas and concepts in reinforcement learning, including discussion of Q-Learning and Sarsa methods. Includes a survey of modern RL methods, including Dyna-Q, DQN, REINFORCE, and AC2, and how they relate.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Survey of Modern Reinforcement Learning (20)

Anzeige

Aktuellste (20)

Survey of Modern Reinforcement Learning

  1. 1. Survey of Modern Reinforcement Learning Julia Maddalena
  2. 2. What to expect from this talk Part 1 Introduce the foundations of reinforcement learning ● Definitions and basic ideas ● A couple algorithms that work in simple environments Part 2 Review some state-of-the-art methods ● Higher level concepts, vanilla methods ● Not a complete list of cutting edge methods Part 3 Current state of reinforcement learning
  3. 3. Part 1 Foundations of reinforcement learning
  4. 4. What is reinforcement learning? A type of machine learning where an agent interacts with an environment and learns to take actions that result in greater cumulative reward. X alone is analyzed for patterns ● PCA ● Cluster analysis ● Outlier detection X is used to predict Y ● Classification ● Regression Supervised Learning Unsupervised Learning Reinforcement Learning
  5. 5. Definitions Reward Motivation for the agent. Not always obvious what the reward signal should be YOU WIN! +1 GAME OVER -1 Stay alive +1/second (sort of) Agent The learner and decision maker Environment Everything external to the agent used to make decisions Actions The set of possible steps the agent can take depending on the state of the environment
  6. 6. The Problem with Rewards... Designing reward functions is notoriously difficult Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016. 1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Possible reward structure ● Total points ● Time to finish ● Finishing position Human player “I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima.” - Alex Irpan Reinforcement Learning Agent
  7. 7. More Definitions Return Long-term, discounted reward Value Expected return value of states → V(s) how good is it to be in state s value of state-action pairs → Q(s,a) how good is it to take action a from state s discount factor Policy How the agent should act from a given state → π(a|s)
  8. 8. Markov Decision Process Markov Process A random process whose future behavior only depends on the current state. Sleepy Energetic Hungry 70% 50% 15% 70% 50% 35% 50%
  9. 9. Markov Decision Process Sleepy Energetic Hungry nap beg be good beg be good 30% 70% 20% 60% 20% 50% 60% 40% 60% 40% 10% 40% Markov Process + Actions + Reward = Markov Decision Process +2 -1 -2 -1 +10 +7 -2 -1 +10 +5 -6 -4
  10. 10. To model or not to model Model-based methods Transition Model ● We already know the dynamics of the environment ● We simply need to plan our actions to optimize return Model-free methods We don’t know or care about the dynamics, we just want to learn a good policy by exploring the environment Sample Model ● We don’t know the dynamics ● We try to learn them by exploring the environment and use them to plan our actions to optimize return Planning Learning Planning and Learning
  11. 11. Reinforcement Learning Methods Model-based Model-free Transition Model Sample Model Dynamic Programming
  12. 12. Bellman Equations Value of each states under optimal policy for Robodog: Bellman Equation Bellman Optimality Equation policy transition probabilities value of the next state reward discount factor value of the current state
  13. 13. Policy Iteration Policy evaluation Makes the value function consistent with the current policy Policy improvement Make the policy greedy with respect to the current value function be good beg be good beg 100% 50% 50% 50% 50% sleepy nap energetic hungry be good beg be good beg 100% 0% 100% 100% 0% sleepy nap energetic hungry state value sleepy 19.88 energetic 20.97 hungry 20.63 state value sleepy 29.66 energetic 31.66 hungry 31.90 Converge to optimal policy and value under optimal policy
  14. 14. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Sarsa Q-Learning Monte Carlo Temporal Difference
  15. 15. When learning happens Monte Carlo: wait until end of episode before making updates to value estimates X X O X O X X O O X X O O X X X O O O X X X O O O X X X Update value for all states in episode X X O X O X X O O X Update value for previous state Temporal difference, TD(0): update every step using estimates of next states bootstrapping Update value for previous state Update value for previous state Update value for previous state . . . in this example, learning = updating value of states
  16. 16. Exploration vs exploitation 𝜀-greedy policy Exploration vs Exploitation, Will Evans, slideshare.net/willevans exploitation exploration
  17. 17. Sarsa S A Q(S, A) sleepy nap 0 energetic beg 0 energetic be good 0 hungry beg 0 hungry be good 0 Energetic be good Hungry +5 beg S A R S’ A’ Hungry beg Initialize Q(s,a) For each episode: • Start in a random state, S. • Choose action A from S using 𝛆-greedy policy from Q(s,a). • While S is not terminal: 1. Take action A, observe reward R and new state S’. 2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a). 3. Update Q for state S and action A: 4. S ← S’, A ← A’ 0.5
  18. 18. Q-Learning S A Q(S, A) sleepy nap 0 energetic beg 0 energetic be good 0 hungry beg 0 hungry be good 0 Energetic be good Hungry +5 beg S A R S’ 3. Update Q for state S and action A: be good Q = -1 Q = 2 Hungry Initialize Q(s,a) For each episode: • Start in a random state, S. • While S is not terminal: 1.Choose action A from S using 𝛆-greedy policy from Q(s,a). 2.Take action A, observe reward R and new state S’. 4. S ← S’ beg 0.5
  19. 19. Part 2 State-of-the-art methods
  20. 20. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Sarsa Q-Learning Monte Carlo Temporal Difference
  21. 21. Dyna-Q For each episode: • Start in a random state, S. • While S is not terminal: 1.Choose action A from S using 𝛆-greedy policy from Q(s,a). 2.Take action A, observe reward R and new state S’. 3.Update Q for state S and action A: Model(S, A) S A Q(S, A) R S’ sleepy nap 0 0 NA energetic beg 0 0 NA energetic be good 0 0 NA hungry beg 0 0 NA hungry be good 0 0 NA Energetic be good Hungry +5 ordinary Q-Learning Hungry beg Sleepy -6 R R R ⋮ 5 hungry0.5 R R R ⋮ Initialize Q(s,a) and Model(s,a) 4. Update Model for state S and action A: 5. “Hallucinate” n transitions and use them to update Q: 0.951.3551.7195
  22. 22. Dyna-Q
  23. 23. Deep Reinforcement Learning 2 3 4 5 6 7 8 Black Box state, s Q(s, a) for each action a 1 s a Q(s,a) 1 X Q(1, X) 1 Y Q(1, Y) 1 Z Q(1, Z) 2 X Q(2, X) 2 Y Q(2, Y) 2 Z Q(2, Z) 3 X Q(3, X) 3 Y Q(3, Y) 3 Z Q(3, Z) 4 X Q(4, X) 4 Y Q(4, Y) 4 Z Q(4, Z) 5 X Q(5, X) 5 Y Q(5, Y) 5 Z Q(5, Z) 6 X Q(6, X) 6 Y Q(6, Y) 6 Z Q(6, Z) 7 X Q(7, X) 7 Y Q(7, Y) 7 Z Q(7, Z) 8 X Q(8, X) 8 Y Q(8, Y) 8 Z Q(8, Z) Q(s,X) Q(s,Y) Q(s,Z)
  24. 24. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning
  25. 25. Deep Q Networks (DQN) Black Box X blank O state, s Q(s, a) for each action a X X O X how good is it to take this action from this state? 1. Initialize network. 1. Take one action under Q policy. s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t st at rt st 3. Add new information to training data: 4. Use stochastic gradient descent to update weights based on: Repeat steps 2 - 4 until convergence ŷ y
  26. 26. Deep Q Networks (DQN) 1. Initialize network. 2. Take one action under Q policy. 3. Add new information to training data: 1. Use stochastic gradient descent to update weights based on: Problem: ● Data not i.i.d. ● Data collected based on an evolving policy, not the optimal policy that we are trying to learn. Solution: Create a replay buffer of size k to take small samples from Problem: Instability introduced when updating Q(s, a) using Q(s’, a’) Solution: Have a secondary target network used to evaluate Q(s’, a’) and only sync with primary network after every n training iterations primary network target network s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t st at rt st s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t - k st-k at-k rt-k st-k+1 ... ... ... ... t st at rt st+1
  27. 27. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* REINFORCE* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning
  28. 28. REINFORCE Black Box X blank O state, s 𝛑(a|s) for each action a X X O X what is the probability of taking this action under policy 𝛑? 1. Initialize network. r1 r2 r3 r4 r5 r6 r7 r8 2. Play out a full episode under 𝛑. 3. For every step t, calculate return from that state until the end: 4. Use stochastic gradient descent to update weights based on: Repeat steps 2 - 4 until convergence
  29. 29. DQN vs REINFORCE DQN REINFORCE Learning Off-policy On-policy Updates Temporal difference Monte Carlo Output Q(s,a) ➝ Value-based 𝛑(a|s) ➝ Policy-based Action spaces Small discrete only Large discrete or continuous Exploration 𝛆-greedy Built-in due to stochastic policy Convergence Slower to converge Faster to converge Experience Less experience needed More experience needed
  30. 30. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* REINFORCE* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning Advantage Actor-Critic*
  31. 31. Q Actor-Critic Common layers X blank O state, s Policy net Value net 𝛑(a|s) for each action a Q(s,a) for each action a Actor Policy-based like REINFORCE but can now use temporal difference learning Critic Value-based, works sort of like DQN
  32. 32. Quick review Q-Learning DQN REINFORCE Q Actor-Critic A2C Ability to generalize values in state space Ability to control in continuous action spaces using stochastic policy One step updates Reduce variability in gradients
  33. 33. Advantage vs action value A(S, A) Q(S, A) V(S) Q(S, A) V(S) advantage
  34. 34. Advantage Actor-Critic (A2C) Common layers X blank O state, s Policy net Value net 𝛑(a|s) for each action a V(s) Actor Policy-based like REINFORCE Can now use temporal difference learning and baseline: Critic Value-based, now learns value of states instead of state-action pairs
  35. 35. Part 3 Current state of reinforcement learning
  36. 36. Current state of reinforcement learning Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI ● Most impressive progress has been made in games 1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Barriers to entry: ● Too much real-world experience required Driverless car, robotics, etc. still largely not using RL. “The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and better than reinforcement learning.”1 “Reinforcement learning is a type of machine learning whose hunger for data is even greater than supervised learning. It is really difficult to get enough data for reinforcement learning algorithms. There’s more work to be done to translate this to businesses and practice.” - Andrew Ng ● Simulation is often not realistic enough ● Poor convergence properties ● There has not been enough development in transfer learning for RL models ○ Models do not generalize well outside of what they are trained on
  37. 37. Promising applications of RL (that aren’t games) Energy Finance Healthcare Some aspects of robotics NLP Computer systems Traffic light control Assisting GANs Neural network architecture Computer vision Education Recommendation systems Science & Math
  38. 38. References Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016. Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing. Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.

Hinweis der Redaktion

  • Environment → states of the environment
  • ****Describe image!!****

    Discount factor prevents infinite return

    Value vs policy based methods
  • drama/sci-fi
  • State-action pair

    Dynamics
  • Sample model - by learning and planning we are often able to do better than we would with just learning alone
  • Only applies to single agent fully observable MDPs
  • Reward can propagate backwards
  • Almost all reinforcement learning methods are well described as generalized policy iteration

  • Monte Carlo = low bias, high variance
    Temporal difference methods = higher bias, lower variance (and don’t need complete episodes in order to learn)

    Lower variance is often better!
  • Major consideration in all RL algorithms

    Greedy action = action that we currently believe has the most value

    Decrease epsilon over time
  • Now, we no longer no the dynamics of Robodog
  • What is the advantage/disadvantage of off-policy vs on-policy?
  • Q learning and Sarsa were developed in the late 80s. While not state-of-the-art as they only work for small state and action spaces, they laid the foundation for a some of modern reinforcement learning methods covered in Part 2
  • ***Learning from experience can be expensive***

    Not necessarily best system to use last observed reward and new state for our model
    Q updates from sample moel get more interesting when more state action pairs have been observed
  • add reference

  • While tabular methods would be memory intensive for large state spaces, the bigger issue is the time it would take to visit all states and observe and update their values - we need the ability to generalize

    With deep RL, we can have some idea of the value of a state even if we’ve never seen it before
  • Developed by DeepMind in 2014
  • Stochastic gradient descent needs iid data

    A lot of work has been done since 2015 to make these networks even better and more efficient
  • G is an unbiased estimate of the true Q

    Loss function drives policy towards actions with positive reward and away from actions with negative reward

    Major issues: noisy gradients (due to randomness of samples), high variance ---> unstable learning and possibly suboptimal policy
  • for each learning step, we upgrade policy net towards actions that the critic says are good, and update the value net to match the change in the actor’s policy

    ---> policy iteration
  • We can swap out A for Q in our loss function without changing the direction of the gradients, but while reducing variance greatly

    AC2 introduced by OpenAI and asynchronous method developed by DeepMind
  • DeepMind has supposedly reduced Google’s energy consumption by 50%
    NLP: SalesForce used RL among other text generation models to write high quality summaries of long text.
    JPMorgan using RL robot to execute trades at opportune times
    Healthcare - optimization of treatment for patients with chronic disease, deciphering medical images
    Improving output of GANs by making output adhere to standard rules

×