Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Reinforcement Learning : A Beginners Tutorial

Nächste SlideShare
Reinforcement learning 7313
Reinforcement learning 7313
Wird geladen in …3

Hier ansehen

1 von 37 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)


Ähnlich wie Reinforcement Learning : A Beginners Tutorial (20)

Reinforcement Learning : A Beginners Tutorial

  1. 1. Reinforcement Learning<br />A Beginner’s TutorialBy: Omar Enayet<br />(Presentation Version)<br />
  2. 2. The Problem<br />
  3. 3. Agent-Environment Interface<br />
  4. 4. Environment Model<br />
  5. 5. Goals & Rewards<br />
  6. 6. Returns<br />
  7. 7. Credit-Assignment Problem<br />
  8. 8. Markov Decision Process<br />An MDP is defined by &lt; S, A, p, r, &gt;<br />S - set of states of the environment<br />A(s)– set of actions possible in state s<br /> - probability of transition from s<br />- expected reward when executing ain s<br /> - discount rate for expected reward<br />Assumption: discrete timet = 0, 1, 2, . . .<br />r<br />r<br />r<br />t +2<br />t +3<br />. . .<br />s<br />. . .<br />t +1<br />s<br />s<br />s<br />t+3<br />t+1<br />t+2<br />t<br />a<br />a<br />a<br />a<br />t<br />t+1<br />t +2<br />t +3<br />
  9. 9. Value Functions<br />
  10. 10. Value Functions<br />
  11. 11. Value Functions<br />
  12. 12. Optimal Value Functions<br />
  13. 13. Exploration-Exploitation Problem<br />
  14. 14. Policies<br />
  15. 15. Elementary Solution Methods<br />
  16. 16. Dynamic Programming<br />
  17. 17. Perfect Model<br />
  18. 18. Bootstrapping<br />
  19. 19. Generalized Policy Iteration<br />
  20. 20. Efficiency of DP<br />
  21. 21. Monte-Carlo Methods<br />
  22. 22. Episodic Return<br />
  23. 23. Advantages over DP<br /><ul><li>No Model
  24. 24. Simulation OR part of Model
  25. 25. Focus on small subset of states
  26. 26. Less Harmed by violations of Markov Property</li></li></ul><li>First Visit VS Every-Visit<br />
  27. 27. On-Policy VS Off-Policy<br />
  28. 28. Action-value instead of State-value<br />
  29. 29. Temporal-Difference Learning<br />
  30. 30. Advantages of TD Learning<br />
  31. 31. SARSA (On-Policy)<br />
  32. 32. Q-Learning (Off-Policy)<br />
  33. 33.
  34. 34. Actor-Critic Methods(On-Policy)<br />
  35. 35. R-Learning (Off-Policy)<br />&gt;&gt;Average Expected reward per time-step<br />
  36. 36. Eligibility Traces<br />
  37. 37.
  38. 38.
  39. 39. References<br />Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, Bradford Books, 1998.<br />Richard Crouch, Peter Bennett, Stephen Bridges, Nick Piper and Robert Oates - Monte Carlo - 2003<br />Slides for reading with :<br /><ul><li>Omar Enayet – Reinforcement Learning : A Beginner’s Tutorial- 2009</li>

Hinweis der Redaktion

  • By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions. Given a state and an action, a model produces a prediction of the resultant next state and next reward. If the model is stochastic, then there are several possible next states and next rewards, each with some probability of occurring. Some models produce a description of all possibilities and their probabilities; these we call distribution models. Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models. For example, consider modeling the sum of a dozen dice. A distribution model would produce all possible sums and their probabilities of occurring, whereas a sample model would produce an individual sum drawn according to this probability distribution.
  • Credit assignment problem: How do you distribute credit for success among the many decisions that may have been involved in producing it?
  • One of the challenges that arise in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions andprogressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate its expected reward. The exploration-exploitation dilemma has been intensively studied by mathematicians for many decades