Diese Präsentation wurde erfolgreich gemeldet.

# Reinforcement Learning : A Beginners Tutorial

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Reinforcement learning 7313
×

1 von 37 Anzeige

# Reinforcement Learning : A Beginners Tutorial

This a presentation of a Reinforcement Learning tutorial for beginners which I worked on.

This a presentation of a Reinforcement Learning tutorial for beginners which I worked on.

Anzeige
Anzeige

## Weitere Verwandte Inhalte

Anzeige

### Reinforcement Learning : A Beginners Tutorial

1. 1. Reinforcement Learning<br />A Beginner’s TutorialBy: Omar Enayet<br />(Presentation Version)<br />
2. 2. The Problem<br />
3. 3. Agent-Environment Interface<br />
4. 4. Environment Model<br />
5. 5. Goals & Rewards<br />
6. 6. Returns<br />
7. 7. Credit-Assignment Problem<br />
8. 8. Markov Decision Process<br />An MDP is defined by &lt; S, A, p, r, &gt;<br />S - set of states of the environment<br />A(s)– set of actions possible in state s<br /> - probability of transition from s<br />- expected reward when executing ain s<br /> - discount rate for expected reward<br />Assumption: discrete timet = 0, 1, 2, . . .<br />r<br />r<br />r<br />t +2<br />t +3<br />. . .<br />s<br />. . .<br />t +1<br />s<br />s<br />s<br />t+3<br />t+1<br />t+2<br />t<br />a<br />a<br />a<br />a<br />t<br />t+1<br />t +2<br />t +3<br />
9. 9. Value Functions<br />
10. 10. Value Functions<br />
11. 11. Value Functions<br />
12. 12. Optimal Value Functions<br />
13. 13. Exploration-Exploitation Problem<br />
14. 14. Policies<br />
15. 15. Elementary Solution Methods<br />
16. 16. Dynamic Programming<br />
17. 17. Perfect Model<br />
18. 18. Bootstrapping<br />
19. 19. Generalized Policy Iteration<br />
20. 20. Efficiency of DP<br />
21. 21. Monte-Carlo Methods<br />
22. 22. Episodic Return<br />
23. 23. Advantages over DP<br /><ul><li>No Model
24. 24. Simulation OR part of Model
25. 25. Focus on small subset of states
26. 26. Less Harmed by violations of Markov Property</li></li></ul><li>First Visit VS Every-Visit<br />
27. 27. On-Policy VS Off-Policy<br />
28. 28. Action-value instead of State-value<br />
29. 29. Temporal-Difference Learning<br />
30. 30. Advantages of TD Learning<br />
31. 31. SARSA (On-Policy)<br />
32. 32. Q-Learning (Off-Policy)<br />
33. 33.
34. 34. Actor-Critic Methods(On-Policy)<br />
35. 35. R-Learning (Off-Policy)<br />&gt;&gt;Average Expected reward per time-step<br />
36. 36. Eligibility Traces<br />
37. 37.
38. 38.
39. 39. References<br />Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, Bradford Books, 1998.<br />Richard Crouch, Peter Bennett, Stephen Bridges, Nick Piper and Robert Oates - Monte Carlo - 2003<br />Slides for reading with :<br /><ul><li>Omar Enayet – Reinforcement Learning : A Beginner’s Tutorial- 2009</li>

### Hinweis der Redaktion

• By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions. Given a state and an action, a model produces a prediction of the resultant next state and next reward. If the model is stochastic, then there are several possible next states and next rewards, each with some probability of occurring. Some models produce a description of all possibilities and their probabilities; these we call distribution models. Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models. For example, consider modeling the sum of a dozen dice. A distribution model would produce all possible sums and their probabilities of occurring, whereas a sample model would produce an individual sum drawn according to this probability distribution.
• Credit assignment problem: How do you distribute credit for success among the many decisions that may have been involved in producing it?
• One of the challenges that arise in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions andprogressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate its expected reward. The exploration-exploitation dilemma has been intensively studied by mathematicians for many decades