Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Reinforcement Learning
Reinforcement Learning
Wird geladen in …3
×

Hier ansehen

1 von 27 Anzeige

Deep Q-Learning

Herunterladen, um offline zu lesen

Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)

Anzeige

Ähnlich wie Deep Q-Learning (20)

Aktuellste (20)

Anzeige

Deep Q-Learning

  1. 1. Deep Q-Learning A Reinforcement Learning approach
  2. 2. What is Reinforcement Learning? - Much like biological agents behave - No supervisor, only a reward - Data is time dependent (non iid) - Feedback is delayed - Agent actions affect the data it receives
  3. 3. Examples - Play checkers (1959) - Defeat the world champion at Backgammon (1992) - Control a helicopter (2008) - Make a robot to walk - Robocup Soccer - Play ATARI games better than humans (2014) - Defeat the world champion at Go (2016) Videos
  4. 4. Reward Hypothesis All goals can be described by the maximisation of expected cumulative reward - Defeat the world champion at Go: +R / -R for winning/losing a game - Make a robot to walk: +R for forward, -R for falling over - Play ATARI games: +R / -R for increasing/decreasing score - Control a helicopter: + R / -R following trajectory / crashing
  5. 5. Agent and Environment
  6. 6. Fully Observable Environments Fully Observable Environments (agent state = environment state): - Agent directly observes environment - Example: chess board Partially Observable Environments (agent state not equal environment state): - Agent indirectly observes environment - Example: A robot with motion sensor or camera - Agent must construct its own state representation
  7. 7. RL components: Policy and Value Function Policy is agent’s behaviour function - Maps from state to action - Deterministic policy: - Stochastic: Value function is a is a prediction of future reward - Used to evaluate state and select between actions -
  8. 8. Model Predicts what environment will do next:
  9. 9. Maze example: r = -1 per time-step and policy [David Silver. Advanced Topics: RL]
  10. 10. Maze example: Value function and Model [David Silver. Advanced Topics: RL]
  11. 11. Exploration - Exploitation dilemma
  12. 12. Math: Markov Decision Process (MDP) Almost all RL problems can be formalised as MDPs It’s a tuple: - S is finite set of states - A is finite set of actions - P is state transition probability matrix: - R is a reward function: - Discount factor:
  13. 13. State-Value and Action-Value functions, Bellman eq. Expected return starting from state s, and then following policy : Expected return starting from state s, taking action a, and then following policy :
  14. 14. Finding an Optimal Policy - There is always optimal policy for any MPD - All optimal policies achieve the optimal value function - All optimal policies achieve the optimal action-value function All you need is to find
  15. 15. Bellman Opt Equation for state-value function [David Silver. Advanced Topics: RL]
  16. 16. Bellman Opt Equation for action-value function [David Silver. Advanced Topics: RL]
  17. 17. Bellman Opt Equation for state-value function [David Silver. Advanced Topics: RL]
  18. 18. Bellman Opt Equation for action-value function [David Silver. Advanced Topics: RL]
  19. 19. Policy Iteration Demo
  20. 20. Q-Learning - model-free off-policy control algorithm Model-free (vs Model-based): - MDP model is unknown, but experience can be sampled MDP - Model is known, but is too big to use, except by samples Off-policy (vs On-policy): - Can learn about policy from experience sampled from some other policy Control (vs Prediction): - Find best policy
  21. 21. Q-Learning [David Silver. Advanced Topics: RL]
  22. 22. DQN - Q-Learning with function approximation [Human-level control through deep reinforcement learning]
  23. 23. [Human-level control through deep reinforcement learning]
  24. 24. Issues with Q-learning with neural network - Data is sequential (non-iid) - Policy changes rapidly with slight changes to Q-values - Policy may oscillate - Experience flows from one extreme to another - Scale of rewards and Q-values is unknown - Unstable backpropagation due to large gradients
  25. 25. DQN solutions - Use experience replay - Breaks correlations in data - Learn from all past policies - Using off-policy Q-learning - Freeze target Q-network - Avoid policy oscillations - Break correlations between Q-network and target - Clip rewards and gradients
  26. 26. Neon Demo
  27. 27. Links - Human-level control through deep reinforcement learning - Course: David Silver. Advanced Topics: RL - Tutorial: David Silver. Deep Reinforcement Learning - Book: Sutton, Barto. Reinforcement learning - Source Code: simple_dqn - Reinforcejs - The Arcade Learning Environment

×