Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Deep Reinforcement Learning

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Deep Reinforcement Learning
Deep Reinforcement Learning
Wird geladen in …3
×

Hier ansehen

1 von 38 Anzeige

Deep Reinforcement Learning

Herunterladen, um offline zu lesen

Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)

Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Deep Reinforcement Learning (20)

Anzeige

Aktuellste (20)

Anzeige

Deep Reinforcement Learning

  1. 1. 1 Reinforcement Learning By Usman Qayyum 13, Nov, 2018
  2. 2. Machine Learning Expert ? 2 Supervised Learning suffers from underline human-bias present in the data
  3. 3. Machine Learning • Supervised Learning Example Class • Reinforcement Learning Situation Reward Situation Reward … • Un-Supervised Learning Example Classification Regression Clustering Auto-Encoder Qlearning, DQN Policy Gradient Actor-Critic 3
  4. 4. Human Learning (Trail & Error) ● Achieves Goal Fail to achieve Goal Baby starts walking and successfully reaches the couch 4
  5. 5. Reinforcement Learning ● Trial & error learning ● Learning from interaction ● Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal 5
  6. 6. How to Formulate RL Problem Environment—Physical world in which the agent operates State—Current situation of the agent Action— Agent interaction with environment through actions Reward—Feedback from the environment Policy—Method to map agent’s state to actions Value—Future reward that an agent would receive by taking an action in a particular state 6
  7. 7. RL Applications (Games/Networking) Objective Complete the game with the highest score State Raw pixel inputs of the game state Action Game controls e.g. Left, Right, Up, Down Reward Score increase/decrease at each time step Objective Win the game! State Position of all pieces Action Where to put the next piece down Reward 1 if win at the end of the game, 0 otherwise Objective Intelligent Channel Selection State Occupation on each channel in current time slot Action Set the channel to be used for the next time slot Reward +1 in case of no collision with interferer otherwise -17
  8. 8. Markov Decision Process  8
  9. 9. Markov Decision Process 9 • MDP is used to describe an environment for reinforcement learning • Almost all RL problems can be formalized as MDPs Markov property states that, “ The future is independent of the past given the present.” P[St+1 | St ] = P[ St+1 | S1, ….. , St ] Markov Chain Transition matrix Markov reward
  10. 10. Model / Model-Free Learning 10
  11. 11. Environment (Taxi Game) 11 Representations WALL --> (Can't pass through, will remain in the same position Yellow --> Taxi Current Location Blue --> Pick up Location Purple --> Drop-off Location Green --> Taxi turn green once passenger board
  12. 12. Q Learning … ● Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future rewards for action at each state. But the questions are: How do we calculate the values of the Q-table? Are the values available or predefined?12 States = 500 Actions 0: move south 1: move north 2: move east 3: move west 4: pickup passenger 5: dropoff passenger Reward: +20: successfully pick up a passenger and drop them off at desired location -1: for each step -10: every time you incorrectly pick up or drop off a passenger
  13. 13. Q Learning … Step1: When the episode initially starts, every Q-value is 0. 13
  14. 14. Q Learning … Step 2&3: choose and perform an action In the beginning, the agent will explore the environment and randomly choose actions. As the agent explores the environment, the agent starts to exploit the environment. 14
  15. 15. Q Learning … Step 4 & 5: Measure reward and Update Q Table The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a). Learning Rate Discount Factor (Future reward) 15
  16. 16. Q-Learning to DQN 16
  17. 17. Google Deep-mind (Deep Q-Network) 17 “Human-level control through deep reinforcement learning”, Nature, 2015
  18. 18. Gym A library that can simulate large numbers of reinforcement learning environments, including Atari games 18 • Lack of standardization of environments used in publications • The need for better benchmarks.
  19. 19. Example: Taxi Game Problem (OpenAI Gym) 19
  20. 20. Example-1 20
  21. 21. Example-2 21
  22. 22. Example-2 … 22
  23. 23. 23 Deep Q-Network Human-level control through deep reinforcement learning – Nature Vol 518, Feb 26, 2015 By Usman Qayyum 15, Nov, 2018
  24. 24. 24
  25. 25. Model-Free RL (Recap) ● Policy-based RL ○ Search directly for the optimal policy ∏* ○ This is the policy achieving maximum future reward ● Value-based RL ○ Estimate the optimal value function Q*(s,a) ○ This is the maximum value achievable under any policy 25
  26. 26. Q-Learning to DQN (Value based RL ) 26 Q-table is like a “cheat-sheet” to help us to find the maximum expected future reward of an action, given a current state. • Good strategy — however, this is not scalable.
  27. 27. Playing Atari with Deep RL (Nature, 2015) ● Played seven Atari 2600 games ● Beat previous ML approaches on six ● Beat human expert on three ● Aim to create a single neural network agent that is able to successfully learn to play as many of the games as possible. ● Learns strictly from experience - no pre- training. ● Inputs: game screen + score. ● No game-specific tuning. 27
  28. 28. What’s Next 28
  29. 29. Atari ● Rules of the game unknown ● Learn directly from interactive game play ● Pick Action on joystick, see pixels and score 29
  30. 30. Preprocessing & Temporal limitation 30
  31. 31. Convolution Layer/Fully Connected 31 • Frames are processed by three convolution layers. • These layers allow you to exploit spatial relationships in images. • But also, because frames are stacked together, you can exploit some spatial properties across those frames.
  32. 32. Experience Replay 32 Experience replay will help us to handle two things: Avoid forgetting previous experiences: the variability of the weights, because there is high correlation between actions and states. Solution: create a “replay buffer.” This stores experience tuples while interacting with the environment, and then we sample a small batch of tuple to feed our neural network. Reduce correlations between experiences: we know that every action affects the next state. This outputs a sequence of experience tuples which can be highly correlated Solution: By sampling from the replay buffer at random, we can break this correlation. This prevents action values from oscillating or diverging catastrophically.
  33. 33. Clipping Rewards 33 Each game has different score scales. For example, in Pong, players can get 1 point when wining the play. Otherwise, players get -1 point. However, in SpaceInvaders, players get 10~30 points when defeating invaders. This difference would make training unstable. Thus Clipping Rewards technique clips scores, which all positive rewards are set +1 and all negative rewards are set -1.
  34. 34. DQN Algorithm 34
  35. 35. Performance 35 Recent Graph from Google Deepmind, 2018 (current trend in RL Gaming) Naïve DQN vs Replay-buffer-based DQN
  36. 36. STRENGTHS AND WEAKNESSES ● Good at ‣ Quick-moving, complex, short-horizon games ‣ Semi-independent trails within the game ‣ Negative feedback on failure ● Bad at ‣ long-horizon games that don’t converge ‣ Any “walking around” game ‣ Montezuma’s revenge Worldly knowledge helps humans play these games relatively easily. 36
  37. 37. Example Code ● DQN with Atari Game ○ Colab jupyter notebooks 37
  38. 38. Reference ● Rich Sutton, Reinforcement Learning: an introduction, 2017 ● Deep Reinforcement Learning, An overview, 2017 https://arxiv.org/pdf/1701.07274.pdf ● UCL course Reinforcement Learning: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html ● CS231, Reinfrocement Learning, Lecture 14, 2017 http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf ● Thomas Simonini, Medium Post “An introduction to Reinforcement Learning” https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning- 4339519de419 ● Arthur Juliani, Medium Post “Simple Reinforcement Learning in Tensorflow”, https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1- fd544fab149 38

×