Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Deep Q-learning explained

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 50 Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Deep Q-learning explained (20)

Anzeige

Aktuellste (20)

Deep Q-learning explained

  1. 1. Hi 1
  2. 2. Deep Q-Network Deep Reinforcement Learning Azzeddine CHENINE @azzeddineCH_ 2
  3. 3. 15 October 2019 3
  4. 4. Q-Learning Quick Reminder 4
  5. 5. Quick Reminder Learning by experience 5
  6. 6. Quick Reminder Learning by experience Episodic tasks and continuous task 6
  7. 7. Quick Reminder Learning by experience Episodic tasks and continuous task An action has a reward and results a new state 7
  8. 8. Quick Reminder The value based methods try to reach an optimal policy by using a value function 8
  9. 9. Quick Reminder The value based methods try to reach an optimal policy by using a value function The Q function of an optima policy maximise the expected return G 9
  10. 10. Quick Reminder The value based methods try to reach an optimal policy by using a value function The Q function of an optima policy maximise the expected return G Epsilon greedy policies solve the exploration and exploitation problem 10
  11. 11. Q-Learning Limitation of 11
  12. 12. § Limitation Non practical with environments with large or in fi nite state or action space Games Stock market Robots 12
  13. 13. Modern problems require Modern solutions 13
  14. 14. Deep Q-Networks Deepminds - 2015 - 14
  15. 15. No more given de fi nitions …. 15
  16. 16. No more given de fi nitions …. fi guring it out by ourselves 16
  17. 17. 1 - How to replace the Q-table ? Neural networks can be trained to fi t on a linear or a non linear function. 17
  18. 18. 1 - How to replace the Q-table ? Neural networks can be trained to fi t on a linear or a non linear function. Replace the Q-value function with a deep neural network 18
  19. 19. 1 - How to replace the Q-table ? Neural networks can be trained to fi t on a linear or a non linear function. Replace the Q-value function with a deep neural network The outputs layer hold the Q-value of each possible action ⟹ make the actions space discrete 19
  20. 20. 1 - How to replace the Q-table ? Neural networks can be trained to fi t on a linear or a non linear function. Replace the Q-value function with a deep neural network The outputs layer hold the Q-value of each possible action The input layer takes the desired representation of the state ⟹ make the actions space discrete ⟹ a pipeline of convolutions ⟹ a Dense layer ⟹ a sequence of LSTM/RNN cells 20
  21. 21. 1 - How to replace the Q-table ✅ 21
  22. 22. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How did we used update the Q-table The agent took an action (a) in a state (S) 22
  23. 23. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How did we used update the Q-table The agent took an action (a) in a state (S) The agent moved to the next state (S’) and received a reward (r) 23
  24. 24. 1 - How to replace the Q-table ✅ 2 - How to optimize the model the discount rate 𝞭 re fl ects how much we care about the futur reward How did we used update the Q-table The agent took an action (a) in a state (S) The agent moved to the next state (S’) and received a reward (r) 𝞭 ≈ 1⟹ we care too much about the future reward 𝞭 ≈ 0⟹ we care too much about the immediate reward 24
  25. 25. 1 - How to replace the Q-table ✅ 2 - How to optimize the model Q' ( S , a ) ⟸ Q ( S , a ) + 𝛼 [ r ( S , a , S’ ) + 𝞭 max Q(S’, a') - Q ( S, a ) ] How did we used update the Q-table A good Q value for an action (a) given a state (S) is a good approximation to the expected discounted cumulative reward. ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 25
  26. 26. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values State Q-networks ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 26
  27. 27. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection State Q-networks Environments New State S’ Reward R ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 27
  28. 28. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') State Q-networks Environments New State S’ Reward R ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 28
  29. 29. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q(S, a') State Q-networks Environments New State S’ Reward R ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 29
  30. 30. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q(S, a') State Q-networks Environments New State S’ Reward R Loss Optimizers ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 30
  31. 31. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q(S, a') New network weights State Q-networks Environments New State S’ Reward R Loss Optimizers ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 31
  32. 32. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Update weights State New state For n episode 32
  33. 33. We are chasing a moving target 33
  34. 34. r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q ( S, a ) 34
  35. 35. Q ( S, a ) r ( S , a , S’ ) + 𝞭 max Q(S’, a') 35
  36. 36. Q ( S, a ) r ( S , a , S’ ) + 𝞭 max Q(S’, a') 36
  37. 37. r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q ( S, a ) 37
  38. 38. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target We use a local network and a target network 38
  39. 39. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target We use a local network and a target network The target neural network get slightly updated based on the local networks weights 39
  40. 40. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target We use a local network and a target network The target neural network get slightly updated based on the local networks weights The local network predict the Q-values for the current state 40
  41. 41. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target We use a local network and a target network The target neural network get slightly updated based on the local networks weights The local network predict the Q-values for the current state The target network predict the Q-values for the next state 41
  42. 42. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target Q-values Action Action selection State Local network Environments New State S’ Reward R 42
  43. 43. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q(S, a') New network weights State Local network Environments New State S’ Reward R Loss Optimizers for the local network Target network 43
  44. 44. ? Someone to answer the question please In supervised learning, why we shu ffl ing the dataset gives better accuracy? 44
  45. 45. Someone to answer the question please In supervised learning, why we shu ffl ing the dataset gives better accuracy? If not shu ffl ing data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence ? 45
  46. 46. Someone to answer the question please In supervised learning, why we shu ffl ing the dataset gives better accuracy? If not shu ffl ing data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence In our case we are having a big correlation between consecutive states which may lead to ine ffi cient learning . ? 46
  47. 47. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How to break the coloration between states 3 - T-target ✅ 4 - Experience replay Stack the experiences in a buffer After the speci fi ed number of steps pick an N random experiences Proceed with optimizing the loss with the learning batch 47
  48. 48. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How to break the coloration between states 3 - T-target ✅ 4 - Experience replay ✅ Stack the experiences in a buffer After the speci fi ed number of steps pick an N random experiences Proceed with optimizing the loss with the learning batch 48
  49. 49. Build a DQL agent Putting it all together 49
  50. 50. Saha ftourkoum

×