Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Deep Reinforcement Learning

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 42 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Deep Reinforcement Learning (20)

Anzeige

Weitere von MeetupDataScienceRoma (20)

Aktuellste (20)

Anzeige

Deep Reinforcement Learning

  1. 1. DeepReinforcementLearning Agentileand(almost)mathfreeintroduction simone@ai-academy.com 1
  2. 2. Outline What, why and where it stands in ML? General framework Q-Learning Deep + ... Codeanddemo(hopefullyworking...) 2
  3. 3. 3
  4. 4. Atari[Nature,2015] 4
  5. 5. AlphaGo[Nature,2016] 5
  6. 6. "RL tries to understand the optimal way to make decisions." David Silver - Research Scientist, Google DeepMind 6
  7. 7. Howdoesitwork? = ( , , , , . . . , , , , )Ht s1 a1 r1 s2 st at rt sT 7
  8. 8. Whyitisdi erentfromotherMLsettings? No supervisor Delayed feedback Time matters Data depends on the Agent policy 8
  9. 9. Whatcanwemodel? Environment & Agent 9
  10. 10. Universe
  11. 11. 10
  12. 12. Environment import Gym env = Gym.make('SpaceInvaders-v0') s = env.step() # take an action terminal = False while not terminal: next_state, reward, terminal, _ = env.step(action) 11
  13. 13. s_1 -> s_2 -> s_3 12
  14. 14. Keyassumptions 1. The probability of the next state depends only on the current state 2. Each state contains all the relevant information 13
  15. 15. It'sMe,Mario! Mario wants to break bricks and free the princess! 14
  16. 16. Expectedfuturerewards Any goal can be represented as a sum of intermediate rewards. [ ∣ ] = [ + γ + + … ∣ ]∑ ∞ t=0 γ t Rt St R0 R1 γ 2 R2 St 15
  17. 17. Tools 1. Policy: 2. Value function: 3. Model: We have to pick at least 1 of the 3. π(a|s) Q(s, a) (P, R) 16
  18. 18. Policy A policy de nes how the agent behaves. It takes as input a state and output an action. It can be stochastic, or deterministic. 17
  19. 19. Valuefunction A value function estimates how much reward the agent can achieve. It takes as input a (state,action), and output values. One for each possible action. 18
  20. 20. Model A model is the Agent representation of the environment. Takes as input a state and output (next_state,reward). 19
  21. 21. Designchoice Balance learning and planning Explore new actions and exploit good ones Assign credits for correct actions 20
  22. 22. Quick-Q&A 21
  23. 23. Howtosolveit? The goal is to nd the optimalpolicy that maximize the futureexpected rewards. 22
  24. 24. Repeat 1. Prediction: Compute the value of the expected reward from until the terminal state. 2. Control: Act greedly with respect to the predicted values. st 23
  25. 25. Approximationofthevaluefunction Monte Carlo (used in Alpha Go) Temporal Di erence (used in Atari) 24
  26. 26. TemporalDi erence 25
  27. 27. Pavlovianconditioning 26
  28. 28. Updaterule In rabbits, humans and machines we get the same algorithm: while True: Q[t] = Q[t-1] + alpha * (Q_target - Q[t-1]) 27
  29. 29. Q-Learning[Watkins,1989] The agent does not have a model of the environment. Perform actions following a standard policy. Predict using the target policy. Which makes it an "o -policy", model-free method. 28
  30. 30. Lossfunction Building on what we learn from the rabbit. The learning goal is to minimize the following loss function: Putting all together we get... Q_target = r + gamma * np.argmax( Q(s, A)) Loss = 1/n * np.sum( (Q_target - Q(s,a))^2) 29
  31. 31. DeepQ-Learning Let's add Neural Networks and we are good to go right? 30
  32. 32. 31
  33. 33. Notice... 1. Data are highly correlated 2. The target values are not robust 3. Wild rewards make the value function freaks out 32
  34. 34. Wewish... A stable Q_target, a robust Q and predictable rewards. But how? 33
  35. 35. DeepMindideas 1. Di erent neural networks for Q and Q_target 2. Estimate Q_target using past experiences 3. Update Q_target every C steps 4. Clip rewards between -1 and 1 34
  36. 36. Network Input: an image of shape [None, 42, 42, 4] 4 Conv2D 32 lters, 4x4 kernel 1 Hidden layer of size 256 1 Fully connected layer of size action_size 35
  37. 37. Hyperparams Learning rate: 0.001 Reward clip: (-1, 1) Gradient clip: 40 Optimizer: AdamOptimizer 36
  38. 38. 37
  39. 39. Tools Challenges Demo OpenAI Tensor ow General AI Challenge Stanford 38
  40. 40. Resources: Papers RL - David Silver Introduction to RL Patacchiola Blog Human Level control Async Method for DRL 39
  41. 41. Q&A 40
  42. 42. ThankstoMachineLearning/DataScience Meetup simone@ai-academy.com 41

×