SlideShare ist ein Scribd-Unternehmen logo
1 von 26
July 1, 2017
Create Bot to play FlappyBird
Introduce to Reinforcement
Learning
Nguyen Luong An Phu
anphunl@gmail.com
 What is Reinforcement Learning?
 Markov Decision Process
 Introduce OpenAI Gym
 Demo: Bot to play FlappyBird
Agenda
What is RL?
RL examples
 No supervisor, only the reward signal.
 Feedback is delayed, not instantaneous.
 Sequential data, time is master.
 Agent’s actions affect the subsequent data it receives.
Difficulties of RL
Agent and Environment
ActionObservation
Reward
Ot
At
Rt
 History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt
 State is the information used to determine what happens next
 St = f(Ht)
 Agent state vs Environment state (Sa
t vs Se
t)
 Fully Observable and Partially Observable environment.
State
 Policy
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
 Value function
vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)
 Model
Pa
ss’ = P[St+1 = s’ | St = s, At = a]
Ra
s = E [Rt+1 | St = s, At = a]
Major components of an agent
 Value based
Value function
No policy (Implicit)
 Policy based
No value function
Policy
 Actor Critic
Value function
Policy
Categorizing RL agents
 Model free
Value function and/or policy
No model
 Model based
Value function and/or policy
Model
Categorizing RL agents
 Exploration finds more information about the environment
 Exploitation exploits known information to maximize reward
Exploration vs Exploitation
if np.random.uniform() < eps:
action = random_action()
else:
action = get_best_action()
 Markov state contains all useful information from the history.
 P[St+1 | St] = P[St+1 | S1,…, St]
 Some examples:
Se
t is Markov.
The history Ht is Markov.
Markov state (Information state)
 A Markov Decision Process is a tuple (S, A, P, R, γ).
 S: a finite set of states.
 A: a finite set of actions
 P: a state transition probability matrix
Pa
ss’ = P [St+1 = s’ | St = s, At = a]
 R: reward function
Ra
s = E [Rt+1 | St = s, At = a]
 γ: discount factor, γ ∈ [0, 1]
Markov Decision Process (MDP)
Example: Student MDP
Picture from David Silver’s course.
 The state-value function vπ(s) is the expected return
starting from state s, and then following policy π.
 The action-value function qπ(s, a) is the expected return
starting from state s, taking action a, and then following policy
π.
 vπ(s) = Eπ [Gt | St = s]
 qπ(s, a) = Eπ [Gt | St = s, At = a]
 Gt = Rt+1 + γRt+2 + γ2Rt+3 + …
Value function of MDP
Bellman Expectation Equation for vπ
Picture from David Silver’s course.
Bellman Expectation Equation for qπ
Picture from David Silver’s course.
State-Value Function for Student MDP
7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10
Picture from David Silver’s course.
 State-value function
v∗(s) = maxπ vπ(s)
 Action-value function
q∗(s, a) = maxπ qπ(s, a)
 π* (a|s) =
1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎)
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Optimal value function and policy
Bellman equation for optimal value function
Picture from David Silver’s course.
Optimal policy for Student MDP
Picture from David Silver’s course.
 Value Iteration
 Policy Iteration
 Q-learning
 Sarsa
 …
Solving the Bellman Optimality Equation
Deep Q-Learning
https://arxiv.org/pdf/1511.06581.pdf
Deep Q-Learning
http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/
Demo FlappyBird & Discussion
 https://www.coursera.org/learn/machine-learning
 https://www.coursera.org/learn/neural-networks
 NLP: https://web.stanford.edu/class/cs224n/
 CNN: http://cs231n.stanford.edu/
 RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
 http://www.deeplearningbook.org/
 Reinforcement Learning: An Introduction (Richard S. Sutton and
Andrew G. Barto)
Courses and books

Weitere ähnliche Inhalte

Ähnlich wie Introduce to Reinforcement Learning

S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdf
LPrashanthi
 
Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest
 
Cs221 logic-planning
Cs221 logic-planningCs221 logic-planning
Cs221 logic-planning
darwinrlo
 
Cs221 lecture7-fall11
Cs221 lecture7-fall11Cs221 lecture7-fall11
Cs221 lecture7-fall11
darwinrlo
 
Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 

Ähnlich wie Introduce to Reinforcement Learning (20)

S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdf
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLReinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RL
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
100 things I know
100 things I know100 things I know
100 things I know
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Cs221 logic-planning
Cs221 logic-planningCs221 logic-planning
Cs221 logic-planning
 
Cs221 lecture7-fall11
Cs221 lecture7-fall11Cs221 lecture7-fall11
Cs221 lecture7-fall11
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 

Kürzlich hochgeladen

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Kürzlich hochgeladen (20)

Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 

Introduce to Reinforcement Learning

  • 1. July 1, 2017 Create Bot to play FlappyBird Introduce to Reinforcement Learning Nguyen Luong An Phu anphunl@gmail.com
  • 2.  What is Reinforcement Learning?  Markov Decision Process  Introduce OpenAI Gym  Demo: Bot to play FlappyBird Agenda
  • 5.  No supervisor, only the reward signal.  Feedback is delayed, not instantaneous.  Sequential data, time is master.  Agent’s actions affect the subsequent data it receives. Difficulties of RL
  • 7.  History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt  State is the information used to determine what happens next  St = f(Ht)  Agent state vs Environment state (Sa t vs Se t)  Fully Observable and Partially Observable environment. State
  • 8.  Policy Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P[At = a|St = s]  Value function vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)  Model Pa ss’ = P[St+1 = s’ | St = s, At = a] Ra s = E [Rt+1 | St = s, At = a] Major components of an agent
  • 9.  Value based Value function No policy (Implicit)  Policy based No value function Policy  Actor Critic Value function Policy Categorizing RL agents
  • 10.  Model free Value function and/or policy No model  Model based Value function and/or policy Model Categorizing RL agents
  • 11.  Exploration finds more information about the environment  Exploitation exploits known information to maximize reward Exploration vs Exploitation if np.random.uniform() < eps: action = random_action() else: action = get_best_action()
  • 12.  Markov state contains all useful information from the history.  P[St+1 | St] = P[St+1 | S1,…, St]  Some examples: Se t is Markov. The history Ht is Markov. Markov state (Information state)
  • 13.  A Markov Decision Process is a tuple (S, A, P, R, γ).  S: a finite set of states.  A: a finite set of actions  P: a state transition probability matrix Pa ss’ = P [St+1 = s’ | St = s, At = a]  R: reward function Ra s = E [Rt+1 | St = s, At = a]  γ: discount factor, γ ∈ [0, 1] Markov Decision Process (MDP)
  • 14. Example: Student MDP Picture from David Silver’s course.
  • 15.  The state-value function vπ(s) is the expected return starting from state s, and then following policy π.  The action-value function qπ(s, a) is the expected return starting from state s, taking action a, and then following policy π.  vπ(s) = Eπ [Gt | St = s]  qπ(s, a) = Eπ [Gt | St = s, At = a]  Gt = Rt+1 + γRt+2 + γ2Rt+3 + … Value function of MDP
  • 16. Bellman Expectation Equation for vπ Picture from David Silver’s course.
  • 17. Bellman Expectation Equation for qπ Picture from David Silver’s course.
  • 18. State-Value Function for Student MDP 7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10 Picture from David Silver’s course.
  • 19.  State-value function v∗(s) = maxπ vπ(s)  Action-value function q∗(s, a) = maxπ qπ(s, a)  π* (a|s) = 1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎) 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Optimal value function and policy
  • 20. Bellman equation for optimal value function Picture from David Silver’s course.
  • 21. Optimal policy for Student MDP Picture from David Silver’s course.
  • 22.  Value Iteration  Policy Iteration  Q-learning  Sarsa  … Solving the Bellman Optimality Equation
  • 25. Demo FlappyBird & Discussion
  • 26.  https://www.coursera.org/learn/machine-learning  https://www.coursera.org/learn/neural-networks  NLP: https://web.stanford.edu/class/cs224n/  CNN: http://cs231n.stanford.edu/  RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html  http://www.deeplearningbook.org/  Reinforcement Learning: An Introduction (Richard S. Sutton and Andrew G. Barto) Courses and books

Hinweis der Redaktion

  1. Real world reinforcement learning: learn from experience to maximize the rewards. Dog watches the actions of the trainer, hears her command and react based on those information. If the react is good, dog receives rewards (lure, compliment…). If the react is not good, dog will not receive any reward. Dog will learn from its experience to find the way to get as many rewards as possible.
  2. AlphaGo: defeated Ke Jie (Kha Khiết) (other game playing: Atari, chess…) Waymo: Self driving car (Google) DeepMind AI Reduces Google Data Centre Cooling Bill by 40% (https://goo.gl/JbcH5n) Robotics SpaceX reuses rocket. Financial (Investment)
  3. Supervised learning, unsupervised learning? We usually don’t receive the reward immediately. When playing chess, we win or lose because of some moves in the past For the self driving car problem, right before the accident, driver often hits the brake. Observation -> action -> reward -> new observation -> new action -> new reward. The actions of agent can change the environment and affect to the future observation.
  4. At step t: do action At, see new observation Ot and receive reward Rt
  5. History is a series of observations, rewards and actions from the beginning to current time. State is a function of history. Env state is environment’s private representation, usually not visible to the agent. If it’s visible, it may contain the irrelevant information. In fully observable env, agent directly observes the environment. (Sa = Se) In particially observable env, agent indirectly observes env (Sa != St)
  6. Policy is the agent’s behavior, it maps from state to action. Value function is a prediction of future reward, used to evaluate the goodness/badness of states  choose the action. A model predicts what the environment will do next P predict the next state R predict the next immediately reward. (not the Rt+1, just the expected value) If gamma = 0  just care about immediately reward, if gamma =1  don’t discount.
  7. Categorizing : value based, policy based, actor critic
  8. Categorizing : model free, model based
  9. Reinforcement learning is like trial-and-error learning The agent discover the good policy from its experiences of the environment without losing too much reward along way. Reduce epsilon during training time. When at test mode, just choose the best action. Epsilon is a small number (1-> 0.1)
  10. When the state is known, the history can be thrown away. Can convert or create the Markov state by adding more information. Some more examples: chess board and know the player will move next, drive a car -> just need to know the current conditions: position, speed…, don’t need to care about history.
  11. Why do we need the gamma discount factor? The discount γ is the present value of future rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future Like the bank, the money today is better than tomorrow. Animal/human behavior shows preference for immediate reward
  12. The example is from David Silver’s course. Circles and squares are states (square: terminal state) Some actions: Facebook, Quit, Study… From the 3rd state, if we chose action Pub, it may ends with different states.
  13. From state s, we can do many action, the probability of each action is π(a|s) After that, we receive reward then it can move to other state s’ with the probability Pass’
  14. From state s, we choose action a, receive reward Ras , then can move to many new states. After that, we can do many actions based on π(a’|s’)
  15. The optimal state-value function v∗(s) is the maximum value function over all policies The optimal action-value function q∗(s, a) is the maximum action-value function over all policies An MDP is “solved” when we know the optimal value The optimal value function specifies the best possible performance in the MDP If we know q∗(s; a), we immediately have the optimal policy
  16. Follow the q*, we will find the optimal policy
  17. Input: state Output: vector for q value (size : nb_actions). Dueling DQN: the first is the value function V(s), which says simple how good it is to be in any given state. The second is the advantage function A(a), which tells how much better taking a certain action would be compared to the others. We can then think of Q as being the combination of V and A.