Introduce to Reinforcement Learning

•Als PPTX, PDF herunterladen•

2 gefällt mir•131 views

The document presents an agenda for a talk on reinforcement learning and creating a bot to play FlappyBird. The agenda includes introducing reinforcement learning concepts like Markov decision processes, value functions, and deep Q-learning. It also demonstrates using OpenAI Gym to build a bot that learns to play FlappyBird through trial and error without being explicitly programmed.

Wissenschaft

July 1, 2017
Create Bot to play FlappyBird
Introduce to Reinforcement
Learning
Nguyen Luong An Phu
anphunl@gmail.com

 What is Reinforcement Learning?
 Markov Decision Process
 Introduce OpenAI Gym
 Demo: Bot to play FlappyBird
Agenda

 No supervisor, only the reward signal.
 Feedback is delayed, not instantaneous.
 Sequential data, time is master.
 Agent’s actions affect the subsequent data it receives.
Difficulties of RL

Agent and Environment
ActionObservation
Reward
Ot
At
Rt

 History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt
 State is the information used to determine what happens next
 St = f(Ht)
 Agent state vs Environment state (Sa
t vs Se
t)
 Fully Observable and Partially Observable environment.
State

 Policy
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
 Value function
vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)
 Model
Pa
ss’ = P[St+1 = s’ | St = s, At = a]
Ra
s = E [Rt+1 | St = s, At = a]
Major components of an agent

 Value based
Value function
No policy (Implicit)
 Policy based
No value function
Policy
 Actor Critic
Value function
Policy
Categorizing RL agents

 Model free
Value function and/or policy
No model
 Model based
Value function and/or policy
Model
Categorizing RL agents

 Exploration finds more information about the environment
 Exploitation exploits known information to maximize reward
Exploration vs Exploitation
if np.random.uniform() < eps:
action = random_action()
else:
action = get_best_action()

 Markov state contains all useful information from the history.
 P[St+1 | St] = P[St+1 | S1,…, St]
 Some examples:
Se
t is Markov.
The history Ht is Markov.
Markov state (Information state)

 A Markov Decision Process is a tuple (S, A, P, R, γ).
 S: a finite set of states.
 A: a finite set of actions
 P: a state transition probability matrix
Pa
ss’ = P [St+1 = s’ | St = s, At = a]
 R: reward function
Ra
s = E [Rt+1 | St = s, At = a]
 γ: discount factor, γ ∈ [0, 1]
Markov Decision Process (MDP)

Example: Student MDP
Picture from David Silver’s course.

 The state-value function vπ(s) is the expected return
starting from state s, and then following policy π.
 The action-value function qπ(s, a) is the expected return
starting from state s, taking action a, and then following policy
π.
 vπ(s) = Eπ [Gt | St = s]
 qπ(s, a) = Eπ [Gt | St = s, At = a]
 Gt = Rt+1 + γRt+2 + γ2Rt+3 + …
Value function of MDP

Bellman Expectation Equation for vπ
Picture from David Silver’s course.

Bellman Expectation Equation for qπ
Picture from David Silver’s course.

State-Value Function for Student MDP
7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10
Picture from David Silver’s course.

 State-value function
v∗(s) = maxπ vπ(s)
 Action-value function
q∗(s, a) = maxπ qπ(s, a)
 π* (a|s) =
1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎)
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Optimal value function and policy

Bellman equation for optimal value function
Picture from David Silver’s course.

Optimal policy for Student MDP
Picture from David Silver’s course.

 Value Iteration
 Policy Iteration
 Q-learning
 Sarsa
 …
Solving the Bellman Optimality Equation

Deep Q-Learning
https://arxiv.org/pdf/1511.06581.pdf

Deep Q-Learning
http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

 https://www.coursera.org/learn/machine-learning
 https://www.coursera.org/learn/neural-networks
 NLP: https://web.stanford.edu/class/cs224n/
 CNN: http://cs231n.stanford.edu/
 RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
 http://www.deeplearningbook.org/
 Reinforcement Learning: An Introduction (Richard S. Sutton and
Andrew G. Barto)
Courses and books

Weitere ähnliche Inhalte

Ähnlich wie Introduce to Reinforcement Learning

S19_lecture6_exploreexploitinbandits.pdf

LPrashanthi

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

Jack Clark

Introduction to Reinforcement Learning for Molecular Design

Dan Elton

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...

The Statistical and Applied Mathematical Sciences Institute

Lec5 advanced-policy-gradient-methods

Ronald Teo

Reinforcement Learning : A Beginners Tutorial

Omar Enayet

Reinforcement Learning with Amazon SageMaker RL

Thom Lane

CS294-112 Lec 05

Gyubin Son

Lecture notes

butest

Deep Reinforcement Learning with Shallow Trees: In this talk, I present Concept Network Reinforcement Learning (CNRL), developed at Bonsai. It is an industrially applicable approach to solving complex tasks using reinforcement learning, which facilitates problem decomposition, allows component reuse, and simplifies reward functions. Inspired by Sutton’s options framework, we introduce the notion of “Concept Networks” which are tree-like structures in which leaves are “sub-concepts” (sub-tasks), representing policies on a subset of state space. The parent (non-leaf) nodes are “Selectors”, containing policies on which sub-concept to choose from the child nodes, at each time during an episode. There will be a high-level overview on the reinforcement learning fundamentals at the beginning of the talk. Bio: Matineh Shaker is an Artificial Intelligence Scientist at Bonsai in Berkeley, CA, where she builds machine learning, reinforcement learning, and deep learning tools and algorithms for general purpose intelligent systems. She was previously a Machine Learning Researcher at Geometric Intelligence, Data Science Fellow at Insight Data Science, Predoctoral Fellow at Harvard Medical School. She received her PhD from Northeastern University with a dissertation in geometry-inspired manifold learning.

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

MLconf

An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.

Financial Trading as a Game: A Deep Reinforcement Learning Approach

謙益黃

100 things I know

r-uribe

lecture_21.pptx - PowerPoint Presentation

Deep RL.pdf

RL.ppt

Cs221 logic-planning

Cs221 lecture7-fall11

darwinrlo

Imitation learning tutorial

Yisong Yue

TensorFlow and Deep Learning Tips and Tricks

Ben Ball

Lecture notes

butest

Ähnlich wie Introduce to Reinforcement Learning (20)

S19_lecture6_exploreexploitinbandits.pdf

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

Introduction to Reinforcement Learning for Molecular Design

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...

Lec5 advanced-policy-gradient-methods

Reinforcement Learning : A Beginners Tutorial

Reinforcement Learning with Amazon SageMaker RL

CS294-112 Lec 05

Lecture notes

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Financial Trading as a Game: A Deep Reinforcement Learning Approach

100 things I know

lecture_21.pptx - PowerPoint Presentation

Deep RL.pdf

RL.ppt

Cs221 logic-planning

Cs221 lecture7-fall11

Imitation learning tutorial

TensorFlow and Deep Learning Tips and Tricks

Lecture notes

Kürzlich hochgeladen

Unit5-Cloud.pptx for lpu course cse121 o

ManavSingh202607

GBSN - Microbiology (Unit 3)

Areesha Ahmad

GBSN - Biochemistry (Unit 1)

Areesha Ahmad

Just Call Vip call girls Srinagar Jammu Kashmir Escorts ☎️ 8617697112 Starting From 5K to 15K High Profile Escorts In Srinagar Jammu Kashmir ❤Personal Whatsapp Number Jammu Kashmir Call Girls 8617697112 💦✅. There are a number of Srinagar Jammu Kashmir Escorts willing to meet you at an affordable rate, which also possesses high moral standards and humanitarian tendencies. These girls can help satisfy the sexual desires of clients without fail; it is therefore essential that clients select an established service. Our services feature various packages at competitive rates: One shot: ₹2000/in-call, ₹5000/out-call Two shots with one girl: ₹3500/in-call, ₹6000/out-call Body to body massage with sex: ₹3000/in-call Full night for one person: ₹7000/in-call, ₹10000/out-call

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.

Nitya salvi

The computation of anti-derivatives is just an in-tellectual challenge, we know how to take deriv-atives, but … can we invert the process? We call this Computing the indefinite integral . In the last presentation we have seen a few indefinite integrals (we called them bricks), but they did not include the anti-derivative of many functions! We are going to try and do better !

COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)

AkefAfaneh2

Clean In Place(CIP).pptx .

Poonam Aher Patil

Module for Grade 9 for Asynchronous/Distance learning

levieagacer

Seismic Method Estimate velocity from seismic data.pptx

AlMamun560346

GBSN - Microbiology (Unit 1)

Areesha Ahmad

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...

Silpa

Proteomics: types, protein profiling steps etc.

Silpa

GBSN - Microbiology (Unit 2)

Areesha Ahmad

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx

RizalinePalanog2

PSYCHOSOCIAL NEEDS. in nursing II sem pptx

Suji236384

IDENTIFICATION OF THE LIVING- forensic medicine

sherlingomez2

Call Girls In Safdarung Enclave Arjun Nagar Whatsapp +91 9654467111 Delhi ⛟ Open 24 Hrs, ☎ Booking Short 2000 Night 6000 ALL HOME/HOTEL SERVICE DOORSTEP SERVICE IN/CALL & OUT/CALL SERVICE WITH MANY OPTIONS AVAILABLE DELHI GURGAON & NOIDA SERVICE IN REASONABLE RATES FROM LOW TO HIGH PROFILE STAFF’S. Call Girl Number~24X7~Call Girl Services, New Delhi, Delhi OutCall Rate Call Girl Mahipalpur,Call Girl Connaught Place,Call Girl Nehru Place,Call Girl Chanakyapuri,Call Girl Paharganj,Call Girl Dhaula Kuan,Call Girl Moti Bagh,Call Girl Karol Bagh,Call Girl Greater Kailash,Call Girl Naraina, Call Girl Katwaria Sarai,Call Girl Janakpuri,Call Girl Kalkaji,Call Girl Lajpat Nagar,Call Girl Palam,Call Girl Malviya Nagar,Call Girl Mehrauli,Call Girl Govindpuri,Call Girl Sarojini Nagar ,Call Girl Neb Sarai,Call Girl South Ex,Call Girl Munirka,Call Girl Saket,Call Girl Chattarpur

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000

Sapana Sha

Zoology 5th semester notes( Sumit_yadav).pdf

Sumit Kumar yadav

Presenter: Chen Li, PhD. Professor, Department of Computer Science, University of California Irvine Abstract Many data analytics projects have collaborators with complementary backgrounds, including biologists, bioinformaticians, computer scientists, and AI/ML experts. Many of them have limited experience to code, set up a computing infrastructure, and use MLmodels. Existing tools and services, such as email attachments, GitHub, and Google Drive are inefficient for sharing data and analyses. In this talk, we present an open source system called Texera that provides a cloud computing platform for collaborators to share data and analyses as workflows. After seven years of development, the system has a rich set of powerful features, such as shared editing, shared execution, version control, commenting, debugging, user-defined functions in multiple languages (e.g., Python, R, Java), and support of state-of-the-art AI/ML techniques. Its backend parallel engine enables scalable computation on large data sets using computing clusters. We will show a demo of the system, and present our vision supported by a recent NIH award, dkNET(NIDDK Information Network, https://dknet.org), to serve the diabetes, endocrinology, and metabolic diseases research communities through the FAIR sharing of data and knowledge. Resource link: https://github.com/Texera/texera Upcoming webinars schedule: https://dknet.org/about/webinar

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...

dkNET

Bacterial Identification and Classifications

Areesha Ahmad

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...

chandars293

Kürzlich hochgeladen (20)

Unit5-Cloud.pptx for lpu course cse121 o

GBSN - Microbiology (Unit 3)

GBSN - Biochemistry (Unit 1)

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.

COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)

Clean In Place(CIP).pptx .

Module for Grade 9 for Asynchronous/Distance learning

Seismic Method Estimate velocity from seismic data.pptx

GBSN - Microbiology (Unit 1)

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...

Proteomics: types, protein profiling steps etc.

GBSN - Microbiology (Unit 2)

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx

PSYCHOSOCIAL NEEDS. in nursing II sem pptx

IDENTIFICATION OF THE LIVING- forensic medicine

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000

Zoology 5th semester notes( Sumit_yadav).pdf

dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...

Bacterial Identification and Classifications

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...

Introduce to Reinforcement Learning

1. July 1, 2017 Create Bot to play FlappyBird Introduce to Reinforcement Learning Nguyen Luong An Phu anphunl@gmail.com

2.  What is Reinforcement Learning?  Markov Decision Process  Introduce OpenAI Gym  Demo: Bot to play FlappyBird Agenda

3. What is RL?

4. RL examples

5.  No supervisor, only the reward signal.  Feedback is delayed, not instantaneous.  Sequential data, time is master.  Agent’s actions affect the subsequent data it receives. Difficulties of RL

6. Agent and Environment ActionObservation Reward Ot At Rt

7.  History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt  State is the information used to determine what happens next  St = f(Ht)  Agent state vs Environment state (Sa t vs Se t)  Fully Observable and Partially Observable environment. State

8.  Policy Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P[At = a|St = s]  Value function vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)  Model Pa ss’ = P[St+1 = s’ | St = s, At = a] Ra s = E [Rt+1 | St = s, At = a] Major components of an agent

9.  Value based Value function No policy (Implicit)  Policy based No value function Policy  Actor Critic Value function Policy Categorizing RL agents

10.  Model free Value function and/or policy No model  Model based Value function and/or policy Model Categorizing RL agents

11.  Exploration finds more information about the environment  Exploitation exploits known information to maximize reward Exploration vs Exploitation if np.random.uniform() < eps: action = random_action() else: action = get_best_action()

12.  Markov state contains all useful information from the history.  P[St+1 | St] = P[St+1 | S1,…, St]  Some examples: Se t is Markov. The history Ht is Markov. Markov state (Information state)

13.  A Markov Decision Process is a tuple (S, A, P, R, γ).  S: a finite set of states.  A: a finite set of actions  P: a state transition probability matrix Pa ss’ = P [St+1 = s’ | St = s, At = a]  R: reward function Ra s = E [Rt+1 | St = s, At = a]  γ: discount factor, γ ∈ [0, 1] Markov Decision Process (MDP)

14. Example: Student MDP Picture from David Silver’s course.

15.  The state-value function vπ(s) is the expected return starting from state s, and then following policy π.  The action-value function qπ(s, a) is the expected return starting from state s, taking action a, and then following policy π.  vπ(s) = Eπ [Gt | St = s]  qπ(s, a) = Eπ [Gt | St = s, At = a]  Gt = Rt+1 + γRt+2 + γ2Rt+3 + … Value function of MDP

16. Bellman Expectation Equation for vπ Picture from David Silver’s course.

17. Bellman Expectation Equation for qπ Picture from David Silver’s course.

18. State-Value Function for Student MDP 7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10 Picture from David Silver’s course.

19.  State-value function v∗(s) = maxπ vπ(s)  Action-value function q∗(s, a) = maxπ qπ(s, a)  π* (a|s) = 1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎) 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Optimal value function and policy

20. Bellman equation for optimal value function Picture from David Silver’s course.

21. Optimal policy for Student MDP Picture from David Silver’s course.

22.  Value Iteration  Policy Iteration  Q-learning  Sarsa  … Solving the Bellman Optimality Equation

23. Deep Q-Learning https://arxiv.org/pdf/1511.06581.pdf

24. Deep Q-Learning http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

25. Demo FlappyBird & Discussion

26.  https://www.coursera.org/learn/machine-learning  https://www.coursera.org/learn/neural-networks  NLP: https://web.stanford.edu/class/cs224n/  CNN: http://cs231n.stanford.edu/  RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html  http://www.deeplearningbook.org/  Reinforcement Learning: An Introduction (Richard S. Sutton and Andrew G. Barto) Courses and books

Hinweis der Redaktion

Real world reinforcement learning: learn from experience to maximize the rewards. Dog watches the actions of the trainer, hears her command and react based on those information. If the react is good, dog receives rewards (lure, compliment…). If the react is not good, dog will not receive any reward. Dog will learn from its experience to find the way to get as many rewards as possible.
AlphaGo: defeated Ke Jie (Kha Khiết) (other game playing: Atari, chess…) Waymo: Self driving car (Google) DeepMind AI Reduces Google Data Centre Cooling Bill by 40% (https://goo.gl/JbcH5n) Robotics SpaceX reuses rocket. Financial (Investment)
Supervised learning, unsupervised learning? We usually don’t receive the reward immediately. When playing chess, we win or lose because of some moves in the past For the self driving car problem, right before the accident, driver often hits the brake. Observation -> action -> reward -> new observation -> new action -> new reward. The actions of agent can change the environment and affect to the future observation.
At step t: do action At, see new observation Ot and receive reward Rt
History is a series of observations, rewards and actions from the beginning to current time. State is a function of history. Env state is environment’s private representation, usually not visible to the agent. If it’s visible, it may contain the irrelevant information. In fully observable env, agent directly observes the environment. (Sa = Se) In particially observable env, agent indirectly observes env (Sa != St)
Policy is the agent’s behavior, it maps from state to action. Value function is a prediction of future reward, used to evaluate the goodness/badness of states  choose the action. A model predicts what the environment will do next P predict the next state R predict the next immediately reward. (not the Rt+1, just the expected value) If gamma = 0  just care about immediately reward, if gamma =1  don’t discount.
Categorizing : value based, policy based, actor critic
Categorizing : model free, model based
Reinforcement learning is like trial-and-error learning The agent discover the good policy from its experiences of the environment without losing too much reward along way. Reduce epsilon during training time. When at test mode, just choose the best action. Epsilon is a small number (1-> 0.1)
When the state is known, the history can be thrown away. Can convert or create the Markov state by adding more information. Some more examples: chess board and know the player will move next, drive a car -> just need to know the current conditions: position, speed…, don’t need to care about history.
Why do we need the gamma discount factor? The discount γ is the present value of future rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future Like the bank, the money today is better than tomorrow. Animal/human behavior shows preference for immediatereward
The example is from David Silver’s course. Circles and squares are states (square: terminal state) Some actions: Facebook, Quit, Study… From the 3rd state, if we chose action Pub, it may ends with different states.
From state s, we can do many action, the probability of each action is π(a|s) After that, we receive reward then it can move to other state s’ with the probability Pass’
From state s, we choose action a, receive reward Ras , then can move to many new states. After that, we can do many actions based on π(a’|s’)
The optimal state-value function v∗(s) is the maximum value function over all policies The optimal action-value function q∗(s, a) is the maximum action-value function over all policies An MDP is “solved” when we know the optimal value The optimal value function specifies the best possible performance in the MDP If we know q∗(s; a), we immediately have the optimal policy
Follow the q*, we will find the optimal policy
Input: state Output: vector for q value (size : nb_actions). Dueling DQN: the first is the value function V(s), which says simple how good it is to be in any given state. The second is the advantage function A(a), which tells how much better taking a certain action would be compared to the others. We can then think of Q as being the combination of V and A.

Introduce to Reinforcement Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Introduce to Reinforcement Learning

Ähnlich wie Introduce to Reinforcement Learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduce to Reinforcement Learning

Hinweis der Redaktion