SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Lecture 1: Introduction to Reinforcement Learning
Lecture 1: Introduction to Reinforcement
Learning
David Silver
Lecture 1: Introduction to Reinforcement Learning
Outline
1 Admin
2 About Reinforcement Learning
3 The Reinforcement Learning Problem
4 Inside An RL Agent
5 Problems within Reinforcement Learning
Lecture 1: Introduction to Reinforcement Learning
Admin
Class Information
Thursdays 9:30 to 11:00am
Website:
http://www.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
Group:
http://groups.google.com/group/csml-advanced-topics
Contact me: d.silver@cs.ucl.ac.uk
Lecture 1: Introduction to Reinforcement Learning
Admin
Assessment
Assessment will be 50% coursework, 50% exam
Coursework
Assignment A: RL problem
Assignment B: Kernels problem
Assessment = max(assignment1, assignment2)
Examination
A: 3 RL questions
B: 3 kernels questions
Answer any 3 questions
Lecture 1: Introduction to Reinforcement Learning
Admin
Textbooks
An Introduction to Reinforcement Learning, Sutton and
Barto, 1998
MIT Press, 1998
∼ 40 pounds
Available free online!
http://webdocs.cs.ualberta.ca/∼sutton/book/the-book.html
Algorithms for Reinforcement Learning, Szepesvari
Morgan and Claypool, 2010
∼ 20 pounds
Available free online!
http://www.ualberta.ca/∼szepesva/papers/RLAlgsInMDPs.pdf
Lecture 1: Introduction to Reinforcement Learning
About RL
Many Faces of Reinforcement Learning
Computer Science
Economics
Mathematics
Engineering Neuroscience
Psychology
Machine
Learning
Classical/Operant
Conditioning
Optimal
Control
Reward
System
Operations
Research
Bounded
Rationality
Reinforcement
Learning
Lecture 1: Introduction to Reinforcement Learning
About RL
Branches of Machine Learning
Reinforcement
Learning
Supervised
Learning
Unsupervised
Learning
Machine
Learning
Lecture 1: Introduction to Reinforcement Learning
About RL
Characteristics of Reinforcement Learning
What makes reinforcement learning different from other machine
learning paradigms?
There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Lecture 1: Introduction to Reinforcement Learning
About RL
Examples of Reinforcement Learning
Fly stunt manoeuvres in a helicopter
Defeat the world champion at Backgammon
Manage an investment portfolio
Control a power station
Make a humanoid robot walk
Play many different Atari games better than humans
Lecture 1: Introduction to Reinforcement Learning
About RL
Helicopter Manoeuvres
Lecture 1: Introduction to Reinforcement Learning
About RL
Bipedal Robots
Lecture 1: Introduction to Reinforcement Learning
About RL
Atari
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Reward
Rewards
A reward Rt is a scalar feedback signal
Indicates how well agent is doing at step t
The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected
cumulative reward
Do you agree with this statement?
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Reward
Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ve reward for following desired trajectory
−ve reward for crashing
Defeat the world champion at Backgammon
+/−ve reward for winning/losing a game
Manage an investment portfolio
+ve reward for each $ in bank
Control a power station
+ve reward for producing power
−ve reward for exceeding safety thresholds
Make a humanoid robot walk
+ve reward for forward motion
−ve reward for falling over
Play many different Atari games better than humans
+/−ve reward for increasing/decreasing score
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Reward
Sequential Decision Making
Goal: select actions to maximise total future reward
Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more
long-term reward
Examples:
A financial investment (may take months to mature)
Refuelling a helicopter (might prevent a crash in several hours)
Blocking opponent moves (might help winning chances many
moves from now)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Environments
Agent and Environment
observation
reward
action
At
Rt
Ot
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Environments
Agent and Environment
observation
reward
action
At
Rt
Ot
At each step t the agent:
Executes action At
Receives observation Ot
Receives scalar reward Rt
The environment:
Receives action At
Emits observation Ot+1
Emits scalar reward Rt+1
t increments at env. step
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
History and State
The history is the sequence of observations, actions, rewards
Ht = O1, R1, A1, ..., At−1, Ot, Rt
i.e. all observable variables up to time t
i.e. the sensorimotor stream of a robot or embodied agent
What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
State is the information used to determine what happens next
Formally, state is a function of the history:
St = f (Ht)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Environment State
observation
reward
action
At
Rt
Ot
St
eenvironment state
The environment state Se
t is
the environment’s private
representation
i.e. whatever data the
environment uses to pick the
next observation/reward
The environment state is not
usually visible to the agent
Even if Se
t is visible, it may
contain irrelevant
information
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Agent State
observation
reward
action
At
Rt
Ot
St
a
agent state
The agent state Sa
t is the
agent’s internal
representation
i.e. whatever information
the agent uses to pick the
next action
i.e. it is the information
used by reinforcement
learning algorithms
It can be any function of
history:
Sa
t = f (Ht)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Information State
An information state (a.k.a. Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if
P[St+1 | St] = P[St+1 | S1, ..., St]
“The future is independent of the past given the present”
H1:t → St → Ht+1:∞
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Se
t is Markov
The history Ht is Markov
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Rat Example
What if agent state = last 3 items in sequence?
What if agent state = counts for lights, bells and levers?
What if agent state = complete sequence?
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Fully Observable Environments
state
reward
action
At
Rt
St
Full observability: agent directly
observes environment state
Ot = Sa
t = Se
t
Agent state = environment
state = information state
Formally, this is a Markov
decision process (MDP)
(Next lecture and the
majority of this course)
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Partially Observable Environments
Partial observability: agent indirectly observes environment:
A robot with camera vision isn’t told its absolute location
A trading agent only observes current prices
A poker playing agent only observes public cards
Now agent state = environment state
Formally this is a partially observable Markov decision process
(POMDP)
Agent must construct its own state representation Sa
t , e.g.
Complete history: Sa
t = Ht
Beliefs of environment state: Sa
t = (P[Se
t = s1
], ..., P[Se
t = sn
])
Recurrent neural network: Sa
t = σ(Sa
t−1Ws + OtWo)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Major Components of an RL Agent
An RL agent may include one or more of these components:
Policy: agent’s behaviour function
Value function: how good is each state and/or action
Model: agent’s representation of the environment
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Policy
A policy is the agent’s behaviour
It is a map from state to action, e.g.
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states
And therefore to select between actions, e.g.
vπ(s) = Eπ Rt+1 + γRt+2 + γ2
Rt+3 + ... | St = s
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Example: Value Function in Atari
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Model
A model predicts what the environment will do next
P predicts the next state
R predicts the next (immediate) reward, e.g.
Pa
ss = P[St+1 = s | St = s, At = a]
Ra
s = E [Rt+1 | St = s, At = a]
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example
Start
Goal
Rewards: -1 per time-step
Actions: N, E, S, W
States: Agent’s location
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Policy
Start
Goal
Arrows represent policy π(s) for each state s
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Value Function
-14 -13 -12 -11 -10 -9
-16 -15 -12 -8
-16 -17 -6 -7
-18 -19 -5
-24 -20 -4 -3
-23 -22 -21 -22 -2 -1
Start
Goal
Numbers represent value vπ(s) of each state s
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Model
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1
-1 -1
-1 -1
Start
Goal
Agent may have an internal
model of the environment
Dynamics: how actions
change the state
Rewards: how much reward
from each state
The model may be imperfect
Grid layout represents transition model Pa
ss
Numbers represent immediate reward Ra
s from each state s
(same for all a)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Categorizing RL agents (1)
Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
Policy
Value Function
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Categorizing RL agents (2)
Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
RL Agent Taxonomy
Model
Value Function PolicyActor
Critic
Value-Based Policy-Based
Model-Free
Model-Based
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Learning and Planning
Two fundamental problems in sequential decision making
Reinforcement Learning:
The environment is initially unknown
The agent interacts with the environment
The agent improves its policy
Planning:
A model of the environment is known
The agent performs computations with its model (without any
external interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering,
thought, search
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Atari Example: Reinforcement Learning
observation
reward
action
At
Rt
Ot
Rules of the game are
unknown
Learn directly from
interactive game-play
Pick actions on
joystick, see pixels
and scores
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Atari Example: Planning
Rules of the game are known
Can query emulator
perfect model inside agent’s brain
If I take action a from state s:
what would the next state be?
what would the score be?
Plan ahead to find optimal policy
e.g. tree search
right left
right rightleft left
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Exploration and Exploitation (1)
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Exploration and Exploitation (2)
Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Examples
Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Prediction and Control
Prediction: evaluate the future
Given a policy
Control: optimise the future
Find the best policy
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Gridworld Example: Prediction
3.3 8.8 4.4 5.3 1.5
1.5 3.0 2.3 1.9 0.5
0.1 0.7 0.7 0.4 -0.4
-1.0 -0.4 -0.4 -0.6 -1.2
-1.9 -1.3 -1.2 -1.4 -2.0
A B
A’
B’+10
+5
Actions
(a) (b)
What is the value function for the uniform random policy?
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Gridworld Example: Control
a) gridworld b) V* c) *
22.0 24.4 22.0 19.4 17.5
19.8 22.0 19.8 17.8 16.0
17.8 19.8 17.8 16.0 14.4
16.0 17.8 16.0 14.4 13.0
14.4 16.0 14.4 13.0 11.7
A B
A’
B’+10
+5
πv⇤ ⇡⇤
What is the optimal value function over all possible policies?
What is the optimal policy?
Lecture 1: Introduction to Reinforcement Learning
Course Outline
Course Outline
Part I: Elementary Reinforcement Learning
1 Introduction to RL
2 Markov Decision Processes
3 Planning by Dynamic Programming
4 Model-Free Prediction
5 Model-Free Control
Part II: Reinforcement Learning in Practice
1 Value Function Approximation
2 Policy Gradient Methods
3 Integrating Learning and Planning
4 Exploration and Exploitation
5 Case study - RL in games

Weitere ähnliche Inhalte

Was ist angesagt?

ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
SanGeet25
 

Was ist angesagt? (20)

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Analise de componentes principais
Analise de componentes principaisAnalise de componentes principais
Analise de componentes principais
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Adv. python regular expression by Rj
Adv. python regular expression by RjAdv. python regular expression by Rj
Adv. python regular expression by Rj
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Ähnlich wie Intro rl

Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest
 
S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdf
LPrashanthi
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
butest
 

Ähnlich wie Intro rl (20)

RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdf
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement Learning
 
S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdf
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 

Mehr von Ronald Teo

Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 

Mehr von Ronald Teo (16)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
06 mlp
06 mlp06 mlp
06 mlp
 
Mdp
MdpMdp
Mdp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Intro rl

  • 1. Lecture 1: Introduction to Reinforcement Learning Lecture 1: Introduction to Reinforcement Learning David Silver
  • 2. Lecture 1: Introduction to Reinforcement Learning Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning
  • 3. Lecture 1: Introduction to Reinforcement Learning Admin Class Information Thursdays 9:30 to 11:00am Website: http://www.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html Group: http://groups.google.com/group/csml-advanced-topics Contact me: d.silver@cs.ucl.ac.uk
  • 4. Lecture 1: Introduction to Reinforcement Learning Admin Assessment Assessment will be 50% coursework, 50% exam Coursework Assignment A: RL problem Assignment B: Kernels problem Assessment = max(assignment1, assignment2) Examination A: 3 RL questions B: 3 kernels questions Answer any 3 questions
  • 5. Lecture 1: Introduction to Reinforcement Learning Admin Textbooks An Introduction to Reinforcement Learning, Sutton and Barto, 1998 MIT Press, 1998 ∼ 40 pounds Available free online! http://webdocs.cs.ualberta.ca/∼sutton/book/the-book.html Algorithms for Reinforcement Learning, Szepesvari Morgan and Claypool, 2010 ∼ 20 pounds Available free online! http://www.ualberta.ca/∼szepesva/papers/RLAlgsInMDPs.pdf
  • 6. Lecture 1: Introduction to Reinforcement Learning About RL Many Faces of Reinforcement Learning Computer Science Economics Mathematics Engineering Neuroscience Psychology Machine Learning Classical/Operant Conditioning Optimal Control Reward System Operations Research Bounded Rationality Reinforcement Learning
  • 7. Lecture 1: Introduction to Reinforcement Learning About RL Branches of Machine Learning Reinforcement Learning Supervised Learning Unsupervised Learning Machine Learning
  • 8. Lecture 1: Introduction to Reinforcement Learning About RL Characteristics of Reinforcement Learning What makes reinforcement learning different from other machine learning paradigms? There is no supervisor, only a reward signal Feedback is delayed, not instantaneous Time really matters (sequential, non i.i.d data) Agent’s actions affect the subsequent data it receives
  • 9. Lecture 1: Introduction to Reinforcement Learning About RL Examples of Reinforcement Learning Fly stunt manoeuvres in a helicopter Defeat the world champion at Backgammon Manage an investment portfolio Control a power station Make a humanoid robot walk Play many different Atari games better than humans
  • 10. Lecture 1: Introduction to Reinforcement Learning About RL Helicopter Manoeuvres
  • 11. Lecture 1: Introduction to Reinforcement Learning About RL Bipedal Robots
  • 12. Lecture 1: Introduction to Reinforcement Learning About RL Atari
  • 13. Lecture 1: Introduction to Reinforcement Learning The RL Problem Reward Rewards A reward Rt is a scalar feedback signal Indicates how well agent is doing at step t The agent’s job is to maximise cumulative reward Reinforcement learning is based on the reward hypothesis Definition (Reward Hypothesis) All goals can be described by the maximisation of expected cumulative reward Do you agree with this statement?
  • 14. Lecture 1: Introduction to Reinforcement Learning The RL Problem Reward Examples of Rewards Fly stunt manoeuvres in a helicopter +ve reward for following desired trajectory −ve reward for crashing Defeat the world champion at Backgammon +/−ve reward for winning/losing a game Manage an investment portfolio +ve reward for each $ in bank Control a power station +ve reward for producing power −ve reward for exceeding safety thresholds Make a humanoid robot walk +ve reward for forward motion −ve reward for falling over Play many different Atari games better than humans +/−ve reward for increasing/decreasing score
  • 15. Lecture 1: Introduction to Reinforcement Learning The RL Problem Reward Sequential Decision Making Goal: select actions to maximise total future reward Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-term reward Examples: A financial investment (may take months to mature) Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many moves from now)
  • 16. Lecture 1: Introduction to Reinforcement Learning The RL Problem Environments Agent and Environment observation reward action At Rt Ot
  • 17. Lecture 1: Introduction to Reinforcement Learning The RL Problem Environments Agent and Environment observation reward action At Rt Ot At each step t the agent: Executes action At Receives observation Ot Receives scalar reward Rt The environment: Receives action At Emits observation Ot+1 Emits scalar reward Rt+1 t increments at env. step
  • 18. Lecture 1: Introduction to Reinforcement Learning The RL Problem State History and State The history is the sequence of observations, actions, rewards Ht = O1, R1, A1, ..., At−1, Ot, Rt i.e. all observable variables up to time t i.e. the sensorimotor stream of a robot or embodied agent What happens next depends on the history: The agent selects actions The environment selects observations/rewards State is the information used to determine what happens next Formally, state is a function of the history: St = f (Ht)
  • 19. Lecture 1: Introduction to Reinforcement Learning The RL Problem State Environment State observation reward action At Rt Ot St eenvironment state The environment state Se t is the environment’s private representation i.e. whatever data the environment uses to pick the next observation/reward The environment state is not usually visible to the agent Even if Se t is visible, it may contain irrelevant information
  • 20. Lecture 1: Introduction to Reinforcement Learning The RL Problem State Agent State observation reward action At Rt Ot St a agent state The agent state Sa t is the agent’s internal representation i.e. whatever information the agent uses to pick the next action i.e. it is the information used by reinforcement learning algorithms It can be any function of history: Sa t = f (Ht)
  • 21. Lecture 1: Introduction to Reinforcement Learning The RL Problem State Information State An information state (a.k.a. Markov state) contains all useful information from the history. Definition A state St is Markov if and only if P[St+1 | St] = P[St+1 | S1, ..., St] “The future is independent of the past given the present” H1:t → St → Ht+1:∞ Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future The environment state Se t is Markov The history Ht is Markov
  • 22. Lecture 1: Introduction to Reinforcement Learning The RL Problem State Rat Example What if agent state = last 3 items in sequence? What if agent state = counts for lights, bells and levers? What if agent state = complete sequence?
  • 23. Lecture 1: Introduction to Reinforcement Learning The RL Problem State Fully Observable Environments state reward action At Rt St Full observability: agent directly observes environment state Ot = Sa t = Se t Agent state = environment state = information state Formally, this is a Markov decision process (MDP) (Next lecture and the majority of this course)
  • 24. Lecture 1: Introduction to Reinforcement Learning The RL Problem State Partially Observable Environments Partial observability: agent indirectly observes environment: A robot with camera vision isn’t told its absolute location A trading agent only observes current prices A poker playing agent only observes public cards Now agent state = environment state Formally this is a partially observable Markov decision process (POMDP) Agent must construct its own state representation Sa t , e.g. Complete history: Sa t = Ht Beliefs of environment state: Sa t = (P[Se t = s1 ], ..., P[Se t = sn ]) Recurrent neural network: Sa t = σ(Sa t−1Ws + OtWo)
  • 25. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Major Components of an RL Agent An RL agent may include one or more of these components: Policy: agent’s behaviour function Value function: how good is each state and/or action Model: agent’s representation of the environment
  • 26. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Policy A policy is the agent’s behaviour It is a map from state to action, e.g. Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P[At = a|St = s]
  • 27. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Value Function Value function is a prediction of future reward Used to evaluate the goodness/badness of states And therefore to select between actions, e.g. vπ(s) = Eπ Rt+1 + γRt+2 + γ2 Rt+3 + ... | St = s
  • 28. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Example: Value Function in Atari
  • 29. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Model A model predicts what the environment will do next P predicts the next state R predicts the next (immediate) reward, e.g. Pa ss = P[St+1 = s | St = s, At = a] Ra s = E [Rt+1 | St = s, At = a]
  • 30. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Maze Example Start Goal Rewards: -1 per time-step Actions: N, E, S, W States: Agent’s location
  • 31. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Maze Example: Policy Start Goal Arrows represent policy π(s) for each state s
  • 32. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Maze Example: Value Function -14 -13 -12 -11 -10 -9 -16 -15 -12 -8 -16 -17 -6 -7 -18 -19 -5 -24 -20 -4 -3 -23 -22 -21 -22 -2 -1 Start Goal Numbers represent value vπ(s) of each state s
  • 33. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Maze Example: Model -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Start Goal Agent may have an internal model of the environment Dynamics: how actions change the state Rewards: how much reward from each state The model may be imperfect Grid layout represents transition model Pa ss Numbers represent immediate reward Ra s from each state s (same for all a)
  • 34. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Categorizing RL agents (1) Value Based No Policy (Implicit) Value Function Policy Based Policy No Value Function Actor Critic Policy Value Function
  • 35. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent Categorizing RL agents (2) Model Free Policy and/or Value Function No Model Model Based Policy and/or Value Function Model
  • 36. Lecture 1: Introduction to Reinforcement Learning Inside An RL Agent RL Agent Taxonomy Model Value Function PolicyActor Critic Value-Based Policy-Based Model-Free Model-Based
  • 37. Lecture 1: Introduction to Reinforcement Learning Problems within RL Learning and Planning Two fundamental problems in sequential decision making Reinforcement Learning: The environment is initially unknown The agent interacts with the environment The agent improves its policy Planning: A model of the environment is known The agent performs computations with its model (without any external interaction) The agent improves its policy a.k.a. deliberation, reasoning, introspection, pondering, thought, search
  • 38. Lecture 1: Introduction to Reinforcement Learning Problems within RL Atari Example: Reinforcement Learning observation reward action At Rt Ot Rules of the game are unknown Learn directly from interactive game-play Pick actions on joystick, see pixels and scores
  • 39. Lecture 1: Introduction to Reinforcement Learning Problems within RL Atari Example: Planning Rules of the game are known Can query emulator perfect model inside agent’s brain If I take action a from state s: what would the next state be? what would the score be? Plan ahead to find optimal policy e.g. tree search right left right rightleft left
  • 40. Lecture 1: Introduction to Reinforcement Learning Problems within RL Exploration and Exploitation (1) Reinforcement learning is like trial-and-error learning The agent should discover a good policy From its experiences of the environment Without losing too much reward along the way
  • 41. Lecture 1: Introduction to Reinforcement Learning Problems within RL Exploration and Exploitation (2) Exploration finds more information about the environment Exploitation exploits known information to maximise reward It is usually important to explore as well as exploit
  • 42. Lecture 1: Introduction to Reinforcement Learning Problems within RL Examples Restaurant Selection Exploitation Go to your favourite restaurant Exploration Try a new restaurant Online Banner Advertisements Exploitation Show the most successful advert Exploration Show a different advert Oil Drilling Exploitation Drill at the best known location Exploration Drill at a new location Game Playing Exploitation Play the move you believe is best Exploration Play an experimental move
  • 43. Lecture 1: Introduction to Reinforcement Learning Problems within RL Prediction and Control Prediction: evaluate the future Given a policy Control: optimise the future Find the best policy
  • 44. Lecture 1: Introduction to Reinforcement Learning Problems within RL Gridworld Example: Prediction 3.3 8.8 4.4 5.3 1.5 1.5 3.0 2.3 1.9 0.5 0.1 0.7 0.7 0.4 -0.4 -1.0 -0.4 -0.4 -0.6 -1.2 -1.9 -1.3 -1.2 -1.4 -2.0 A B A’ B’+10 +5 Actions (a) (b) What is the value function for the uniform random policy?
  • 45. Lecture 1: Introduction to Reinforcement Learning Problems within RL Gridworld Example: Control a) gridworld b) V* c) * 22.0 24.4 22.0 19.4 17.5 19.8 22.0 19.8 17.8 16.0 17.8 19.8 17.8 16.0 14.4 16.0 17.8 16.0 14.4 13.0 14.4 16.0 14.4 13.0 11.7 A B A’ B’+10 +5 πv⇤ ⇡⇤ What is the optimal value function over all possible policies? What is the optimal policy?
  • 46. Lecture 1: Introduction to Reinforcement Learning Course Outline Course Outline Part I: Elementary Reinforcement Learning 1 Introduction to RL 2 Markov Decision Processes 3 Planning by Dynamic Programming 4 Model-Free Prediction 5 Model-Free Control Part II: Reinforcement Learning in Practice 1 Value Function Approximation 2 Policy Gradient Methods 3 Integrating Learning and Planning 4 Exploration and Exploitation 5 Case study - RL in games