SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Lecture	2:	Sampling-based	Approximations
And
Function	Fitting
Yan	(Rocky)	Duan
Berkeley	AI	Research	Lab
Many	slides	made	with	John	Schulman,	Xi	(Peter)	Chen	and	Pieter	Abbeel
n Optimal	Control
=	
given	an	MDP	(S, A, P, R, γ, H)
find	the	optimal	policy	π*
Quick	One-Slide	Recap
n Exact	Methods:
n Value	Iteration
n Policy	Iteration
Limitations:	
• Update	equations	require	access	to	dynamics	
model
• Iteration	over	/	Storage	for	all	states	and	actions:	
requires	small,	discrete	state-action	space
->	sampling-based	approximations
->	Q/V	function	fitting
n Q	Value	Iteration
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
Sampling-Based	Approximation
Recap	Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:
n Q-value	iteration:
n Rewrite	as	expectation:	
n (Tabular)	Q-Learning:	replace	expectation	by	samples
n For	an	state-action	pair	(s,a),	receive:
n Consider	your	old	estimate:
n Consider	your	new	sample	estimate:
n Incorporate	the	new	estimate	into	a	running	average:
(Tabular)	Q-Learning
Qk+1 Es0⇠P (s0|s,a)
h
R(s, a, s0
) + max
a0
Qk(s0
, a0
)
i
s0
⇠ P(s0
|s, a)
Qk(s, a)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
(Tabular)	Q-Learning
Algorithm:
Start	with	 for	all	s,	a.
Get	initial	state	s
For	k =	1,	2,	…	till	convergence
Sample	action	a,	get	next	state	s’
If	s’	is	terminal:
Sample	new	initial	state	s’
else:
Q0(s, a)
target = R(s, a, s0
) + max
a0
Qk(s0
, a0
)
target = R(s, a, s0
)
s s0
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target]
n Choose random actions?
n Choose action that maximizes (i.e.	greedily)?
n ɛ-Greedy:	choose	random	action	with	prob.	ɛ,	otherwise	choose	
action	greedily
How	to	sample	actions?
Qk(s, a)
n Amazing	result:	Q-learning	converges	to	optimal	policy	--
even	if	you’re	acting	suboptimally!
n This	is	called	off-policy	learning
n Caveats:
n You	have	to	explore	enough
n You	have	to	eventually	make	the	learning	rate
small	enough
n …	but	not	decrease	it	too	quickly
Q-Learning	Properties
n Technical	requirements.	
n All	states	and	actions	are	visited	infinitely	often
n Basically,	in	the	limit,	it	doesn’t	matter	how	you	select	actions	(!)
n Learning	rate	schedule	such	that	for	all	state	and	action	
pairs	(s,a):
Q-Learning	Properties
For details, see Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative
dynamic programming algorithms. Neural Computation, 6(6), November 1994.
1X
t=0
↵t(s, a) = 1
1X
t=0
↵2
t (s, a) < 1
Q-Learning	Demo:	Gridworld
• States:	11	cells
• Actions:	{up,	down,	left,	right}
• Deterministic	transition	function
• Learning	rate:	0.5
• Discount:	1
• Reward:	+1	for	getting	diamond,	-1	for	falling	into	trap
Q-Learning	Demo:	Crawler
• States:	discretized	value	of	2d	state:	(arm	angle,	hand	angle)
• Actions:	Cartesian	product	of	{arm	up,	arm	down}	and	{hand	up,	hand	down}
• Reward:	speed	in	the	forward	direction
Sampling-Based	Approximation
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
n Value	Iteration
n unclear	how	to	draw	samples	through	max…...
Value	Iteration	w/	Samples?
V ⇤
i+1(s) max
a
Es0⇠P (s0|s,a) [R(s, a, s0
) + V ⇤
i (s0
)]
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation
n Policy	Improvement?
Sampling-Based	Approximation
Recap:	Policy	Iteration
One	iteration	of	policy	iteration:
n Policy	evaluation	for	current	policy									:
n Iterate	until	convergence
n Policy	improvement:	find	the	best	action	according	to	one-step	
look-ahead
⇡k
Can	be	approximated	by	samples
This	is	called	Temporal	Difference	(TD)	Learning
Unclear	what	to	do	with	the	max	(for	now)
V ⇡k
i+1(s) Es0⇠P (s0|s,⇡k(s))[R(s, ⇡k(s), s0
) + V ⇡k
i (s0
)]
⇡k+1(s) arg max
a
Es0⇠P (s0|s,a)[R(s, a, s0
) + V ⇡k
(s0
)]
n Q	Value	Iteration	à (Tabular)	Q-learning
n Value	Iteration?
n Policy	Iteration
n Policy	Evaluation	à (Tabular)	TD-learning
n Policy	Improvement	(for	now)
Sampling-Based	Approximation
n Optimal	Control
=	
given	an	MDP	(S, A, P, R, γ, H)
find	the	optimal	policy	π*
Quick	One-Slide	Recap
n Exact	Methods:
n Value	Iteration
n Policy	Iteration
Limitations:	
• Update	equations	require	access	to	dynamics	
model
• Iteration	over	/	Storage	for	all	states	and	actions:	
requires	small,	discrete	state-action	space
->	sampling-based	approximations
->	Q/V	function	fitting
n Discrete	environments
Can	tabular	methods	scale?
Tetris
10^60
Atari
10^308 (ram) 10^16992 (pixels)
Gridworld
10^1
n Continuous	environments	(by	crude	discretization)
Crawler
10^2
Hopper
10^10
Humanoid
10^100
Can	tabular	methods	scale?
Generalizing	Across	States
n Basic	Q-Learning	keeps	a	table	of	all	q-values
n In	realistic	situations,	we	cannot	possibly	learn	
about	every	single	state!
n Too	many	states	to	visit	them	all	in	training
n Too	many	states	to	hold	the	q-tables	in	memory
n Instead,	we	want	to	generalize:
n Learn	about	some	small	number	of	training	states	from	
experience
n Generalize	that	experience	to	new,	similar	situations
n This	is	a	fundamental	idea	in	machine	learning,	and	
we’ll	see	it	over	and	over	again
n Instead	of	a	table,	we	have	a	parametrized	Q	function:
n Can	be	a	linear	function	in	features:	
n Or	a	complicated	neural	net
n Learning	rule:
n Remember:	
n Update:
Approximate	Q-Learning
Q✓(s, a)
Q✓(s, a) = ✓0f0(s, a) + ✓1f1(s, a) + · · · + ✓nfn(s, a)
target(s0
) = R(s, a, s0
) + max
a0
Q✓k
(s0
, a0
)
✓k+1 ✓k ↵r✓

1
2
(Q✓(s, a) target(s0
))2
✓=✓k
Connection	to	Tabular	Q-Learning
n Suppose	
n Plug	into	update:
n Compare	with	Tabular	Q-Learning	update:
✓ 2 R|S|⇥|A|
, Q✓(s, a) ⌘ ✓sa
r✓sa

1
2
(Q✓(s, a) target(s0
))2
= r✓sa

1
2
(✓sa target(s0
))2
= ✓sa target(s0
)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
✓sa ✓sa ↵(✓sa target(s0
))
= (1 ↵)✓sa + ↵[target(s0
)]
n state:	naïve	board	configuration	+	shape	of	the	falling	piece	~1060 states!
n action:	rotation	and	translation	applied	to	the	falling	piece
n 22	features	aka	basis	functions	
n Ten	basis	functions,	0,	.	.	.	,	9,	mapping	the	state	to	the	height	h[k]	of	each	column.
n Nine	basis	functions,	10,	.	.	.	,	18,	each	mapping	the	state	to	the	absolute	difference	
between	heights	of	successive	columns:	|h[k+1]	−	h[k]|,	k	=	1,	.	.	.	,	9.
n One	basis	function,	19,	that	maps	state	to	the	maximum	column	height:	maxk h[k]
n One	basis	function,	20,	that	maps	state	to	the	number	of	‘holes’	in	the	board.
n One	basis	function,	21,	that	is	equal	to	1	in	every	state.
[Bertsekas &	Ioffe,	1996	(TD);	Bertsekas &	Tsitsiklis 1996	(TD);	Kakade 2002	(policy	gradient);	Farias &	Van	Roy,	2006	(approximate	LP)]
ˆV (s) =
21X
i=0
i⇥i(s) = >
⇥(s)
i
Engineered	Approximation	Example:	Tetris
Deep	Reinforcement	Learning
Pong Enduro Beamrider Q*bert
• From	pixels	to	actions
• Same	algorithm	(with	effective	tricks)
• CNN	function	approximator,	w/	3M	free	parameters
n We	have	now	covered	enough	materials	for	Lab	1.
n Will	be	released	on	Piazza	by	this	afternoon.
n Covers	value	iteration,	policy	iteration,	and	tabular	Q-learning.
Lab	1
n The	bad:	it	is	not	guaranteed	to	converge…
n Even	if	the	function	approximation	is	expressive	enough	to	
represent	the	true	Q	function
Convergence	of	Approximate	Q-Learning
Function	approximator:		[1	2]	*	θ
θ 2θ
x1 x2r=0
r=0
Simple	Example**
n Definition.		An	operator	G	is	a	non-expansion with	respect	to	a	norm	||	.	||		if
n Fact. If	the	operator	F	is	a	γ-contraction	with	respect	to	a	norm	||	.	||	and	the	
operator	G	is	a	non-expansion	with	respect	to	the	same	norm,	then	the	
sequential	application	of	the	operators	G	and	F	is	a	γ-contraction,	i.e.,	
n Corollary. If	the	supervised	learning	step	is	a	non-expansion,	then	iteration	in	
value	iteration	with	function	approximation	is	a	γ-contraction,	and	in	this	case	
we	have	a	convergence	guarantee.
Composing	Operators**
n Examples:	
n nearest	neighbor	(aka	state	aggregation)
n linear	interpolation	over	triangles	
(tetrahedrons,	…)
Averager Function	Approximators Are	Non-Expansions**
Averager Function	Approximators Are	Non-Expansions**
Example	taken	from	Gordon,	1995
Linear	Regression	L **
n I.e.,	if	we	pick	a	non-expansion	function	approximator which	can	approximate	
J*	well,	then	we	obtain	a	good	value	function	estimate.
n To	apply	to	discretization:	use	continuity	assumptions	to	show	that	J*	can	be	
approximated	well	by	chosen	discretization	scheme.
Guarantees	for	Fixed	Point**

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Final slide (bsc csit) chapter 5
Final slide (bsc csit) chapter 5Final slide (bsc csit) chapter 5
Final slide (bsc csit) chapter 5
 
Critical Overview of Some Pumping Test Analysis Equations
Critical Overview of Some Pumping Test Analysis EquationsCritical Overview of Some Pumping Test Analysis Equations
Critical Overview of Some Pumping Test Analysis Equations
 
Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu Nguyen
 
Everything You Wanted to Know About Optimization
Everything You Wanted to Know About OptimizationEverything You Wanted to Know About Optimization
Everything You Wanted to Know About Optimization
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
 

Ähnlich wie Lec2 sampling-based-approximations-and-function-fitting

week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
YuChianWu
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
darwinrlo
 
Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 

Ähnlich wie Lec2 sampling-based-approximations-and-function-fitting (20)

14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
reinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of universityreinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of university
 
reinforcement-learning.ppt
reinforcement-learning.pptreinforcement-learning.ppt
reinforcement-learning.ppt
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 

Mehr von Ronald Teo

Mehr von Ronald Teo (16)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
06 mlp
06 mlp06 mlp
06 mlp
 
Mdp
MdpMdp
Mdp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Lec2 sampling-based-approximations-and-function-fitting