Lecture notes

Reinforcement Learning Michael L. Littman Slides from http://www.cs.vu.nl/~elena/ml_13light.ppt which appear to have been adapted from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/l20.ps

Reinforcement Learning ,[object Object],[object Object],[object Object],[object Object],[Read Ch. 13] [Exercise 13.2]

Control Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

One Example: TD-Gammon ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Reinforcement Learning Problem ,[object Object],[object Object]

Markov Decision Processes ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Agent’s Learning Task ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Value Function ,[object Object],[object Object],where r t , r t+1 , ... are generated by following policy  starting at state s Restated, the task is to learn the optimal policy  *

What to Learn ,[object Object],[object Object],[object Object],[object Object],[object Object]

Q Function ,[object Object],If agent learns Q, it can choose optimal action even without knowing  ! Q is the evaluation function the agent will learn [Watkins 1989].

Training Rule to Learn Q ,[object Object],This allows us to write Q recursively as Nice! Let denote learner’s current approximation to Q. Consider training rule where s’ is the state resulting from applying action a in state s.

Q Learning for Deterministic Worlds ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Updating Q Notice if rewards non-negative, then and

[object Object],[object Object],[object Object],[object Object],Convergence Theorem

Note we used general fact that: This works with things other than max that satisfy this non-expansion property [Szepesv á ri & Littman, 1999].

Non-deterministic Case (1) ,[object Object],[object Object]

Nondeterministic Case (2) Q learning generalizes to nondeterministic worlds Alter training rule to where Can still prove convergence of to Q [Watkins and Dayan, 1992]. Standard properties:   n = 0,   n 2 =  .

Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n ? Blend all of these:

Temporal Difference Learning (2) ,[object Object],[object Object],[object Object],[object Object],Equivalent expression:

Subtleties and Ongoing Research ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Lecture notes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (16)

Ähnlich wie Lecture notes

Ähnlich wie Lecture notes (20)

Mehr von butest

Mehr von butest (20)

Lecture notes