Reinforcement learning

Today’s (short) Lecture
●
●
●
●
●

Background
Supervised Learning: given data, predict labels
Unsupervised Learning: given data, learn about that data
Reinforcement Learning: given data, choose action to maximize expected
long-term reward

Why is it hard?
● No one-shot decision
○ Example: Losing Chess on 60th move

Why is it hard?
● No one-shot decision Credit Assignment Problem
○ Helicopter Crashing
○ Car crashing - break

Why is it hard?
● No one-shot decision Credit Assignment Problem
○ Helicopter Crashing
○ Car crashing - break
● Explore Exploit Problem
○ Example: Brick Game

Formalize: Markov Decision Process (MDPs)
(S, A, {Psa
}, Ɣ, R)

(S, A, {Psa
}, Ɣ, R)
S: Set of states

(S, A, {Psa
}, Ɣ, R)
S: Set of states
A: Set of Actions

(S, A, {Psa
}, Ɣ, R)
S: Set of states
A: Set of Actions
{Psa}: State Transition Distributions

(S, A, {Psa
}, Ɣ, R)
S: Set of states
A: Set of Actions
{Psa}: State Transition Distributions
Ɣ: Discount factor

(S, A, {Psa
}, Ɣ, R)
S: Set of states
A: Set of Actions
{Psa
}: State Transition Distributions
Ɣ: Discount factor
R: Reward function

Example: Robot Navigation Task

Simplified Example
3
2
1
1 2 3 4
+1
-1

Framework
● Number of states |S|: 11
● A: {N,S,W,E}

Framework
● Number of states |S|: 11
● A: {N,S,W,E}
● Assumption: Noisy Dynamics
80%
80%10%

{Psa
(s’)}
+1
-1
S0
P(3,1),N
((3,2)) = 0.8
P(3,1),N
((4,1)) = 0.1
P(3,1),N
((2,1)) = 0.1
P(3,3),N
((3,3)) = 0
.. so on

R
+1
-1
S0
R((4,3)) = +1
R((4,2)) = -1
R(s) = -0.02

Back to Reinforcement Learning

How MDPs work?
1. Start at S0
2. Choose a0

How MDPs work?
1. Start at S0
2. Choose a0
3. Get to S1 ~ Ps0,a0 (probabilistic)

How MDPs work?
S0, S1, S2, … Sn

How MDPs work?
S0, S1, S2, … Sn
R = R(S0)+R(S1)+...+R(Sn)

How MDPs work?
S0, S1, S2, … Sn
R = R(S0)+R(S1)+...+R(Sn)
R = R(S0)+ƔR(S1)+...+ƔⁿR(Sn)
where 0 < Ɣ < 1

Goal of Reinforcement Learning
E[R] = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn)]

More concretely..
Find a policy π: S →A to maximize
E[R] = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn)]

Example Policy
3
2
1
1 2 3 4
→ → → +1
↑ ↑ -1
↑ ← ← ←

Example Optimal Policy
3
2
1
1 2 3 4
→ → → +1
↑ ↑ -1
↑ ← ← ←

How do we get to the optimal policy?

Definitions
1. Vπ
(s) = For any policy π,
Vπ
(s): s → ℝ
i.e. expected total pay-off starting at state s and executing π
Vπ
(s) = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn) | S0 = s, π]

More concretely..
→ → → +1
↓ → -1
→ → ↑ ←
Given π

More concretely..
→ → → +1
↓ → -1
→ → ↑ ←
Given π
.52 .73 .77 +1
-.9 -.8 -1
-.8 -.8 -.8 -1
Compute Vπ
(S)

Given any policy, value function can be written as:
Vπ

Vπ
Vπ
(s) = E[R(S0)+Ɣ(R(S1)+...+Ɣn-1
R(Sn)) | S0 = s, π]

Vπ
Vπ
(s) = E[R(S0)+Ɣ(R(S1)+...+Ɣn-1
R(Sn)) | S0 = s, π]
Vπ
(s1)

Vπ
Vπ
(s) = E[R(S0)+Ɣ(R(S1)+...+Ɣn-1
R(Sn)) | S0 = s, π]
s0 →s
s1→s’
Vπ
(s) = R(S0)+ƔVπ
(s’)
But s’ is a random variable

Vπ
Vπ
(s) = E[R(S0)+Ɣ(R(S1)+...+Ɣn-1
R(Sn)) | S0 = s, π]
s0 →s
s1→s’
Vπ
(s) = R(s)+ƔVπ
(s’)
Vπ
(s) = R(s)+Ɣ∑s’
Psπ(s’)
Vπ
(s’)

Bellman Equation
Vπ
(s) = R(s)+Ɣ∑s’
Psπ(s’)
Vπ
(s’)

Example for (3,1) state
Vπ
(s) = R(s)+Ɣ∑s’
Psπ(s’)
Vπ
(s’)
Vπ
((3,1)) = R((3,1))+Ɣ∑[0.8*Vπ
((3,2)) +0.1*Vπ
((4,1)) +0.1*Vπ
((2,1)) ]

What are the unknowns?
Vπ
((3,1)) = R((3,1))+Ɣ∑[0.8*Vπ
((3,2)) +0.1*Vπ
((4,1)) +0.1*Vπ
((2,1)) ]

What are the unknowns?
Vπ
((3,1)) = R((3,1))+Ɣ∑[0.8*Vπ
((3,2)) +0.1*Vπ
((4,1)) +0.1*Vπ
((2,1)) ]
Solution: Solve 11 equations simultaneously for 11 unknowns

2. Optimal Value Function
V*(s) = maxπ
Vπ
(s)

2. Bellman Equation for Optimal Value Function
V*(s) = R(s)+maxa
Ɣ∑s’
Psa(s’)
V*(s’)
Immediate
Reward
Depending on action choose
action that maximizes my
pay-off

3. Optimal Policy Function
π*(s) = argmax ∑s’
Psa(s’)
V*(s’)

3. Optimal Policy Function
π*(s) = argmax ∑s’
Psa(s’)
V*(s’)
If we know V*(s’), we can compute π*(s)
But V*(s’) = maxπ
Vπ
(s)
Problem?

3. Algorithms
● Value Learning/ Value Iteration
● Policy Learning/ Policy Iteration

3. Value Iteration
1. Initialize V(s) = 0 ∀s
2. For every s, update repeatedly
V(s) = R(s) + maxa
Ɣ∑s’
Psa(s’)
V(s’)
At the end V(s) →V*(s)

More concretely..
→ → → +1
↑ ↑ -1
← ← ← ←
.86 .90 .93 +1
.82 .69 -1
.78 .75 .71 .49
V*(s)

3. Policy Iteration
1. Initialize π randomly
2. Repeat:
V:= Vπ
π(s) = argmax∑s’
Psa(s’)
V(s’)
At the end
V→V* and
π→π*

References
● MIT 6.S191 Lecture 6: Deep Reinforcement Learning
○ https://www.youtube.com/watch?v=xWe58WGWmlk
● Lecture 16 | Machine Learning (Stanford)
○ https://www.youtube.com/watch?v=RtxI449ZjSc
● Guest Post (Part I): Demystifying Deep Reinforcement Learning
○ https://www.nervanasys.com/demystifying-deep-reinforcement-learning/

Reinforcement learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reinforcement learning

Similar to Reinforcement learning (20)

More from Shahan Ali Memon

More from Shahan Ali Memon (6)

Recently uploaded

Recently uploaded (20)

Reinforcement learning