3. Background
Supervised Learning: given data, predict labels
Unsupervised Learning: given data, learn about that data
Reinforcement Learning: given data, choose action to maximize expected
long-term reward
4.
5.
6.
7. Why is it hard?
● No one-shot decision
○ Example: Losing Chess on 60th move
8. Why is it hard?
● No one-shot decision Credit Assignment Problem
○ Example: Losing Chess on 60th move
○ Helicopter Crashing
○ Car crashing - break
9. Why is it hard?
● No one-shot decision Credit Assignment Problem
○ Example: Losing Chess on 60th move
○ Helicopter Crashing
○ Car crashing - break
● Explore Exploit Problem
○ Example: Brick Game
13. Formalize: Markov Decision Process (MDPs)
(S, A, {Psa
}, Ɣ, R)
S: Set of states
A: Set of Actions
{Psa}: State Transition Distributions
14. Formalize: Markov Decision Process (MDPs)
(S, A, {Psa
}, Ɣ, R)
S: Set of states
A: Set of Actions
{Psa}: State Transition Distributions
Ɣ: Discount factor
15. Formalize: Markov Decision Process (MDPs)
(S, A, {Psa
}, Ɣ, R)
S: Set of states
A: Set of Actions
{Psa
}: State Transition Distributions
Ɣ: Discount factor
R: Reward function
36. Definitions
1. Vπ
(s) = For any policy π,
Vπ
(s): s → ℝ
i.e. expected total pay-off starting at state s and executing π
Vπ
(s) = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn) | S0 = s, π]
38. More concretely..
→ → → +1
↓ → -1
→ → ↑ ←
Given π
.52 .73 .77 +1
-.9 -.8 -1
-.8 -.8 -.8 -1
Compute Vπ
(S)
39. Given any policy, value function can be written as:
Vπ
(s) = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn) | S0 = s, π]
40. Given any policy, value function can be written as:
Vπ
(s) = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn) | S0 = s, π]
Vπ
(s) = E[R(S0)+Ɣ(R(S1)+...+Ɣn-1
R(Sn)) | S0 = s, π]
41. Given any policy, value function can be written as:
Vπ
(s) = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn) | S0 = s, π]
Vπ
(s) = E[R(S0)+Ɣ(R(S1)+...+Ɣn-1
R(Sn)) | S0 = s, π]
Vπ
(s1)
42. Given any policy, value function can be written as:
Vπ
(s) = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn) | S0 = s, π]
Vπ
(s) = E[R(S0)+Ɣ(R(S1)+...+Ɣn-1
R(Sn)) | S0 = s, π]
s0 →s
s1→s’
Vπ
(s) = R(S0)+ƔVπ
(s’)
But s’ is a random variable
43. Given any policy, value function can be written as:
Vπ
(s) = E[R(S0)+ƔR(S1)+...+ƔⁿR(Sn) | S0 = s, π]
Vπ
(s) = E[R(S0)+Ɣ(R(S1)+...+Ɣn-1
R(Sn)) | S0 = s, π]
s0 →s
s1→s’
Vπ
(s) = R(s)+ƔVπ
(s’)
Vπ
(s) = R(s)+Ɣ∑s’
Psπ(s’)
Vπ
(s’)
49. 2. Bellman Equation for Optimal Value Function
V*(s) = R(s)+maxa
Ɣ∑s’
Psa(s’)
V*(s’)
Immediate
Reward
Depending on action choose
action that maximizes my
pay-off