SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Lecture 2: Markov Decision Processes
Lecture 2: Markov Decision Processes
David Silver
Lecture 2: Markov Decision Processes
1 Markov Processes
2 Markov Reward Processes
3 Markov Decision Processes
4 Extensions to MDPs
Lecture 2: Markov Decision Processes
Markov Processes
Introduction
Introduction to MDPs
Markov decision processes formally describe an environment
for reinforcement learning
Where the environment is fully observable
i.e. The current state completely characterises the process
Almost all RL problems can be formalised as MDPs, e.g.
Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
Bandits are MDPs with one state
Lecture 2: Markov Decision Processes
Markov Processes
Markov Property
Markov Property
“The future is independent of the past given the present”
Definition
A state St is Markov if and only if
P [St+1 | St] = P [St+1 | S1, ..., St]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
Lecture 2: Markov Decision Processes
Markov Processes
Markov Property
State Transition Matrix
For a Markov state s and successor state s , the state transition
probability is defined by
Pss = P St+1 = s | St = s
State transition matrix P defines transition probabilities from all
states s to all successor states s ,
to
P = from



P11 . . . P1n
...
Pn1 . . . Pnn



where each row of the matrix sums to 1.
Lecture 2: Markov Decision Processes
Markov Processes
Markov Chains
Markov Process
A Markov process is a memoryless random process, i.e. a sequence
of random states S1, S2, ... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple S, P
S is a (finite) set of states
P is a state transition probability matrix,
Pss = P [St+1 = s | St = s]
Lecture 2: Markov Decision Processes
Markov Processes
Markov Chains
Example: Student Markov Chain
0.5
0.5
0.2
0.8 0.6
0.4
SleepFacebook
Class 2
0.9
0.1
Pub
Class 3 PassClass 1
0.2
0.4
0.4
1.0
Lecture 2: Markov Decision Processes
Markov Processes
Markov Chains
Example: Student Markov Chain Episodes
0.5
0.5
0.2
0.8 0.6
0.4
SleepFacebook
Class 2
0.9
0.1
Pub
Class 3 PassClass 1
0.2
0.4
0.4
1.0
Sample episodes for Student Markov
Chain starting from S1 = C1
S1, S2, ..., ST
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep
C1 FB FB C1 C2 C3 Pub C1 FB FB
FB C1 C2 C3 Pub C2 Sleep
Lecture 2: Markov Decision Processes
Markov Processes
Markov Chains
Example: Student Markov Chain Transition Matrix
0.5
0.5
0.2
0.8 0.6
0.4
SleepFacebook
Class 2
0.9
0.1
Pub
Class 3 PassClass 1
0.2
0.4
0.4
1.0
P =









C1 C2 C3 Pass Pub FB Sleep
C1 0.5 0.5
C2 0.8 0.2
C3 0.6 0.4
Pass 1.0
Pub 0.2 0.4 0.4
FB 0.1 0.9
Sleep 1









Lecture 2: Markov Decision Processes
Markov Reward Processes
MRP
Markov Reward Process
A Markov reward process is a Markov chain with values.
Definition
A Markov Reward Process is a tuple S, P, R, γ
S is a finite set of states
P is a state transition probability matrix,
Pss = P [St+1 = s | St = s]
R is a reward function, Rs = E [Rt+1 | St = s]
γ is a discount factor, γ ∈ [0, 1]
Lecture 2: Markov Decision Processes
Markov Reward Processes
MRP
Example: Student MRP
R = +10
0.5
0.5
0.2
0.8 0.6
0.4
SleepFacebook
Class 2
0.9
0.1
R = +1
R = -1 R = 0
Pub
Class 3 PassClass 1
R = -2 R = -2 R = -2
0.2
0.4
0.4
1.0
Lecture 2: Markov Decision Processes
Markov Reward Processes
Return
Return
Definition
The return Gt is the total discounted reward from time-step t.
Gt = Rt+1 + γRt+2 + ... =
∞
k=0
γk
Rt+k+1
The discount γ ∈ [0, 1] is the present value of future rewards
The value of receiving reward R after k + 1 time-steps is γkR.
This values immediate reward above delayed reward.
γ close to 0 leads to ”myopic” evaluation
γ close to 1 leads to ”far-sighted” evaluation
Lecture 2: Markov Decision Processes
Markov Reward Processes
Return
Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Animal/human behaviour shows preference for immediate
reward
It is sometimes possible to use undiscounted Markov reward
processes (i.e. γ = 1), e.g. if all sequences terminate.
Lecture 2: Markov Decision Processes
Markov Reward Processes
Value Function
Value Function
The value function v(s) gives the long-term value of state s
Definition
The state value function v(s) of an MRP is the expected return
starting from state s
v(s) = E [Gt | St = s]
Lecture 2: Markov Decision Processes
Markov Reward Processes
Value Function
Example: Student MRP Returns
Sample returns for Student MRP:
Starting from S1 = C1 with γ = 1
2
G1 = R2 + γR3 + ... + γT−2
RT
C1 C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1
2
− 2 ∗ 1
4
+ 10 ∗ 1
8
= −2.25
C1 FB FB C1 C2 Sleep v1 = −2 − 1 ∗ 1
2
− 1 ∗ 1
4
− 2 ∗ 1
8
− 2 ∗ 1
16
= −3.125
C1 C2 C3 Pub C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1
2
− 2 ∗ 1
4
+ 1 ∗ 1
8
− 2 ∗ 1
16
... = −3.41
C1 FB FB C1 C2 C3 Pub C1 ... v1 = −2 − 1 ∗ 1
2
− 1 ∗ 1
4
− 2 ∗ 1
8
− 2 ∗ 1
16
...
= −3.20
FB FB FB C1 C2 C3 Pub C2 Sleep
Lecture 2: Markov Decision Processes
Markov Reward Processes
Value Function
Example: State-Value Function for Student MRP (1)
10-2 -2 -2
0-1
R = +10
0.5
0.5
0.2
0.8 0.6
0.4
0.9
0.1
R = +1
R = -1 R = 0
+1
R = -2 R = -2 R = -2
0.2
0.4
0.4
1.0
v(s) for γ =0
Lecture 2: Markov Decision Processes
Markov Reward Processes
Value Function
Example: State-Value Function for Student MRP (2)
10-5.0 0.9 4.1
0-7.6
R = +10
0.5
0.5
0.2
0.8 0.6
0.4
0.9
0.1
R = +1
R = -1 R = 0
1.9
R = -2 R = -2 R = -2
0.2
0.4
0.4
1.0
v(s) for γ =0.9
Lecture 2: Markov Decision Processes
Markov Reward Processes
Value Function
Example: State-Value Function for Student MRP (3)
10-13 1.5 4.3
0-23
R = +10
0.5
0.5
0.2
0.8 0.6
0.4
0.9
0.1
R = +1
R = -1 R = 0
+0.8
R = -2 R = -2 R = -2
0.2
0.4
0.4
1.0
v(s) for γ =1
Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
Bellman Equation for MRPs
The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state γv(St+1)
v(s) = E [Gt | St = s]
= E Rt+1 + γRt+2 + γ2
Rt+3 + ... | St = s
= E [Rt+1 + γ (Rt+2 + γRt+3 + ...) | St = s]
= E [Rt+1 + γGt+1 | St = s]
= E [Rt+1 + γv(St+1) | St = s]
Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
Bellman Equation for MRPs (2)
v(s) = E [Rt+1 + γv(St+1) | St = s]
v(s) 7! s
v(s0
) 7! s0
r
v(s) = Rs + γ
s ∈S
Pss v(s )
Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
Example: Bellman Equation for Student MRP
10-13 1.5 4.3
0-23
R = +10
0.5
0.5
0.2
0.8 0.6
0.4
0.9
0.1
R = +1
R = -1 R = 0
0.8
R = -2 R = -2 R = -2
0.2
0.4
0.4
1.0
4.3 = -2 + 0.6*10 + 0.4*0.8
Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices,
v = R + γPv
where v is a column vector with one entry per state



v(1)
...
v(n)


 =



R1
...
Rn


 + γ



P11 . . . P1n
...
P11 . . . Pnn






v(1)
...
v(n)



Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
Solving the Bellman Equation
The Bellman equation is a linear equation
It can be solved directly:
v = R + γPv
(I − γP) v = R
v = (I − γP)−1
R
Computational complexity is O(n3) for n states
Direct solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.
Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning
Lecture 2: Markov Decision Processes
Markov Decision Processes
MDP
Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with
decisions. It is an environment in which all states are Markov.
Definition
A Markov Decision Process is a tuple S, A, P, R, γ
S is a finite set of states
A is a finite set of actions
P is a state transition probability matrix,
Pa
ss = P [St+1 = s | St = s, At = a]
R is a reward function, Ra
s = E [Rt+1 | St = s, At = a]
γ is a discount factor γ ∈ [0, 1].
Lecture 2: Markov Decision Processes
Markov Decision Processes
MDP
Example: Student MDP
R = +10
R = +1
R = -1 R = 0
R = -2 R = -2
0.2
0.4
0.4
Study
Facebook
Study
Sleep
Facebook
Quit
Pub
Study
R = -1
R = 0
Lecture 2: Markov Decision Processes
Markov Decision Processes
Policies
Policies (1)
Definition
A policy π is a distribution over actions given states,
π(a|s) = P [At = a | St = s]
A policy fully defines the behaviour of an agent
MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent),
At ∼ π(·|St), ∀t > 0
Lecture 2: Markov Decision Processes
Markov Decision Processes
Policies
Policies (2)
Given an MDP M = S, A, P, R, γ and a policy π
The state sequence S1, S2, ... is a Markov process S, Pπ
The state and reward sequence S1, R2, S2, ... is a Markov
reward process S, Pπ, Rπ, γ
where
Pπ
s,s =
a∈A
π(a|s)Pa
ss
Rπ
s =
a∈A
π(a|s)Ra
s
Lecture 2: Markov Decision Processes
Markov Decision Processes
Value Functions
Value Function
Definition
The state-value function vπ(s) of an MDP is the expected return
starting from state s, and then following policy π
vπ(s) = Eπ [Gt | St = s]
Definition
The action-value function qπ(s, a) is the expected return
starting from state s, taking action a, and then following policy π
qπ(s, a) = Eπ [Gt | St = s, At = a]
Lecture 2: Markov Decision Processes
Markov Decision Processes
Value Functions
Example: State-Value Function for Student MDP
-1.3 2.7 7.4
0-2.3
R = +10
R = +1
R = -1 R = 0
R = -2 R = -2
0.2
0.4
0.4
Study
Facebook
Study
Sleep
Facebook
Quit
Pub
Study
R = -1
R = 0
vπ(s) for π(a|s)=0.5, γ =1
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
Bellman Expectation Equation
The state-value function can again be decomposed into immediate
reward plus discounted value of successor state,
vπ(s) = Eπ [Rt+1 + γvπ(St+1) | St = s]
The action-value function can similarly be decomposed,
qπ(s, a) = Eπ [Rt+1 + γqπ(St+1, At+1) | St = s, At = a]
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
Bellman Expectation Equation for V π
v⇡(s) 7! s
q⇡(s, a) 7! a
vπ(s) =
a∈A
π(a|s)qπ(s, a)
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
Bellman Expectation Equation for Qπ
v⇡(s0
) 7! s0
q⇡(s, a) 7! s, a
r
qπ(s, a) = Ra
s + γ
s ∈S
Pa
ss vπ(s )
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
Bellman Expectation Equation for vπ (2)
v⇡(s0
) 7! s0
v⇡(s) 7! s
r
a
vπ(s) =
a∈A
π(a|s) Ra
s + γ
s ∈S
Pa
ss vπ(s )
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
Bellman Expectation Equation for qπ (2)
q⇡(s, a) 7! s, a
q⇡(s0
, a0
) 7! a0
r
s0
qπ(s, a) = Ra
s + γ
s ∈S
Pa
ss
a ∈A
π(a |s )qπ(s , a )
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
Example: Bellman Expectation Equation in Student MDP
-1.3 2.7 7.4
0-2.3
R = +10
R = +1
R = -1 R = 0
R = -2 R = -2
0.2
0.4
0.4
Study
Facebook
Study
Sleep
Facebook
Quit
Pub
Study
R = -1
R = 0
7.4 = 0.5 * (1 + 0.2* -1.3 + 0.4 * 2.7 + 0.4 * 7.4)
+ 0.5 * 10
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
Bellman Expectation Equation (Matrix Form)
The Bellman expectation equation can be expressed concisely
using the induced MRP,
vπ = Rπ
+ γPπ
vπ
with direct solution
vπ = (I − γPπ
)−1
Rπ
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
Optimal Value Function
Definition
The optimal state-value function v∗(s) is the maximum value
function over all policies
v∗(s) = max
π
vπ(s)
The optimal action-value function q∗(s, a) is the maximum
action-value function over all policies
q∗(s, a) = max
π
qπ(s, a)
The optimal value function specifies the best possible
performance in the MDP.
An MDP is “solved” when we know the optimal value fn.
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
Example: Optimal Value Function for Student MDP
6 8 10
06
R = +10
R = +1
R = -1 R = 0
R = -2 R = -2
0.2
0.4
0.4
Study
Facebook
Study
Sleep
Facebook
Quit
Pub
Study
R = -1
R = 0
v*(s) for γ =1
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
Example: Optimal Action-Value Function for Student MDP
6 8 10
06
R = +10
R = +1
R = -1 R = 0
R = -2 R = -2
0.2
0.4
0.4
Study
Facebook
Study
Sleep
Facebook
Quit
Pub
Study
R = -1
R = 0
q*(s,a) for γ =1
q* =5
q* =6
q* =6
q* =5
q* =8
q* = 0
q* =10
q* =8.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
Optimal Policy
Define a partial ordering over policies
π ≥ π if vπ(s) ≥ vπ (s), ∀s
Theorem
For any Markov Decision Process
There exists an optimal policy π∗ that is better than or equal
to all other policies, π∗ ≥ π, ∀π
All optimal policies achieve the optimal value function,
vπ∗ (s) = v∗(s)
All optimal policies achieve the optimal action-value function,
qπ∗ (s, a) = q∗(s, a)
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
Finding an Optimal Policy
An optimal policy can be found by maximising over q∗(s, a),
π∗(a|s) =
1 if a = argmax
a∈A
q∗(s, a)
0 otherwise
There is always a deterministic optimal policy for any MDP
If we know q∗(s, a), we immediately have the optimal policy
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
Example: Optimal Policy for Student MDP
6 8 10
06
R = +10
R = +1
R = -1 R = 0
R = -2 R = -2
0.2
0.4
0.4
Study
Facebook
Study
Sleep
Facebook
Quit
Pub
Study
R = -1
R = 0
π*(a|s) for γ =1
q* =5
q* =6
q* =6
q* =5
q* =8
q* =0
q* =10
q* =8.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
Bellman Optimality Equation for v∗
The optimal value functions are recursively related by the Bellman
optimality equations:
v⇤(s) 7! s
q⇤(s, a) 7! a
v∗(s) = max
a
q∗(s, a)
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
Bellman Optimality Equation for Q∗
q⇤(s, a) 7! s, a
v⇤(s0
) 7! s0
r
q∗(s, a) = Ra
s + γ
s ∈S
Pa
ss v∗(s )
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
Bellman Optimality Equation for V ∗
(2)
v⇤(s0
) 7! s0
v⇤(s) 7! s
a
r
v∗(s) = max
a
Ra
s + γ
s ∈S
Pa
ss v∗(s )
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
Bellman Optimality Equation for Q∗
(2)
q⇤(s0
, a0
) 7! a0
r
q⇤(s, a) 7! s, a
s0
q∗(s, a) = Ra
s + γ
s ∈S
Pa
ss max
a
q∗(s , a )
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
Example: Bellman Optimality Equation in Student MDP
6 8 10
06
R = +10
R = +1
R = -1 R = 0
R = -2 R = -2
0.2
0.4
0.4
Study
Facebook
Study
Sleep
Facebook
Quit
Pub
Study
R = -1
R = 0
6 = max {-2 + 8, -1 + 6}
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
Solving the Bellman Optimality Equation
Bellman Optimality Equation is non-linear
No closed form solution (in general)
Many iterative solution methods
Value Iteration
Policy Iteration
Q-learning
Sarsa
Lecture 2: Markov Decision Processes
Extensions to MDPs
Extensions to MDPs (no exam)
Infinite and continuous MDPs
Partially observable MDPs
Undiscounted, average reward MDPs
Lecture 2: Markov Decision Processes
Extensions to MDPs
Infinite MDPs
Infinite MDPs (no exam)
The following extensions are all possible:
Countably infinite state and/or action spaces
Straightforward
Continuous state and/or action spaces
Closed form for linear quadratic model (LQR)
Continuous time
Requires partial differential equations
Hamilton-Jacobi-Bellman (HJB) equation
Limiting case of Bellman equation as time-step → 0
Lecture 2: Markov Decision Processes
Extensions to MDPs
Partially Observable MDPs
POMDPs (no exam)
A Partially Observable Markov Decision Process is an MDP with
hidden states. It is a hidden Markov model with actions.
Definition
A POMDP is a tuple S, A, O, P, R, Z, γ
S is a finite set of states
A is a finite set of actions
O is a finite set of observations
P is a state transition probability matrix,
Pa
ss = P [St+1 = s | St = s, At = a]
R is a reward function, Ra
s = E [Rt+1 | St = s, At = a]
Z is an observation function,
Za
s o = P [Ot+1 = o | St+1 = s , At = a]
γ is a discount factor γ ∈ [0, 1].
Lecture 2: Markov Decision Processes
Extensions to MDPs
Partially Observable MDPs
Belief States (no exam)
Definition
A history Ht is a sequence of actions, observations and rewards,
Ht = A0, O1, R1, ..., At−1, Ot, Rt
Definition
A belief state b(h) is a probability distribution over states,
conditioned on the history h
b(h) = (P St = s1
| Ht = h , ..., P [St = sn
| Ht = h])
Lecture 2: Markov Decision Processes
Extensions to MDPs
Partially Observable MDPs
Reductions of POMDPs (no exam)
The history Ht satisfies the Markov property
The belief state b(Ht) satisfies the Markov property
a1 a2
a1o1 a1o2 a2o1 a2o2
a1o1a1 a1o1a2
... ... ...
a1 a2
o1 o2 o1 o2
a1
a2
... ... ...
a1 a2
o1 o2 o1 o2
a1 a2
P(s)
P(s|a1) P(s|a2)
P(s|a1o1) P(s|a1o2) P(s|a2o1) P(s|a2o2)
History tree Belief tree
P(s|a1
o1
a1
) P(s|a1
o1
a2
)
A POMDP can be reduced to an (infinite) history tree
A POMDP can be reduced to an (infinite) belief state tree
Lecture 2: Markov Decision Processes
Extensions to MDPs
Average Reward MDPs
Ergodic Markov Process (no exam)
An ergodic Markov process is
Recurrent: each state is visited an infinite number of times
Aperiodic: each state is visited without any systematic period
Theorem
An ergodic Markov process has a limiting stationary distribution
dπ(s) with the property
dπ
(s) =
s ∈S
dπ
(s )Ps s
Lecture 2: Markov Decision Processes
Extensions to MDPs
Average Reward MDPs
Ergodic MDP (no exam)
Definition
An MDP is ergodic if the Markov chain induced by any policy is
ergodic.
For any policy π, an ergodic MDP has an average reward per
time-step ρπ that is independent of start state.
ρπ
= lim
T→∞
1
T
E
T
t=1
Rt
Lecture 2: Markov Decision Processes
Extensions to MDPs
Average Reward MDPs
Average Reward Value Function (no exam)
The value function of an undiscounted, ergodic MDP can be
expressed in terms of average reward.
˜vπ(s) is the extra reward due to starting from state s,
˜vπ(s) = Eπ
∞
k=1
(Rt+k − ρπ
) | St = s
There is a corresponding average reward Bellman equation,
˜vπ(s) = Eπ (Rt+1 − ρπ
) +
∞
k=1
(Rt+k+1 − ρπ
) | St = s
= Eπ [(Rt+1 − ρπ
) + ˜vπ(St+1) | St = s]
Lecture 2: Markov Decision Processes
Extensions to MDPs
Average Reward MDPs
Questions?
The only stupid question is the one you were afraid to
ask but never did.
-Rich Sutton

Weitere ähnliche Inhalte

Was ist angesagt?

Design of IIR filters
Design of IIR filtersDesign of IIR filters
Design of IIR filtersop205
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Monoalphabetic Substitution Cipher
Monoalphabetic Substitution  CipherMonoalphabetic Substitution  Cipher
Monoalphabetic Substitution CipherSHUBHA CHATURVEDI
 
Travelling salesman problem
Travelling salesman problemTravelling salesman problem
Travelling salesman problemPradeep Behera
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithmswapnac12
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine LearningVARUN KUMAR
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingAmr E. Mohamed
 
Classical encryption techniques
Classical encryption techniquesClassical encryption techniques
Classical encryption techniquesDr.Florence Dayana
 

Was ist angesagt? (20)

Design of IIR filters
Design of IIR filtersDesign of IIR filters
Design of IIR filters
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Monoalphabetic Substitution Cipher
Monoalphabetic Substitution  CipherMonoalphabetic Substitution  Cipher
Monoalphabetic Substitution Cipher
 
Travelling salesman problem
Travelling salesman problemTravelling salesman problem
Travelling salesman problem
 
RSA Algorithm
RSA AlgorithmRSA Algorithm
RSA Algorithm
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithm
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Heuristic search
Heuristic searchHeuristic search
Heuristic search
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Q-learning
Q-learningQ-learning
Q-learning
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 
Practical Swarm Optimization (PSO)
Practical Swarm Optimization (PSO)Practical Swarm Optimization (PSO)
Practical Swarm Optimization (PSO)
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
 
Classical encryption techniques
Classical encryption techniquesClassical encryption techniques
Classical encryption techniques
 

Ähnlich wie Lecture 2: Markov Decision Processes Explained

TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
2 discrete markov chain
2 discrete markov chain2 discrete markov chain
2 discrete markov chainWindie Chan
 
Time-Response Lecture
Time-Response LectureTime-Response Lecture
Time-Response Lectures2021677
 
Lect w2 152 - rate laws_alg
Lect w2 152 - rate laws_algLect w2 152 - rate laws_alg
Lect w2 152 - rate laws_algchelss
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11darwinrlo
 
meng.ppt
meng.pptmeng.ppt
meng.pptaozcan1
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
Modeling of Granular Mixing using Markov Chains and the Discrete Element Method
Modeling of Granular Mixing using Markov Chains and the Discrete Element MethodModeling of Granular Mixing using Markov Chains and the Discrete Element Method
Modeling of Granular Mixing using Markov Chains and the Discrete Element Methodjodoua
 
markov chain.ppt
markov chain.pptmarkov chain.ppt
markov chain.pptDWin Myo
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Andrea Tassi
 
Chemical Kinetics
Chemical KineticsChemical Kinetics
Chemical Kineticsjc762006
 
Elecyrochemesyry.pdf
Elecyrochemesyry.pdfElecyrochemesyry.pdf
Elecyrochemesyry.pdfLUXMIKANTGIRI
 

Ähnlich wie Lecture 2: Markov Decision Processes Explained (20)

TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
2 discrete markov chain
2 discrete markov chain2 discrete markov chain
2 discrete markov chain
 
10 slides
10 slides10 slides
10 slides
 
Time-Response Lecture
Time-Response LectureTime-Response Lecture
Time-Response Lecture
 
Lect w2 152 - rate laws_alg
Lect w2 152 - rate laws_algLect w2 152 - rate laws_alg
Lect w2 152 - rate laws_alg
 
Finite automata
Finite automataFinite automata
Finite automata
 
Dp
DpDp
Dp
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
meng.ppt
meng.pptmeng.ppt
meng.ppt
 
Hidden Markov Model
Hidden Markov Model Hidden Markov Model
Hidden Markov Model
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Modeling of Granular Mixing using Markov Chains and the Discrete Element Method
Modeling of Granular Mixing using Markov Chains and the Discrete Element MethodModeling of Granular Mixing using Markov Chains and the Discrete Element Method
Modeling of Granular Mixing using Markov Chains and the Discrete Element Method
 
markov chain.ppt
markov chain.pptmarkov chain.ppt
markov chain.ppt
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
 
LDP_Notes_2015.pdf
LDP_Notes_2015.pdfLDP_Notes_2015.pdf
LDP_Notes_2015.pdf
 
Chemical Kinetics
Chemical KineticsChemical Kinetics
Chemical Kinetics
 
Elecyrochemesyry.pdf
Elecyrochemesyry.pdfElecyrochemesyry.pdf
Elecyrochemesyry.pdf
 
Markov chain
Markov chainMarkov chain
Markov chain
 

Mehr von Ronald Teo

07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixelsRonald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticRonald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsRonald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebraRonald Teo
 

Mehr von Ronald Teo (15)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
06 mlp
06 mlp06 mlp
06 mlp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Kürzlich hochgeladen

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Kürzlich hochgeladen (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Lecture 2: Markov Decision Processes Explained

  • 1. Lecture 2: Markov Decision Processes Lecture 2: Markov Decision Processes David Silver
  • 2. Lecture 2: Markov Decision Processes 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs
  • 3. Lecture 2: Markov Decision Processes Markov Processes Introduction Introduction to MDPs Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable i.e. The current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g. Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state
  • 4. Lecture 2: Markov Decision Processes Markov Processes Markov Property Markov Property “The future is independent of the past given the present” Definition A state St is Markov if and only if P [St+1 | St] = P [St+1 | S1, ..., St] The state captures all relevant information from the history Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future
  • 5. Lecture 2: Markov Decision Processes Markov Processes Markov Property State Transition Matrix For a Markov state s and successor state s , the state transition probability is defined by Pss = P St+1 = s | St = s State transition matrix P defines transition probabilities from all states s to all successor states s , to P = from    P11 . . . P1n ... Pn1 . . . Pnn    where each row of the matrix sums to 1.
  • 6. Lecture 2: Markov Decision Processes Markov Processes Markov Chains Markov Process A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property. Definition A Markov Process (or Markov Chain) is a tuple S, P S is a (finite) set of states P is a state transition probability matrix, Pss = P [St+1 = s | St = s]
  • 7. Lecture 2: Markov Decision Processes Markov Processes Markov Chains Example: Student Markov Chain 0.5 0.5 0.2 0.8 0.6 0.4 SleepFacebook Class 2 0.9 0.1 Pub Class 3 PassClass 1 0.2 0.4 0.4 1.0
  • 8. Lecture 2: Markov Decision Processes Markov Processes Markov Chains Example: Student Markov Chain Episodes 0.5 0.5 0.2 0.8 0.6 0.4 SleepFacebook Class 2 0.9 0.1 Pub Class 3 PassClass 1 0.2 0.4 0.4 1.0 Sample episodes for Student Markov Chain starting from S1 = C1 S1, S2, ..., ST C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep C1 FB FB C1 C2 C3 Pub C1 FB FB FB C1 C2 C3 Pub C2 Sleep
  • 9. Lecture 2: Markov Decision Processes Markov Processes Markov Chains Example: Student Markov Chain Transition Matrix 0.5 0.5 0.2 0.8 0.6 0.4 SleepFacebook Class 2 0.9 0.1 Pub Class 3 PassClass 1 0.2 0.4 0.4 1.0 P =          C1 C2 C3 Pass Pub FB Sleep C1 0.5 0.5 C2 0.8 0.2 C3 0.6 0.4 Pass 1.0 Pub 0.2 0.4 0.4 FB 0.1 0.9 Sleep 1         
  • 10. Lecture 2: Markov Decision Processes Markov Reward Processes MRP Markov Reward Process A Markov reward process is a Markov chain with values. Definition A Markov Reward Process is a tuple S, P, R, γ S is a finite set of states P is a state transition probability matrix, Pss = P [St+1 = s | St = s] R is a reward function, Rs = E [Rt+1 | St = s] γ is a discount factor, γ ∈ [0, 1]
  • 11. Lecture 2: Markov Decision Processes Markov Reward Processes MRP Example: Student MRP R = +10 0.5 0.5 0.2 0.8 0.6 0.4 SleepFacebook Class 2 0.9 0.1 R = +1 R = -1 R = 0 Pub Class 3 PassClass 1 R = -2 R = -2 R = -2 0.2 0.4 0.4 1.0
  • 12. Lecture 2: Markov Decision Processes Markov Reward Processes Return Return Definition The return Gt is the total discounted reward from time-step t. Gt = Rt+1 + γRt+2 + ... = ∞ k=0 γk Rt+k+1 The discount γ ∈ [0, 1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γkR. This values immediate reward above delayed reward. γ close to 0 leads to ”myopic” evaluation γ close to 1 leads to ”far-sighted” evaluation
  • 13. Lecture 2: Markov Decision Processes Markov Reward Processes Return Why discount? Most Markov reward and decision processes are discounted. Why? Mathematically convenient to discount rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future may not be fully represented If the reward is financial, immediate rewards may earn more interest than delayed rewards Animal/human behaviour shows preference for immediate reward It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate.
  • 14. Lecture 2: Markov Decision Processes Markov Reward Processes Value Function Value Function The value function v(s) gives the long-term value of state s Definition The state value function v(s) of an MRP is the expected return starting from state s v(s) = E [Gt | St = s]
  • 15. Lecture 2: Markov Decision Processes Markov Reward Processes Value Function Example: Student MRP Returns Sample returns for Student MRP: Starting from S1 = C1 with γ = 1 2 G1 = R2 + γR3 + ... + γT−2 RT C1 C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1 2 − 2 ∗ 1 4 + 10 ∗ 1 8 = −2.25 C1 FB FB C1 C2 Sleep v1 = −2 − 1 ∗ 1 2 − 1 ∗ 1 4 − 2 ∗ 1 8 − 2 ∗ 1 16 = −3.125 C1 C2 C3 Pub C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1 2 − 2 ∗ 1 4 + 1 ∗ 1 8 − 2 ∗ 1 16 ... = −3.41 C1 FB FB C1 C2 C3 Pub C1 ... v1 = −2 − 1 ∗ 1 2 − 1 ∗ 1 4 − 2 ∗ 1 8 − 2 ∗ 1 16 ... = −3.20 FB FB FB C1 C2 C3 Pub C2 Sleep
  • 16. Lecture 2: Markov Decision Processes Markov Reward Processes Value Function Example: State-Value Function for Student MRP (1) 10-2 -2 -2 0-1 R = +10 0.5 0.5 0.2 0.8 0.6 0.4 0.9 0.1 R = +1 R = -1 R = 0 +1 R = -2 R = -2 R = -2 0.2 0.4 0.4 1.0 v(s) for γ =0
  • 17. Lecture 2: Markov Decision Processes Markov Reward Processes Value Function Example: State-Value Function for Student MRP (2) 10-5.0 0.9 4.1 0-7.6 R = +10 0.5 0.5 0.2 0.8 0.6 0.4 0.9 0.1 R = +1 R = -1 R = 0 1.9 R = -2 R = -2 R = -2 0.2 0.4 0.4 1.0 v(s) for γ =0.9
  • 18. Lecture 2: Markov Decision Processes Markov Reward Processes Value Function Example: State-Value Function for Student MRP (3) 10-13 1.5 4.3 0-23 R = +10 0.5 0.5 0.2 0.8 0.6 0.4 0.9 0.1 R = +1 R = -1 R = 0 +0.8 R = -2 R = -2 R = -2 0.2 0.4 0.4 1.0 v(s) for γ =1
  • 19. Lecture 2: Markov Decision Processes Markov Reward Processes Bellman Equation Bellman Equation for MRPs The value function can be decomposed into two parts: immediate reward Rt+1 discounted value of successor state γv(St+1) v(s) = E [Gt | St = s] = E Rt+1 + γRt+2 + γ2 Rt+3 + ... | St = s = E [Rt+1 + γ (Rt+2 + γRt+3 + ...) | St = s] = E [Rt+1 + γGt+1 | St = s] = E [Rt+1 + γv(St+1) | St = s]
  • 20. Lecture 2: Markov Decision Processes Markov Reward Processes Bellman Equation Bellman Equation for MRPs (2) v(s) = E [Rt+1 + γv(St+1) | St = s] v(s) 7! s v(s0 ) 7! s0 r v(s) = Rs + γ s ∈S Pss v(s )
  • 21. Lecture 2: Markov Decision Processes Markov Reward Processes Bellman Equation Example: Bellman Equation for Student MRP 10-13 1.5 4.3 0-23 R = +10 0.5 0.5 0.2 0.8 0.6 0.4 0.9 0.1 R = +1 R = -1 R = 0 0.8 R = -2 R = -2 R = -2 0.2 0.4 0.4 1.0 4.3 = -2 + 0.6*10 + 0.4*0.8
  • 22. Lecture 2: Markov Decision Processes Markov Reward Processes Bellman Equation Bellman Equation in Matrix Form The Bellman equation can be expressed concisely using matrices, v = R + γPv where v is a column vector with one entry per state    v(1) ... v(n)    =    R1 ... Rn    + γ    P11 . . . P1n ... P11 . . . Pnn       v(1) ... v(n)   
  • 23. Lecture 2: Markov Decision Processes Markov Reward Processes Bellman Equation Solving the Bellman Equation The Bellman equation is a linear equation It can be solved directly: v = R + γPv (I − γP) v = R v = (I − γP)−1 R Computational complexity is O(n3) for n states Direct solution only possible for small MRPs There are many iterative methods for large MRPs, e.g. Dynamic programming Monte-Carlo evaluation Temporal-Difference learning
  • 24. Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. Definition A Markov Decision Process is a tuple S, A, P, R, γ S is a finite set of states A is a finite set of actions P is a state transition probability matrix, Pa ss = P [St+1 = s | St = s, At = a] R is a reward function, Ra s = E [Rt+1 | St = s, At = a] γ is a discount factor γ ∈ [0, 1].
  • 25. Lecture 2: Markov Decision Processes Markov Decision Processes MDP Example: Student MDP R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 Study Facebook Study Sleep Facebook Quit Pub Study R = -1 R = 0
  • 26. Lecture 2: Markov Decision Processes Markov Decision Processes Policies Policies (1) Definition A policy π is a distribution over actions given states, π(a|s) = P [At = a | St = s] A policy fully defines the behaviour of an agent MDP policies depend on the current state (not the history) i.e. Policies are stationary (time-independent), At ∼ π(·|St), ∀t > 0
  • 27. Lecture 2: Markov Decision Processes Markov Decision Processes Policies Policies (2) Given an MDP M = S, A, P, R, γ and a policy π The state sequence S1, S2, ... is a Markov process S, Pπ The state and reward sequence S1, R2, S2, ... is a Markov reward process S, Pπ, Rπ, γ where Pπ s,s = a∈A π(a|s)Pa ss Rπ s = a∈A π(a|s)Ra s
  • 28. Lecture 2: Markov Decision Processes Markov Decision Processes Value Functions Value Function Definition The state-value function vπ(s) of an MDP is the expected return starting from state s, and then following policy π vπ(s) = Eπ [Gt | St = s] Definition The action-value function qπ(s, a) is the expected return starting from state s, taking action a, and then following policy π qπ(s, a) = Eπ [Gt | St = s, At = a]
  • 29. Lecture 2: Markov Decision Processes Markov Decision Processes Value Functions Example: State-Value Function for Student MDP -1.3 2.7 7.4 0-2.3 R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 Study Facebook Study Sleep Facebook Quit Pub Study R = -1 R = 0 vπ(s) for π(a|s)=0.5, γ =1
  • 30. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation The state-value function can again be decomposed into immediate reward plus discounted value of successor state, vπ(s) = Eπ [Rt+1 + γvπ(St+1) | St = s] The action-value function can similarly be decomposed, qπ(s, a) = Eπ [Rt+1 + γqπ(St+1, At+1) | St = s, At = a]
  • 31. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation for V π v⇡(s) 7! s q⇡(s, a) 7! a vπ(s) = a∈A π(a|s)qπ(s, a)
  • 32. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation for Qπ v⇡(s0 ) 7! s0 q⇡(s, a) 7! s, a r qπ(s, a) = Ra s + γ s ∈S Pa ss vπ(s )
  • 33. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation for vπ (2) v⇡(s0 ) 7! s0 v⇡(s) 7! s r a vπ(s) = a∈A π(a|s) Ra s + γ s ∈S Pa ss vπ(s )
  • 34. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation for qπ (2) q⇡(s, a) 7! s, a q⇡(s0 , a0 ) 7! a0 r s0 qπ(s, a) = Ra s + γ s ∈S Pa ss a ∈A π(a |s )qπ(s , a )
  • 35. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Expectation Equation Example: Bellman Expectation Equation in Student MDP -1.3 2.7 7.4 0-2.3 R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 Study Facebook Study Sleep Facebook Quit Pub Study R = -1 R = 0 7.4 = 0.5 * (1 + 0.2* -1.3 + 0.4 * 2.7 + 0.4 * 7.4) + 0.5 * 10
  • 36. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation (Matrix Form) The Bellman expectation equation can be expressed concisely using the induced MRP, vπ = Rπ + γPπ vπ with direct solution vπ = (I − γPπ )−1 Rπ
  • 37. Lecture 2: Markov Decision Processes Markov Decision Processes Optimal Value Functions Optimal Value Function Definition The optimal state-value function v∗(s) is the maximum value function over all policies v∗(s) = max π vπ(s) The optimal action-value function q∗(s, a) is the maximum action-value function over all policies q∗(s, a) = max π qπ(s, a) The optimal value function specifies the best possible performance in the MDP. An MDP is “solved” when we know the optimal value fn.
  • 38. Lecture 2: Markov Decision Processes Markov Decision Processes Optimal Value Functions Example: Optimal Value Function for Student MDP 6 8 10 06 R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 Study Facebook Study Sleep Facebook Quit Pub Study R = -1 R = 0 v*(s) for γ =1
  • 39. Lecture 2: Markov Decision Processes Markov Decision Processes Optimal Value Functions Example: Optimal Action-Value Function for Student MDP 6 8 10 06 R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 Study Facebook Study Sleep Facebook Quit Pub Study R = -1 R = 0 q*(s,a) for γ =1 q* =5 q* =6 q* =6 q* =5 q* =8 q* = 0 q* =10 q* =8.4
  • 40. Lecture 2: Markov Decision Processes Markov Decision Processes Optimal Value Functions Optimal Policy Define a partial ordering over policies π ≥ π if vπ(s) ≥ vπ (s), ∀s Theorem For any Markov Decision Process There exists an optimal policy π∗ that is better than or equal to all other policies, π∗ ≥ π, ∀π All optimal policies achieve the optimal value function, vπ∗ (s) = v∗(s) All optimal policies achieve the optimal action-value function, qπ∗ (s, a) = q∗(s, a)
  • 41. Lecture 2: Markov Decision Processes Markov Decision Processes Optimal Value Functions Finding an Optimal Policy An optimal policy can be found by maximising over q∗(s, a), π∗(a|s) = 1 if a = argmax a∈A q∗(s, a) 0 otherwise There is always a deterministic optimal policy for any MDP If we know q∗(s, a), we immediately have the optimal policy
  • 42. Lecture 2: Markov Decision Processes Markov Decision Processes Optimal Value Functions Example: Optimal Policy for Student MDP 6 8 10 06 R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 Study Facebook Study Sleep Facebook Quit Pub Study R = -1 R = 0 π*(a|s) for γ =1 q* =5 q* =6 q* =6 q* =5 q* =8 q* =0 q* =10 q* =8.4
  • 43. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation for v∗ The optimal value functions are recursively related by the Bellman optimality equations: v⇤(s) 7! s q⇤(s, a) 7! a v∗(s) = max a q∗(s, a)
  • 44. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation for Q∗ q⇤(s, a) 7! s, a v⇤(s0 ) 7! s0 r q∗(s, a) = Ra s + γ s ∈S Pa ss v∗(s )
  • 45. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation for V ∗ (2) v⇤(s0 ) 7! s0 v⇤(s) 7! s a r v∗(s) = max a Ra s + γ s ∈S Pa ss v∗(s )
  • 46. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation for Q∗ (2) q⇤(s0 , a0 ) 7! a0 r q⇤(s, a) 7! s, a s0 q∗(s, a) = Ra s + γ s ∈S Pa ss max a q∗(s , a )
  • 47. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Optimality Equation Example: Bellman Optimality Equation in Student MDP 6 8 10 06 R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 Study Facebook Study Sleep Facebook Quit Pub Study R = -1 R = 0 6 = max {-2 + 8, -1 + 6}
  • 48. Lecture 2: Markov Decision Processes Markov Decision Processes Bellman Optimality Equation Solving the Bellman Optimality Equation Bellman Optimality Equation is non-linear No closed form solution (in general) Many iterative solution methods Value Iteration Policy Iteration Q-learning Sarsa
  • 49. Lecture 2: Markov Decision Processes Extensions to MDPs Extensions to MDPs (no exam) Infinite and continuous MDPs Partially observable MDPs Undiscounted, average reward MDPs
  • 50. Lecture 2: Markov Decision Processes Extensions to MDPs Infinite MDPs Infinite MDPs (no exam) The following extensions are all possible: Countably infinite state and/or action spaces Straightforward Continuous state and/or action spaces Closed form for linear quadratic model (LQR) Continuous time Requires partial differential equations Hamilton-Jacobi-Bellman (HJB) equation Limiting case of Bellman equation as time-step → 0
  • 51. Lecture 2: Markov Decision Processes Extensions to MDPs Partially Observable MDPs POMDPs (no exam) A Partially Observable Markov Decision Process is an MDP with hidden states. It is a hidden Markov model with actions. Definition A POMDP is a tuple S, A, O, P, R, Z, γ S is a finite set of states A is a finite set of actions O is a finite set of observations P is a state transition probability matrix, Pa ss = P [St+1 = s | St = s, At = a] R is a reward function, Ra s = E [Rt+1 | St = s, At = a] Z is an observation function, Za s o = P [Ot+1 = o | St+1 = s , At = a] γ is a discount factor γ ∈ [0, 1].
  • 52. Lecture 2: Markov Decision Processes Extensions to MDPs Partially Observable MDPs Belief States (no exam) Definition A history Ht is a sequence of actions, observations and rewards, Ht = A0, O1, R1, ..., At−1, Ot, Rt Definition A belief state b(h) is a probability distribution over states, conditioned on the history h b(h) = (P St = s1 | Ht = h , ..., P [St = sn | Ht = h])
  • 53. Lecture 2: Markov Decision Processes Extensions to MDPs Partially Observable MDPs Reductions of POMDPs (no exam) The history Ht satisfies the Markov property The belief state b(Ht) satisfies the Markov property a1 a2 a1o1 a1o2 a2o1 a2o2 a1o1a1 a1o1a2 ... ... ... a1 a2 o1 o2 o1 o2 a1 a2 ... ... ... a1 a2 o1 o2 o1 o2 a1 a2 P(s) P(s|a1) P(s|a2) P(s|a1o1) P(s|a1o2) P(s|a2o1) P(s|a2o2) History tree Belief tree P(s|a1 o1 a1 ) P(s|a1 o1 a2 ) A POMDP can be reduced to an (infinite) history tree A POMDP can be reduced to an (infinite) belief state tree
  • 54. Lecture 2: Markov Decision Processes Extensions to MDPs Average Reward MDPs Ergodic Markov Process (no exam) An ergodic Markov process is Recurrent: each state is visited an infinite number of times Aperiodic: each state is visited without any systematic period Theorem An ergodic Markov process has a limiting stationary distribution dπ(s) with the property dπ (s) = s ∈S dπ (s )Ps s
  • 55. Lecture 2: Markov Decision Processes Extensions to MDPs Average Reward MDPs Ergodic MDP (no exam) Definition An MDP is ergodic if the Markov chain induced by any policy is ergodic. For any policy π, an ergodic MDP has an average reward per time-step ρπ that is independent of start state. ρπ = lim T→∞ 1 T E T t=1 Rt
  • 56. Lecture 2: Markov Decision Processes Extensions to MDPs Average Reward MDPs Average Reward Value Function (no exam) The value function of an undiscounted, ergodic MDP can be expressed in terms of average reward. ˜vπ(s) is the extra reward due to starting from state s, ˜vπ(s) = Eπ ∞ k=1 (Rt+k − ρπ ) | St = s There is a corresponding average reward Bellman equation, ˜vπ(s) = Eπ (Rt+1 − ρπ ) + ∞ k=1 (Rt+k+1 − ρπ ) | St = s = Eπ [(Rt+1 − ρπ ) + ˜vπ(St+1) | St = s]
  • 57. Lecture 2: Markov Decision Processes Extensions to MDPs Average Reward MDPs Questions? The only stupid question is the one you were afraid to ask but never did. -Rich Sutton