3. Objective
Markov Decision Processes (Sequences of decisions)
– Introduction to MDPs
– Computing optimal policies for MDPs
4. Markov Decision Process (MDP)
Sequential decision problems under uncertainty
– Not just the immediate utility, but the longer-term utility
as well
– Uncertainty in outcomes
Roots in operations research
Also used in economics, communications engineering,
ecology, performance modeling and of course, AI!
– Also referred to as stochastic dynamic programs
5. Markov Decision Process (MDP)
Defined as a tuple: <S, A, P, R>
– S: State
– A: Action
– P: Transition function
Table P(s’| s, a), prob of s’ given action “a” in state “s”
– R: Reward
R(s, a) = cost or reward of taking action a in state s
Choose a sequence of actions (not just one decision or one action)
– Utility based on a sequence of decisions
6. Example: What SEQUENCE of actions
should our agent take?
Reward
-1
Blocked
CELL
Reward
+1
Start1 2 3 4
1
2
3
0.8
0.1
0.1
• Each action costs –1/25
• Agent can take action N, E, S, W
• Faces uncertainty in every state
N
7. MDP Tuple: <S, A, P, R>
S: State of the agent on the grid (4,3)
– Note that cell denoted by (x,y)
A: Actions of the agent, i.e., N, E, S, W
P: Transition function
– Table P(s’| s, a), prob of s’ given action “a” in state “s”
– E.g., P( (4,3) | (3,3), N) = 0.1
– E.g., P((3, 2) | (3,3), N) = 0.8
– (Robot movement, uncertainty of another agent’s actions,…)
R: Reward (more comments on the reward function later)
– R( (3, 3), N) = -1/25
– R (4,1) = +1
8. ??Terminology
• Before describing policies, lets go through some terminology
• Terminology useful throughout this set of lectures
•Policy: Complete mapping from states to actions
9. MDP Basics and Terminology
An agent must make a decision or control a probabilistic
system
Goal is to choose a sequence of actions for optimality
Defined as <S, A, P, R>
MDP models:
– Finite horizon: Maximize the expected reward for the
next n steps
– Infinite horizon: Maximize the expected discounted
reward.
– Transition model: Maximize average expected reward
per transition.
– Goal state: maximize expected reward (minimize expected
cost) to some target state G.
10. ???Reward Function
According to chapter2, directly associated with state
– Denoted R(I)
– Simplifies computations seen later in algorithms presented
Sometimes, reward is assumed associated with state,action
– R(S, A)
– We could also assume a mix of R(S,A) and R(S)
Sometimes, reward associated with state,action,destination-state
– R(S,A,J)
– R(S,A) = S R(S,A,J) * P(J | S, A)
J
11. Markov Assumption
Markov Assumption: Transition probabilities (and rewards) from
any given state depend only on the state and not on previous
history
Where you end up after action depends only on current state
– After Russian Mathematician A. A. Markov (1856-1922)
– (He did not come up with markov decision processes
however)
– Transitions in state (1,2) do not depend on prior state (1,1)
or (1,2)
12. ???MDP vs POMDPs
Accessibility: Agent’s percept in any given state identify the
state that it is in, e.g., state (4,3) vs (3,3)
– Given observations, uniquely determine the state
– Hence, we will not explicitly consider observations, only states
Inaccessibility: Agent’s percepts in any given state DO NOT
identify the state that it is in, e.g., may be (4,3) or (3,3)
– Given observations, not uniquely determine the state
– POMDP: Partially observable MDP for inaccessible environments
We will focus on MDPs in this presentation.
15. Policy
Policy is like a plan, but not quite
– Certainly, generated ahead of time, like a plan
Unlike traditional plans, it is not a sequence of
actions that an agent must execute
– If there are failures in execution, agent can continue to
execute a policy
Prescribes an action for all the states
Maximizes expected reward, rather than just
reaching a goal state
16. MDP problem
The MDP problem consists of:
– Finding the optimal control policy for all possible states;
– Finding the sequence of optimal control functions for a specific
initial state
– Finding the best control action(decision) for a specific state.
17. Non-Optimal Vs Optimal Policy
-1
+1
Start
1 2 3 4
1
2
3
• Choose Red policy or Yellow policy?
• Choose Red policy or Blue policy?
Which is optimal (if any)?
• Value iteration: One popular algorithm to determine optimal policy
18. Value Iteration: Key Idea
• Iterate: update utility of state “I” using old utility of
neighbor states “J”; given actions “A”
– U t+1 (I) = max [R(I,A) + S P(J|I,A)* U t (J)]
A J
– P(J|I,A): Probability of J if A is taken in state I
– max F(A) returns highest F(A)
– Immediate reward & longer term reward taken into
account
19. Value Iteration: Algorithm
• Initialize: U0 (I) = 0
• Iterate:
U t+1 (I) = max [ R(I,A) + S P(J|I,A)* U t (J) ]
A J
– Until close-enough (U t+1, Ut)
At the end of iteration, calculate optimal policy:
Policy(I) = argmax [R(I,A) + S P(J|I,A)* U t+1 (J) ]
A J
21. ??Markov Chain
Given fixed policy, you get a markov chain from the MDP
– Markov chain: Next state is dependent only on previous state
– Next state: Not dependent on action (there is only one action)
– Next state: History dependency only via the previous state
– P(S t+1 | St, S t-1, S t-2 …..) = P(S t+1 | St)
How to evaluate the markov chain?
• Could we try simulations?
• Are there other sophisticated methods around?
26. Dynamic Construction
of the Decision Tree
Incrémental expansion(MDP,γ, sI, є, VL, VU)
Initialize tree T with sI and ubound (sI), lbound (sI) using VL, VU;
repeat until(single action remains for sI or ubound (sI) - lbound (sI) <= є
call Improve-tree(T,MDP,γ, VL, VU)
return action with greatest lover bound as a result;
Improve-tree (T,MDP,γ, VL, VU)
if root(T) is a leaf
then expand root(T)
set bouds lbound, ubound of new leaves using VL, VU;
else for all decision subtrees T’ of T
do call Improve-tree (T,MDP,γ, VL, VU)
recompute bounds lbound(root(T)), ubound(root(T))for root(T);
when root(T) is a decision node
prune suboptimal action branches from T;
return;
27. Incremental expansion function:
Basic Method for the Dynamic Construction of the Decision Tree
start
MDP, γ, SI, ε, VL,
VU
OR SI)-bound(SI)
initialize leaf node of the partially
built decision tree
return
call Improve-tree(T,MDP, γ, ε, VL, VU)
Terminate
28. Computer Decisions
using Bound Iteration
Incrémental expansion(MDP,γ, sI, є, VL, VU)
Initialize tree T with sI and ubound (sI), lbound (sI) using VL, VU;
repeat until(single action remains for sI or ubound (sI) - lbound (sI) <= є
call Improve-tree(T,MDP,γ, VL, VU)
return action with greatest lover bound as a result;
Improve-tree (T,MDP,γ, VL, VU)
if root(T) is a leaf
then expand root(T)
set bouds lbound, ubound of new leaves using VL, VU;
else for all decision subtrees T’ of T
do call Improve-tree (T,MDP,γ, VL, VU)
recompute bounds lbound(root(T)), ubound(root(T))for root(T);
when root(T) is a decision node
prune suboptimal action branches from T;
return;
29. Incremental expansion function:
Basic Method for the Dynamic Construction of the Decision Tree
start
MDP, γ, SI, ε, VL,
VU
OR (SI)-bound(SI)
initialize leaf node of the partially
built decision tree
return
call Improve-tree(T,MDP, γ, ε, VL, VU)
Terminate
31. If You Want to Read More
on MDPs
If You Want to Read More
on MDPs
Book:
– Martin L. Puterman
Markov Decision Processes
Wiley Series in Probability
– Available on Amazon.com