SlideShare ist ein Scribd-Unternehmen logo
1 von 31
MDP Presentation
CS594
Automated Optimal Decision Making
Sohail M Yousof
Advanced Artificial Intelligence
Topic
Planning and Control in
Stochastic Domains
With
Imperfect Information
Objective
 Markov Decision Processes (Sequences of decisions)
– Introduction to MDPs
– Computing optimal policies for MDPs
Markov Decision Process (MDP)
 Sequential decision problems under uncertainty
– Not just the immediate utility, but the longer-term utility
as well
– Uncertainty in outcomes
 Roots in operations research
 Also used in economics, communications engineering,
ecology, performance modeling and of course, AI!
– Also referred to as stochastic dynamic programs
Markov Decision Process (MDP)
 Defined as a tuple: <S, A, P, R>
– S: State
– A: Action
– P: Transition function
 Table P(s’| s, a), prob of s’ given action “a” in state “s”
– R: Reward
 R(s, a) = cost or reward of taking action a in state s
 Choose a sequence of actions (not just one decision or one action)
– Utility based on a sequence of decisions
Example: What SEQUENCE of actions
should our agent take?
Reward
-1
Blocked
CELL
Reward
+1
Start1 2 3 4
1
2
3
0.8
0.1
0.1
• Each action costs –1/25
• Agent can take action N, E, S, W
• Faces uncertainty in every state
N
MDP Tuple: <S, A, P, R>
 S: State of the agent on the grid (4,3)
– Note that cell denoted by (x,y)
 A: Actions of the agent, i.e., N, E, S, W
 P: Transition function
– Table P(s’| s, a), prob of s’ given action “a” in state “s”
– E.g., P( (4,3) | (3,3), N) = 0.1
– E.g., P((3, 2) | (3,3), N) = 0.8
– (Robot movement, uncertainty of another agent’s actions,…)
 R: Reward (more comments on the reward function later)
– R( (3, 3), N) = -1/25
– R (4,1) = +1
??Terminology
• Before describing policies, lets go through some terminology
• Terminology useful throughout this set of lectures
•Policy: Complete mapping from states to actions
MDP Basics and Terminology
An agent must make a decision or control a probabilistic
system
 Goal is to choose a sequence of actions for optimality
 Defined as <S, A, P, R>
 MDP models:
– Finite horizon: Maximize the expected reward for the
next n steps
– Infinite horizon: Maximize the expected discounted
reward.
– Transition model: Maximize average expected reward
per transition.
– Goal state: maximize expected reward (minimize expected
cost) to some target state G.
???Reward Function
 According to chapter2, directly associated with state
– Denoted R(I)
– Simplifies computations seen later in algorithms presented
 Sometimes, reward is assumed associated with state,action
– R(S, A)
– We could also assume a mix of R(S,A) and R(S)
 Sometimes, reward associated with state,action,destination-state
– R(S,A,J)
– R(S,A) = S R(S,A,J) * P(J | S, A)
J
Markov Assumption
 Markov Assumption: Transition probabilities (and rewards) from
any given state depend only on the state and not on previous
history
 Where you end up after action depends only on current state
– After Russian Mathematician A. A. Markov (1856-1922)
– (He did not come up with markov decision processes
however)
– Transitions in state (1,2) do not depend on prior state (1,1)
or (1,2)
???MDP vs POMDPs
 Accessibility: Agent’s percept in any given state identify the
state that it is in, e.g., state (4,3) vs (3,3)
– Given observations, uniquely determine the state
– Hence, we will not explicitly consider observations, only states
 Inaccessibility: Agent’s percepts in any given state DO NOT
identify the state that it is in, e.g., may be (4,3) or (3,3)
– Given observations, not uniquely determine the state
– POMDP: Partially observable MDP for inaccessible environments
 We will focus on MDPs in this presentation.
MDP vs POMDP
Agent
World
States
Actions
MDP
Agent
World
Observations
Actions
SE
P
b
Stationary and Deterministic Policies
 Policy denoted by symbol 
Policy
 Policy is like a plan, but not quite
– Certainly, generated ahead of time, like a plan
 Unlike traditional plans, it is not a sequence of
actions that an agent must execute
– If there are failures in execution, agent can continue to
execute a policy
 Prescribes an action for all the states
 Maximizes expected reward, rather than just
reaching a goal state
MDP problem
 The MDP problem consists of:
– Finding the optimal control policy for all possible states;
– Finding the sequence of optimal control functions for a specific
initial state
– Finding the best control action(decision) for a specific state.
Non-Optimal Vs Optimal Policy
-1
+1
Start
1 2 3 4
1
2
3
• Choose Red policy or Yellow policy?
• Choose Red policy or Blue policy?
Which is optimal (if any)?
• Value iteration: One popular algorithm to determine optimal policy
Value Iteration: Key Idea
• Iterate: update utility of state “I” using old utility of
neighbor states “J”; given actions “A”
– U t+1 (I) = max [R(I,A) + S P(J|I,A)* U t (J)]
A J
– P(J|I,A): Probability of J if A is taken in state I
– max F(A) returns highest F(A)
– Immediate reward & longer term reward taken into
account
Value Iteration: Algorithm
• Initialize: U0 (I) = 0
• Iterate:
U t+1 (I) = max [ R(I,A) + S P(J|I,A)* U t (J) ]
A J
– Until close-enough (U t+1, Ut)
 At the end of iteration, calculate optimal policy:
Policy(I) = argmax [R(I,A) + S P(J|I,A)* U t+1 (J) ]
A J
Forward Method
for Solving MDP
Decision Tree
??Markov Chain
 Given fixed policy, you get a markov chain from the MDP
– Markov chain: Next state is dependent only on previous state
– Next state: Not dependent on action (there is only one action)
– Next state: History dependency only via the previous state
– P(S t+1 | St, S t-1, S t-2 …..) = P(S t+1 | St)
 How to evaluate the markov chain?
• Could we try simulations?
• Are there other sophisticated methods around?
Influence Diagram
Expanded Influence Diagram
Relation between
time & steps-to-go
Decision Tree
Dynamic Construction
of the Decision Tree
 Incrémental expansion(MDP,γ, sI, є, VL, VU)
Initialize tree T with sI and ubound (sI), lbound (sI) using VL, VU;
repeat until(single action remains for sI or ubound (sI) - lbound (sI) <= є
call Improve-tree(T,MDP,γ, VL, VU)
return action with greatest lover bound as a result;
Improve-tree (T,MDP,γ, VL, VU)
if root(T) is a leaf
then expand root(T)
set bouds lbound, ubound of new leaves using VL, VU;
else for all decision subtrees T’ of T
do call Improve-tree (T,MDP,γ, VL, VU)
recompute bounds lbound(root(T)), ubound(root(T))for root(T);
when root(T) is a decision node
prune suboptimal action branches from T;
return;
Incremental expansion function:
Basic Method for the Dynamic Construction of the Decision Tree
start
MDP, γ, SI, ε, VL,
VU
OR SI)-bound(SI)
initialize leaf node of the partially
built decision tree
return
call Improve-tree(T,MDP, γ, ε, VL, VU)
Terminate
Computer Decisions
using Bound Iteration
 Incrémental expansion(MDP,γ, sI, є, VL, VU)
Initialize tree T with sI and ubound (sI), lbound (sI) using VL, VU;
repeat until(single action remains for sI or ubound (sI) - lbound (sI) <= є
call Improve-tree(T,MDP,γ, VL, VU)
return action with greatest lover bound as a result;
Improve-tree (T,MDP,γ, VL, VU)
if root(T) is a leaf
then expand root(T)
set bouds lbound, ubound of new leaves using VL, VU;
else for all decision subtrees T’ of T
do call Improve-tree (T,MDP,γ, VL, VU)
recompute bounds lbound(root(T)), ubound(root(T))for root(T);
when root(T) is a decision node
prune suboptimal action branches from T;
return;
Incremental expansion function:
Basic Method for the Dynamic Construction of the Decision Tree
start
MDP, γ, SI, ε, VL,
VU
OR (SI)-bound(SI)
initialize leaf node of the partially
built decision tree
return
call Improve-tree(T,MDP, γ, ε, VL, VU)
Terminate
Solving Large MDP problmes
If You Want to Read More
on MDPs
If You Want to Read More
on MDPs
 Book:
– Martin L. Puterman
 Markov Decision Processes
 Wiley Series in Probability
– Available on Amazon.com

Weitere ähnliche Inhalte

Ähnlich wie RL intro

Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learningahmad bassiouny
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
technical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision processtechnical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision processmudavathnarasimhanai
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement LearningNguyen Luong An Phu
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.pptssuser43a599
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.pptPOOJASHREEC1
 

Ähnlich wie RL intro (20)

Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
technical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision processtechnical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision process
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement Learning
 
YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
Lec#2
Lec#2Lec#2
Lec#2
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 

Kürzlich hochgeladen

Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAIGetting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAITim Wilson
 
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTSDurg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTSkajalroy875762
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentationuneakwhite
 
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...meghakumariji156
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 MonthsIndeedSEO
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...daisycvs
 
Phases of Negotiation .pptx
 Phases of Negotiation .pptx Phases of Negotiation .pptx
Phases of Negotiation .pptxnandhinijagan9867
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165meghakumariji156
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxCynthia Clay
 
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...NadhimTaha
 
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGpr788182
 
Kalyan Call Girl 98350*37198 Call Girls in Escort service book now
Kalyan Call Girl 98350*37198 Call Girls in Escort service book nowKalyan Call Girl 98350*37198 Call Girls in Escort service book now
Kalyan Call Girl 98350*37198 Call Girls in Escort service book nowranineha57744
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingNauman Safdar
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel
 
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All TimeCall 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Timegargpaaro
 
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book nowGUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book nowkapoorjyoti4444
 
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxQSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxDitasDelaCruz
 
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGParadip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGpr788182
 

Kürzlich hochgeladen (20)

Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAIGetting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
 
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTSDurg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
Phases of Negotiation .pptx
 Phases of Negotiation .pptx Phases of Negotiation .pptx
Phases of Negotiation .pptx
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
 
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
 
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
Kalyan Call Girl 98350*37198 Call Girls in Escort service book now
Kalyan Call Girl 98350*37198 Call Girls in Escort service book nowKalyan Call Girl 98350*37198 Call Girls in Escort service book now
Kalyan Call Girl 98350*37198 Call Girls in Escort service book now
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
 
Buy gmail accounts.pdf buy Old Gmail Accounts
Buy gmail accounts.pdf buy Old Gmail AccountsBuy gmail accounts.pdf buy Old Gmail Accounts
Buy gmail accounts.pdf buy Old Gmail Accounts
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All TimeCall 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
 
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book nowGUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxQSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
 
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGParadip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Paradip CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
WheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond InsightsWheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond Insights
 

RL intro

  • 1. MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence
  • 2. Topic Planning and Control in Stochastic Domains With Imperfect Information
  • 3. Objective  Markov Decision Processes (Sequences of decisions) – Introduction to MDPs – Computing optimal policies for MDPs
  • 4. Markov Decision Process (MDP)  Sequential decision problems under uncertainty – Not just the immediate utility, but the longer-term utility as well – Uncertainty in outcomes  Roots in operations research  Also used in economics, communications engineering, ecology, performance modeling and of course, AI! – Also referred to as stochastic dynamic programs
  • 5. Markov Decision Process (MDP)  Defined as a tuple: <S, A, P, R> – S: State – A: Action – P: Transition function  Table P(s’| s, a), prob of s’ given action “a” in state “s” – R: Reward  R(s, a) = cost or reward of taking action a in state s  Choose a sequence of actions (not just one decision or one action) – Utility based on a sequence of decisions
  • 6. Example: What SEQUENCE of actions should our agent take? Reward -1 Blocked CELL Reward +1 Start1 2 3 4 1 2 3 0.8 0.1 0.1 • Each action costs –1/25 • Agent can take action N, E, S, W • Faces uncertainty in every state N
  • 7. MDP Tuple: <S, A, P, R>  S: State of the agent on the grid (4,3) – Note that cell denoted by (x,y)  A: Actions of the agent, i.e., N, E, S, W  P: Transition function – Table P(s’| s, a), prob of s’ given action “a” in state “s” – E.g., P( (4,3) | (3,3), N) = 0.1 – E.g., P((3, 2) | (3,3), N) = 0.8 – (Robot movement, uncertainty of another agent’s actions,…)  R: Reward (more comments on the reward function later) – R( (3, 3), N) = -1/25 – R (4,1) = +1
  • 8. ??Terminology • Before describing policies, lets go through some terminology • Terminology useful throughout this set of lectures •Policy: Complete mapping from states to actions
  • 9. MDP Basics and Terminology An agent must make a decision or control a probabilistic system  Goal is to choose a sequence of actions for optimality  Defined as <S, A, P, R>  MDP models: – Finite horizon: Maximize the expected reward for the next n steps – Infinite horizon: Maximize the expected discounted reward. – Transition model: Maximize average expected reward per transition. – Goal state: maximize expected reward (minimize expected cost) to some target state G.
  • 10. ???Reward Function  According to chapter2, directly associated with state – Denoted R(I) – Simplifies computations seen later in algorithms presented  Sometimes, reward is assumed associated with state,action – R(S, A) – We could also assume a mix of R(S,A) and R(S)  Sometimes, reward associated with state,action,destination-state – R(S,A,J) – R(S,A) = S R(S,A,J) * P(J | S, A) J
  • 11. Markov Assumption  Markov Assumption: Transition probabilities (and rewards) from any given state depend only on the state and not on previous history  Where you end up after action depends only on current state – After Russian Mathematician A. A. Markov (1856-1922) – (He did not come up with markov decision processes however) – Transitions in state (1,2) do not depend on prior state (1,1) or (1,2)
  • 12. ???MDP vs POMDPs  Accessibility: Agent’s percept in any given state identify the state that it is in, e.g., state (4,3) vs (3,3) – Given observations, uniquely determine the state – Hence, we will not explicitly consider observations, only states  Inaccessibility: Agent’s percepts in any given state DO NOT identify the state that it is in, e.g., may be (4,3) or (3,3) – Given observations, not uniquely determine the state – POMDP: Partially observable MDP for inaccessible environments  We will focus on MDPs in this presentation.
  • 14. Stationary and Deterministic Policies  Policy denoted by symbol 
  • 15. Policy  Policy is like a plan, but not quite – Certainly, generated ahead of time, like a plan  Unlike traditional plans, it is not a sequence of actions that an agent must execute – If there are failures in execution, agent can continue to execute a policy  Prescribes an action for all the states  Maximizes expected reward, rather than just reaching a goal state
  • 16. MDP problem  The MDP problem consists of: – Finding the optimal control policy for all possible states; – Finding the sequence of optimal control functions for a specific initial state – Finding the best control action(decision) for a specific state.
  • 17. Non-Optimal Vs Optimal Policy -1 +1 Start 1 2 3 4 1 2 3 • Choose Red policy or Yellow policy? • Choose Red policy or Blue policy? Which is optimal (if any)? • Value iteration: One popular algorithm to determine optimal policy
  • 18. Value Iteration: Key Idea • Iterate: update utility of state “I” using old utility of neighbor states “J”; given actions “A” – U t+1 (I) = max [R(I,A) + S P(J|I,A)* U t (J)] A J – P(J|I,A): Probability of J if A is taken in state I – max F(A) returns highest F(A) – Immediate reward & longer term reward taken into account
  • 19. Value Iteration: Algorithm • Initialize: U0 (I) = 0 • Iterate: U t+1 (I) = max [ R(I,A) + S P(J|I,A)* U t (J) ] A J – Until close-enough (U t+1, Ut)  At the end of iteration, calculate optimal policy: Policy(I) = argmax [R(I,A) + S P(J|I,A)* U t+1 (J) ] A J
  • 20. Forward Method for Solving MDP Decision Tree
  • 21. ??Markov Chain  Given fixed policy, you get a markov chain from the MDP – Markov chain: Next state is dependent only on previous state – Next state: Not dependent on action (there is only one action) – Next state: History dependency only via the previous state – P(S t+1 | St, S t-1, S t-2 …..) = P(S t+1 | St)  How to evaluate the markov chain? • Could we try simulations? • Are there other sophisticated methods around?
  • 24. Relation between time & steps-to-go
  • 26. Dynamic Construction of the Decision Tree  Incrémental expansion(MDP,γ, sI, є, VL, VU) Initialize tree T with sI and ubound (sI), lbound (sI) using VL, VU; repeat until(single action remains for sI or ubound (sI) - lbound (sI) <= є call Improve-tree(T,MDP,γ, VL, VU) return action with greatest lover bound as a result; Improve-tree (T,MDP,γ, VL, VU) if root(T) is a leaf then expand root(T) set bouds lbound, ubound of new leaves using VL, VU; else for all decision subtrees T’ of T do call Improve-tree (T,MDP,γ, VL, VU) recompute bounds lbound(root(T)), ubound(root(T))for root(T); when root(T) is a decision node prune suboptimal action branches from T; return;
  • 27. Incremental expansion function: Basic Method for the Dynamic Construction of the Decision Tree start MDP, γ, SI, ε, VL, VU OR SI)-bound(SI) initialize leaf node of the partially built decision tree return call Improve-tree(T,MDP, γ, ε, VL, VU) Terminate
  • 28. Computer Decisions using Bound Iteration  Incrémental expansion(MDP,γ, sI, є, VL, VU) Initialize tree T with sI and ubound (sI), lbound (sI) using VL, VU; repeat until(single action remains for sI or ubound (sI) - lbound (sI) <= є call Improve-tree(T,MDP,γ, VL, VU) return action with greatest lover bound as a result; Improve-tree (T,MDP,γ, VL, VU) if root(T) is a leaf then expand root(T) set bouds lbound, ubound of new leaves using VL, VU; else for all decision subtrees T’ of T do call Improve-tree (T,MDP,γ, VL, VU) recompute bounds lbound(root(T)), ubound(root(T))for root(T); when root(T) is a decision node prune suboptimal action branches from T; return;
  • 29. Incremental expansion function: Basic Method for the Dynamic Construction of the Decision Tree start MDP, γ, SI, ε, VL, VU OR (SI)-bound(SI) initialize leaf node of the partially built decision tree return call Improve-tree(T,MDP, γ, ε, VL, VU) Terminate
  • 30. Solving Large MDP problmes
  • 31. If You Want to Read More on MDPs If You Want to Read More on MDPs  Book: – Martin L. Puterman  Markov Decision Processes  Wiley Series in Probability – Available on Amazon.com