SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Markov Decision
Process
Hamed Abdi
PhD Candidate in Computational Cognitive Modeling
Institute for Cognitive & Brain Science (ICBS)
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Introduction
 How should we make decisions so as to maximize payoff (Reward)?
 How should we do this when others may not go along?
 How should we do this when the payoff (Reward) may be far in the future?
“Preferred Outcomes” or “Utility”
2
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
DecisionTheories
ProbabilityTheory + UtilityTheory
Properties of Task Environments
3
Maximize Reward Utility Theory
Other Agents Game Theory
Sequence of Actions Markov Decision Process
Fully Observable vs. Partially Observable Single agent vs. Multi agent
Deterministic vs. Stochastic Episodic vs. Sequential
Static vs. Dynamic Discrete vs. Continuous
Known vs. Unknown
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Agents
The Structure of Agents
Agent= Architecture+ Program
 Simple reflex agents
 Model-based reflex agents
 Goal-based agents
 Utility-based agents
4
Anything that can be viewed as perceiving
its environment through sensors and
acting upon that environment through
actuators.
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
MaximumExpectedUtility (MEU)
A rational agent should choose the action that maximizes the agent’s expected utility:
action = argmaxa EU(a|e)
EU(a|e) = ∑P(RESULT(a)= s’ | a, e) U(s’)
The Value of Information
VPI (Ej) = [∑P(Ej = ejk | e) EU (aejk
| e, Ej = ejk)] − EU (a|e)
5
Random VariableNondeterministic Partially Observable Environments
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Sequential DecisionProblems
Policyand Optimal Policy
6
Optimal Policies depend
on reward and horizon
Finite Horizon Infinite Horizon
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
MarkovDecision Process (MDP)
A sequential decision problem for a fully observable, stochastic environment with a
markovian transition model and additive rewards is called a markov decision process and
consists of four components:
 S: A set of states (with an initial state S0)
 A: A set ACTIONS(s) of actions in each state
 T: A transition model p(s’ | s, a)
 R: A reward function R(s) R(s,a,s’)
7
St
Rt
St+1
Rt+1
St+2
At+1
Rt+2
At
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Assumptions
First-OrderMarkovianDynamics(historyindependence)
 P (St+1|At,St,At-1,St-1,..., S0) = P (St+1|At,St)
First-OrderMarkovianRewardProcess
 P (Rt+1|At,St,At-1,St-1,..., S0) = P (Rt|At,St)
StationaryDynamicsandReward
 P (St+1|At,St) = P (Sk+1|Ak,Sk) for all t, k
 The world dynamics do not depend on the absolute time
FullObservability
8
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Utilities of Sequences
1. Additive rewards:
Uh([s0, s1, s2, . . .]) = R(s0) + R(s1) + R(s2) + · · ·
2. Discounted rewards:
Uh([s0, s1, s2, . . .]) = R(s0) + γR(s1) + γ2R(s2) + · · ·
 With discounted rewards, the utility of an infinite sequence is finite. (γ < 1)
Uh([s0, s1, s2, . . .]) = ∑γtR(st) ≤ ∑γtRmax = Rmax/(1 − γ)
9
Discount Factor is a number between 0 and 1
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
The Bellman Equationfor Utilities
The utility of a state is the immediate reward for that state plus the expected discounted
utility of the next state, assuming that the agent chooses the optimal action.
U’(s) = R(s) + γ maxa∈A(s) [∑P(s’ | s, a)U(s’)]
The Value IterationAlgorithm
10
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Policy Iteration Algorithm
The policy iteration algorithm alternates the following two steps, beginning from some
initial policy π0:
 Policyevaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were
to be executed.
 Policyimprovement: Calculate a new MEU policy πi+1, using one-step look-ahead based
on Ui.
π∗(s) = argmaxa∈A(s)∑P(s’| s, a)U(s’)
Modified Policy Iteration Algorithm (MPI)
Asynchronous Policy Iteration Algorithm (API)
11
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
POMDP
Markov DecisionProcesses The Environment was FullyObservable
The agent always knowswhich state it is in
PartiallyObservableMDP The Environment is PartiallyObservable
The agent doesnot necessarilyknowwhich state it is in
It cannot execute the actionπ(s)recommended for that state
The utility of a state S and the optimal action in S depend not just on S, but
also on how much the agent knows when it is in S.
12
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Definition of POMDP
MDP
Belief State
If b(s) was the previous belief state, and the agent does action a and then perceives
evidence e, then the new belief state is given by
b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s)
13
Transition Model P(s’ | s, a)
Actions A(s)
Reward Function R(s)
Sensor Model P(e | s)
POMDP
Normalizing Constant
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
POMDP
The optimalactiondepends only on the agent’s currentbeliefstate
The decision cycle of a POMDP agent
b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s)
Define a RewardFunctionfor belief states
ρ(b) = ∑ b(s)R(s)
Solving a POMDP on a physical state space can be reduced to solving an MDP on the
corresponding belief-statespace
14
Given the current belief state b, execute the action a = π∗(b)
Receive percept e
Set the current belief state to b’(b, a, e) and repeat
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
DynamicBayesianNetwork(DBN)
In the DBN, the single state St becomes a set of state variables Xt, and there may be
multiple evidence variables Et.
KnownValue: At-2, Et-1, Rt-1, At-1, Et, Rt
15
Xt
Rt
Xt+1
Rt+1
Xt+2
At+1
Rt+2
AtAt-1
Ut+2
Et Et+1 Et+2
Xt-1
At-2
Et-1
Rt-1
Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Dopaminergic System
 Prediction Error in Human
 Reinforcement Learning
 Reward-based Learning
 Decision Making
 ActionSelection(what to do next)
 Time Perception
16
Dopamine Functions:
 Motor control
 Reward behavior
 Addiction
 Synaptic Plasticity
 Nausea
 & …
17
Reference
• E.A. Feinberg and A. Shwartz (eds.) Handbook of Markov Decision Processes, Kluwer, Boston, MA, 2002.
• Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.
• Gurney, K., Prescott, T. J., & Redgrave, P. (2001). A computational model of action selection in the basal ganglia. Biological
Cybernetics, 84(6), 401-423.
• Stuart Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. Upper Saddle River (New Jersey, 1995.
Thanks for your
Attention

Weitere ähnliche Inhalte

Was ist angesagt?

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processchauhankapil
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descentkandelin
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 

Was ist angesagt? (20)

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
Activation function
Activation functionActivation function
Activation function
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
Back propagation
Back propagationBack propagation
Back propagation
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 

Ähnlich wie Markov decision process

Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenTu Le Dinh
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement LearningNguyen Luong An Phu
 
Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies
Robust Policy Computation in Reward-uncertain MDPs using Nondominated PoliciesRobust Policy Computation in Reward-uncertain MDPs using Nondominated Policies
Robust Policy Computation in Reward-uncertain MDPs using Nondominated PoliciesKevin Regan
 
Partially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsPartially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsMartin Majlis
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learningahmad bassiouny
 
Reinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLReinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLThom Lane
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
 

Ähnlich wie Markov decision process (20)

RL intro
RL introRL intro
RL intro
 
Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu Nguyen
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement Learning
 
Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies
Robust Policy Computation in Reward-uncertain MDPs using Nondominated PoliciesRobust Policy Computation in Reward-uncertain MDPs using Nondominated Policies
Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies
 
Partially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsPartially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systems
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Reinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLReinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RL
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Mehr von Hamed Abdi

Organizing for the future
Organizing for the futureOrganizing for the future
Organizing for the futureHamed Abdi
 
Emotinal Design
Emotinal DesignEmotinal Design
Emotinal DesignHamed Abdi
 
Neuromarketing (BCNC 2019)
Neuromarketing (BCNC 2019)Neuromarketing (BCNC 2019)
Neuromarketing (BCNC 2019)Hamed Abdi
 
Cognetics (UX Shiraz 2019)
Cognetics (UX Shiraz 2019)Cognetics (UX Shiraz 2019)
Cognetics (UX Shiraz 2019)Hamed Abdi
 
Social Decision Making
Social Decision MakingSocial Decision Making
Social Decision MakingHamed Abdi
 
A direct brain to-brain interface in humans
A direct brain to-brain interface in humansA direct brain to-brain interface in humans
A direct brain to-brain interface in humansHamed Abdi
 
The construction of visual reality
The construction of visual realityThe construction of visual reality
The construction of visual realityHamed Abdi
 
Time in philosophy, physics and psychology
Time in philosophy, physics and psychologyTime in philosophy, physics and psychology
Time in philosophy, physics and psychologyHamed Abdi
 
Mental representation
Mental representationMental representation
Mental representationHamed Abdi
 
Human Factors (Every User has a Mind!)
Human Factors (Every User has a Mind!)Human Factors (Every User has a Mind!)
Human Factors (Every User has a Mind!)Hamed Abdi
 
Cognitive Science Perspective on User eXperience!
Cognitive Science Perspective on User eXperience!Cognitive Science Perspective on User eXperience!
Cognitive Science Perspective on User eXperience!Hamed Abdi
 

Mehr von Hamed Abdi (15)

Organizing for the future
Organizing for the futureOrganizing for the future
Organizing for the future
 
Data Modeling
Data ModelingData Modeling
Data Modeling
 
Emotinal Design
Emotinal DesignEmotinal Design
Emotinal Design
 
Neuromarketing (BCNC 2019)
Neuromarketing (BCNC 2019)Neuromarketing (BCNC 2019)
Neuromarketing (BCNC 2019)
 
Cognetics (UX Shiraz 2019)
Cognetics (UX Shiraz 2019)Cognetics (UX Shiraz 2019)
Cognetics (UX Shiraz 2019)
 
UserX 2019
UserX 2019UserX 2019
UserX 2019
 
Social Decision Making
Social Decision MakingSocial Decision Making
Social Decision Making
 
A direct brain to-brain interface in humans
A direct brain to-brain interface in humansA direct brain to-brain interface in humans
A direct brain to-brain interface in humans
 
Bifurcation
BifurcationBifurcation
Bifurcation
 
The construction of visual reality
The construction of visual realityThe construction of visual reality
The construction of visual reality
 
Time in philosophy, physics and psychology
Time in philosophy, physics and psychologyTime in philosophy, physics and psychology
Time in philosophy, physics and psychology
 
Mental representation
Mental representationMental representation
Mental representation
 
Human Factors (Every User has a Mind!)
Human Factors (Every User has a Mind!)Human Factors (Every User has a Mind!)
Human Factors (Every User has a Mind!)
 
Cognetics
CogneticsCognetics
Cognetics
 
Cognitive Science Perspective on User eXperience!
Cognitive Science Perspective on User eXperience!Cognitive Science Perspective on User eXperience!
Cognitive Science Perspective on User eXperience!
 

Kürzlich hochgeladen

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 

Kürzlich hochgeladen (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 

Markov decision process

  • 1. Markov Decision Process Hamed Abdi PhD Candidate in Computational Cognitive Modeling Institute for Cognitive & Brain Science (ICBS)
  • 2. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Introduction  How should we make decisions so as to maximize payoff (Reward)?  How should we do this when others may not go along?  How should we do this when the payoff (Reward) may be far in the future? “Preferred Outcomes” or “Utility” 2
  • 3. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning DecisionTheories ProbabilityTheory + UtilityTheory Properties of Task Environments 3 Maximize Reward Utility Theory Other Agents Game Theory Sequence of Actions Markov Decision Process Fully Observable vs. Partially Observable Single agent vs. Multi agent Deterministic vs. Stochastic Episodic vs. Sequential Static vs. Dynamic Discrete vs. Continuous Known vs. Unknown
  • 4. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Agents The Structure of Agents Agent= Architecture+ Program  Simple reflex agents  Model-based reflex agents  Goal-based agents  Utility-based agents 4 Anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.
  • 5. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning MaximumExpectedUtility (MEU) A rational agent should choose the action that maximizes the agent’s expected utility: action = argmaxa EU(a|e) EU(a|e) = ∑P(RESULT(a)= s’ | a, e) U(s’) The Value of Information VPI (Ej) = [∑P(Ej = ejk | e) EU (aejk | e, Ej = ejk)] − EU (a|e) 5 Random VariableNondeterministic Partially Observable Environments
  • 6. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Sequential DecisionProblems Policyand Optimal Policy 6 Optimal Policies depend on reward and horizon Finite Horizon Infinite Horizon
  • 7. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning MarkovDecision Process (MDP) A sequential decision problem for a fully observable, stochastic environment with a markovian transition model and additive rewards is called a markov decision process and consists of four components:  S: A set of states (with an initial state S0)  A: A set ACTIONS(s) of actions in each state  T: A transition model p(s’ | s, a)  R: A reward function R(s) R(s,a,s’) 7 St Rt St+1 Rt+1 St+2 At+1 Rt+2 At
  • 8. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Assumptions First-OrderMarkovianDynamics(historyindependence)  P (St+1|At,St,At-1,St-1,..., S0) = P (St+1|At,St) First-OrderMarkovianRewardProcess  P (Rt+1|At,St,At-1,St-1,..., S0) = P (Rt|At,St) StationaryDynamicsandReward  P (St+1|At,St) = P (Sk+1|Ak,Sk) for all t, k  The world dynamics do not depend on the absolute time FullObservability 8
  • 9. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Utilities of Sequences 1. Additive rewards: Uh([s0, s1, s2, . . .]) = R(s0) + R(s1) + R(s2) + · · · 2. Discounted rewards: Uh([s0, s1, s2, . . .]) = R(s0) + γR(s1) + γ2R(s2) + · · ·  With discounted rewards, the utility of an infinite sequence is finite. (γ < 1) Uh([s0, s1, s2, . . .]) = ∑γtR(st) ≤ ∑γtRmax = Rmax/(1 − γ) 9 Discount Factor is a number between 0 and 1
  • 10. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning The Bellman Equationfor Utilities The utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action. U’(s) = R(s) + γ maxa∈A(s) [∑P(s’ | s, a)U(s’)] The Value IterationAlgorithm 10
  • 11. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Policy Iteration Algorithm The policy iteration algorithm alternates the following two steps, beginning from some initial policy π0:  Policyevaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were to be executed.  Policyimprovement: Calculate a new MEU policy πi+1, using one-step look-ahead based on Ui. π∗(s) = argmaxa∈A(s)∑P(s’| s, a)U(s’) Modified Policy Iteration Algorithm (MPI) Asynchronous Policy Iteration Algorithm (API) 11
  • 12. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning POMDP Markov DecisionProcesses The Environment was FullyObservable The agent always knowswhich state it is in PartiallyObservableMDP The Environment is PartiallyObservable The agent doesnot necessarilyknowwhich state it is in It cannot execute the actionπ(s)recommended for that state The utility of a state S and the optimal action in S depend not just on S, but also on how much the agent knows when it is in S. 12
  • 13. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Definition of POMDP MDP Belief State If b(s) was the previous belief state, and the agent does action a and then perceives evidence e, then the new belief state is given by b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s) 13 Transition Model P(s’ | s, a) Actions A(s) Reward Function R(s) Sensor Model P(e | s) POMDP Normalizing Constant
  • 14. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning POMDP The optimalactiondepends only on the agent’s currentbeliefstate The decision cycle of a POMDP agent b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s) Define a RewardFunctionfor belief states ρ(b) = ∑ b(s)R(s) Solving a POMDP on a physical state space can be reduced to solving an MDP on the corresponding belief-statespace 14 Given the current belief state b, execute the action a = π∗(b) Receive percept e Set the current belief state to b’(b, a, e) and repeat
  • 15. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning DynamicBayesianNetwork(DBN) In the DBN, the single state St becomes a set of state variables Xt, and there may be multiple evidence variables Et. KnownValue: At-2, Et-1, Rt-1, At-1, Et, Rt 15 Xt Rt Xt+1 Rt+1 Xt+2 At+1 Rt+2 AtAt-1 Ut+2 Et Et+1 Et+2 Xt-1 At-2 Et-1 Rt-1
  • 16. Introduction DecisionTheory Intelligence Agents Simple Decisions Complex Decisions Value Iteration Policy Iteration Partially Observable MDP Dopamine-based learning Dopaminergic System  Prediction Error in Human  Reinforcement Learning  Reward-based Learning  Decision Making  ActionSelection(what to do next)  Time Perception 16 Dopamine Functions:  Motor control  Reward behavior  Addiction  Synaptic Plasticity  Nausea  & …
  • 17. 17 Reference • E.A. Feinberg and A. Shwartz (eds.) Handbook of Markov Decision Processes, Kluwer, Boston, MA, 2002. • Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998. • Gurney, K., Prescott, T. J., & Redgrave, P. (2001). A computational model of action selection in the basal ganglia. Biological Cybernetics, 84(6), 401-423. • Stuart Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. Upper Saddle River (New Jersey, 1995. Thanks for your Attention