2. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Introduction
How should we make decisions so as to maximize payoff (Reward)?
How should we do this when others may not go along?
How should we do this when the payoff (Reward) may be far in the future?
“Preferred Outcomes” or “Utility”
2
3. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
DecisionTheories
ProbabilityTheory + UtilityTheory
Properties of Task Environments
3
Maximize Reward Utility Theory
Other Agents Game Theory
Sequence of Actions Markov Decision Process
Fully Observable vs. Partially Observable Single agent vs. Multi agent
Deterministic vs. Stochastic Episodic vs. Sequential
Static vs. Dynamic Discrete vs. Continuous
Known vs. Unknown
4. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Agents
The Structure of Agents
Agent= Architecture+ Program
Simple reflex agents
Model-based reflex agents
Goal-based agents
Utility-based agents
4
Anything that can be viewed as perceiving
its environment through sensors and
acting upon that environment through
actuators.
5. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
MaximumExpectedUtility (MEU)
A rational agent should choose the action that maximizes the agent’s expected utility:
action = argmaxa EU(a|e)
EU(a|e) = ∑P(RESULT(a)= s’ | a, e) U(s’)
The Value of Information
VPI (Ej) = [∑P(Ej = ejk | e) EU (aejk
| e, Ej = ejk)] − EU (a|e)
5
Random VariableNondeterministic Partially Observable Environments
7. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
MarkovDecision Process (MDP)
A sequential decision problem for a fully observable, stochastic environment with a
markovian transition model and additive rewards is called a markov decision process and
consists of four components:
S: A set of states (with an initial state S0)
A: A set ACTIONS(s) of actions in each state
T: A transition model p(s’ | s, a)
R: A reward function R(s) R(s,a,s’)
7
St
Rt
St+1
Rt+1
St+2
At+1
Rt+2
At
8. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Assumptions
First-OrderMarkovianDynamics(historyindependence)
P (St+1|At,St,At-1,St-1,..., S0) = P (St+1|At,St)
First-OrderMarkovianRewardProcess
P (Rt+1|At,St,At-1,St-1,..., S0) = P (Rt|At,St)
StationaryDynamicsandReward
P (St+1|At,St) = P (Sk+1|Ak,Sk) for all t, k
The world dynamics do not depend on the absolute time
FullObservability
8
9. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Utilities of Sequences
1. Additive rewards:
Uh([s0, s1, s2, . . .]) = R(s0) + R(s1) + R(s2) + · · ·
2. Discounted rewards:
Uh([s0, s1, s2, . . .]) = R(s0) + γR(s1) + γ2R(s2) + · · ·
With discounted rewards, the utility of an infinite sequence is finite. (γ < 1)
Uh([s0, s1, s2, . . .]) = ∑γtR(st) ≤ ∑γtRmax = Rmax/(1 − γ)
9
Discount Factor is a number between 0 and 1
10. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
The Bellman Equationfor Utilities
The utility of a state is the immediate reward for that state plus the expected discounted
utility of the next state, assuming that the agent chooses the optimal action.
U’(s) = R(s) + γ maxa∈A(s) [∑P(s’ | s, a)U(s’)]
The Value IterationAlgorithm
10
11. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Policy Iteration Algorithm
The policy iteration algorithm alternates the following two steps, beginning from some
initial policy π0:
Policyevaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were
to be executed.
Policyimprovement: Calculate a new MEU policy πi+1, using one-step look-ahead based
on Ui.
π∗(s) = argmaxa∈A(s)∑P(s’| s, a)U(s’)
Modified Policy Iteration Algorithm (MPI)
Asynchronous Policy Iteration Algorithm (API)
11
12. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
POMDP
Markov DecisionProcesses The Environment was FullyObservable
The agent always knowswhich state it is in
PartiallyObservableMDP The Environment is PartiallyObservable
The agent doesnot necessarilyknowwhich state it is in
It cannot execute the actionπ(s)recommended for that state
The utility of a state S and the optimal action in S depend not just on S, but
also on how much the agent knows when it is in S.
12
13. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Definition of POMDP
MDP
Belief State
If b(s) was the previous belief state, and the agent does action a and then perceives
evidence e, then the new belief state is given by
b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s)
13
Transition Model P(s’ | s, a)
Actions A(s)
Reward Function R(s)
Sensor Model P(e | s)
POMDP
Normalizing Constant
14. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
POMDP
The optimalactiondepends only on the agent’s currentbeliefstate
The decision cycle of a POMDP agent
b’(s’) = α P (e | s’) ∑ P (s’ | s, a)b(s)
Define a RewardFunctionfor belief states
ρ(b) = ∑ b(s)R(s)
Solving a POMDP on a physical state space can be reduced to solving an MDP on the
corresponding belief-statespace
14
Given the current belief state b, execute the action a = π∗(b)
Receive percept e
Set the current belief state to b’(b, a, e) and repeat
15. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
DynamicBayesianNetwork(DBN)
In the DBN, the single state St becomes a set of state variables Xt, and there may be
multiple evidence variables Et.
KnownValue: At-2, Et-1, Rt-1, At-1, Et, Rt
15
Xt
Rt
Xt+1
Rt+1
Xt+2
At+1
Rt+2
AtAt-1
Ut+2
Et Et+1 Et+2
Xt-1
At-2
Et-1
Rt-1
16. Introduction
DecisionTheory
Intelligence Agents
Simple Decisions
Complex Decisions
Value Iteration
Policy Iteration
Partially Observable MDP
Dopamine-based learning
Dopaminergic System
Prediction Error in Human
Reinforcement Learning
Reward-based Learning
Decision Making
ActionSelection(what to do next)
Time Perception
16
Dopamine Functions:
Motor control
Reward behavior
Addiction
Synaptic Plasticity
Nausea
& …
17. 17
Reference
• E.A. Feinberg and A. Shwartz (eds.) Handbook of Markov Decision Processes, Kluwer, Boston, MA, 2002.
• Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.
• Gurney, K., Prescott, T. J., & Redgrave, P. (2001). A computational model of action selection in the basal ganglia. Biological
Cybernetics, 84(6), 401-423.
• Stuart Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. Upper Saddle River (New Jersey, 1995.
Thanks for your
Attention