This document summarizes an efficient use of temporal difference techniques in computer game learning. It discusses reinforcement learning and some key concepts including the agent-environment interface, types of reinforcement learning tasks, elements of reinforcement learning like policy, reward functions, and value functions. It also describes algorithms like dynamic programming, policy iteration, value iteration, and temporal difference learning. Finally, it mentions some applications of reinforcement learning in benchmark problems, games, and real-world domains like robotics and control.
An efficient use of temporal difference technique in Computer Game Learning
1. An Efficient use of temporal difference technique in computer game
learning
Indian institute of technology
( Indian school of mines ),
Dhanbad.
Project guide:- Presented by:
Dr. Rajendra Pamula Prabhu Kumar
Department of computer 15MT000624
Science and engineering Computer science and engineering
Indian institute of technology Indian institute of technology
( Indian school of mines), ( Indian school of mines),
Dhanbad Dhanbad
2. Outline
1. Introduction of reinforcement learning
2. Agent-Environment interface
3. Types of reinforcement learning
4. Elements of the reinforcement learning
5. Types of selection of state
6. Algorithms of reinforcement learning
References
3. Introduction of reinforcement learning
Reinforcement learning is the part of machine learning, which is a field of computer
science that gives computer to ability to learn without being explicitly programmed.
Reinforcement learning is a framework for computational learning agents use experience
from their interaction with an environment to improve performance over time.
In reinforcement learning task, the agent understands the state of the environment and it
always tries to maximize the long-term return which is based on real value reward.
It is learning of what to do-how to do mapping situation to action so as maximize total
numerical reward and minimize the penalty.
4. Introduction of reinforcement learning cont.…
• If there is no explicit teacher to guide the learning agent, the agent must learn the behavior
through trail-and-error interaction with unknown environment.
• The learning agent senses the environment, takes actions on it, and receives numeric reward or
punishment from some reward function.
• When we say agent learn means ”sometimes it modifies the code itself or modifies the database ”,
database implies the experiences, information, event etc.
• It is responsible for making decision.
• The main goal of reinforcement learning is “Buildup a good model such as algorithm which
generate a sequence of decision and lead to the highest long-term reward.”
5. Agent environment interface
o At each time step t, the reinforcement learning agent
receives some representation of environment’s current state
s(t) € S ,where S is the set of possible state and then choose
some action a(t)€ A(st), where A(st) is set of actions that can
be executed in state s(t).
o The agent receives reward r(t+1) and execute in next state
s(t+1)
o The reward function can be used for specify the wide range
of planning goals, It means the designer can tell the agent
what he has achieve.
o The reward function which must be unalterable by the
agent.
6. Types of reinforcement learning
There are two types of reinforcement learning
1. Episodic: The interaction with the environment is divided into independent episodes.
“Independent”, means performance in each episode is depends only the action taken on that
episode.
in episodic task, a return is sum of all reward received from the beginning of the episode until ends.
where, T is terminal state i.e. end of episode ends
S0 is the starting state of
R denotes as total return
r(k) denotes as the reward on the kth states
7. Types of the reinforcement learning contd..
2.Continuing task: It consist infinite sequence of state, action and rewards. In this task, the action
and environment interaction doesn’t break down in separate episode. The performance is depends
upon the current action.
In the case of continuing task, The return is depends upon discount factor
where γ denotes discount factor which adjust the relative importance
between long-term consequences vs. short term consequences.
The discount factor is between 0 and 1.
The discount factor reflects the strategy of how fast learning takes place
If γ =0, agent only concerned about maximizing the immediate rewards
If γ approaches to 1, The agent takes the future reward into account
8. Element of the reinforcement learning
1. Policy:
It defines the learning agent’s way of behaving at a given time.
It might be a function or simple lookup table.
It only used in reinforcement learning is to determine the behavior.
2. Reward function:
It is the function which defines which one is the bad and good event for agent.
It maps each state-action pair of the environment to a single real number.
It must necessarily unalterable by the agent.
9. Elements in the reinforcement learning contd..
3. Value function:
It specifies what is good in long run. The value of the state is the total amount of reward an
agent can expect to accumulate over the future, starting from that state.
Where as reward is the immediate desirability of environmental states. i.e. values indicate the
long-tem desirability of states.
4. Model:
It is used for planning. It defines the copy of behavior, e.g. by given state and action ,The model
might predict the next state and next reward.
10. Algorithm of reinforcement learning
1. Markov decision processes:
• It is standard, general formalism for sequential decision problems.
• It consist tuple of <S,A,P,R>
where S is the set of states.
A is the set of actions available to the agent
P is the probability, P(a, ss′) = P r {st+1 = s ′ | st = s, at = a}, it is a state transition function that
defines the probability of transitioning to state s ′ at time t + 1 after action a is taken when agent is
in state s at time t.
R is the reward function that determines the probability of receiving reward after choosing action a
in state s and going for next state s’.
11. Algorithms in reinforcement learning
2. Dynamic programming (DP)
• It is the method to solve the markov decision process i.e. to find an optimal policy, if the full
knowledge of model is available.
• For dynamic programming, all the transition probabilities and reward expectation must be known.
• This algorithm updates the estimates of states values based on their estimates of the next state.
• There are two basic DP methods used for computing optimal policy
1. Policy iteration
2. Value iteration
12. Policy Iteration:
• It forms a sequence of policy Ωo, Ω1, Ω2….Ωk, Ωk+1 where Ωk+1 is an improvement of Ωk.
• Policy evaluation task is concerned with computing state value function for any policy Ω
• The iterative algorithm for policy evaluation is
• Estimating value functions is particularly useful for finding the better policy.
• The policy improvement algorithm uses action-value function to improve the current policy. If
then it is better to select
action a in policy Ω
• If Ω and Ω’ are two policy and this condition hold then Ω’ is the better policy than Ω
13. Value iteration
• In value iteration, optimal policy is not computed directly.
• For that, optimal value function is computed and then a greedy policy with respect to function is
an optimal policy.
• It stops and find the optimal policy when the changes introduced by backups/updates becomes
sufficiently small.
• One threshold value has been initialized and compared with threshold value.
• If the policy value is sufficiently smaller than threshold, the policy is called as optimal policy
14. 3. Temporal difference
• The temporal differences idea has been taken from dynamic programming.
• The temporal difference and dynamic programming, both are used for accumulating the value functions.
• In this methods, learning takes place after every time step which is beneficial as it makes for efficient
learning
• The agent can revise its policy after every action and state it experiences.
• TD algorithms make updates of the estimated policy values based on each state transition and on the
immediate reward received from the environment on this transition.
• The initial temporal difference algorithm is called TD(0),called tabular estimates v(Ω).
It updates by following method
V (s) ← V (s) + α (V (s’) − V (s)) where α is the positive step size parameter
V(s’) is the value function for next state
α (V (s’) − V (s)), called as Temporal difference error. V(s) is the value function for the current state
Which is always designed to move toward 0.
15. Type of selection of states
1. Greedy
2. Exploration process
a) Providing initial knowledge
b) Deriving a policy from demonstration
c) Ask for help
d) Teacher provide advice
16. Application of reinforcement learning
1. Benchmark Problems
a) Mountain car
b) Cart-pole balancing
c) Pendulum swing up tec..
2. Games
a) Tic-Tac-Toe
b) Chess etc..
3. Real world applications
a) Robotics
b) Control of helicopter
c) Prediction of stock prices
17. References
• R. S. Sutton. Reinforcement learning: past, present and future [online]. Available from http:
//www-anw. cs. umass. edu/ rich/Talks/SEAL98/SEAL98. html [accessed on December 2005]. 1999.
• R. S. Sutton and A. G. Barto. Reinforcement learning. an introduction. Cambridge, MA: The MIT
Press, 1998.
• M. L. Puterman. Markov decision processes-discrete stochastic dynamic programming. John
Wiley and sons, Inc, New York, NY, 1994.