The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS
1. MOUNTAIN CAR
PROBLEM
USING TEMPORAL DIFFERENCE(TD)
& VALUE ITERATION(VI)
REINFORCEMENT LEARNING
ALRORITHMS
By
Muzammil Abdulrahman
&
Yusuf Garba Dambatta
Mevlana University Konya, Turkey
2013
2. INTRODUCTION
The aim of the mountain car problem is for the
car to learn on two continuous variables;
• position and
• velocity
So that it can reach the top of the mountain in a
minimum number of steps.
By starting the car from rest, its engine power
alone will not be powerful enough to bring the
car over the hill in front.
2
4. INTRODUCTION CONT.
By accelerating forward and backward in order
to gather momentum.
The agent receives a negative reward at every
time step when the goal is not reached
The agent has no information about the goal
until an initial success, it uses reinforcement
learning methods.
In this project, we employed TD-Q learning and
value iteration algorithms
4
5. REINFORCEMENT LEARNING
Reinforcement learning is orthogonal learning
algorithm in the field of machine learning
Where an estimation of the correctness of the
answer is provided to the system
It deals with how an agent should take an action
in an environment so as to maximize a
cumulative reward
It is a Learning from interaction
And is a Goal-oriented learning
5
6. CHARACTERISTICS
No direct training examples – (delayed) rewards instead
Goal-oriented learning
Learning about, from, and while interacting with
an external environment
Need for exploration of environment & exploitation
The environment might be stochastic and/or unknown
The learning actions of the agent affect future rewards
6
9. UNSUPERVISED LEARNING
Training Info = Evaluation(rewards/penalties)
Input
RL System
Output(actions)
Objectives: Get as much reward as possible
9
10. SUPERVISED LEARNING
Training Info = desired (target) outputs
Input
Supervised Learning
System
Output
Training example = {input (state), target output}
Error = (target output – actual output)
10
11. TEMPORAL DIFFERENCE(TD)
Temporal difference (TD) learning is a
prediction method.
It has been mostly used for solving the
reinforcement learning problem.
TD learning is a combination of Monte Carlo
ideas and dynamic programming (DP) ideas.
11
13. TD Q-LEARNING ALGORITHM
Initialize Q values for all states ‘s’ and actions ‘a’
Obtain the current state
Select an action according to current state
Implement the selected action and obtain an immediate
reward and the next state
Update the Q function according to the above equation
Update the system state
Stop the algorithm if the maximum number of iteration
is reached
13
14. ε -GREEDY SELECTION (Q,S,EPSILON)
The agent randomly select action from Q table
based on e-greedy strategy.
Initially, epsilon=0.01 which is the probability of
selecting random action.
It will be approximately equal to zero, when the
car agent has fully learned how to climb the
front hill (no randomness because it has learned
about best action).
14
15. STATE, ACTION & REWARD
State: The states are position and speed. Position is
between the range of -1.5 and 0.55 and speed is
between the range of -0.07 and 0.07
Action: The agent has one of these 3 actions at all
the time: Forward, backward, neutral (Forward
accelaration=+1m/s2, backward deccelaration =accelaration=+1
1m/s2 , neutral=0 m/s2).
Reward: The agent receive a reward of -1 for all
actions except when the agent reaches the goal
state where it receives a 0 reward
15
16. VALUE ITERATION
The value iteration algorithm which is also called
backward induction
Combines policy improvement and a truncated
policy evaluation into a single update step
V (s)
= R(s) + γ max ∑ T (s, a, s) V (s′)
16
17. VALUE ITERATION ALGORITHM
Inputs: (S, A,T, R, γ), ε: threshold value
Initialize V0 for state ‘s’ and action ‘a’
for each compute the next approximation using Bellman
backup equation.
V(s)
R(s) + γ max ∑ T (s, a, s) V (s′)
δ
V (s′) -V(s)
Until δ < ε
Return V
17
18. GRAPHICAL RESULTS
The graph shows the relation between RMS
value (also called policy-loss) and the number of
episodes.
RMS value is the error between the current Q
value and the previous Q value.
With any probability, an agent chooses action
randomly. If the choosen action happen to be
bad, it will cause instant rise in error.
At convergence, the error is approximately zero.
In our case, convergence is reached when 3 or
more successive RMS value equals 0.0001 or less
18
19. The car in the mountain will be displayed at 11th
iteration to visualize how the car agent learns.
19
22. RESULT CONT.
The car in the mountain will be displayed at 11th
iteration to visualize how the car agent learns.
After 11th iteration, it will be stopped to reduce
the time it takes to converge.
After 3 or more successive RMS values equals
0.001 or less, the car will be displayed again to
show how it has fully learned how to reach goal
state at any episode maintaining constant steps.
22
23. VI RESULTS
The graph below shows the convergence error
over iterations
23
24. VI CONT.
Figure 6 shows the graph of Optimal Positions and Velocities over time on top
while bottom one displays the car learning in the mountain.
24
25. VI CONT.
The first Episode records the highest error
This is because the error is the difference
between the current value function and the
previous value function i.e. Error= V (s′) -V(s)
But initially the previous value function is 0
Hence Error= V (s′)
25
26. VI CONT.
At subsequent episodes, the error keeps
decreasing as the next updated value functions
increase.
At convergence, the error (with 0 value) is less
ε
than the threshold value ( =0.0001) which is
the termination criteria for this project.
Finally the optimal policy will be returned.
26
27. VI CONT.
The graphs below shows the optimal positions
and velocities over time
The first graph is that of the optimal positions
over time
It simply shows the optimal positions attained by
the car as it attempt to reach the goal state at
different time
27
28. CONT.
Also the second graph shows the optimal
velocities attained by the car as it attempt to
reach the goal state at different time
The car initially accelerate from rest position to
attain a position of -0.2 it then swings back to
gather enough momentum by attaining a
position of -0.95, it finally accelerate forward
again and reach the goal state
28
29. CONCLUSION
In this project, the temporal difference and value
iteration learning algorithms were implemented for
mountain car problem. Both the algorithms were
guaranteed to converge by determining the
optimal policy for reaching the goal state.
29
Hinweis der Redaktion
{"19":"Figure show the graph of RMS vs Episode at 11th episode at the top, while bottom one displays the car learning in the mountain\n","20":"Figure show the graph of RMS vs Episode at 1000th episode\n"}