SlideShare ist ein Scribd-Unternehmen logo
1 von 29
MOUNTAIN CAR
PROBLEM
USING TEMPORAL DIFFERENCE(TD)
& VALUE ITERATION(VI)
REINFORCEMENT LEARNING
ALRORITHMS
By
Muzammil Abdulrahman
&
Yusuf Garba Dambatta

Mevlana University Konya, Turkey
2013
INTRODUCTION
 The aim of the mountain car problem is for the
car to learn on two continuous variables;
• position and
• velocity
 So that it can reach the top of the mountain in a
minimum number of steps.
 By starting the car from rest, its engine power
alone will not be powerful enough to bring the
car over the hill in front.
2
INTRODUCTION CONT.

To climb up the hill, the car would need to swing back
and forth inside the valley
3
INTRODUCTION CONT.
 By accelerating forward and backward in order
to gather momentum.
 The agent receives a negative reward at every
time step when the goal is not reached
 The agent has no information about the goal
until an initial success, it uses reinforcement
learning methods.
 In this project, we employed TD-Q learning and
value iteration algorithms
4
REINFORCEMENT LEARNING
 Reinforcement learning is orthogonal learning
algorithm in the field of machine learning
 Where an estimation of the correctness of the
answer is provided to the system
 It deals with how an agent should take an action
in an environment so as to maximize a
cumulative reward
 It is a Learning from interaction
 And is a Goal-oriented learning
5
CHARACTERISTICS
 No direct training examples – (delayed) rewards instead

 Goal-oriented learning
 Learning about, from, and while interacting with
an external environment
 Need for exploration of environment & exploitation
 The environment might be stochastic and/or unknown
 The learning actions of the agent affect future rewards

6
EXAMPLES

Robot moving in an environment
EXAMPLES

 Chess Master

8
UNSUPERVISED LEARNING
Training Info = Evaluation(rewards/penalties)

Input

RL System

Output(actions)

Objectives: Get as much reward as possible
9
SUPERVISED LEARNING
Training Info = desired (target) outputs

Input

Supervised Learning
System

Output

Training example = {input (state), target output}
Error = (target output – actual output)
10
TEMPORAL DIFFERENCE(TD)
 Temporal difference (TD) learning is a
prediction method.
 It has been mostly used for solving the
reinforcement learning problem.
 TD learning is a combination of Monte Carlo
ideas and dynamic programming (DP) ideas.

11
TD Q-LEARNING
 The update of TD Q-Learning looks as follows

Q (a, s)

Q (a, s) + α [(R(s) + γ maxaı Q (aı, sı) - Q (a, s)]

12
TD Q-LEARNING ALGORITHM
 Initialize Q values for all states ‘s’ and actions ‘a’
 Obtain the current state
 Select an action according to current state
 Implement the selected action and obtain an immediate
reward and the next state
 Update the Q function according to the above equation
 Update the system state
 Stop the algorithm if the maximum number of iteration
is reached
13
ε -GREEDY SELECTION (Q,S,EPSILON)
 The agent randomly select action from Q table
based on e-greedy strategy.
 Initially, epsilon=0.01 which is the probability of
selecting random action.
 It will be approximately equal to zero, when the
car agent has fully learned how to climb the
front hill (no randomness because it has learned
about best action).
14
STATE, ACTION & REWARD
State: The states are position and speed. Position is
between the range of -1.5 and 0.55 and speed is
between the range of -0.07 and 0.07
Action: The agent has one of these 3 actions at all
the time: Forward, backward, neutral (Forward
accelaration=+1m/s2, backward deccelaration =accelaration=+1
1m/s2 , neutral=0 m/s2).
Reward: The agent receive a reward of -1 for all
actions except when the agent reaches the goal
state where it receives a 0 reward
15
VALUE ITERATION
 The value iteration algorithm which is also called
backward induction
 Combines policy improvement and a truncated
policy evaluation into a single update step
 V (s)

= R(s) + γ max ∑ T (s, a, s) V (s′)

16
VALUE ITERATION ALGORITHM
 Inputs: (S, A,T, R, γ), ε: threshold value
 Initialize V0 for state ‘s’ and action ‘a’
 for each compute the next approximation using Bellman
backup equation.
 V(s)
R(s) + γ max ∑ T (s, a, s) V (s′)
δ
V (s′) -V(s)
 Until δ < ε
 Return V
17
GRAPHICAL RESULTS
 The graph shows the relation between RMS
value (also called policy-loss) and the number of
episodes.
 RMS value is the error between the current Q
value and the previous Q value.
 With any probability, an agent chooses action
randomly. If the choosen action happen to be
bad, it will cause instant rise in error.
 At convergence, the error is approximately zero.
 In our case, convergence is reached when 3 or
more successive RMS value equals 0.0001 or less
18
 The car in the mountain will be displayed at 11th
iteration to visualize how the car agent learns.

19
GRAPH

20
CONT.

Figure show the graph of Total Reward vs Episode at 1000th episode
21
RESULT CONT.
 The car in the mountain will be displayed at 11th
iteration to visualize how the car agent learns.
 After 11th iteration, it will be stopped to reduce
the time it takes to converge.
 After 3 or more successive RMS values equals
0.001 or less, the car will be displayed again to
show how it has fully learned how to reach goal
state at any episode maintaining constant steps.
22
VI RESULTS
 The graph below shows the convergence error
over iterations

23
VI CONT.

Figure 6 shows the graph of Optimal Positions and Velocities over time on top
while bottom one displays the car learning in the mountain.
24
VI CONT.
 The first Episode records the highest error
 This is because the error is the difference
between the current value function and the
previous value function i.e. Error= V (s′) -V(s)
 But initially the previous value function is 0
 Hence Error= V (s′)

25
VI CONT.
 At subsequent episodes, the error keeps
decreasing as the next updated value functions
increase.
 At convergence, the error (with 0 value) is less

ε

than the threshold value ( =0.0001) which is
the termination criteria for this project.
 Finally the optimal policy will be returned.
26
VI CONT.
 The graphs below shows the optimal positions
and velocities over time
 The first graph is that of the optimal positions
over time
 It simply shows the optimal positions attained by
the car as it attempt to reach the goal state at
different time

27
CONT.
 Also the second graph shows the optimal
velocities attained by the car as it attempt to
reach the goal state at different time
 The car initially accelerate from rest position to
attain a position of -0.2 it then swings back to
gather enough momentum by attaining a
position of -0.95, it finally accelerate forward
again and reach the goal state
28
CONCLUSION
In this project, the temporal difference and value
iteration learning algorithms were implemented for
mountain car problem. Both the algorithms were
guaranteed to converge by determining the
optimal policy for reaching the goal state.

29

Weitere ähnliche Inhalte

Ähnlich wie MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS

Understanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsUnderstanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsSEMINARGROOT
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Omkar Rane
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraData Science Milan
 
#6 formal methods – loop proof using induction method
#6 formal methods – loop proof using induction method#6 formal methods – loop proof using induction method
#6 formal methods – loop proof using induction methodSharif Omar Salem
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learnedweka Content
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedDataminingTools Inc
 
Kickstart ML.pptx
Kickstart ML.pptxKickstart ML.pptx
Kickstart ML.pptxGDSCVJTI
 
Time series analysis use E-views programer
Time series analysis use E-views programerTime series analysis use E-views programer
Time series analysis use E-views programerAl-Qadisiya University
 
LP linear programming (summary) (5s)
LP linear programming (summary) (5s)LP linear programming (summary) (5s)
LP linear programming (summary) (5s)Dionísio Carmo-Neto
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionankit_ppt
 
Maxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearningMaxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearningHyunKyu Jeon
 
Mb0048 operations research
Mb0048 operations researchMb0048 operations research
Mb0048 operations researchsmumbahelp
 
Modern control system
Modern control systemModern control system
Modern control systemPourya Parsa
 
Dynamic Programming: Smith-Waterman
Dynamic Programming: Smith-WatermanDynamic Programming: Smith-Waterman
Dynamic Programming: Smith-WatermanRohan Prakash
 

Ähnlich wie MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS (20)

Understanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsUnderstanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence Functions
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
#6 formal methods – loop proof using induction method
#6 formal methods – loop proof using induction method#6 formal methods – loop proof using induction method
#6 formal methods – loop proof using induction method
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
working with python
working with pythonworking with python
working with python
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
 
Basics Of Kalman Filter And Position Estimation Of Front Wheel Automatic Stee...
Basics Of Kalman Filter And Position Estimation Of Front Wheel Automatic Stee...Basics Of Kalman Filter And Position Estimation Of Front Wheel Automatic Stee...
Basics Of Kalman Filter And Position Estimation Of Front Wheel Automatic Stee...
 
Kickstart ML.pptx
Kickstart ML.pptxKickstart ML.pptx
Kickstart ML.pptx
 
frozen_lake_rl_report.pdf
frozen_lake_rl_report.pdffrozen_lake_rl_report.pdf
frozen_lake_rl_report.pdf
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
 
Time series analysis use E-views programer
Time series analysis use E-views programerTime series analysis use E-views programer
Time series analysis use E-views programer
 
LP linear programming (summary) (5s)
LP linear programming (summary) (5s)LP linear programming (summary) (5s)
LP linear programming (summary) (5s)
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Maxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearningMaxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearning
 
Mb0048 operations research
Mb0048 operations researchMb0048 operations research
Mb0048 operations research
 
Session 4 .pdf
Session 4 .pdfSession 4 .pdf
Session 4 .pdf
 
Modern control system
Modern control systemModern control system
Modern control system
 
Dynamic Programming: Smith-Waterman
Dynamic Programming: Smith-WatermanDynamic Programming: Smith-Waterman
Dynamic Programming: Smith-Waterman
 

Kürzlich hochgeladen

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Kürzlich hochgeladen (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS

  • 1. MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS By Muzammil Abdulrahman & Yusuf Garba Dambatta Mevlana University Konya, Turkey 2013
  • 2. INTRODUCTION  The aim of the mountain car problem is for the car to learn on two continuous variables; • position and • velocity  So that it can reach the top of the mountain in a minimum number of steps.  By starting the car from rest, its engine power alone will not be powerful enough to bring the car over the hill in front. 2
  • 3. INTRODUCTION CONT. To climb up the hill, the car would need to swing back and forth inside the valley 3
  • 4. INTRODUCTION CONT.  By accelerating forward and backward in order to gather momentum.  The agent receives a negative reward at every time step when the goal is not reached  The agent has no information about the goal until an initial success, it uses reinforcement learning methods.  In this project, we employed TD-Q learning and value iteration algorithms 4
  • 5. REINFORCEMENT LEARNING  Reinforcement learning is orthogonal learning algorithm in the field of machine learning  Where an estimation of the correctness of the answer is provided to the system  It deals with how an agent should take an action in an environment so as to maximize a cumulative reward  It is a Learning from interaction  And is a Goal-oriented learning 5
  • 6. CHARACTERISTICS  No direct training examples – (delayed) rewards instead  Goal-oriented learning  Learning about, from, and while interacting with an external environment  Need for exploration of environment & exploitation  The environment might be stochastic and/or unknown  The learning actions of the agent affect future rewards 6
  • 9. UNSUPERVISED LEARNING Training Info = Evaluation(rewards/penalties) Input RL System Output(actions) Objectives: Get as much reward as possible 9
  • 10. SUPERVISED LEARNING Training Info = desired (target) outputs Input Supervised Learning System Output Training example = {input (state), target output} Error = (target output – actual output) 10
  • 11. TEMPORAL DIFFERENCE(TD)  Temporal difference (TD) learning is a prediction method.  It has been mostly used for solving the reinforcement learning problem.  TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. 11
  • 12. TD Q-LEARNING  The update of TD Q-Learning looks as follows Q (a, s) Q (a, s) + α [(R(s) + γ maxaı Q (aı, sı) - Q (a, s)] 12
  • 13. TD Q-LEARNING ALGORITHM  Initialize Q values for all states ‘s’ and actions ‘a’  Obtain the current state  Select an action according to current state  Implement the selected action and obtain an immediate reward and the next state  Update the Q function according to the above equation  Update the system state  Stop the algorithm if the maximum number of iteration is reached 13
  • 14. ε -GREEDY SELECTION (Q,S,EPSILON)  The agent randomly select action from Q table based on e-greedy strategy.  Initially, epsilon=0.01 which is the probability of selecting random action.  It will be approximately equal to zero, when the car agent has fully learned how to climb the front hill (no randomness because it has learned about best action). 14
  • 15. STATE, ACTION & REWARD State: The states are position and speed. Position is between the range of -1.5 and 0.55 and speed is between the range of -0.07 and 0.07 Action: The agent has one of these 3 actions at all the time: Forward, backward, neutral (Forward accelaration=+1m/s2, backward deccelaration =accelaration=+1 1m/s2 , neutral=0 m/s2). Reward: The agent receive a reward of -1 for all actions except when the agent reaches the goal state where it receives a 0 reward 15
  • 16. VALUE ITERATION  The value iteration algorithm which is also called backward induction  Combines policy improvement and a truncated policy evaluation into a single update step  V (s) = R(s) + γ max ∑ T (s, a, s) V (s′) 16
  • 17. VALUE ITERATION ALGORITHM  Inputs: (S, A,T, R, γ), ε: threshold value  Initialize V0 for state ‘s’ and action ‘a’  for each compute the next approximation using Bellman backup equation.  V(s) R(s) + γ max ∑ T (s, a, s) V (s′) δ V (s′) -V(s)  Until δ < ε  Return V 17
  • 18. GRAPHICAL RESULTS  The graph shows the relation between RMS value (also called policy-loss) and the number of episodes.  RMS value is the error between the current Q value and the previous Q value.  With any probability, an agent chooses action randomly. If the choosen action happen to be bad, it will cause instant rise in error.  At convergence, the error is approximately zero.  In our case, convergence is reached when 3 or more successive RMS value equals 0.0001 or less 18
  • 19.  The car in the mountain will be displayed at 11th iteration to visualize how the car agent learns. 19
  • 21. CONT. Figure show the graph of Total Reward vs Episode at 1000th episode 21
  • 22. RESULT CONT.  The car in the mountain will be displayed at 11th iteration to visualize how the car agent learns.  After 11th iteration, it will be stopped to reduce the time it takes to converge.  After 3 or more successive RMS values equals 0.001 or less, the car will be displayed again to show how it has fully learned how to reach goal state at any episode maintaining constant steps. 22
  • 23. VI RESULTS  The graph below shows the convergence error over iterations 23
  • 24. VI CONT. Figure 6 shows the graph of Optimal Positions and Velocities over time on top while bottom one displays the car learning in the mountain. 24
  • 25. VI CONT.  The first Episode records the highest error  This is because the error is the difference between the current value function and the previous value function i.e. Error= V (s′) -V(s)  But initially the previous value function is 0  Hence Error= V (s′) 25
  • 26. VI CONT.  At subsequent episodes, the error keeps decreasing as the next updated value functions increase.  At convergence, the error (with 0 value) is less ε than the threshold value ( =0.0001) which is the termination criteria for this project.  Finally the optimal policy will be returned. 26
  • 27. VI CONT.  The graphs below shows the optimal positions and velocities over time  The first graph is that of the optimal positions over time  It simply shows the optimal positions attained by the car as it attempt to reach the goal state at different time 27
  • 28. CONT.  Also the second graph shows the optimal velocities attained by the car as it attempt to reach the goal state at different time  The car initially accelerate from rest position to attain a position of -0.2 it then swings back to gather enough momentum by attaining a position of -0.95, it finally accelerate forward again and reach the goal state 28
  • 29. CONCLUSION In this project, the temporal difference and value iteration learning algorithms were implemented for mountain car problem. Both the algorithms were guaranteed to converge by determining the optimal policy for reaching the goal state. 29

Hinweis der Redaktion

  1. {"19":"Figure show the graph of RMS vs Episode at 11th episode at the top, while bottom one displays the car learning in the mountain\n","20":"Figure show the graph of RMS vs Episode at 1000th episode\n"}