SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Reinforcement Learning
for Self Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
Motivation
Self driving cars today
Credits: NVIDIA Drive
Credits: Prof Jeff Schneider – RI Seminar Talk
Motivation
Credits: YouTube
Credits: Prof Jeff Schneider – RI Seminar Talk
Goal – To make self driving …
• Scalable to new domains.
• Robust to rare long tail events.
• Have verifiable performance through simulation.
Motivation
• A good policy exists!
Credits: Chen et al.,“Learning by cheating”
(https://arxiv.org/pdf/1912.12294.pdf)
Motivation
• A good policy exists!
• RL should in theory
outperform imitation
learning.
Credits: OpenAI Five
(Berner et al., “Dota 2 with Large Scale Deep Reinforcement Learning”)
Motivation
• Given a good policy, it can be
optimized further every time a
safety driver intervenes.
• RL could, in theory, outperform
human performance.
Credits: Wayve
Types of RL
algorithms
Types of RL algorithms
• On Policy Algorithms
• Uses actions from current policy to
obtain training data and updates
values.
• Off Policy Algorithms
• Uses actions from a separate
“behavior” policy to obtain training
data and updates the values.
Brief Recap of RL
• Reward - R(s,a)
• State Value Function - V(s)
• State-Action Value Function - Q(s,a)
• Discount Factor -
• Tabular Q Learning
Deep Q-Networks (DQN)
• First to use deep neural networks for learning Q functions [1]
• Main contributions:
• Uses target networks
• Uses replay buffer
1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
• Cons:
• Maximization bias
• Pros:
• Off policy – Sample efficient
Policy Gradients
• Why policy gradients?
• Direct method to compute optimal policy
• Parametrize policies and optimize using loss functions[1]
• Advantageous in large/continuous action domains
[1] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
State
distribution
State-action
value function
Gradient of
Policy function
Trust Region Policy Optimization
• Pros
• Introduced the idea that a large
shift in policy is bad!
• Thus, reduces sample
complexity.
• Cons
• It is an on-policy algorithm.
Schulman, John, et al. "Trust region policy optimization." International conference on
machine learning. 2015.
Proximal Policy Algorithm
𝐴 𝑡 is functionally the same as Q within the expectation
• PPO was an improvement on
TRPO.
• We can rearrange the hard KL
constraint into the softer loss
described here.
• But, their main contribution is…
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
Proximal Policy Algorithm
• The clip loss function!
• They clip the loss value instead
of a KL constraint
• Good actions will not be too
beneficial, but any bad actions
will have a minimum penalty.
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
Proximal Policy Algorithm
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
Actor Critic Algorithms
• What if the gradient estimator in policy gradients has too
much variance?
• What does that mean?
• It takes too many interactions with environment to learn the optimal
policy parameters
Actor Critic Algorithms
• Turns out that we can control this variance using value functions.
• If we have some information about the current state, gradient estimation can
be better.
• Actor
• Policy network
• Critic
• Value function
Soft Actor Critic
• Uses Maximum Entropy
RL framework [1]
• Uses clipped double-Q trick to
avoid maximization bias
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Credits: BAIR
Soft Actor Critic
• Advantages :
• Off Policy algorithm
• Exploration is inherently handled
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Credits: BAIR
Soft Actor Critic
Performance
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Experimental Setting (Past Work)
• State Space
• Semantically segmented bird eye view images
• An autoencoder is then trained on them.
• Waypoints!
• Action Space
• Speed - Continuous, Controlled using a simulated PID
• Steer Angle – Continuous
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
Experimental Setting (Past Work)
• Inputs include waypoint features
as route to follow
• Uses the CARLA Simulator
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
Experimental Setting (Past Work)
• Rewards
• (Speed Reward)
• Assuming that we are following waypoints, this is distance to goal
• (Deviation Penalty)
• Penalize if we are deviating from the trajectory/ waypoints
• (Collision Penalty)
• Avoid collisions. Even if we are going to collide, collide with low speed
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
Complete Pipeline
(Past Work)
• AE Network for state
representation
• Shallow policy network
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
Past Work
Good Navigation in
Empty Lanes
Crashes with
stationary cars
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
• Uses PPO at the
moment
• DQN is being
tried
• We want to use
SAC for this task
Future Work
• Next steps -
• To focus on settings with dynamic actors.
• Improve exploration on current settings using SAC.
• Training in dense environments, possibly also through self play RL.
Thank you
Hit us with questions!
We’d appreciate any useful suggestions.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (12)

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
Algorithms and Programming
Algorithms and ProgrammingAlgorithms and Programming
Algorithms and Programming
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
 
Model Based Episodic Memory
Model Based Episodic MemoryModel Based Episodic Memory
Model Based Episodic Memory
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
 
Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scale
 

Ähnlich wie Literature Review - Presentation on Relevant work for RL4AD capstone

Ähnlich wie Literature Review - Presentation on Relevant work for RL4AD capstone (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
20181125 pybullet
20181125 pybullet20181125 pybullet
20181125 pybullet
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
 
Deep Learning in Robotics
Deep Learning in RoboticsDeep Learning in Robotics
Deep Learning in Robotics
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Recent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy LearningRecent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy Learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
SPLT Transformer.pptx
SPLT Transformer.pptxSPLT Transformer.pptx
SPLT Transformer.pptx
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
 
Maximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as ExplanationMaximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as Explanation
 
Deep reinforcement learning framework for autonomous driving
Deep reinforcement learning framework for autonomous drivingDeep reinforcement learning framework for autonomous driving
Deep reinforcement learning framework for autonomous driving
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Literature Review - Presentation on Relevant work for RL4AD capstone

  • 1. Reinforcement Learning for Self Driving Cars Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
  • 2. Motivation Self driving cars today Credits: NVIDIA Drive Credits: Prof Jeff Schneider – RI Seminar Talk
  • 3. Motivation Credits: YouTube Credits: Prof Jeff Schneider – RI Seminar Talk Goal – To make self driving … • Scalable to new domains. • Robust to rare long tail events. • Have verifiable performance through simulation.
  • 4. Motivation • A good policy exists! Credits: Chen et al.,“Learning by cheating” (https://arxiv.org/pdf/1912.12294.pdf)
  • 5. Motivation • A good policy exists! • RL should in theory outperform imitation learning. Credits: OpenAI Five (Berner et al., “Dota 2 with Large Scale Deep Reinforcement Learning”)
  • 6. Motivation • Given a good policy, it can be optimized further every time a safety driver intervenes. • RL could, in theory, outperform human performance. Credits: Wayve
  • 8. Types of RL algorithms • On Policy Algorithms • Uses actions from current policy to obtain training data and updates values. • Off Policy Algorithms • Uses actions from a separate “behavior” policy to obtain training data and updates the values.
  • 9. Brief Recap of RL • Reward - R(s,a) • State Value Function - V(s) • State-Action Value Function - Q(s,a) • Discount Factor - • Tabular Q Learning
  • 10. Deep Q-Networks (DQN) • First to use deep neural networks for learning Q functions [1] • Main contributions: • Uses target networks • Uses replay buffer 1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). • Cons: • Maximization bias • Pros: • Off policy – Sample efficient
  • 11. Policy Gradients • Why policy gradients? • Direct method to compute optimal policy • Parametrize policies and optimize using loss functions[1] • Advantageous in large/continuous action domains [1] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. State distribution State-action value function Gradient of Policy function
  • 12. Trust Region Policy Optimization • Pros • Introduced the idea that a large shift in policy is bad! • Thus, reduces sample complexity. • Cons • It is an on-policy algorithm. Schulman, John, et al. "Trust region policy optimization." International conference on machine learning. 2015.
  • 13. Proximal Policy Algorithm 𝐴 𝑡 is functionally the same as Q within the expectation • PPO was an improvement on TRPO. • We can rearrange the hard KL constraint into the softer loss described here. • But, their main contribution is… Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  • 14. Proximal Policy Algorithm • The clip loss function! • They clip the loss value instead of a KL constraint • Good actions will not be too beneficial, but any bad actions will have a minimum penalty. Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  • 15. Proximal Policy Algorithm Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  • 16. Actor Critic Algorithms • What if the gradient estimator in policy gradients has too much variance? • What does that mean? • It takes too many interactions with environment to learn the optimal policy parameters
  • 17. Actor Critic Algorithms • Turns out that we can control this variance using value functions. • If we have some information about the current state, gradient estimation can be better. • Actor • Policy network • Critic • Value function
  • 18. Soft Actor Critic • Uses Maximum Entropy RL framework [1] • Uses clipped double-Q trick to avoid maximization bias [1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018). Credits: BAIR
  • 19. Soft Actor Critic • Advantages : • Off Policy algorithm • Exploration is inherently handled [1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018). Credits: BAIR
  • 21. [1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
  • 22. Experimental Setting (Past Work) • State Space • Semantically segmented bird eye view images • An autoencoder is then trained on them. • Waypoints! • Action Space • Speed - Continuous, Controlled using a simulated PID • Steer Angle – Continuous Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
  • 23. Experimental Setting (Past Work) • Inputs include waypoint features as route to follow • Uses the CARLA Simulator Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
  • 24. Experimental Setting (Past Work) • Rewards • (Speed Reward) • Assuming that we are following waypoints, this is distance to goal • (Deviation Penalty) • Penalize if we are deviating from the trajectory/ waypoints • (Collision Penalty) • Avoid collisions. Even if we are going to collide, collide with low speed Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
  • 25. Complete Pipeline (Past Work) • AE Network for state representation • Shallow policy network Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
  • 26. Past Work Good Navigation in Empty Lanes Crashes with stationary cars Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.) • Uses PPO at the moment • DQN is being tried • We want to use SAC for this task
  • 27. Future Work • Next steps - • To focus on settings with dynamic actors. • Improve exploration on current settings using SAC. • Training in dense environments, possibly also through self play RL.
  • 28. Thank you Hit us with questions! We’d appreciate any useful suggestions.

Hinweis der Redaktion

  1. NOTES : Self driving cars today process sensor inputs through tens of learned systems. NEXT Existing approach works well and will result in fully autonomous cars eventually, at least in specific domains. But we need a less engineer intensive version of the current approach. Less engineering heavy – Tail scenarios. Also, transferability across cities. Perception, Prediction, Mapping and Localization, Planning – All include several sub systems that do various tasks. Assuming Perception is solved we want to use RL to remove other components.
  2. NOTES : Self driving cars today process sensor inputs through tens of learned systems. NEXT Existing approach works well and will result in fully autonomous cars eventually, at least in specific domains. But we need a less engineer intensive version of the current approach. Less engineering heavy – Tail scenarios. Also, transferability across cities. Perception, Prediction, Mapping and Localization, Planning – All include several sub systems that do various tasks. Assuming Perception is solved we want to use RL to remove other components.
  3. Learning by cheating achieved 100% performance on Carla’s benchmark recently. This shows that a good policy exists. CARLA NEXT - How it works Needs expert driving trajectories
  4. RL has repeatedly shown itself to be capable of outperforming humans on highly complex tasks with large branching factors Branching factor is 10^4. Chess has 35 and Go has 250. However, transferring this performance to the real world in a noisy environment is a big challenge for RL.
  5. This is the first car to learn self driving using reinforcement learning. Explain Points on Screen.
  6. Model RL methods can have error propagation. It is difficult to fit a model to the real world, unlike in Chess or AlphaGo. So, we want Model free RL. Two major classes, either optimize policy or value function, or have a method that uses both.
  7. TO EXPLAIN Off Policy vs On Policy methods Off policy - Advantages of replay buffers in highly correlated data Off policy methods are more sample efficient as they can reuse past experiences for training later on Important experiences can be saved and reused later for training In Green – Off Policy, In Purple – On Policy
  8. As we learned in 10-601
  9. As we learned in 10-601.. 1) Tabular Q learning works well when the state space is finite. But as state space grows larger, we need to turn to function approximation which takes state and action as input and return Q value.  2) We update the parameters until we get all the Q values of state action pairs correct 3) But the update equation assumes iid. But in RL tasks, the states are correlated. One of the major contribution is that they use replay buffer to break the correlations between samples. 4) Also, the target changes(non stationary) here which causes instability in learning. Targets networks hold the parameters fixed and avoid the changing targets problem. 5) But still one problem remains. The target is estimated. What if the target is wrong? It leads to maximization bias
  10. 1) Q learning is good for discrete actions. But what if the actions are large/countinous? 2) Policy gradients is an alternative to Q learning. In Q learning, we first fit a Q function and then learn policy. Policy grad directly learns policies by parametrizing it and updating the parameters according to a loss function. 3) We don't want to go too much into the math, so the  final gradient of loss w.r.t policy parameters looks like …..  4) q_pi is estimated from experience 5) What happens is that we start with some params, and state, take an action, collect reward and next state and update …(interaction with environment)
  11. TO EXPLAIN Takes optimal steps compared to Policy Gradients Maximizes Expected Value due to NEW POLICY but old value function. Denominator is due to importance sampling. q(a|s) as the old policy. BUT we do it with a constraint. Forms a “TRUST REGION” with the help of the KL Divergence.
  12. We can rearrange the constraint to form a loss. A_t is like Q(s,a) – V(s). But their main objective is L CLIP R is the ratio in the above function. On left, it doesn’t get too confident on it’s update and thus clips towards top the loss On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
  13. We can rearrange the constraint to form a loss. A_t is like Q(s,a) – V(s). But their main objective is L CLIP R is the ratio in the above function. On left, it doesn’t get too confident on it’s update and thus clips towards top the loss On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
  14. We can rearrange the constraint to form a loss. A_t is like Q(s,a) – V(s). But their main objective is L CLIP R is the ratio in the above function. On left, it doesn’t get too confident on it’s update and thus clips towards top the loss On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
  15. We want to learn optimal policy using minimum number of interactions with the environment.
  16. Furthermore, as the policy changes, a new gradient is estimated independently of past estimates. --> The basic idea is that if we know about state, variance can be less 1) Actor used critic to update itself 2) Critic improves itself to catchup with the changing policy  These keep on going and they complement each other until they converge
  17. 1) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition 2) Improves critic part by using clipped double Q trick
  18. CAR EXAMPLE
  19. 1) SAC belongs to the class of Actor-critic algorithms 2) Before SAC, the major effort to reduce sample complexity is by DDPG but it is brittle and uses deterministic policies 3) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition 4) Improves critic part by using clipped double Q trick
  20. 1) SAC belongs to the class of Actor-critic algorithms 2) Before SAC, the major effort to reduce sample complexity is by DDPG but it is brittle and uses deterministic policies 3) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition 4) Improves critic part by using clipped double Q trick
  21. NOTES: Experiment Setting - Problem Statement State Space (AE on Semantically Segmented images generated by CARLA) Action Space (Speed and Steer)
  22. NOTES: Experiment Setting - Problem Statement State Space (AE on Semantically Segmented images generated by CARLA) Action Space (Speed and Steer) Rewards (Ask Hitesh) Training and Testing Towns 4 Test Scenarios - Each has several test cases Photos for everything
  23. NOTES: Experiment Setting - Problem Statement State Space (AE on Semantically Segmented images generated by CARLA) Action Space (Speed and Steer) Rewards (Ask Hitesh) Training and Testing Towns 4 Test Scenarios - Each has several test cases Photos for everything
  24. To Explain Policy Input could include current speed and steer Encoder decoder could use stack of frames