Literature Review - Presentation on Relevant work for RL4AD capstone

Reinforcement Learning
for Self Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider

Motivation
Self driving cars today
Credits: NVIDIA Drive
Credits: Prof Jeff Schneider – RI Seminar Talk

Motivation
Credits: YouTube
Credits: Prof Jeff Schneider – RI Seminar Talk
Goal – To make self driving …
• Scalable to new domains.
• Robust to rare long tail events.
• Have verifiable performance through simulation.

Motivation
• A good policy exists!
Credits: Chen et al.,“Learning by cheating”
(https://arxiv.org/pdf/1912.12294.pdf)

Motivation
• A good policy exists!
• RL should in theory
outperform imitation
learning.
Credits: OpenAI Five
(Berner et al., “Dota 2 with Large Scale Deep Reinforcement Learning”)

Motivation
• Given a good policy, it can be
optimized further every time a
safety driver intervenes.
• RL could, in theory, outperform
human performance.
Credits: Wayve

Types of RL algorithms
• On Policy Algorithms
• Uses actions from current policy to
obtain training data and updates
values.
• Off Policy Algorithms
• Uses actions from a separate
“behavior” policy to obtain training
data and updates the values.

Brief Recap of RL
• Reward - R(s,a)
• State Value Function - V(s)
• State-Action Value Function - Q(s,a)
• Discount Factor -
• Tabular Q Learning

Deep Q-Networks (DQN)
• First to use deep neural networks for learning Q functions [1]
• Main contributions:
• Uses target networks
• Uses replay buffer
1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
• Cons:
• Maximization bias
• Pros:
• Off policy – Sample efficient

Policy Gradients
• Why policy gradients?
• Direct method to compute optimal policy
• Parametrize policies and optimize using loss functions[1]
• Advantageous in large/continuous action domains
[1] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
State
distribution
State-action
value function
Gradient of
Policy function

Trust Region Policy Optimization
• Pros
• Introduced the idea that a large
shift in policy is bad!
• Thus, reduces sample
complexity.
• Cons
• It is an on-policy algorithm.
Schulman, John, et al. "Trust region policy optimization." International conference on
machine learning. 2015.

Proximal Policy Algorithm
𝐴 𝑡 is functionally the same as Q within the expectation
• PPO was an improvement on
TRPO.
• We can rearrange the hard KL
constraint into the softer loss
described here.
• But, their main contribution is…
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).

• The clip loss function!
• They clip the loss value instead
of a KL constraint
• Good actions will not be too
beneficial, but any bad actions
will have a minimum penalty.

Actor Critic Algorithms
• What if the gradient estimator in policy gradients has too
much variance?
• What does that mean?
• It takes too many interactions with environment to learn the optimal
policy parameters

Actor Critic Algorithms
• Turns out that we can control this variance using value functions.
• If we have some information about the current state, gradient estimation can
be better.
• Actor
• Policy network
• Critic
• Value function

Soft Actor Critic
• Uses Maximum Entropy
RL framework [1]
• Uses clipped double-Q trick to
avoid maximization bias
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Credits: BAIR

Soft Actor Critic
• Advantages :
• Off Policy algorithm
• Exploration is inherently handled
Credits: BAIR

Experimental Setting (Past Work)
• State Space
• Semantically segmented bird eye view images
• An autoencoder is then trained on them.
• Waypoints!
• Action Space
• Speed - Continuous, Controlled using a simulated PID
• Steer Angle – Continuous
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)

• Inputs include waypoint features
as route to follow
• Uses the CARLA Simulator

• Rewards
• (Speed Reward)
• Assuming that we are following waypoints, this is distance to goal
• (Deviation Penalty)
• Penalize if we are deviating from the trajectory/ waypoints
• (Collision Penalty)
• Avoid collisions. Even if we are going to collide, collide with low speed

Complete Pipeline
(Past Work)
• AE Network for state
representation
• Shallow policy network

Past Work
Good Navigation in
Empty Lanes
Crashes with
stationary cars
• Uses PPO at the
moment
• DQN is being
tried
• We want to use
SAC for this task

Future Work
• Next steps -
• To focus on settings with dynamic actors.
• Improve exploration on current settings using SAC.
• Training in dense environments, possibly also through self play RL.

Thank you
Hit us with questions!
We’d appreciate any useful suggestions.

Literature Review - Presentation on Relevant work for RL4AD capstone

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (12)

Ähnlich wie Literature Review - Presentation on Relevant work for RL4AD capstone

Ähnlich wie Literature Review - Presentation on Relevant work for RL4AD capstone (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Literature Review - Presentation on Relevant work for RL4AD capstone

Hinweis der Redaktion