3. Motivation
Credits: YouTube
Credits: Prof Jeff Schneider – RI Seminar Talk
Goal – To make self driving …
• Scalable to new domains.
• Robust to rare long tail events.
• Have verifiable performance through simulation.
4. Motivation
• A good policy exists!
Credits: Chen et al.,“Learning by cheating”
(https://arxiv.org/pdf/1912.12294.pdf)
5. Motivation
• A good policy exists!
• RL should in theory
outperform imitation
learning.
Credits: OpenAI Five
(Berner et al., “Dota 2 with Large Scale Deep Reinforcement Learning”)
6. Motivation
• Given a good policy, it can be
optimized further every time a
safety driver intervenes.
• RL could, in theory, outperform
human performance.
Credits: Wayve
8. Types of RL algorithms
• On Policy Algorithms
• Uses actions from current policy to
obtain training data and updates
values.
• Off Policy Algorithms
• Uses actions from a separate
“behavior” policy to obtain training
data and updates the values.
9. Brief Recap of RL
• Reward - R(s,a)
• State Value Function - V(s)
• State-Action Value Function - Q(s,a)
• Discount Factor -
• Tabular Q Learning
10. Deep Q-Networks (DQN)
• First to use deep neural networks for learning Q functions [1]
• Main contributions:
• Uses target networks
• Uses replay buffer
1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
• Cons:
• Maximization bias
• Pros:
• Off policy – Sample efficient
11. Policy Gradients
• Why policy gradients?
• Direct method to compute optimal policy
• Parametrize policies and optimize using loss functions[1]
• Advantageous in large/continuous action domains
[1] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
State
distribution
State-action
value function
Gradient of
Policy function
12. Trust Region Policy Optimization
• Pros
• Introduced the idea that a large
shift in policy is bad!
• Thus, reduces sample
complexity.
• Cons
• It is an on-policy algorithm.
Schulman, John, et al. "Trust region policy optimization." International conference on
machine learning. 2015.
13. Proximal Policy Algorithm
𝐴 𝑡 is functionally the same as Q within the expectation
• PPO was an improvement on
TRPO.
• We can rearrange the hard KL
constraint into the softer loss
described here.
• But, their main contribution is…
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
14. Proximal Policy Algorithm
• The clip loss function!
• They clip the loss value instead
of a KL constraint
• Good actions will not be too
beneficial, but any bad actions
will have a minimum penalty.
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
16. Actor Critic Algorithms
• What if the gradient estimator in policy gradients has too
much variance?
• What does that mean?
• It takes too many interactions with environment to learn the optimal
policy parameters
17. Actor Critic Algorithms
• Turns out that we can control this variance using value functions.
• If we have some information about the current state, gradient estimation can
be better.
• Actor
• Policy network
• Critic
• Value function
18. Soft Actor Critic
• Uses Maximum Entropy
RL framework [1]
• Uses clipped double-Q trick to
avoid maximization bias
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Credits: BAIR
19. Soft Actor Critic
• Advantages :
• Off Policy algorithm
• Exploration is inherently handled
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Credits: BAIR
21. [1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
22. Experimental Setting (Past Work)
• State Space
• Semantically segmented bird eye view images
• An autoencoder is then trained on them.
• Waypoints!
• Action Space
• Speed - Continuous, Controlled using a simulated PID
• Steer Angle – Continuous
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
23. Experimental Setting (Past Work)
• Inputs include waypoint features
as route to follow
• Uses the CARLA Simulator
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
24. Experimental Setting (Past Work)
• Rewards
• (Speed Reward)
• Assuming that we are following waypoints, this is distance to goal
• (Deviation Penalty)
• Penalize if we are deviating from the trajectory/ waypoints
• (Collision Penalty)
• Avoid collisions. Even if we are going to collide, collide with low speed
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
25. Complete Pipeline
(Past Work)
• AE Network for state
representation
• Shallow policy network
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
26. Past Work
Good Navigation in
Empty Lanes
Crashes with
stationary cars
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
• Uses PPO at the
moment
• DQN is being
tried
• We want to use
SAC for this task
27. Future Work
• Next steps -
• To focus on settings with dynamic actors.
• Improve exploration on current settings using SAC.
• Training in dense environments, possibly also through self play RL.
28. Thank you
Hit us with questions!
We’d appreciate any useful suggestions.
Hinweis der Redaktion
NOTES :
Self driving cars today process sensor inputs through tens of learned systems.
NEXT
Existing approach works well and will result in fully autonomous cars eventually, at least in specific domains. But we need a less engineer intensive version of the current approach.
Less engineering heavy – Tail scenarios. Also, transferability across cities.
Perception, Prediction, Mapping and Localization, Planning – All include several sub systems that do various tasks.
Assuming Perception is solved we want to use RL to remove other components.
NOTES :
Self driving cars today process sensor inputs through tens of learned systems.
NEXT
Existing approach works well and will result in fully autonomous cars eventually, at least in specific domains. But we need a less engineer intensive version of the current approach.
Less engineering heavy – Tail scenarios. Also, transferability across cities.
Perception, Prediction, Mapping and Localization, Planning – All include several sub systems that do various tasks.
Assuming Perception is solved we want to use RL to remove other components.
Learning by cheating achieved 100% performance on Carla’s benchmark recently.
This shows that a good policy exists.
CARLA
NEXT - How it works
Needs expert driving trajectories
RL has repeatedly shown itself to be capable of outperforming humans on highly complex tasks with large branching factors
Branching factor is 10^4. Chess has 35 and Go has 250.
However, transferring this performance to the real world in a noisy environment is a big challenge for RL.
This is the first car to learn self driving using reinforcement learning.
Explain Points on Screen.
Model RL methods can have error propagation. It is difficult to fit a model to the real world, unlike in Chess or AlphaGo.
So, we want Model free RL.
Two major classes, either optimize policy or value function, or have a method that uses both.
TO EXPLAIN
Off Policy vs On Policy methods
Off policy - Advantages of replay buffers in highly correlated data
Off policy methods are more sample efficient as they can reuse past experiences for training later on
Important experiences can be saved and reused later for training
In Green – Off Policy, In Purple – On Policy
As we learned in 10-601
As we learned in 10-601..
1) Tabular Q learning works well when the state space is finite. But as state space grows larger, we need to turn to function approximation which takes state and action as input and return Q value.
2) We update the parameters until we get all the Q values of state action pairs correct
3) But the update equation assumes iid. But in RL tasks, the states are correlated. One of the major contribution is that they use replay buffer to break the correlations between samples.
4) Also, the target changes(non stationary) here which causes instability in learning. Targets networks hold the parameters fixed and avoid the changing targets problem.
5) But still one problem remains. The target is estimated. What if the target is wrong? It leads to maximization bias
1) Q learning is good for discrete actions. But what if the actions are large/countinous?
2) Policy gradients is an alternative to Q learning. In Q learning, we first fit a Q function and then learn policy. Policy grad directly learns policies by parametrizing it and updating the parameters according to a loss function.
3) We don't want to go too much into the math, so the final gradient of loss w.r.t policy parameters looks like …..
4) q_pi is estimated from experience
5) What happens is that we start with some params, and state, take an action, collect reward and next state and update …(interaction with environment)
TO EXPLAIN
Takes optimal steps compared to Policy Gradients
Maximizes Expected Value due to NEW POLICY but old value function. Denominator is due to importance sampling.
q(a|s) as the old policy.
BUT we do it with a constraint.
Forms a “TRUST REGION” with the help of the KL Divergence.
We can rearrange the constraint to form a loss.
A_t is like Q(s,a) – V(s).
But their main objective is L CLIP
R is the ratio in the above function.
On left, it doesn’t get too confident on it’s update and thus clips towards top the loss
On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
We can rearrange the constraint to form a loss.
A_t is like Q(s,a) – V(s).
But their main objective is L CLIP
R is the ratio in the above function.
On left, it doesn’t get too confident on it’s update and thus clips towards top the loss
On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
We can rearrange the constraint to form a loss.
A_t is like Q(s,a) – V(s).
But their main objective is L CLIP
R is the ratio in the above function.
On left, it doesn’t get too confident on it’s update and thus clips towards top the loss
On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
We want to learn optimal policy using minimum number of interactions with the environment.
Furthermore, as the policy changes, a new gradient is estimated independently of past estimates.
--> The basic idea is that if we know about state, variance can be less
1) Actor used critic to update itself
2) Critic improves itself to catchup with the changing policy
These keep on going and they complement each other until they converge
1) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition
2) Improves critic part by using clipped double Q trick
CAR EXAMPLE
1) SAC belongs to the class of Actor-critic algorithms
2) Before SAC, the major effort to reduce sample complexity is by DDPG but it is brittle and uses deterministic policies
3) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition
4) Improves critic part by using clipped double Q trick
1) SAC belongs to the class of Actor-critic algorithms
2) Before SAC, the major effort to reduce sample complexity is by DDPG but it is brittle and uses deterministic policies
3) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition
4) Improves critic part by using clipped double Q trick
NOTES: Experiment Setting - Problem Statement
State Space (AE on Semantically Segmented images generated by CARLA)
Action Space (Speed and Steer)
NOTES: Experiment Setting - Problem Statement
State Space (AE on Semantically Segmented images generated by CARLA)
Action Space (Speed and Steer)
Rewards (Ask Hitesh)
Training and Testing Towns
4 Test Scenarios - Each has several test cases
Photos for everything
NOTES: Experiment Setting - Problem Statement
State Space (AE on Semantically Segmented images generated by CARLA)
Action Space (Speed and Steer)
Rewards (Ask Hitesh)
Training and Testing Towns
4 Test Scenarios - Each has several test cases
Photos for everything
To Explain
Policy Input could include current speed and steer
Encoder decoder could use stack of frames