1. Rl for self-driving cars
Under the guidance of Prof. Srikanth Krishnamurthy
Sneha Ravikumar
Dhanshri More
Shweta Srinivasan
2. Components of Self-Driving Cars
● Obstacle detection using sensors to detect presence and notify game state
● Implemented with a pygame environment
● The second component is implementation of Deep Deterministic Policy
Gradient algorithm (DDPG) to control acceleration and braking in self-driving
cars.
● Implemented using Gym Torcs that was custom built
3. Obstacle detection
● The goal here is to identify the presence of obstacles in the path of the car using Reinforcement Learning trained on
a neural network.
● This is enabled with the help of sensors which track the obstacles and enforce a reward based system where the car
learns to maneuver around such obstacles and change course of path if obstacles are encountered.
● Obstacles with car represented using a Pygame environment
● Pymunk is the physics engine used by the simulation. Pymunk along with Pygame render the environment for the
game to run with obstacles and cars to detect obstacles.
5. Sensors
● Sensors used in this game is a set of sonar readings which returns distance for each sonar we are simulating.
● There are 3 sonars in this environment, one at the center, one on either side of the center at an angle of 45 degrees.
● Instead of a grid of boolean sensors, sonar reading returns “N” distance readings, one for each sonar we are
simulating
● Distance is the count of first non-zero readings starting at the object
● In simple words, the sensor input is a reading of three distances from the car to any object it detects.
● At any given point in time, distance of 1 indicates obstacle in vicinity of car.
● These sensors get updated and distance computed for every frame with the current state of the car.
6. Key components
● Game environment to render the screen and car with the obstacles. This also manages the speed, direction and
control of the car.
● Learning component of the game where the heart of the Q Learning process resides.
● Neural Network model to train input from the sensors and output the action for the game state.
7. Game Environment
● The car automatically moves itself forward, faster as the game progresses. If it runs into a wall or an obstacle, the
game ends.
● There are three available actions at each frame: turn left(0), turn right(1), do nothing(2).
● At every frame, the game returns both a state and a reward.
● The state is a one dimensional array of sensor values, which can be 0, 1 or 2, as stated above.
● The reward is -500 if it runs into something and average of sensor readings if it doesn’t. Lower the sum of the
sensors, farther away it is from running into an obstacle.
8. Neural Network
● Input data for the neural network is the input from the three sensors.
● An input layer of 3 units (because there are 3 different sensors).
● 2 hidden layers of 164 and 150 units, each hidden layer is followed by a dropout layer with dropout rate 0.2 to avoid
overfitting
● Output layer of 3, one for each of our possible actions (left, right, do nothing) [in that order].
9. Reasons for choosing this architecture
Google’s DeepMind for Atari explains in detail the advantages of this architecture as opposed to a traditional architecture for
Q learning problems.
Traditional approach:
Input is the state and action and the output is the value of that single state action pair.
DeepMind approach:
Input is the state and the output is separate Q values for each possible action in its output layer.
In Q learning we need to get the maxQ(S’, A’), max of Q values for every action in the new state. Instead of running the
network forward for every action, just run it forward once instead.
10. Implementation
● Move the car forward one frame once the game starts.
● Get the sensor readings
● Based on the readings, predict Q values which show car’s confidence to take each of the three actions.
● Epsilon Greedy methodology to explore for 10%
● Execute action and get another sensor reading and reward
● Store the original reading, action we took, reward and new reading in a buffer
● Randomly sample the above buffer to generate the training data to be fed to the neural network
● Set y value for the iteration to a prediction based on original reading
● Make a new prediction based on new reading
● Observe reward, -500 indicates crash, set y and action as -500. If not, multiply predicted Q value and gamma and set
17. TORCS
TORCS, The Open Racing Car Simulator is a highly portable multi platform car racing simulation.
It is used as ordinary car racing game, as AI racing game and as research platform. It runs on
Linux (x86, AMD64 and PPC), FreeBSD, OpenSolaris, MacOSX and Windows.
Why TORCS?
You can visualize how the neural network learns over time and inspect its learning process,
rather than just looking at the final result
TORCS can help us simulate and understand machine learning technique in automated driving,
which is important for self-driving car technologies
18. Gym Torcs
OpenAI Gym is a toolkit for building reinforcement learning (RL) algorithm
Gym doesn’t have the environment set for Torcs. So the process starts from building the environment, defining rewards
and then training the agent through Reinforcement Learning
There are three steps to have this agent running.
Server for Torcs
Client for Torcs
An environment, built like Gym environments that gives the observations and rewards based on the agent st.
19. Server and Client
v-Torcs
This is an all in one package of TORCS.
The link below gives a complete overview of how this can be installed and set up on a linux machine.
https://github.com/giuse/vtorcs
This captures various sensor information that can be used to train the agent once we build the environment
SnakeOil
SnakeOil is a Python library for interfacing with a TORCS race car simulator
Its as simle as creating the client as shown and implementing the custom
drive function
Involves mechanics of driving the car & not its implementation
20. More about SnakeOil
These objects contain a member dictionary "d"
(for data dictionary) which contain key value pairs based on the
server's syntax.
We can read the following:
angle, curLapTime, damage, distFromStart, distRaced, focus,
fuel, gear, lastLapTime, opponents, racePos, rpm,
speedX, speedY, speedZ, track, trackPos, wheelSpinVel
We can set the following:
accel, brake, clutch, gear, steer, focus, meta
https://www.youtube.com/watch?v=Bg4t16TVXew#action=share
21. Defining Step() function for Gym Environment
The environment's step function returns exactly what we need. It returns four
values:
Observation.
Reward
Done
Info
We have written functions to map the the dictionary of values we get from the
client to the Gym environment
22. Design of the rewards
AI will try to accelerate the gas pedal very hard (to get maximum reward) and it
hits the edge and the episode terminated very quickly. Therefore, the neural network
stuck in a very poor local minimum.
we want to maximum longitudinal velocity , minimize transverse velocity, and we
also penalize the AI if it constantly drives very off center of the track.
We found the new reward function greatly improves the stability and the learning
time of TORCS.
25. Choice of algorithm:
DQN solves problems with high-dimensional observation spaces, but it can only
handle discrete and low-dimensional action spaces.
DQN cannot be straightforwardly applied to continuous domains
An obvious approach to adapt DQN to continuous domains is to simply discretize the
action space but this result in dimensionality explosion.
Solution: Google Deepmind has innovated a new algorithm to tackle the continuous
action space problem by DDPG algorithm
26. DDPG algorithm:
Google Deepmind has developed a new algorithm to tackle the continuous action space problem by
combining 3 techniques together
1. Deterministic Policy-Gradient Algorithms
2. Actor-Critic Methods
3. Deep Q-Network
DDPG is a policy gradient algorithm that uses a stochastic behavior policy for good exploration but
estimates a deterministic target policy, which is much easier to learn.
Policy gradient algorithms utilize a form of policy iteration: they evaluate the policy, and then follow the
policy gradient to maximize performance.
27. Self driving reinforcement learning in Torcs
The code receives the sensor input in the form of array from gym_torcs environment
Input: network will take states of game ie. XSpeed, YSpeed, angle between car and
track, position of car and so on as explained earlier.
The sensor input will be fed into our Neural Network, and the network will output 3
real numbers (value of the steering, acceleration and brake)
The network will be trained many times, via the Deep Deterministic Policy Gradient,
to maximize the future expected reward.
Output: the action such as steer left or right, hit the gas pedal or hit the brake
Self driving in Torcs environment
28. Policy objective function:
reinforcement technique can be used to find πθ(s,a)
total discount future reward
An intuitive policy objective function will be the expectation of the total discount
reward
where the expectations of the total reward R is calculated under some probability
distribution p(x∣θ)
29. Actor-Critic Algorithm
The Actor-Critic Algorithm is essentially
a hybrid method to combine the policy
gradient method and the value function
method together.
Actor : policy function
Critic: value function
Essentially, the actor produces the action
for given the current state of the
environment s, while the critic produces a
signal to criticizes the actions made by the
actor.
30. Actor Network
We used 2 hidden layers with 300 and 600
hidden units respectively.
The output consist of 3 continuous actions:
1. Steering, which is a single unit with tanh
activation function (where -1 means max
right turn and +1 means max left turn)
2. Acceleration, which is a single unit with
sigmoid activation function (where 0
means no gas, 1 means full gas)
3. Brake, another single unit with sigmoid
activation function (where 0 means no
brake, 1 bull brake)
31. Critic Network
Critic network takes both the states and
the action as inputs.
According to the DDPG paper, the
actions were not included until the 2nd
hidden layer of Q-network.
32. Target Network
1. Directly implementing Q-learning with neural networks is unstable in environments like
TORCS.
2. Deepmind team came up the solution to the problem is to use a target network, where we
created a copy of the actor and critic networks respectively, that are used for calculating
the target values. The weights of these target networks are then updated by having them
slowly track the learned networks:
1. where τ≪1, here 0.0001
2. This means that the target values are constrained to change slowly, greatly improving
33. Policy
Now we can use the inputs above to feed into the neural network.
for j in range(max_steps):
a_t = actor.model.predict(s_t.reshape(1, s_t.shape[0]))
ob, r_t, done, info = env.step(a_t[0])
34. Design of the rewards
AI will try to accelerate the gas pedal very hard (to get maximum reward) and it
hits the edge and the episode terminated very quickly. Therefore, the neural network
stuck in a very poor local minimum.
we want to maximum longitudinal velocity , minimize transverse velocity, and we
also penalize the AI if it constantly drives very off center of the track.
We found the new reward function greatly improves the stability and the learning
time of TORCS.
35. Design of the exploration algorithm
We have used ϵ greedy policy in our RL problem like pac man, atari breakout where
the agent to try a random action some percentage of the time.
Above approach does not work very well in TORCS because we have 3 actions
(steering,acceleration,brake). If we just randomly choose the action from uniform
random distribution we will generate some the combinations like value of the brake is
greater than the value of acceleration and the car simply not moving.
Therefore, we add the noise using Ornstein-Uhlenbeck process to do the exploration.
Ornstein-Uhlenbeck process is a stochastic process which has mean-reverting
properties.
36. Braking Mechanism
To train AI to learn how to brake is much harder than steering or acceleration,
because
1.Reward decreases
2.Exploration phase can give break and acceleration at same time
3.Chances of getting stuck in local minima is more.
stochastic brake: allows AI agent accelerate very fast in a straight line and brake
properly before the turn. I like this driving action as it is much more closer to human
37. Training
we first update the critic by minimizing the loss,
Then the actor policy is updated using the sampled policy gradient,
Then update the target network
38. Conclusion
Implemented obstacle detection using reinforcement learning on a pygame
environment.
We have implemented Deep deterministic policy gradient algorithm to train a car to
drive in torcs environment.
We used Ornstein-Uhlenbeck process to perform the exploration. This helped to
stabilize the policy in continuous domain like driving the vehicle.
39. Future Work:
Build all the modules to deploy RL model.
We have successfully build obstacle detection module, We are aiming to build
module to detect exact position of car on road, road edge detection to get angle
between car and road.
if steering wheel angle from -90 to +90 degrees, we can discretize to 5 degrees each and acceleration from 0km to 200km in 5km each, your output combinations will be 36 steering states times 40 velocity states which equals to 1440 possible combinations.
DQN is able to learn value functions using such function approximators in a stable and robust way due to two innovations:
1. the network is trained off-policy with samples from a replay buffer to minimize correlations between samples; 2. the network is trained with a target Q network to give consistent targets during temporal difference backups.