Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

Deep Reinforcement
Learning
using deep learning to play self-driving car games
Ben Lau
Ben Lau - Deep Learning and Reinforcement
MLConf 2017, New York City

What is Reinforcement
Learning?
3 classes of
learning
Supervised Learning
 Label data
 Direct Feedback
Unsupervised Learning
 No labels data
 No feedback
 “Find Hidden Structure
Reinforcement Learning
 Using reward as feedback
 Learn series of actions
 Trial and Error

RL: Agent and Environment
𝑅𝑡
Agent
Action 𝐴 𝑡
Environment
Reward
Observation 𝑂𝑡
At each step t the Agent
• Receive observation 𝑂𝑡
• Execute action 𝐴 𝑡
• Receive reward 𝑅𝑡
the Environment
• Receive action 𝐴 𝑡
• Sends observation 𝑂𝑡+1
• Sends reward 𝑅𝑡+1

RL: State
Experience is a sequence of observations, actions, rewards
𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡
The state is a summary of experience
𝑠𝑡 = 𝑓(𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡)
Note: Not all the state are fully observable
Fully Observable Not Fully Observable

Approach to Reinforcement
Learning
Value-Based RL
 Estimate the optimal value function 𝑄∗(𝑠, 𝑎)
 This is the maximum value achievable under any policy
Policy-Based RL
 Search directly for the optimal policy 𝜋∗
 This is the policy achieving maximum future reward
Model-based RL
 Build a model of the environment
 Plan (e.g. by lookahead) using model

Deep Learning + RL  AI
reward
Game input
Deep convolution network
Stee
r
Gas
Peda
l
Brake

Policies
A deterministic policy is the agent’s behavior
 It is a map from state to action:
 𝑎 𝑡 = 𝜋(𝑠𝑡)
In Reinforcement Learning, the agent’s goal is to
choose each action such that it maximize the sum
of future rewards
Choose at to maximize 𝑅𝑡 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2
𝑟𝑡+3 + ⋯
𝛾 is a discount factor [0,1], as the reward is less certain when
further away
State(s) Action(a)
Obstacle Brake
Corner Left/Right
Straight line Acceleration

Learning
Value-Based RL
 Estimate the optimal value function 𝑄∗(𝑠, 𝑎)
 This is the maximum value achievable under any policy

Value Function
 A value function is a prediction of future reward
 How much reward will I get from action a in state s?
 A Q-value function gives expected total reward
 From state-action pair (s, a)
 Under policy 𝜋
 With discount factor 𝛾
𝑄 𝜋
𝑠, 𝑎 = 𝐸 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2
𝑟𝑡+3 + ⋯ 𝑠, 𝑎]
 An optimal value function is the maximum achievable value
𝑄∗ 𝑠, 𝑎 = 𝑀𝑎𝑥 𝑎 𝑄 𝜋 𝑠, 𝑎
 Once we have the 𝑄∗
we can act optimally
𝜋∗
𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄∗
(𝑠, 𝑎)

Understanding Q Function
 The best way to understand Q function is considering a “strategy guide”
 Suppose you are playing a difficult game (DOOM)
 If you have a strategy guide, it’s pretty easy  Just follow the guide
 Suppose you are in state s, and need to make a decision, If you have this m
Q-function(strategy guide), then it is easy, just pick the action with highest Q
Doom Strategy Guide

How to find Q-function
 Discount Future Reward:𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛
which can be written as:
 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑅𝑡+1
Recall the definition of Q-function (max reward if choose action a in state s)
 𝑄 𝑠𝑡, 𝑎 𝑡 = max 𝑅𝑡+1
Therefore, we can rewrite the Q-function as below
 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′)
In plain English, it means maximum future reward for (s,a) is the
immediate reward r + maximum future reward in next state s’, action a’
It can be solved by dynamic programming or iterative solution

Deep Q-Network (DQN)
 Action-Value function (Q-function) often very big
 DQN idea: I use the neural network to compress this Q-table, using
the weight (w) in the neural network
 𝑄 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎, 𝑤
 Training become finding sets of optimal weights w instead
 In the literature we often called “non-linear function approximation”
State Action Value
A 1 140.11
A 2 139.22
B 1 145.89
B 2 140.23
C 1 123.67
C 2 135.27
≈

DQN Demo Using DeepQ network to play Doom

Learning
Policy-Based RL
 Search directly for the optimal policy 𝜋∗
 This is the policy achieving maximum future reward

Deep Policy Network
Review: A policy is the agent’s behavior
 It is a map from state to action:
 at = π(st)
 We can directly search the policy
 Let’s parameterize the policy by some model parameters 𝜃
 𝑎 = 𝜋(𝑠, 𝜃)
 We called it Policy-Based reinforcement learning because we
will adjust the model parameters 𝜃 directly
 The goal is to maximize the total discount reward from beginning
maximize total 𝑅 = 𝑟1 + 𝛾𝑟2 + 𝛾2
𝑟3 + ⋯

Policy Gradient
How to make good action more likely?
 Define objective function as total discounted reward
𝐿 𝜃 = 𝐸 𝑟1 + 𝛾𝑟2 + 𝛾2
𝑟3 + ⋯ |𝜋 𝜃(𝑠, 𝑎)
or
𝐿 𝜃 = 𝐸 𝑅|𝜋 𝜃(𝑠, 𝑎)
Where the expectations of the total reward R is calculated under some
probability distribution 𝑝(𝑎|𝜃) parameterized by 𝜃
 The goal become maximize the total reward by
compute the gradient
𝜕𝐿(𝜃)
𝜕𝜃

Policy Gradient (II)
Recall: Q-function is the maximum discounted future reward in state s, actio
𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥𝑅𝑡+1
 In the continuous case we can written as
𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑅𝑡+1
Therefore, we can compute the gradient as
𝜕𝐿(𝜃)
𝜕𝜃
= 𝐸 𝑝(𝑎|𝜃)
𝜕𝑄
𝜕𝜃
 Using chain-rule, we can re-write as
𝜕𝐿(𝜃)
𝜕𝜃
= 𝐸 𝑝(𝑎|𝜃)
𝜕𝑄 𝜃(𝑠,𝑎)
𝜕𝑎
𝜕𝑎
𝜕𝜃
No dynamics model required!
1. Only requires Q is differential w.r.t. a
2. As long as a can be parameterized
as function of 𝜃

The power of Policy Gradient
Because the policy gradient does not require the dynamical model
therefore, no prior domain knowledge is required
AlphaGo doesn’t pre-programme any domain knowledge
It keep playing many times (via self-play) and adjust the policy parameters 𝜃
to maximize the reward(winning probability)

Intuition: Value vs Policy RL
 Valued Based RL is similar to driving instructor : A score is
given for any action is taken by student
 Policy Based RL is similar to a driver : It is the actual policy
how to drive a car

The car racing game TORCS
 TORCS is a state of the art open source simulator written in C++
 Main Features
 Sophisticated dynamics
 Provided with several
tracks, controllers
 Sensors
 Rangefinder
 Speed
 Position on track
 Rotation speed of wheels
 RPM
 Angle with tracks
Quite realistic to self-driving cars… Track sensors

Deep Learning Recipe
reward
Game input state s
Deep Neural network
Stee
r
Gas
Peda
l
Brak
e
 Rangefinder
 Speed
 Position on track
 Rotation speed of wheels
 RPM
 Angle with tracks
Compute the optimal policy 𝜋 via policy gradient

Design of the reward function
 Obvious choice : Highest velocity of the car 𝑅 = 𝑉𝑐𝑎𝑟 cos 𝜃
 However, experience found that learning not very stable
 Use modify reward function 𝑅 = 𝑉𝑥 cos 𝜃 −𝑉𝑥 sin 𝜃 −𝑉𝑥|track pos|
Encourage stay in the center of the track

Source code available here:
Google: DDPG Keras

Validation Set: Alpine Tracks
Recall basic Machine Learning, make sure you need to test the
model
In the validation set, not the training set

Learning how to brake
Since we try to maximize the velocity of the car
The AI agent don’t want to hit the brake at all! (As it go against the reward function)
Using Stochastic Brake Idea

Final Demo – Car does not stay center
of track

Future Application
Self driving cars:

Thank you!
Twitter: @yanpanlau

How to find Q-function (II)
 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′
, 𝑎′
)
We could use iterative method to solve the Q-function, given a transition (s,a,
 We want 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′
, 𝑎′
) to be same as 𝑄 𝑠, 𝑎
 Consider find Q-function is a regression task, we can define a loss function
 Loss function =
1
2
𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) − 𝑄(𝑠, 𝑎) 2
 Q is optimal when the loss function is minimum
target prediction

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

Ähnlich wie Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017 (20)

Mehr von MLconf

Mehr von MLconf (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017