RL presentation

NILOOFAR SEDIGHIAN BIDGOLI
MACHINE LEARNING COURSE
CS DEPARTMENT, SBU UNIVERSITY
JUNE 2020, TEHRAN, IRAN

When it is not in our power to determine
what is true, we ought to act in accordance
with what is most probable.
- Descartes

That thing is a
“double bacon cheese
burger
N.Sedighian - CS Dep. SBU - 06/2020

That thing is like this
other thing

Eat that thing because it
tastes good and will keep
you alive longer

Deep reinforcement learning is
about how we make decisions
To tackle decision-making problems under uncertainty

Two core components in a RL system
 Agent: represents the “solution”
 A computer program with a single role of making decisions to solve complex
decision-making problems under uncertainty.
 An Environment: that is the representation of a “problem”
 Everything that comes after the decision of the Agent.

Notations:
 State = s = x
 Action = control = a = u
 Policy 𝜋𝜋(𝑎𝑎|𝑠𝑠) is defined as probability and not as a concrete action
 like weights in Deep Learning method, parameterized by θ
 Gamma: We discount rewards or lower their estimated value in the future
 Human intuition: “In the long run, we are all dead.
 If it is 1: we care about all rewards equally
 If it is 0: we care only about the immediate reward

Policy

Intuition: why humans?
 If you are the agent, the environment could be the laws of physics and the
rules of society that process your actions and determine the
consequences of them.
Were you ever in the wrong place at the wrong time?
That’s a state

There is no training data here
 Like humans learning how to live (and survive!) as a kid
 By trial and error
 With positive or negative rewards
 Reward and punishment method

Google's artificial
intelligence company,
DeepMind, has
developed an AI that
has managed to learn
how to walk, run, jump,
and climb without any
prior guidance. The result
is as impressive as it is
goofy
Watch Video

Google
DeepMind
Learning to play Atari
Watch Video

Reward vs Value
 Reward (Return) is an immediate signal that is received in a given state,
while value is the sum of all rewards you might anticipate from that state.
 Value is a long-term expectation, while reward is an immediate pleasure.

Return

Tasks
 Natural ending: episodic tasks -> games
 Episode: sequence of time steps
 The sum of rewards collected in a single episode is called a return. Agents are
often designed to maximize the return.
 Without natural ending: continuing tasks -> learning forward motion

How the environment reacts to
certain actions is defined by a model
which may or may not be known by
the Agent

Approaches
 Analyze how good to reach a certain state or take a specific action (i.e.
Value-learning)
 measures the total rewards that you get from a particular state following a
specific policy
 Go cheat sheet
 uses V or Q value to derive the optimal policy
 Q- Learning
 Use the model to find actions that have the maximum rewards (model-
based learning)
 Model-based RL uses the model and the cost function to find the optimal path
 Derive a policy directly to maximize rewards (policy gradient)
 For actions with better rewards, we make it more likely to happen (or vice versa).N.Sedighian - CS Dep. SBU - 06/2020

For a model
based learning
Watch this →
Watch Video

How can we
mathematically formalize
the RL problem
• MARKOV DECISION PROCESSES FORMALIZE THE REINFORCEMENT
LEARNING PROBLEM SET
• AND Q-LEARNING AND POLICY GRADIENTS ARE 2 MAJOR
ALGORITHMS IN THIS AREA

MDP
 Attempt to model a complex probability distribution of rewards in relation
to a very large number of state-action pair
 Markov decision process, a method to sample from a complex distribution
to infer its properties. even when we do not understand the mechanism by
which they relate

MPD
• Genes on a chromosome are
states. To read them (and create
amino acids) is to go through
their transitions
• Emotions are states in a
psychological system. Mood
swings are the transitions.

Markov chains have a particular property:
oblivion. Forgetting
It assume the entirety of the past is encoded in
the present

Q-learning
"quality" of an action taken in a given state
 Q-learning is a model-free reinforcement learning algorithm to learn a
policy telling an agent what action to take under what circumstances.
 For any finite Markov decision process (FMDP), Q-learning finds an optimal
policy in the sense of maximizing the expected value of the total reward
over any and all successive steps, starting from the current state.

Q
A value for each state-action pair, which is called
the action-value function, also known as Q-function.
It is usually denoted by 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) and refers to the
expected return 𝐺𝐺 when the Agent is at state 𝑠𝑠 and
takes action 𝑎𝑎 following the policy 𝜋𝜋.

Break
west world…
Creation of Adam, 1508-1512

Bellman Equation
It writes the "value" of a decision problem at a
certain point in time in terms of the payoff from
some initial choices and the "value" of the
remaining decision problem that results from
those initial choices
that if we know the value of 𝑠𝑠𝑡𝑡+!, we can very easily calculate the value of 𝑠𝑠𝑡𝑡.

Iteration Phase:

DQN
Deep Q-network
Using a deep network to estimate Q

Experience Replay
Experience replay stores the last million of state-
action-reward in a replay buffer. We train Q with
batches of random samples from this buffer
 enabling the RL agent to sample from and train on previously observed data offline
 massively reduce the amount of interactions needed with the environment,
 batches of experience can be sampled, reducing the variance of learning updates

Experience!

Reinforce rule
= estimator of gradient
We change the policy in the direction with the steepest reward increase
It means for actions with better rewards, we make it more likely to happen

Actor-critic set-up:
The “actor”
(policy) learns by
using feedback
from the “critic”
(value function).

So…

Questions
Sophia, on from 2016N.Sedighian - CS Dep. SBU - 06/2020

Thank you

RL presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RL presentation

Similar to RL presentation (20)

Recently uploaded

Recently uploaded (20)

RL presentation