8. Two core components in a RL system
Agent: represents the “solution”
A computer program with a single role of making decisions to solve complex
decision-making problems under uncertainty.
An Environment: that is the representation of a “problem”
Everything that comes after the decision of the Agent.
N.Sedighian - CS Dep. SBU - 06/2020
9. Notations:
State = s = x
Action = control = a = u
Policy 𝜋𝜋(𝑎𝑎|𝑠𝑠) is defined as probability and not as a concrete action
like weights in Deep Learning method, parameterized by θ
Gamma: We discount rewards or lower their estimated value in the future
Human intuition: “In the long run, we are all dead.
If it is 1: we care about all rewards equally
If it is 0: we care only about the immediate reward
N.Sedighian - CS Dep. SBU - 06/2020
11. Intuition: why humans?
If you are the agent, the environment could be the laws of physics and the
rules of society that process your actions and determine the
consequences of them.
Were you ever in the wrong place at the wrong time?
That’s a state
N.Sedighian - CS Dep. SBU - 06/2020
12. There is no training data here
Like humans learning how to live (and survive!) as a kid
By trial and error
With positive or negative rewards
Reward and punishment method
N.Sedighian - CS Dep. SBU - 06/2020
15. Google's artificial
intelligence company,
DeepMind, has
developed an AI that
has managed to learn
how to walk, run, jump,
and climb without any
prior guidance. The result
is as impressive as it is
goofy
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
18. Reward vs Value
Reward (Return) is an immediate signal that is received in a given state,
while value is the sum of all rewards you might anticipate from that state.
Value is a long-term expectation, while reward is an immediate pleasure.
N.Sedighian - CS Dep. SBU - 06/2020
20. Tasks
Natural ending: episodic tasks -> games
Episode: sequence of time steps
The sum of rewards collected in a single episode is called a return. Agents are
often designed to maximize the return.
Without natural ending: continuing tasks -> learning forward motion
N.Sedighian - CS Dep. SBU - 06/2020
21. How the environment reacts to
certain actions is defined by a model
which may or may not be known by
the Agent
22. Approaches
Analyze how good to reach a certain state or take a specific action (i.e.
Value-learning)
measures the total rewards that you get from a particular state following a
specific policy
Go cheat sheet
uses V or Q value to derive the optimal policy
Q- Learning
Use the model to find actions that have the maximum rewards (model-
based learning)
Model-based RL uses the model and the cost function to find the optimal path
Derive a policy directly to maximize rewards (policy gradient)
For actions with better rewards, we make it more likely to happen (or vice versa).N.Sedighian - CS Dep. SBU - 06/2020
23. For a model
based learning
Watch this →
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
25. How can we
mathematically formalize
the RL problem
• MARKOV DECISION PROCESSES FORMALIZE THE REINFORCEMENT
LEARNING PROBLEM SET
• AND Q-LEARNING AND POLICY GRADIENTS ARE 2 MAJOR
ALGORITHMS IN THIS AREA
26. MDP
Attempt to model a complex probability distribution of rewards in relation
to a very large number of state-action pair
Markov decision process, a method to sample from a complex distribution
to infer its properties. even when we do not understand the mechanism by
which they relate
N.Sedighian - CS Dep. SBU - 06/2020
32. MPD
• Genes on a chromosome are
states. To read them (and create
amino acids) is to go through
their transitions
• Emotions are states in a
psychological system. Mood
swings are the transitions.
N.Sedighian - CS Dep. SBU - 06/2020
33. Markov chains have a particular property:
oblivion. Forgetting
It assume the entirety of the past is encoded in
the present
N.Sedighian - CS Dep. SBU - 06/2020
35. Q-learning
"quality" of an action taken in a given state
Q-learning is a model-free reinforcement learning algorithm to learn a
policy telling an agent what action to take under what circumstances.
For any finite Markov decision process (FMDP), Q-learning finds an optimal
policy in the sense of maximizing the expected value of the total reward
over any and all successive steps, starting from the current state.
N.Sedighian - CS Dep. SBU - 06/2020
37. Q
A value for each state-action pair, which is called
the action-value function, also known as Q-function.
It is usually denoted by 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) and refers to the
expected return 𝐺𝐺 when the Agent is at state 𝑠𝑠 and
takes action 𝑎𝑎 following the policy 𝜋𝜋.
N.Sedighian - CS Dep. SBU - 06/2020
40. Bellman Equation
It writes the "value" of a decision problem at a
certain point in time in terms of the payoff from
some initial choices and the "value" of the
remaining decision problem that results from
those initial choices
that if we know the value of 𝑠𝑠𝑡𝑡+!, we can very easily calculate the value of 𝑠𝑠𝑡𝑡.
N.Sedighian - CS Dep. SBU - 06/2020
46. Experience Replay
Experience replay stores the last million of state-
action-reward in a replay buffer. We train Q with
batches of random samples from this buffer
enabling the RL agent to sample from and train on previously observed data offline
massively reduce the amount of interactions needed with the environment,
batches of experience can be sampled, reducing the variance of learning updates
N.Sedighian - CS Dep. SBU - 06/2020
51. Reinforce rule
= estimator of gradient
We change the policy in the direction with the steepest reward increase
It means for actions with better rewards, we make it more likely to happen
N.Sedighian - CS Dep. SBU - 06/2020