Demystifying deep reinforement learning

Demystifying Reinforement
Learning
Slides by JaeyeunYoon

IDS Lab.
What is Reinforcement Learning?
• Learning by trial-and-error, in real-time.
• Improves with experience
• Inspired by psychology
- Agent + Environment
- Agent selects actions to maximize utility function.

IDS Lab.
When to use RL?
•Data in the form of trajectories(궤적).
•Need to make a sequence of (related) decisions.
•Observe (partial, noisy) feedback to choice of
actions.
•Tasks that require both learning and planning.

IDS Lab.
Supervised Learning VS RL

IDS Lab.
Markov Decision Process(MDP)
•Defined by:
S: = 𝒔 𝟏, 𝒔 𝟐, … , 𝒔 𝒏 , the set of states (can be infinite / continuous)
A: = 𝑎 𝟏, 𝑎 𝟐, … , 𝑎 𝒏 , the set of actions (can be infinite / continuous)
T(s,a,s′ ): = Pr(𝑠′
|𝑠, 𝑎), the dynamics of states (can b infinite /
continuous)
R(s,a): Reward function
μ(s): Initial state distribution

IDS Lab.
The Markov Property
•The distribution over future states depends only on the
present state and action, not on any other previous event.
Pr 𝑠𝑡+1 𝑠0, … , 𝑠𝑡, 𝑎0 , … , 𝑎 𝑡) = Pr(𝑠𝑡+1 | 𝑠𝑡, 𝑎 𝑡)

IDS Lab.
The goal of RL? Maximize return!
•Returns, 𝑼𝒕 of a trajectory, is the sum of rewards starting
from step t.
•Episodic task: consider over finite horizon (e.g. games,
maze).
→ 𝑼 𝒕 = 𝒓 𝒕 + 𝒓 𝒕+𝟏 + 𝒓 𝒕+𝟐 + ⋯ + 𝒓 𝑻
•Continuing task: consider return over infinite horizon
(e.g. juggling,
balancing).
→ 𝑼 𝒕 = γ𝒓 𝒕 + γ 𝟐 𝒓 𝒕+𝟏 + γ 𝟑 𝒓 𝒕+𝟐 + ⋯ = 𝒌=𝟎:𝒊𝒏𝒇 γ 𝒌 𝒓 𝒕+𝒌

IDS Lab.
The discount factor, γ
•Discount facator, γ ∈ 𝟎, 𝟏 (usually close to 1).
•This values immediate reward above delayed reward.
- γ close to 0 leads to ”myopic”(근시안적인) evaluation
- γ close to 1 leads to ”far-sighted”(원시안적인) evaluation
•Intuition :
- Receiving $80 today is worth the same as $100 tomorrow assuming
a discount of factor of γ = 𝟎. 𝟖
- At each time step, there is a (𝟏 − γ) chance that the agen dies, and
does not receive rewards aftwards

IDS Lab.
Major Components of an RL Agent
•An RL agent may include one or more of these components:
- Policy: agent's behavior function
- Value function: how good is each state and/or action
- Model: agent's representation of the environment

IDS Lab.
Defining behavior: The policy
•Policy, π defines the action-selction strategy at every state:
π 𝒔, 𝒂 = 𝑷 𝒂 𝒕 = 𝒂 𝒔 𝒕 = 𝒔)
π : S -> A
Goal : Find the policy that maximizes expected total reward.
(But there are many policies!)
𝒂𝒓𝒈𝒎𝒂𝒙π 𝑬π[𝒓 𝟎 + 𝒓 𝟏 + 𝒓 … + 𝒓 𝑻|𝒔 𝟎
???

IDS Lab.
Example: Career Options

IDS Lab.
Value functions
•The expected return of a policy (for every state) is called the
•Value function: 𝐕π 𝒔 = 𝑬 𝒑[𝒓 𝒕 + 𝒓 𝒕+𝟏 + ⋯ + 𝒓 𝑻|𝒔 𝒕 = 𝒔]
* Simple strategy to find the best policy:
1. Enumerate the space of all possible policies.
2. Estimate the expected return of each one.
3. Keep the policy that has maximum expected return.

IDS Lab.
Getting confused with terminology?
•Reward: 1 step numerical feedback
•Return: Sum of rewards over the agent’s trajectory.
•Value: Expected sum of rewards over the agent’s trajector.
•Utility: Numerical function representing preferences.
* In RL, we assume Utility = Return.

IDS Lab.
Q-learning: Model-Free RL
•In Q-learning we define a function Q(s, a) representing the
maximum discounted future reward when we perform action a in
state s, and continue optimally from that point on. (함수 Q(s, a)를
각 지점에서 계속 최적값을 찾으면서 상태 에서 행동 를 수행할 때 차감된
미래의 리워드(discounted future reward)를 나타내는 함수로 정의함)
𝑸 𝒔 𝒕, 𝒂 𝒕 = 𝒎𝒂𝒙 𝑹 𝒕+𝟏
• The way to think about Q(s, a) is that it is “the best possible score
at the end of the game after performing action a in state s”. It is
called Q-function, because it represents the “quality” of a certain
action in a given state.
• Then, we can choose followed policy function :
π 𝒔 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑸(𝒔, 𝒂)

IDS Lab.
Q-learning: Bellman equation
•How do we get that Q-function then? Let’s focus on just one
transition <s, a, r, s’>. Just like with discounted future rewards in
the previous section, we can express the Q-value of state s and
action a in terms of the Q-value of the next state s’.
𝑸 𝒔, 𝒂 = 𝒓 + 𝜸𝒎𝒂𝒙 𝒂′ 𝑸(𝒔′
, 𝒂′
) (Bellman equation)
• The main idea in Q-learning
- we can iteratively approximate the Q-function using the Bellman equation.

IDS Lab.
Q-learning: Atari Breakout
• For example, ‘Breakout’ game screens as in the DeepMind paper
-> take the four last screen images, resize them to 84×84 and
convert to grayscale with 256 gray levels
-> we would have 25684x84x4 ≈ 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎
possible game states.
This means 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎 rows in our imaginary Q-table
-> more than the number of atoms in the known universe!
Atari Breakout game. Image credit: DeepMind.

IDS Lab.
Deep Q Network: Atari Breakout
•The Q-function can be approximated using a neural network
model.

IDS Lab.

IDS Lab.
* No pooling layer? Why?

IDS Lab.
•Experience Replay
- During gameplay all the experiences < s, a, r, s’ > are stored in a replay
memory. When training the network, random minibatches from the replay
memory are used instead of the most recent transition.
•Exploration-Exploitation
- ε-greedy exploration – with probability ε choose a random action, otherwise
go with the “greedy” action with the highest Q-value. In their system
DeepMind actually decreases ε over time from 1 to 0.1

Demystifying deep reinforement learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Demystifying deep reinforement learning

Ähnlich wie Demystifying deep reinforement learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Demystifying deep reinforement learning

Hinweis der Redaktion