PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili Wu (Laber Labs), October 23, 2018

Introduction to Reinforcement Learning
Lili Wu, Laber Lab
October 23, 2018

Outline
Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs

Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs

Basic idea
Reinforcement learning (RL): An agent interacting with an
environment, which provides rewards
Goal: Learn how to take actions in order to maximize the
cumulative rewards

History
Figure 1: Puzzle Box. (Trial and Error
Learning)
Figure 2: Thorndike,
1911

Humans and animals learn from reward and punishment
In reinforcement learning, we try to get computers to learn
complicated skills in a similar way

Framework
Figure 3: Reinforcement learning

RL in the news
Advances in computer power and algorithms in recent years
have led to lots of interest in using RL for artiﬁcial intelligence
RL has now been used to achieve superhuman performance
for a number of diﬃcult games

Example: Atari
Figure 4: Deep Q-Network playing
Breakout. (Mnih et al. 2015.)
States: Pixels on
screen
Actions: Move
paddle
Rewards: Points

Example: AlphaZero (Silver et al. 2017)
Figure 5: The game of Go.
States: Positions
of stones
Actions: Stone
placement
Rewards:
Win/lose

Setup: MDPs
We formalize the reinforcement learning problem using a Markov
decision process (MDP) (S, A, T, r, γ):
S is the set of states the environment can be in;
A is the set of actions available to the decision-maker;
T : S × A × S → R+ is a transition function which gives the
probability distribution of the next state given the current
state and action;
r : S → R is the reward function;
γ is a discount factor, 0 ≤ γ < 1.
Data: at each time t we observe current state, action, and reward
(St, At, Rt, St+1).

Setup: Policies
Policies tell us which action to take in each state
π : S → A
Goal: choose policy to maximize expected cumulative
discounted reward
Eπ
∞
t=0
γt
Rt

Setup: Value functions
Value functions tell us the long-term rewards we can expect under
a given policy, starting from a given state and/or action.
“V-function” measures expected cumulative reward from
given state:
V π
(s) = Eπ
∞
t=0
γt
Rt | S0 = s
“Q-function” measures expected cumulative reward from
given state and action:
Qπ
(s, a) = Eπ
∞
t=0
γt
Rt | S0 = s, A0 = a
=
s ∈S
r(s ) + γV π
(s ) T(s |s, a)

Problem 1: Estimating optimal policy
Two ways of getting at optimal policy π∗:
Try to improve π directly
Try to estimate Qπ∗
Example: Q-learning
Qnew
(St, At) ← (1 − α)Q(St, At) + α[Rt + γ max
a
Q(St+1, a)],
where α is learning rate, 0 ≤ α ≤ 1.

Problem 2: Exploration-exploitation tradeoff
Tradeoff between gaining information (exploration) and
following current estimate of optimal policy (exploitation)
Restaurant example
Exploitation: Go to your favorite restaurant
Exploration: Try a new place
Need to balance both to maximize cumulative deliciousness
Different strategies
Occasionally do something completely random
Act based on optimistic estimates of each action’s value
Sample action according to its posterior probability of being
optimal

A small task using RL to solve(CartPole)
Action space A = {0, 1}, represents {left, right}
State space (S1, S2, S3, S4) ∈ R4, represents (position,
velocity, angle, angular velocity)
Goal: Stand for 200 timesteps in each episode. (Large angle
or Far-away distance → Die! )
Deﬁne Rt
i ∈ {−1, 1}

Application to CartPole Problem

RL in Laber Labs
At Laber Labs we apply reinforcement learning to interesting
and important real-world problems
Controlling the spread of disease
Dynamic medical treatment
Education
Sports decision-making

Stopping the spread of disease
Figure 6: The spread of white-nose
syndrome in bats, 2006-2014. States: Which
locations are
infected
Actions: Locations
to treat
Rewards: Number
of uninfected
locations

Space Mice
Figure 7: Space Mice (By Laber
Labs’ Marshall Wang).

Dynamic medical treatment
Figure 8: RL can help us customize medical
treatment to individual patients’
characteristics.
States: Current
health status
(exercise levels,
food intake, blood
pressure, blood
sugar, many more)
Actions:
Recommend
treatment
Rewards: Health
outcomes

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili Wu (Laber Labs), October 23, 2018

Recommended

Recommended

More Related Content

Similar to PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili Wu (Laber Labs), October 23, 2018

Similar to PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili Wu (Laber Labs), October 23, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili Wu (Laber Labs), October 23, 2018