2. Outline
Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
3. Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
4. Basic idea
Reinforcement learning (RL): An agent interacting with an
environment, which provides rewards
Goal: Learn how to take actions in order to maximize the
cumulative rewards
8. RL in the news
Advances in computer power and algorithms in recent years
have led to lots of interest in using RL for artificial intelligence
RL has now been used to achieve superhuman performance
for a number of difficult games
9. Example: Atari
Figure 4: Deep Q-Network playing
Breakout. (Mnih et al. 2015.)
States: Pixels on
screen
Actions: Move
paddle
Rewards: Points
10. Example: AlphaZero (Silver et al. 2017)
Figure 5: The game of Go.
States: Positions
of stones
Actions: Stone
placement
Rewards:
Win/lose
11. Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
12. Setup: MDPs
We formalize the reinforcement learning problem using a Markov
decision process (MDP) (S, A, T, r, γ):
S is the set of states the environment can be in;
A is the set of actions available to the decision-maker;
T : S × A × S → R+ is a transition function which gives the
probability distribution of the next state given the current
state and action;
r : S → R is the reward function;
γ is a discount factor, 0 ≤ γ < 1.
Data: at each time t we observe current state, action, and reward
(St, At, Rt, St+1).
13. Setup: Policies
Policies tell us which action to take in each state
π : S → A
Goal: choose policy to maximize expected cumulative
discounted reward
Eπ
∞
t=0
γt
Rt
14. Setup: Value functions
Value functions tell us the long-term rewards we can expect under
a given policy, starting from a given state and/or action.
“V-function” measures expected cumulative reward from
given state:
V π
(s) = Eπ
∞
t=0
γt
Rt | S0 = s
“Q-function” measures expected cumulative reward from
given state and action:
Qπ
(s, a) = Eπ
∞
t=0
γt
Rt | S0 = s, A0 = a
=
s ∈S
r(s ) + γV π
(s ) T(s |s, a)
15. Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
16. Problem 1: Estimating optimal policy
Two ways of getting at optimal policy π∗:
Try to improve π directly
Try to estimate Qπ∗
Example: Q-learning
Qnew
(St, At) ← (1 − α)Q(St, At) + α[Rt + γ max
a
Q(St+1, a)],
where α is learning rate, 0 ≤ α ≤ 1.
17. Problem 2: Exploration-exploitation tradeoff
Tradeoff between gaining information (exploration) and
following current estimate of optimal policy (exploitation)
Restaurant example
Exploitation: Go to your favorite restaurant
Exploration: Try a new place
Need to balance both to maximize cumulative deliciousness
Different strategies
Occasionally do something completely random
Act based on optimistic estimates of each action’s value
Sample action according to its posterior probability of being
optimal
18. A small task using RL to solve(CartPole)
Action space A = {0, 1}, represents {left, right}
State space (S1, S2, S3, S4) ∈ R4, represents (position,
velocity, angle, angular velocity)
Goal: Stand for 200 timesteps in each episode. (Large angle
or Far-away distance → Die! )
Define Rt
i ∈ {−1, 1}
20. Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
21. RL in Laber Labs
At Laber Labs we apply reinforcement learning to interesting
and important real-world problems
Controlling the spread of disease
Dynamic medical treatment
Education
Sports decision-making
22. Stopping the spread of disease
Figure 6: The spread of white-nose
syndrome in bats, 2006-2014. States: Which
locations are
infected
Actions: Locations
to treat
Rewards: Number
of uninfected
locations
24. Dynamic medical treatment
Figure 8: RL can help us customize medical
treatment to individual patients’
characteristics.
States: Current
health status
(exercise levels,
food intake, blood
pressure, blood
sugar, many more)
Actions:
Recommend
treatment
Rewards: Health
outcomes