Diese Präsentation wurde erfolgreich gemeldet.

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
×

1 von 24 Anzeige

# Practical Reinforcement Learning with TensorFlow

How to build reinforcement learning in Tensorflow, from Q-Learning to Policy Gradient and A3C.

How to build reinforcement learning in Tensorflow, from Q-Learning to Policy Gradient and A3C.

Anzeige
Anzeige

## Weitere Verwandte Inhalte

Anzeige

### Practical Reinforcement Learning with TensorFlow

1. 1. Practical RL with TensorFlow Illia Polosukhin, XIX.ai
2. 2. Reinforcement Learning Problem
3. 3. OpenAI Gym - Library of environments Control, Atari, Doom, etc. - Same API - Provides way to share and compare results https://gym.openai.com/
4. 4. Acting in an Environment
5. 5. Random Agent
6. 6. Let’s review some theory
7. 7. Markov Decision Process MDP < S, A, P, R, 𝛾 > - S: set of states - A: set of actions - T(s, a, s’): probability of transition - Reward(s): reward function - 𝛾: discounting factory Trace: {<s0,a0,r0>, …, <sn,an,rn>}
8. 8. Definitions - Return: total discounted reward: - Policy: Agent’s behavior - Deterministic policy: π(s) = a - Stochastic policy: π(a | s) = P[At = a | St = s] - Value function: Expected return starting from state s: - State-value function: Vπ(s) = Eπ[R | St = s] - Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a]
9. 9. Deep Q Learning - Model-free, off-policy technique to learn optimal Q(s, a): - Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a)) - Optimal policy then π(s) = argmaxa’ Q(s, a’) - Requires exploration (ε-greedy) to explore various transitions from the states. - Take random action with ε probability, start ε high and decay to low value as training progresses. - Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃) - Do stochastic gradient descent using loss
10. 10. Q-network
11. 11. Run Optimization Full example: https://github.com/ilblackdragon/tensorflow-rl/blob/master/examples/atari-rl.py
12. 12. Monitored Session - Handles pitfalls of distributed training. - Saving and restoring checkpoints. - Hooks is a general interface for injecting computation into TensorFlow training loop.
13. 13. Original Results on Atari Games Mnih et al., 2013
14. 14. Beating Human Level Mnih at el., 2015
15. 15. Policy Gradient - Given policy π 𝜃(a | s) find such 𝜃 that maximizes expected return: J(𝜃) = ∑sdπ(s)V(s) - In Deep RL, we approximate π 𝜃(a | s) with neural network. - Usually with softmax layer on top to estimate probabilities of each action. - We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃( 𝜏k | π)R( 𝜏k) - Do stochastic gradient descent using update: 𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃( 𝜏k | π)R( 𝜏k)
16. 16. Policy Network
17. 17. Run Optimization
18. 18. Async Advantage Actor-Critic (A3C) - Asynchronous: using multiple instances of environments and networks - Actor-Critic: using both policy and estimate of value function. - Advantage: estimate how different was outcome than expected. Image by Arthur Juliani
19. 19. Policy and Value Networks
20. 20. Run optimization
21. 21. A3C Results on Atari Games Mnih at el., 2016
22. 22. Mnih at el., 2016
23. 23. Practical use cases - Robotics - Finance - Industrial optimization - Predictive assistant
24. 24. Illia Polosukhin XIX.ai @ilblackdragon, illia@xix.ai Questions? Full code will be available soon at https://github.com/ilblackdragon/tensorflow-rl/

### Hinweis der Redaktion

• Let’s start by defining a problem that we are trying to solve.

...

Agents divide into model-based and model-free agents
Model based agent try to simulate the environment inside it to make decisions based on that.
Model free though just take observation and choose action.

This is interesting, because this is very close how animals and people learn - based on some limited feedback from the environment or teacher. Like animals get positive reinforcement when developing reflexes. Or children getting positive or negative reinforcement from parents on their behaviour.
• Let’s review some theory around RL.

The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards.

Additional term - set of [(s, a), ..] is a trajectory.
• Model free - meaning there is no MDP approximation or learning inside the agent.
Observations are stored into replay buffers and used as training data for the model.

Off policy means that learning optimal policy is independent of agent’s actions.

Because the policy of taking action would be deterministic, force it to explore by taking random action with ε probability. Where ε starts high in the beginning and slowly decays as training progresses.

For example for Atari game, there is lots of possible states (number of pixels by number of colors).
E.g. breakout game 84x84 pixels screen by 256 colors - at least 256^84*84 states.
And it will take a long time to even visit each state. Approximate with neural network, that will be able to learn how to deal with state based on their similarity.
Deep Q Learning - popularized by DeepMind - first Deep RL model that worked.
• Expected return is can be defined in few ways.
One way is to define as sum of values of state-value function of each state weighted by how much we will end up at that state under current policy (it’s also called stationary distribution).

This can be estimated from observations - trajectories, as a sum of probability of a trajectory under policy multiplied by reward from this trajectory.
• Asynchronous: Unlike DQN, where a single agent represented by a single neural network interacts with a single environment, A3C utilizes multiple incarnations of the above in order to learn more efficiently. In A3C there is a global network, and multiple worker agents which each have their own set of network parameters. Each of these agents interacts with it’s own copy of the environment at the same time as the other agents are interacting with their environments. The reason this works better than having a single agent (beyond the speedup of getting more work done), is that the experience of each agent is independent of the experience of the others. In this way the overall experience available for training becomes more diverse.
Actor-Critic: Actor-Critic combines the benefits of both approaches. In the case of A3C, our network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s) (a set of action probability outputs). These will each be separate fully-connected layers sitting at the top of the network. Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods.

The insight of using advantage estimates rather than just discounted returns is to allow the agent to determine not just how good its actions were, but how much better they turned out to be than expected.
• Mean and median human-normalized scores on 57 Atari games using the human starts evaluation metric.

D-DQN - double DQN.
A3C paper - https://arxiv.org/pdf/1602.01783.pdf