Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

# Dueling network architectures for deep reinforcement learning

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

1 von 30 Anzeige

# Dueling network architectures for deep reinforcement learning

Dueling network architectures for deep reinforcement learning

Dueling network architectures for deep reinforcement learning

Anzeige
Anzeige

## Weitere Verwandte Inhalte

Anzeige

Anzeige

### Dueling network architectures for deep reinforcement learning

1. 1. Dueling Network Architectures for Deep Reinforcement Learning 2016-06-28 Taehoon Kim
2. 2. Motivation • Recent advances • Design improved control and RL algorithms • Incorporate existing NN into RL methods • We, • focus on innovating a NN that is better suited for model-free RL • Separate • the representation of state value • (state-dependent) action advantages 2
3. 3. Overview 3 state value function advantage function sharing convolutional feature learning module aggregating layer state-action value function
4. 4. Dueling network • Single Q network with two streams • Produce separate estimations of state value func and advantage func • without any extra supervision • which states are valuable? • without having to learn the effect of each action for each state 4
5. 5. Saliency map on the Atari game Enduro 5 1. Focus on horizon, where new cars appear 2. Focus on the score Not pay much attention when there are no cars in front Attention on car immediately in front making its choice of action very relevant
6. 6. Definitions • Value 𝑉(𝑠), how good it is to be in particular state 𝑠 • Advantage 𝐴(𝑠, 𝑎) • Policy 𝜋 • Return 𝑅* = ∑ 𝛾./* 𝑟. ∞ .1* , where 𝛾 ∈ [0,1] • Q function 𝑄8 𝑠, 𝑎 = 𝔼 𝑅* 𝑠* = 𝑠, 𝑎* = 𝑎, 𝜋 • State-value function 𝑉8 𝑠 = 𝔼:~8(<)[𝑄8 𝑠, 𝑎 ] 6
7. 7. Bellman equation • Recursively with dynamic programming • 𝑄8 𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾𝔼:`~8(<`) 𝑄8 𝑠`, 𝑎` |𝑠, 𝑎, 𝜋 • Optimal Q∗ 𝑠, 𝑎 = max 8 𝑄8 𝑠, 𝑎 • Deterministic policy a = arg max :`∈𝒜 Q∗ 𝑠, 𝑎` • Optimal V∗ 𝑠 = max : 𝑄∗ 𝑠, 𝑎 • Bellman equation Q∗ 𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾 max :` 𝑄∗ 𝑠`, 𝑎` |𝑠, 𝑎 7
8. 8. Advantage function • Bellman equation Q∗ 𝑠, 𝑎 = 𝔼<` 𝑟 + 𝛾 max :` 𝑄∗ 𝑠`, 𝑎` |𝑠, 𝑎 • Advantage function A8 𝑠, 𝑎 = 𝑄8 𝑠, 𝑎 − 𝑉8 (𝑠) • 𝔼:~8(<`) 𝐴8 𝑠, 𝑎 = 0 8
9. 9. Advantage function • Value 𝑉(𝑠), how good it is to be in particular state 𝑠 • Q(𝑠, 𝑎), the value of choosing a particular action 𝑎 when in state 𝑠 • A = 𝑉 − 𝑄 to obtain a relative measure of importance of each action 9
10. 10. Deep Q-network (DQN) • Model Free • states and rewards are produced by the environment • Off policy • states and rewards are obtained with a behavior policy (epsilon greedy) • different from the online policy that is being learned 10
11. 11. Deep Q-network: 1) Target network • Deep Q-network 𝑄 𝑠, 𝑎; 𝜽 • Target network 𝑄 𝑠, 𝑎; 𝜽/ • 𝐿O 𝜃O = 𝔼<,:,Q,<` 𝑦O STU − 𝑄 𝑠, 𝑎; 𝜽𝒊 W • 𝑦O STU = 𝑟 + 𝛾 max :` 𝑄 𝑠`, 𝑎`; 𝜽/ • Freeze parameters for a fixed number of iterations • 𝛻YZ 𝐿O 𝜃O = 𝔼<,:,Q,<` 𝑦O STU − 𝑄 𝑠, 𝑎; 𝜽𝒊 𝛻YZ 𝑄 𝑠, 𝑎; 𝜽𝒊 11 𝑠` 𝑠
12. 12. Deep Q-network: 2) Experience memory • Experience 𝑒* = (𝑠*, 𝑎*, 𝑟*, 𝑠*]) • Accumulates a dataset 𝒟* = 𝑒], 𝑒W, … , 𝑒* • 𝐿O 𝜃O = 𝔼 <,:,Q,<` ~𝒰(𝒟) 𝑦O STU − 𝑄 𝑠, 𝑎; 𝜽𝒊 W 12
13. 13. Double Deep Q-network (DDQN) • In DQN • the max operator uses the same values to both select and evaluate an action • lead to overoptimistic value estimates • 𝑦O STU = 𝑟 + 𝛾 max :` 𝑄 𝑠`, 𝑎`; 𝜽/ • To migrate this problem, in DDQN • 𝑦O SSTU = 𝑟 + 𝛾𝑄 𝑠`,arg max :` 𝑄(𝑠`, 𝑎`; 𝜃O); 𝜽/ 13
14. 14. Prioritized Replay (Schaul et al., 2016) • To increase the replay probability of experience tuples • that have a high expected learning progress • use importance sampling weight measured via the proxy of absolute TD-error • sampling transitions with high absolute TD-errors • Led to faster learning and to better final policy quality 14
15. 15. Dueling Network Architecture : Key insight • For many states • unnecessary to estimate the value of each action choice • For example, move left or right only matters when a collision is eminent • In most of states, the choice of action has no affect on what happens • For bootstrapping based algorithm • the estimation of state value is of great importance for every state • bootstrapping : update estimates on the basis of other estimates. 15
16. 16. Formulation • A8 𝑠, 𝑎 = 𝑄8 𝑠, 𝑎 − 𝑉8 (𝑠) • 𝑉8 𝑠 = 𝔼:~8(<) 𝑄8 𝑠, 𝑎 • A8 𝑠, 𝑎 = 𝑄8 𝑠, 𝑎 − 𝔼:~8(<) 𝑄8 𝑠, 𝑎 • 𝔼:~8(<) 𝐴8 𝑠, 𝑎 = 0 • For a deterministic policy, 𝑎∗ = argmax a`∈𝒜 𝑄 𝑠, 𝑎` • 𝑄 𝑠, 𝑎∗ = 𝑉(𝑠) and 𝐴 𝑠, 𝑎∗ = 0 16
17. 17. Formulation • Dueling network = CNN + fully-connected layers that output • a scalar 𝑉 𝑠; 𝜃, 𝛽 • an 𝒜 -dimensional vector 𝐴(𝑠, 𝑎; 𝜃, 𝛼) • Tempt to construct the aggregating module • 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴(𝑠, 𝑎; 𝜃, 𝛼) 17
18. 18. Aggregation module 1: simple add • But 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 is only a parameterized estimate of the true Q- function • Unidentifiable • Given 𝑄, 𝑉 and 𝐴 can’t uniquely be recovered • We force the 𝐴 to have zero at the chosen action • 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 − max :`∈𝒜 𝐴(𝑠, 𝑎`; 𝜃, 𝛼) 18
19. 19. Aggregation module 2: subtract max • For 𝑎∗ = argmax a`∈𝒜 𝑄(𝑠, 𝑎`; 𝜃, 𝛼, 𝛽) = argmax a`∈𝒜 𝐴 𝑠, 𝑎; 𝜃, 𝛼 • obtain 𝑄 𝑠, 𝑎∗ ; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 • 𝑄 𝑠, 𝑎∗ = 𝑉(𝑠) • An alternative module replace max operator with an average • 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 − ] 𝒜 ∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):` 19
20. 20. Aggregation module 3: subtract average • An alternative module replace max operator with an average • 𝑄 𝑠, 𝑎; 𝜃, 𝛼, 𝛽 = 𝑉 𝑠; 𝜃, 𝛽 + 𝐴 𝑠, 𝑎; 𝜃, 𝛼 − ] 𝒜 ∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):` • Now loses the original semantics of 𝑉 and 𝐴 • because now off-target by a constant, ] 𝒜 ∑ 𝐴(𝑠, 𝑎`; 𝜃, 𝛼):` • But increases the stability of the optimization • 𝐴 only need to change as fast as the mean • Instead of having to compensate any change to the optimal action’s advantage 20 max :`∈𝒜 𝐴(𝑠, 𝑎`; 𝜃, 𝛼)
21. 21. Aggregation module 3: subtract average • Subtracting mean is the best • helps identifiability • not change the relative rank of 𝐴 (and hence Q) • Aggregation module is a part of the network not a algorithmic step • training of dueling network requires only back-propagation 21
22. 22. Compatibility • Because the output of dueling network is Q function • DQN • DDQN • SARSA • On-policy, off-policy, whatever 22
23. 23. Definition: Generalized policy iteration 23
24. 24. Experiments: Policy evaluation • Useful for evaluating network architecture • devoid of confounding factors such as choice of exploration strategy, and interaction between policy improvement and policy evaluation • In experiment, employ temporal difference learning • optimizing 𝑦O = 𝑟 + 𝛾𝔼:`~8(<`) 𝑄 𝑠`, 𝑎`, 𝜃O • Corridor environment • exact 𝑄8 (𝑠, 𝑎) can be computed separately for all 𝑠, 𝑎 ∈ 𝒮×𝒜 24
25. 25. Experiments: Policy evaluation • Test for 5, 10, and 20 actions (first tackled by DDQN) • The stream 𝑽 𝒔; 𝜽, 𝜷 learn a general value shared across many similar actions at 𝑠 • Hence leading to faster convergence 25 Performance gap increasing with the number of actions
26. 26. Experiments: General Atari Game-Playing • Similar to DQN (Mnih et al., 2015) and add fully-connected layers • Rescale the combined gradient entering the last convolutional layer by 1/ 2, which mildly increases stability • Clipped gradients to their norm less than or equal to 10 • clipping is not a standard practice in RL 26
27. 27. Performance: Up-to 30 no-op random start • Duel Clip > Single Clip > Single • Good job Dueling network 27
28. 28. Performance: Human start 28 • Not necessarily have to generalize well to play the Atari games • Can achieve good performance by simply remembering sequences of actions • To obtain a more robust measure that use 100 starting points sampled from a human expert’s trajectory • from each starting points, evaluate up to 108,000 frames • again, good job Dueling network
29. 29. Combining with Prioritized Experience Replay • Prioritization and the dueling architecture address very different aspects of the learning process • Although orthogonal in their objectives, these extensions (prioritization, dueling and gradient clipping) interact in subtle ways • Prioritization interacts with gradient clipping • Sampling transitions with high absolute TD-errors more often leads to gradients with higher norms so re-tuned the hyper parameters 29
30. 30. References 1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. 2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461. 3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. 4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol. 1, No. 1). Cambridge: MIT press. 30