Diese Präsentation wurde erfolgreich gemeldet.

# Discrete sequential prediction of continuous actions for deep RL

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Multi armed bandit
×

1 von 46 Anzeige

# Discrete sequential prediction of continuous actions for deep RL

An RL value-based method to solve continuous action problem, proposed by Google Brain.

An RL value-based method to solve continuous action problem, proposed by Google Brain.

Anzeige
Anzeige

## Weitere Verwandte Inhalte

Anzeige

Anzeige

### Discrete sequential prediction of continuous actions for deep RL

1. 1. Discrete Sequential Prediction of Continuous Actions for Deep RL Luke Metz∗, Julian Ibarz, James Davidson - Google Brain Navdeep Jaitly - NVIDIA Research presented by Jie-Han Chen (under review as a conference paper at ICLR 2018)
2. 2. My current challenge The action space of pysc2 is complicated Different type of actions need different parameters
3. 3. Outline ● Introduction ● Method ● Experiments ● Discussion
4. 4. Introduction Two kinds of action space ● Discrete action space ● Continuous action space
5. 5. Introduction - Continuous action space action 1 action 2 action 3 action 4 action 5 action 6
6. 6. Introduction - Continuous action space Could be solved well using policy gradient-based algorithm (NOT value-based) a1 a2 a3
7. 7. Introduction - Discrete action space
8. 8. Introduction - Discrete action space a1, Q(s, a1) a2, Q(s, a2) a3, Q(s, a3) a4, Q(s, a4)
9. 9. Introduction - Discretized continuous action If we want to use discrete action method to solve continuous action problem, we need to discretize continuous action value.
10. 10. Introduction - Discretized continuous action If we split continuous angle by 3.6° For 1-D action, there are 100 output neurons ! 0° 3.6° 7.2°
11. 11. Introduction - Discretized continuous action For 2-D action, we need 10000 neuron to cover all combination. 10000 neurons !
12. 12. Introduction In this paper, they focus on: ● Off-policy algorithm ● Value-based method (Usually cannot solve continuous action problem) They want to transform DQN able to solve continuous action problem
13. 13. Method ● Inspired by sequence to sequence model ● They call this method: SDQN (S means sequential) ● Output 1-D action at one step ○ reduce N-D actions selection to a series of 1-D action selection problem
14. 14. Method - seq2seq
15. 15. Method - seq2seq
16. 16. Method - seq2seq a1 a1 a2a2 a3 a3 St St + a1 St + a1 + a2 Q1 Q2 Q3
17. 17. Does it make sense?
18. 18. Define agent-environment boundary Before defining the set of state, we should define the boundary between agent and environment. According to Richard Sutton’s textbook: 1. “The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.” 2. “The general rule we follow is that anything cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.” 18
19. 19. Method - seq2seq a1 a1 a2a2 a3 a3 St St + a1 St + a1 + a2 Q1 Q2 Q3 agents
20. 20. Method - transformed MDP Origin MDP -> Inner MDP + Outer MDP input: St + a1 input: St + a1 + a2 input: St
21. 21. Method - transformed MDP input: St + a1 input: St + a1 + a2 input: St
22. 22. Method - Action Selection
23. 23. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
24. 24. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
25. 25. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
26. 26. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
27. 27. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
28. 28. Method - Training 2 kinds of neural network (may not actually 2, depends on implementation) ● Outer Q network is denoted by ● Inner Q network is denoted by ○ The i-th dimension action value ○ The last inner Q is denoted by
29. 29. Method - Training 2 kinds of neural network (may not actually 2, depends on implementation) ● Outer Q network is denoted by ● Inner Q network is denoted by ○ The i-th dimension action value ○ The last inner Q is denoted by
30. 30. Method - Training Update Outer Q network ● Outer Q network is denoted by ○ Just used to evaluate state-action value, not select actions ○ Update by bellman equation
31. 31. Method - Training Update inner Q network ● Inner Q network is denoted by ○ The i-th dimension action value ○ Update by Q Learning:
32. 32. Method - Training Update the last inner Q network ● Inner Q network is denoted by ○ The last inner Q network is denoted by ○ Update by regression
33. 33. Implementation of 1. Recurrent share weights, using LSTM a. input: state + previous selected action (NOT ) 2. Multiple separate feedforward models a. input: state + concatenated selected action b. more stable than upper one
34. 34. Method - Exploration
35. 35. Experiments ● Multimodal Example Environment ○ Compared with other state-of-the-art model and test its effetiveness ○ DDPG: state-of-the-art off-policy actor critic algorithm ○ NAF: another value-based algorithm could solve continuous action problem ● Mujoco environments ○ Test SDQN on common continuous control tasks ○ 5 tasks
36. 36. Experiments - Multimodal Example Environment 1. Single step MDP a. only 2 state: initial step and terminal state 2. Deterministic environment a. fixed transition 3. 2-D action space (2 continuous action) 4. Multimodal distribution reward function a. used to test the algorithm converge to local optimal or global optimal?
37. 37. Experiments - Multimodal Example Environment reward final policy
38. 38. Experiments - Multimodal Example Environment
39. 39. Experiments - MuJoCo environments ● hopper (3-D action) ● swimmer (2-D) ● half cheetah (6-D) ● walker2D (6-D) ● humanoid (17-D)
40. 40. Experiments - MuJoCo environments ● hopper (3-D action) ● swimmer (2-D) ● half cheetah (6-D) ● walker2D (6-D) ● humanoid (17-D)
41. 41. Experiments - MuJoCo environments ● Perform hyper parameter search, select the best one to evaluate performance ● Run 10 random seeds for each environments
42. 42. Experiments - MuJoCo environments
43. 43. Experiments - MuJoCo environments Training for 2M steps The value is average best performance (10 random seeds)
44. 44. Recap DeepMind pysc2 The Network Architecture
45. 45. Recap DeepMind pysc2
46. 46. Discussion