Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Discrete sequential prediction of continuous actions for deep RL

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Multi armed bandit
Multi armed bandit
Wird geladen in …3
×

Hier ansehen

1 von 46 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Discrete sequential prediction of continuous actions for deep RL (20)

Anzeige

Aktuellste (20)

Anzeige

Discrete sequential prediction of continuous actions for deep RL

  1. 1. Discrete Sequential Prediction of Continuous Actions for Deep RL Luke Metz∗, Julian Ibarz, James Davidson - Google Brain Navdeep Jaitly - NVIDIA Research presented by Jie-Han Chen (under review as a conference paper at ICLR 2018)
  2. 2. My current challenge The action space of pysc2 is complicated Different type of actions need different parameters
  3. 3. Outline ● Introduction ● Method ● Experiments ● Discussion
  4. 4. Introduction Two kinds of action space ● Discrete action space ● Continuous action space
  5. 5. Introduction - Continuous action space action 1 action 2 action 3 action 4 action 5 action 6
  6. 6. Introduction - Continuous action space Could be solved well using policy gradient-based algorithm (NOT value-based) a1 a2 a3
  7. 7. Introduction - Discrete action space
  8. 8. Introduction - Discrete action space a1, Q(s, a1) a2, Q(s, a2) a3, Q(s, a3) a4, Q(s, a4)
  9. 9. Introduction - Discretized continuous action If we want to use discrete action method to solve continuous action problem, we need to discretize continuous action value.
  10. 10. Introduction - Discretized continuous action If we split continuous angle by 3.6° For 1-D action, there are 100 output neurons ! 0° 3.6° 7.2°
  11. 11. Introduction - Discretized continuous action For 2-D action, we need 10000 neuron to cover all combination. 10000 neurons !
  12. 12. Introduction In this paper, they focus on: ● Off-policy algorithm ● Value-based method (Usually cannot solve continuous action problem) They want to transform DQN able to solve continuous action problem
  13. 13. Method ● Inspired by sequence to sequence model ● They call this method: SDQN (S means sequential) ● Output 1-D action at one step ○ reduce N-D actions selection to a series of 1-D action selection problem
  14. 14. Method - seq2seq
  15. 15. Method - seq2seq
  16. 16. Method - seq2seq a1 a1 a2a2 a3 a3 St St + a1 St + a1 + a2 Q1 Q2 Q3
  17. 17. Does it make sense?
  18. 18. Define agent-environment boundary Before defining the set of state, we should define the boundary between agent and environment. According to Richard Sutton’s textbook: 1. “The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.” 2. “The general rule we follow is that anything cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.” 18
  19. 19. Method - seq2seq a1 a1 a2a2 a3 a3 St St + a1 St + a1 + a2 Q1 Q2 Q3 agents
  20. 20. Method - transformed MDP Origin MDP -> Inner MDP + Outer MDP input: St + a1 input: St + a1 + a2 input: St
  21. 21. Method - transformed MDP input: St + a1 input: St + a1 + a2 input: St
  22. 22. Method - Action Selection
  23. 23. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  24. 24. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  25. 25. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  26. 26. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  27. 27. Method - Training There are 3 update stages ● Outer MDP (St -> St+1) ● Inner Q need to update by Q-Learning ● The last inner Q need to match Q(St+1)
  28. 28. Method - Training 2 kinds of neural network (may not actually 2, depends on implementation) ● Outer Q network is denoted by ● Inner Q network is denoted by ○ The i-th dimension action value ○ The last inner Q is denoted by
  29. 29. Method - Training 2 kinds of neural network (may not actually 2, depends on implementation) ● Outer Q network is denoted by ● Inner Q network is denoted by ○ The i-th dimension action value ○ The last inner Q is denoted by
  30. 30. Method - Training Update Outer Q network ● Outer Q network is denoted by ○ Just used to evaluate state-action value, not select actions ○ Update by bellman equation
  31. 31. Method - Training Update inner Q network ● Inner Q network is denoted by ○ The i-th dimension action value ○ Update by Q Learning:
  32. 32. Method - Training Update the last inner Q network ● Inner Q network is denoted by ○ The last inner Q network is denoted by ○ Update by regression
  33. 33. Implementation of 1. Recurrent share weights, using LSTM a. input: state + previous selected action (NOT ) 2. Multiple separate feedforward models a. input: state + concatenated selected action b. more stable than upper one
  34. 34. Method - Exploration
  35. 35. Experiments ● Multimodal Example Environment ○ Compared with other state-of-the-art model and test its effetiveness ○ DDPG: state-of-the-art off-policy actor critic algorithm ○ NAF: another value-based algorithm could solve continuous action problem ● Mujoco environments ○ Test SDQN on common continuous control tasks ○ 5 tasks
  36. 36. Experiments - Multimodal Example Environment 1. Single step MDP a. only 2 state: initial step and terminal state 2. Deterministic environment a. fixed transition 3. 2-D action space (2 continuous action) 4. Multimodal distribution reward function a. used to test the algorithm converge to local optimal or global optimal?
  37. 37. Experiments - Multimodal Example Environment reward final policy
  38. 38. Experiments - Multimodal Example Environment
  39. 39. Experiments - MuJoCo environments ● hopper (3-D action) ● swimmer (2-D) ● half cheetah (6-D) ● walker2D (6-D) ● humanoid (17-D)
  40. 40. Experiments - MuJoCo environments ● hopper (3-D action) ● swimmer (2-D) ● half cheetah (6-D) ● walker2D (6-D) ● humanoid (17-D)
  41. 41. Experiments - MuJoCo environments ● Perform hyper parameter search, select the best one to evaluate performance ● Run 10 random seeds for each environments
  42. 42. Experiments - MuJoCo environments
  43. 43. Experiments - MuJoCo environments Training for 2M steps The value is average best performance (10 random seeds)
  44. 44. Recap DeepMind pysc2 The Network Architecture
  45. 45. Recap DeepMind pysc2
  46. 46. Discussion

×