2. Disclaimer
• Equations in slides are notationally inconsistent; many of the
equations are adapted from the textbook of Sutton and Barto, while
equations from other documents are also included.
3. Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
4. Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
5. Markov Decision Process
• Agent’s policy: stochastic action selection conditioned on current state
• Environment: stochastic state transition and stochastic reward
conditioned on state-action pair
Distribution of
state transition and reward
Policy
6. State value function and expected reward
• State value function: expectation of the discounted sum of rewards
obtained in future given MDP starting from an initial state
• Expected reward: expectation of value function over initial states
State value function
Expected reward
• Training of agent: maximize the expected reward with respect to
policy
7. Example: pole balancing
• State:
• angle of the pole
• angular velocity of the pole
• vision as shown above
• Action: force to move the cart right or left
• Reward: +1 if the pole is nearly vertical with a threshold of angle
8. Reinforcement learning objective
• Optimize the policy of the agent through interaction with the environment
• Reward function is given (e.g., game) or designed (e.g., robot)
• MDP model might be (partially) given (e.g., Go) or not given (e.g., robot)
9. Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
10. Action value function
• Expected reward given current state-action pair
• Can be used to select an action by maximizing it with respect to
action
• Implicit representation of policy (value iteration)
• Brute-force for discrete (small) action space
• Gradient-free optimization like CEM
• Can be used as a guide to improve policy (policy iteration, later)
11. Bellman equation
• Recursive relationship of value function with a fixed policy pi
• For optimal policy
• Basis of learning value function (DP, Q-learning, actor-critic)
12. Bellman equation for action value function
• For optimal policy
• Relationship with state value function
13. Updating action value function
• SARSA: using sampled state and action (on-policy)
• Q-learning: using optimal action with current policy (off-policy)
14. Function approximation for value function
• DQN: training deep network that represents action value function
https://towardsdatascience.com/deep-double-q-learning-7fca410b193a
• Select an action based on the output of the deep network
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/deep_q_learning.html
15. Parametrized policy
• Representation of policy by parametric function
Example: linear function + softmax
• Policy gradient theorem (PGT)
• Applicable to continuous action space
16. Actor-critic methods
• Plugging parametric value function instead of sample reward into
policy gradient equation
: n-step cumulative reward
• is considered as improvement of the value by
taking the sampled policy compared to the value of the current policy
(advantage function)
• The above update rule is used in advantage actor critic (A2C)
17. Deterministic policy gradient
• More efficient than training stochastic policy with PGT
• Deterministic policy gradient theorem (DPGT) (Silver et al., 2014)
• Applied to an intensive study of controlling robot with deep network
(Lillicrap et al., 2015)
• Exploration with action noise or parameter noise for deterministic
policy
18. TRPO and PPO
• Ensuring improvement of policy with a lower bound of the objective
• Trust region policy optimization (TRPO): maximize the lower bound in
a trust region, which is close to the current policy parameters
(Schulman et al., 2015)
• Proximal policy optimization (PPO): using simple clipped constraint
instead of KL divergence (Schulman et al., 2017)
19. Expected reward
Difference after
policy update
Written with state distribution
on current policy
Lower bound of
expected reward
Constrained maximum
Importance sampling
TRPO objective
20. Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
21. OpenAI gym
• Providing interface of RL environment with a base class Env
class Env(object):
"""The main OpenAI Gym class.
"""
def step(self, action):
"""Run one timestep of the environment's dynamics.
Returns observation, reward, done and info.
"""
raise NotImplementedError
def reset(self):
"""Resets the state of the environment and returns an initial observation.
"""
raise NotImplementedError
def render(self, mode='human'):
"""Renders the environment."""
raise NotImplementedError
def seed(self, seed=None):
"""Sets the seed for this env's random number generator(s)."""
logger.warn("Could not seed environment %s", self)
22. • Collection of environments to compare RL algorithms
• Minimal example interacting with CartPole-v0 environment
• python scripts/cartpole.py
# https://gym.openai.com/docs/#environments
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample()) # take a random action
23. • Show list of environments
from gym import envs
print(envs.registry.all())
#> [EnvSpec(DoubleDunk-v0), EnvSpec(InvertedDoublePendulum-v0),
# EnvSpec(BeamRider-v0), EnvSpec(Phoenix-ram-v0), EnvSpec(Asterix-v0),
# EnvSpec(TimePilot-v0), EnvSpec(Alien-v0), EnvSpec(Robotank-ram-v0),
# EnvSpec(CartPole-v0), EnvSpec(Berzerk-v0), EnvSpec(Berzerk-ram-v0),
# EnvSpec(Gopher-ram-v0), ...
24. OpenAI baselines
• Set of high-quality implementations of RL algorithms: A2C, ACER,
ACKTR, DDPG, DQN, GAIL, HER, PPO, TRPO
• There is a fork “Stable Baselines”: unified structure for algorithms,
PEP8 compliant, documented, more tests & more code coverage
25. TensorFlow Agent
• Optimized infrastructure for reinforcement learning
• Multiple parallel environments, batch PPO
• Validated with environments
26. PyBullet
• Python wrapper of Bullet physics simulator
• URDF/SDF support, forward dynamics, inverse kinematics, collision
check, 2D/depth cameras, virtual reality
• Used in simulation-to-real transfer
of controller for quadruped robot
(Tan et al., 2018)
27. Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
28. Installation and testing
• Python 3.6, venv
$ brew install cmake openmpi
$ cd $WORKDIR
$ python3 -m venv pybullet-env
$ source pybullet-env/bin/activate
$ pip install tensorflow==1.12
$ pip install gym==0.10.9
$ git clone https://github.com/openai/baselines.git
$ cd baselines
$ pip install -e .
$ cd ..
$ pip install pybullet==2.3.8
$ pip install ruamel-yaml==0.15.76
$ pip install stable-baselines==2.2.1
$ brew install ffmpeg # for making video
• Test pybullet and gym
$ cd $WORKDIR/pybullet-env/lib/python3.6/site-packages/pybullet_envs/examples
$ python kukaGymEnvTest.py
$ python kukaCamGymEnvTest.py # much slower
29. import pybullet as p
import time
import pybullet_data
physicsClient = p.connect(p.GUI)# or p.DIRECT for non-graphical version
p.setAdditionalSearchPath(pybullet_data.getDataPath()) #optionally
p.setGravity(0,0,-10)
planeId = p.loadURDF("plane.urdf")
cubeStartPos = [0,0,1]
cubeStartOrientation = p.getQuaternionFromEuler([0,0,0])
boxId = p.loadURDF("r2d2.urdf",cubeStartPos, cubeStartOrientation)
for i in range (10000):
p.stepSimulation()
time.sleep(1./240.)
cubePos, cubeOrn = p.getBasePositionAndOrientation(boxId)
print(cubePos,cubeOrn)
p.disconnect()
• python scripts/hello_pybullet.py
34. # Gym environment
environment = KukaGymEnv(renders=True, isDiscrete=False)
# GUI
motorsIds=[]
dv = 0.01
motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3))
done = False
while (not done):
action=[]
for motorId in motorsIds:
action.append(environment._p.readUserDebugParameter(motorId))
# 1 step forward
state, reward, done, info = environment.step(action)
obs = environment.getExtendedObservation() # Get more state info
kukaGymEnvTest.py
Note: code modified from the original for comparison
35. # Gym environment with cameras
environment = KukaCamGymEnv(renders=True, isDiscrete=False)
# GUI
motorsIds=[]
dv = 1
motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3))
done = False
while (not done):
action=[]
for motorId in motorsIds:
action.append(environment._p.readUserDebugParameter(motorId))
# 1 step forward
state, reward, done, info = environment.step(action)
obs = environment.getExtendedObservation() # Get more state info
kukaCamGymEnvTest.py
Note: code modified from the original for comparison
36. Run PPO for pendulum environment
$ python -m pybullet_envs.agents.train_ppo --config=pybullet_pendulum --logdir=pendulum
# In another terminal
$ tensorboard --logdir=pendulum --port=2222
• Pybullet + OpenAI gym + Tensorflow agent
• Train policy for pendulum environment with PPO
• Save the result as tensorflow result files (log, model)
38. Create video for an episode with the trained policy
$ python -m pybullet_envs.agents.visualize_ppo
--logdir=pendulum/xxxx-pybullet_pendulum/ --outdir=pendulum_video
42. # Modified from the original code to use PyBullet API directly
# for explanation basic idea,
class InvertedPendulumBulletEnv:
def __init__(self):
# Load robot model, wrap interfaces
self.robot = InvertedPendulum()
def _step(self, a):
#self.robot.apply_action(a)
self._p.setJointMotorControl2(self.robot.objects[0], jointIndex=index_slider,
controlMode=self._p.TORQUE_CONTROL, force=a)
self._p.stepSimulation()
state = self._p.getBasePositionAndOrientation(self.robot.objects[0])[0, 1]
# Return value is (x, y, z), (a, b, c, d)
if self.robot.swingup:
reward = np.cos(self.robot.theta)
done = False
else:
reward = 1.0
done = np.abs(self.robot.theta) > .2
return state, reward, done, {}
43. • python scripts/hello_stable_baselines.py
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
env = gym.make('CartPole-v1')
# The algorithms require a vectorized environment to run
env = DummyVecEnv([lambda: env])
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()