20181125 pybullet

Reinforcement learning with
bullet simulator
25 Nov 2018
Taku Yoshioka

Disclaimer
• Equations in slides are notationally inconsistent; many of the
equations are adapted from the textbook of Sutton and Barto, while
equations from other documents are also included.

Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorﬂow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorﬂow agent on a pendulum environment

Markov Decision Process
• Agent’s policy: stochastic action selection conditioned on current state
• Environment: stochastic state transition and stochastic reward
conditioned on state-action pair
Distribution of
state transition and reward
Policy

State value function and expected reward
• State value function: expectation of the discounted sum of rewards
obtained in future given MDP starting from an initial state
• Expected reward: expectation of value function over initial states
State value function
Expected reward
• Training of agent: maximize the expected reward with respect to
policy

Example: pole balancing
• State:
• angle of the pole
• angular velocity of the pole
• vision as shown above
• Action: force to move the cart right or left
• Reward: +1 if the pole is nearly vertical with a threshold of angle

Reinforcement learning objective
• Optimize the policy of the agent through interaction with the environment
• Reward function is given (e.g., game) or designed (e.g., robot)
• MDP model might be (partially) given (e.g., Go) or not given (e.g., robot)

Action value function
• Expected reward given current state-action pair
• Can be used to select an action by maximizing it with respect to
action
• Implicit representation of policy (value iteration)
• Brute-force for discrete (small) action space
• Gradient-free optimization like CEM
• Can be used as a guide to improve policy (policy iteration, later)

Bellman equation
• Recursive relationship of value function with a ﬁxed policy pi
• For optimal policy
• Basis of learning value function (DP, Q-learning, actor-critic)

Bellman equation for action value function
• For optimal policy
• Relationship with state value function

Updating action value function
• SARSA: using sampled state and action (on-policy)
• Q-learning: using optimal action with current policy (off-policy)

Function approximation for value function
• DQN: training deep network that represents action value function
https://towardsdatascience.com/deep-double-q-learning-7fca410b193a
• Select an action based on the output of the deep network
https://leonardoaraujosantos.gitbooks.io/artiﬁcial-inteligence/content/deep_q_learning.html

Parametrized policy
• Representation of policy by parametric function
Example: linear function + softmax
• Policy gradient theorem (PGT)
• Applicable to continuous action space

Actor-critic methods
• Plugging parametric value function instead of sample reward into
policy gradient equation
: n-step cumulative reward
• is considered as improvement of the value by
taking the sampled policy compared to the value of the current policy
(advantage function)
• The above update rule is used in advantage actor critic (A2C)

Deterministic policy gradient
• More efﬁcient than training stochastic policy with PGT
• Deterministic policy gradient theorem (DPGT) (Silver et al., 2014)
• Applied to an intensive study of controlling robot with deep network
(Lillicrap et al., 2015)
• Exploration with action noise or parameter noise for deterministic
policy

TRPO and PPO
• Ensuring improvement of policy with a lower bound of the objective
• Trust region policy optimization (TRPO): maximize the lower bound in
a trust region, which is close to the current policy parameters
(Schulman et al., 2015)
• Proximal policy optimization (PPO): using simple clipped constraint
instead of KL divergence (Schulman et al., 2017)

Expected reward
Difference after
policy update
Written with state distribution
on current policy
Lower bound of
expected reward
Constrained maximum
Importance sampling
TRPO objective

OpenAI gym
• Providing interface of RL environment with a base class Env
class Env(object):
"""The main OpenAI Gym class.
"""
def step(self, action):
"""Run one timestep of the environment's dynamics.
Returns observation, reward, done and info.
"""
raise NotImplementedError
def reset(self):
"""Resets the state of the environment and returns an initial observation.
"""
def render(self, mode='human'):
"""Renders the environment."""
def seed(self, seed=None):
"""Sets the seed for this env's random number generator(s)."""
logger.warn("Could not seed environment %s", self)

• Collection of environments to compare RL algorithms
• Minimal example interacting with CartPole-v0 environment
• python scripts/cartpole.py
# https://gym.openai.com/docs/#environments
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample()) # take a random action

• Show list of environments
from gym import envs
print(envs.registry.all())
#> [EnvSpec(DoubleDunk-v0), EnvSpec(InvertedDoublePendulum-v0),
# EnvSpec(BeamRider-v0), EnvSpec(Phoenix-ram-v0), EnvSpec(Asterix-v0),
# EnvSpec(TimePilot-v0), EnvSpec(Alien-v0), EnvSpec(Robotank-ram-v0),
# EnvSpec(CartPole-v0), EnvSpec(Berzerk-v0), EnvSpec(Berzerk-ram-v0),
# EnvSpec(Gopher-ram-v0), ...

OpenAI baselines
• Set of high-quality implementations of RL algorithms: A2C, ACER,
ACKTR, DDPG, DQN, GAIL, HER, PPO, TRPO
• There is a fork “Stable Baselines”: uniﬁed structure for algorithms,
PEP8 compliant, documented, more tests & more code coverage

TensorFlow Agent
• Optimized infrastructure for reinforcement learning
• Multiple parallel environments, batch PPO
• Validated with environments

PyBullet
• Python wrapper of Bullet physics simulator
• URDF/SDF support, forward dynamics, inverse kinematics, collision
check, 2D/depth cameras, virtual reality
• Used in simulation-to-real transfer
of controller for quadruped robot
(Tan et al., 2018)

Installation and testing
• Python 3.6, venv
$ brew install cmake openmpi
$ cd $WORKDIR
$ python3 -m venv pybullet-env
$ source pybullet-env/bin/activate
$ pip install tensorflow==1.12
$ pip install gym==0.10.9
$ git clone https://github.com/openai/baselines.git
$ cd baselines
$ pip install -e .
$ cd ..
$ pip install pybullet==2.3.8
$ pip install ruamel-yaml==0.15.76
$ pip install stable-baselines==2.2.1
$ brew install ffmpeg # for making video
• Test pybullet and gym
$ cd $WORKDIR/pybullet-env/lib/python3.6/site-packages/pybullet_envs/examples
$ python kukaGymEnvTest.py
$ python kukaCamGymEnvTest.py # much slower

import pybullet as p
import time
import pybullet_data
physicsClient = p.connect(p.GUI)# or p.DIRECT for non-graphical version
p.setAdditionalSearchPath(pybullet_data.getDataPath()) #optionally
p.setGravity(0,0,-10)
planeId = p.loadURDF("plane.urdf")
cubeStartPos = [0,0,1]
cubeStartOrientation = p.getQuaternionFromEuler([0,0,0])
boxId = p.loadURDF("r2d2.urdf",cubeStartPos, cubeStartOrientation)
for i in range (10000):
p.stepSimulation()
time.sleep(1./240.)
cubePos, cubeOrn = p.getBasePositionAndOrientation(boxId)
print(cubePos,cubeOrn)
p.disconnect()
• python scripts/hello_pybullet.py

# Gym environment
environment = KukaGymEnv(renders=True, isDiscrete=False)
# GUI
motorsIds=[]
dv = 0.01
motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3))
done = False
while (not done):
action=[]
for motorId in motorsIds:
action.append(environment._p.readUserDebugParameter(motorId))
# 1 step forward
state, reward, done, info = environment.step(action)
obs = environment.getExtendedObservation() # Get more state info
kukaGymEnvTest.py
Note: code modiﬁed from the original for comparison

# Gym environment with cameras
environment = KukaCamGymEnv(renders=True, isDiscrete=False)
# GUI
motorsIds=[]
dv = 1
motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3))
done = False
while (not done):
action=[]
for motorId in motorsIds:
action.append(environment._p.readUserDebugParameter(motorId))
# 1 step forward
state, reward, done, info = environment.step(action)
obs = environment.getExtendedObservation() # Get more state info
kukaCamGymEnvTest.py
Note: code modiﬁed from the original for comparison

Run PPO for pendulum environment
$ python -m pybullet_envs.agents.train_ppo --config=pybullet_pendulum --logdir=pendulum
# In another terminal
$ tensorboard --logdir=pendulum --port=2222
• Pybullet + OpenAI gym + Tensorflow agent
• Train policy for pendulum environment with PPO
• Save the result as tensorflow result files (log, model)

• Learning curve (mean score)

Create video for an episode with the trained policy
$ python -m pybullet_envs.agents.visualize_ppo
--logdir=pendulum/xxxx-pybullet_pendulum/ --outdir=pendulum_video

Conﬁgurations
• Environment parameters
pybullet-env/lib/python3.6/site-packages/pybullet_envs/agents/configs.py
def pybullet_pendulum():
locals().update(default())
env = 'InvertedPendulumBulletEnv-v0'
max_length = 200
steps = 5e7 # 50M
return locals()
• Register gym-compatible pybullet environment
pybullet-env/lib/python3.6/site-packages/pybullet_envs/__init__.py
register(
id='InvertedPendulumBulletEnv-v0',
entry_point='pybullet_envs.gym_pendulum_envs:InvertedPendulumBulletEnv',
max_episode_steps=1000,
reward_threshold=950.0,
)

Conﬁgurations
pybullet-env/lib/python3.6/site-packages/pybullet_envs/agents/configs.py
def default():
"""Default configuration for PPO."""
# General
algorithm = ppo.PPOAlgorithm
num_agents = 30
eval_episodes = 30
use_gpu = False
# Network
network = networks.feed_forward_gaussian
weight_summaries = dict(
all=r'.*',
policy=r'.*/policy/.*',
value=r'.*/value/.*')
policy_layers = 200, 100
value_layers = 200, 100
init_mean_factor = 0.1
init_logstd = -1
# Optimization
update_every = 30
update_epochs = 25
optimizer = tf.train.AdamOptimizer
update_epochs_policy = 64
update_epochs_value = 64
learning_rate = 1e-4
# Losses
discount = 0.995
kl_target = 1e-2
kl_cutoff_factor = 2
kl_cutoff_coef = 1000
kl_init_penalty = 1
return locals()
• Algorithm (PPO) parameters

Gym-compatible pendulum environment
pybullet-env/lib/python3.6/site-packages/pybullet_envs/
gym_pendulum_envs.py
• Make a subclass of Env

# Modified from the original code to use PyBullet API directly
# for explanation basic idea,
class InvertedPendulumBulletEnv:
def __init__(self):
# Load robot model, wrap interfaces
self.robot = InvertedPendulum()
def _step(self, a):
#self.robot.apply_action(a)
self._p.setJointMotorControl2(self.robot.objects[0], jointIndex=index_slider,
controlMode=self._p.TORQUE_CONTROL, force=a)
self._p.stepSimulation()
state = self._p.getBasePositionAndOrientation(self.robot.objects[0])[0, 1]
# Return value is (x, y, z), (a, b, c, d)
if self.robot.swingup:
reward = np.cos(self.robot.theta)
done = False
else:
reward = 1.0
done = np.abs(self.robot.theta) > .2
return state, reward, done, {}

• python scripts/hello_stable_baselines.py
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
env = gym.make('CartPole-v1')
# The algorithms require a vectorized environment to run
env = DummyVecEnv([lambda: env])
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()

20181125 pybullet

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20181125 pybullet

Similar to 20181125 pybullet (20)

More from Taku Yoshioka

More from Taku Yoshioka (9)

Recently uploaded

Recently uploaded (20)

20181125 pybullet