SlideShare a Scribd company logo
1 of 43
Download to read offline
Reinforcement learning with
bullet simulator
25 Nov 2018
Taku Yoshioka
Disclaimer
• Equations in slides are notationally inconsistent; many of the
equations are adapted from the textbook of Sutton and Barto, while
equations from other documents are also included.
Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
Markov Decision Process
• Agent’s policy: stochastic action selection conditioned on current state
• Environment: stochastic state transition and stochastic reward
conditioned on state-action pair
Distribution of
state transition and reward
Policy
State value function and expected reward
• State value function: expectation of the discounted sum of rewards
obtained in future given MDP starting from an initial state
• Expected reward: expectation of value function over initial states
State value function
Expected reward
• Training of agent: maximize the expected reward with respect to
policy
Example: pole balancing
• State:
• angle of the pole
• angular velocity of the pole
• vision as shown above
• Action: force to move the cart right or left
• Reward: +1 if the pole is nearly vertical with a threshold of angle
Reinforcement learning objective
• Optimize the policy of the agent through interaction with the environment
• Reward function is given (e.g., game) or designed (e.g., robot)
• MDP model might be (partially) given (e.g., Go) or not given (e.g., robot)
Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
Action value function
• Expected reward given current state-action pair
• Can be used to select an action by maximizing it with respect to
action
• Implicit representation of policy (value iteration)
• Brute-force for discrete (small) action space
• Gradient-free optimization like CEM
• Can be used as a guide to improve policy (policy iteration, later)
Bellman equation
• Recursive relationship of value function with a fixed policy pi
• For optimal policy
• Basis of learning value function (DP, Q-learning, actor-critic)
Bellman equation for action value function
• For optimal policy
• Relationship with state value function
Updating action value function
• SARSA: using sampled state and action (on-policy)
• Q-learning: using optimal action with current policy (off-policy)
Function approximation for value function
• DQN: training deep network that represents action value function
https://towardsdatascience.com/deep-double-q-learning-7fca410b193a
• Select an action based on the output of the deep network
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/deep_q_learning.html
Parametrized policy
• Representation of policy by parametric function
Example: linear function + softmax
• Policy gradient theorem (PGT)
• Applicable to continuous action space
Actor-critic methods
• Plugging parametric value function instead of sample reward into
policy gradient equation
: n-step cumulative reward
• is considered as improvement of the value by
taking the sampled policy compared to the value of the current policy
(advantage function)
• The above update rule is used in advantage actor critic (A2C)
Deterministic policy gradient
• More efficient than training stochastic policy with PGT
• Deterministic policy gradient theorem (DPGT) (Silver et al., 2014)
• Applied to an intensive study of controlling robot with deep network
(Lillicrap et al., 2015)
• Exploration with action noise or parameter noise for deterministic
policy
TRPO and PPO
• Ensuring improvement of policy with a lower bound of the objective
• Trust region policy optimization (TRPO): maximize the lower bound in
a trust region, which is close to the current policy parameters
(Schulman et al., 2015)
• Proximal policy optimization (PPO): using simple clipped constraint
instead of KL divergence (Schulman et al., 2017)
Expected reward
Difference after
policy update
Written with state distribution
on current policy
Lower bound of
expected reward
Constrained maximum
Importance sampling
TRPO objective
Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
OpenAI gym
• Providing interface of RL environment with a base class Env
class Env(object):
"""The main OpenAI Gym class.
"""
def step(self, action):
"""Run one timestep of the environment's dynamics.
Returns observation, reward, done and info.
"""
raise NotImplementedError
def reset(self):
"""Resets the state of the environment and returns an initial observation.
"""
raise NotImplementedError
def render(self, mode='human'):
"""Renders the environment."""
raise NotImplementedError
def seed(self, seed=None):
"""Sets the seed for this env's random number generator(s)."""
logger.warn("Could not seed environment %s", self)
• Collection of environments to compare RL algorithms
• Minimal example interacting with CartPole-v0 environment
• python scripts/cartpole.py
# https://gym.openai.com/docs/#environments
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample()) # take a random action
• Show list of environments
from gym import envs
print(envs.registry.all())
#> [EnvSpec(DoubleDunk-v0), EnvSpec(InvertedDoublePendulum-v0),
# EnvSpec(BeamRider-v0), EnvSpec(Phoenix-ram-v0), EnvSpec(Asterix-v0),
# EnvSpec(TimePilot-v0), EnvSpec(Alien-v0), EnvSpec(Robotank-ram-v0),
# EnvSpec(CartPole-v0), EnvSpec(Berzerk-v0), EnvSpec(Berzerk-ram-v0),
# EnvSpec(Gopher-ram-v0), ...
OpenAI baselines
• Set of high-quality implementations of RL algorithms: A2C, ACER,
ACKTR, DDPG, DQN, GAIL, HER, PPO, TRPO
• There is a fork “Stable Baselines”: unified structure for algorithms,
PEP8 compliant, documented, more tests & more code coverage
TensorFlow Agent
• Optimized infrastructure for reinforcement learning
• Multiple parallel environments, batch PPO
• Validated with environments
PyBullet
• Python wrapper of Bullet physics simulator
• URDF/SDF support, forward dynamics, inverse kinematics, collision
check, 2D/depth cameras, virtual reality
• Used in simulation-to-real transfer
of controller for quadruped robot
(Tan et al., 2018)
Framework of reinforcement learning
Markov decision process, value function, objective
Reinforcement learning methods
Action value function, function approximation, parametrized policy,
actor-critic method, advanced learning methods
Python modules
Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines
Example
Installation with python venv, running Pybullet, training with
Tensorflow agent on a pendulum environment
Installation and testing
• Python 3.6, venv
$ brew install cmake openmpi
$ cd $WORKDIR
$ python3 -m venv pybullet-env
$ source pybullet-env/bin/activate
$ pip install tensorflow==1.12
$ pip install gym==0.10.9
$ git clone https://github.com/openai/baselines.git
$ cd baselines
$ pip install -e .
$ cd ..
$ pip install pybullet==2.3.8
$ pip install ruamel-yaml==0.15.76
$ pip install stable-baselines==2.2.1
$ brew install ffmpeg # for making video
• Test pybullet and gym
$ cd $WORKDIR/pybullet-env/lib/python3.6/site-packages/pybullet_envs/examples
$ python kukaGymEnvTest.py
$ python kukaCamGymEnvTest.py # much slower
import pybullet as p
import time
import pybullet_data
physicsClient = p.connect(p.GUI)# or p.DIRECT for non-graphical version
p.setAdditionalSearchPath(pybullet_data.getDataPath()) #optionally
p.setGravity(0,0,-10)
planeId = p.loadURDF("plane.urdf")
cubeStartPos = [0,0,1]
cubeStartOrientation = p.getQuaternionFromEuler([0,0,0])
boxId = p.loadURDF("r2d2.urdf",cubeStartPos, cubeStartOrientation)
for i in range (10000):
p.stepSimulation()
time.sleep(1./240.)
cubePos, cubeOrn = p.getBasePositionAndOrientation(boxId)
print(cubePos,cubeOrn)
p.disconnect()
• python scripts/hello_pybullet.py
kukaGymEnvTest.py
kukaCamGymEnvTest.py
# Gym environment
environment = KukaGymEnv(renders=True, isDiscrete=False)
# GUI
motorsIds=[]
dv = 0.01
motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3))
done = False
while (not done):
action=[]
for motorId in motorsIds:
action.append(environment._p.readUserDebugParameter(motorId))
# 1 step forward
state, reward, done, info = environment.step(action)
obs = environment.getExtendedObservation() # Get more state info
kukaGymEnvTest.py
Note: code modified from the original for comparison
# Gym environment with cameras
environment = KukaCamGymEnv(renders=True, isDiscrete=False)
# GUI
motorsIds=[]
dv = 1
motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0))
motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3))
done = False
while (not done):
action=[]
for motorId in motorsIds:
action.append(environment._p.readUserDebugParameter(motorId))
# 1 step forward
state, reward, done, info = environment.step(action)
obs = environment.getExtendedObservation() # Get more state info
kukaCamGymEnvTest.py
Note: code modified from the original for comparison
Run PPO for pendulum environment
$ python -m pybullet_envs.agents.train_ppo --config=pybullet_pendulum --logdir=pendulum
# In another terminal
$ tensorboard --logdir=pendulum --port=2222
• Pybullet + OpenAI gym + Tensorflow agent
• Train policy for pendulum environment with PPO
• Save the result as tensorflow result files (log, model)
• Learning curve (mean score)
Create video for an episode with the trained policy
$ python -m pybullet_envs.agents.visualize_ppo 
--logdir=pendulum/xxxx-pybullet_pendulum/ --outdir=pendulum_video
Configurations
• Environment parameters
pybullet-env/lib/python3.6/site-packages/pybullet_envs/agents/configs.py
def pybullet_pendulum():
locals().update(default())
env = 'InvertedPendulumBulletEnv-v0'
max_length = 200
steps = 5e7 # 50M
return locals()
• Register gym-compatible pybullet environment
pybullet-env/lib/python3.6/site-packages/pybullet_envs/__init__.py
register(
id='InvertedPendulumBulletEnv-v0',
entry_point='pybullet_envs.gym_pendulum_envs:InvertedPendulumBulletEnv',
max_episode_steps=1000,
reward_threshold=950.0,
)
Configurations
pybullet-env/lib/python3.6/site-packages/pybullet_envs/agents/configs.py
def default():
"""Default configuration for PPO."""
# General
algorithm = ppo.PPOAlgorithm
num_agents = 30
eval_episodes = 30
use_gpu = False
# Network
network = networks.feed_forward_gaussian
weight_summaries = dict(
all=r'.*',
policy=r'.*/policy/.*',
value=r'.*/value/.*')
policy_layers = 200, 100
value_layers = 200, 100
init_mean_factor = 0.1
init_logstd = -1
# Optimization
update_every = 30
update_epochs = 25
optimizer = tf.train.AdamOptimizer
update_epochs_policy = 64
update_epochs_value = 64
learning_rate = 1e-4
# Losses
discount = 0.995
kl_target = 1e-2
kl_cutoff_factor = 2
kl_cutoff_coef = 1000
kl_init_penalty = 1
return locals()
• Algorithm (PPO) parameters
Gym-compatible pendulum environment
pybullet-env/lib/python3.6/site-packages/pybullet_envs/
gym_pendulum_envs.py
• Make a subclass of Env
# Modified from the original code to use PyBullet API directly
# for explanation basic idea,
class InvertedPendulumBulletEnv:
def __init__(self):
# Load robot model, wrap interfaces
self.robot = InvertedPendulum()
def _step(self, a):
#self.robot.apply_action(a)
self._p.setJointMotorControl2(self.robot.objects[0], jointIndex=index_slider,
controlMode=self._p.TORQUE_CONTROL, force=a)
self._p.stepSimulation()
state = self._p.getBasePositionAndOrientation(self.robot.objects[0])[0, 1]
# Return value is (x, y, z), (a, b, c, d)
if self.robot.swingup:
reward = np.cos(self.robot.theta)
done = False
else:
reward = 1.0
done = np.abs(self.robot.theta) > .2
return state, reward, done, {}
• python scripts/hello_stable_baselines.py
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
env = gym.make('CartPole-v1')
# The algorithms require a vectorized environment to run
env = DummyVecEnv([lambda: env])
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()

More Related Content

What's hot

What's hot (20)

Performance Parameters of IC engines
Performance Parameters of  IC enginesPerformance Parameters of  IC engines
Performance Parameters of IC engines
 
Knocking fundamentals (limitations and issues)
Knocking fundamentals (limitations and issues)Knocking fundamentals (limitations and issues)
Knocking fundamentals (limitations and issues)
 
Combustion in CI engines
Combustion in CI enginesCombustion in CI engines
Combustion in CI engines
 
Design of Hydrogen Internal Combustion Engine with Fuel Regeneration and Ener...
Design of Hydrogen Internal Combustion Engine with Fuel Regeneration and Ener...Design of Hydrogen Internal Combustion Engine with Fuel Regeneration and Ener...
Design of Hydrogen Internal Combustion Engine with Fuel Regeneration and Ener...
 
Combustion Chambers
Combustion ChambersCombustion Chambers
Combustion Chambers
 
OpenGL ES 3.1 Reference Card
OpenGL ES 3.1 Reference CardOpenGL ES 3.1 Reference Card
OpenGL ES 3.1 Reference Card
 
i c engines ppt.pptx
i c engines ppt.pptxi c engines ppt.pptx
i c engines ppt.pptx
 
Turbo shaft engine
Turbo shaft engineTurbo shaft engine
Turbo shaft engine
 
COMBUSTION IN S I & C I ENGINES
COMBUSTION IN S I & C I ENGINESCOMBUSTION IN S I & C I ENGINES
COMBUSTION IN S I & C I ENGINES
 
2018.02.03 이미지 텍스처링
2018.02.03 이미지 텍스처링2018.02.03 이미지 텍스처링
2018.02.03 이미지 텍스처링
 
CI Engine Knocking
CI Engine KnockingCI Engine Knocking
CI Engine Knocking
 
Unity Morph performance test
Unity Morph performance testUnity Morph performance test
Unity Morph performance test
 
Valve Timing Diagram (2161902) ICE
Valve Timing Diagram (2161902) ICEValve Timing Diagram (2161902) ICE
Valve Timing Diagram (2161902) ICE
 
Combustion Chamber for Compression Ignition Engines
Combustion Chamber for Compression Ignition EnginesCombustion Chamber for Compression Ignition Engines
Combustion Chamber for Compression Ignition Engines
 
Carburettor ppt
Carburettor pptCarburettor ppt
Carburettor ppt
 
코히런트 Gt(coherent gt) 통합 및 간단한 사용법
코히런트 Gt(coherent gt) 통합 및 간단한 사용법코히런트 Gt(coherent gt) 통합 및 간단한 사용법
코히런트 Gt(coherent gt) 통합 및 간단한 사용법
 
Ic engine and its types,applications
Ic engine and its types,applicationsIc engine and its types,applications
Ic engine and its types,applications
 
Unit iii normal & oblique shocks
Unit   iii normal & oblique shocksUnit   iii normal & oblique shocks
Unit iii normal & oblique shocks
 
Air car or air engine
Air car or air engineAir car or air engine
Air car or air engine
 
Combustion SI Engines - Unit-III
Combustion SI Engines - Unit-IIICombustion SI Engines - Unit-III
Combustion SI Engines - Unit-III
 

Similar to 20181125 pybullet

Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Jisu Han
 

Similar to 20181125 pybullet (20)

Literature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneLiterature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstone
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Endsem AI merged.pdf
Endsem AI merged.pdfEndsem AI merged.pdf
Endsem AI merged.pdf
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
A Pareto-Compliant Surrogate Approach for Multiobjective Optimization
A Pareto-Compliant Surrogate Approach  for Multiobjective OptimizationA Pareto-Compliant Surrogate Approach  for Multiobjective Optimization
A Pareto-Compliant Surrogate Approach for Multiobjective Optimization
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn
 
MSCV Capstone Spring 2020 Presentation - RL for AD
MSCV Capstone Spring 2020 Presentation - RL for ADMSCV Capstone Spring 2020 Presentation - RL for AD
MSCV Capstone Spring 2020 Presentation - RL for AD
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
 
part3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptxpart3Module 3 ppt_with classification.pptx
part3Module 3 ppt_with classification.pptx
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | Edureka
 
AI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptxAI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptx
 
Ijcai 2020
Ijcai 2020Ijcai 2020
Ijcai 2020
 

More from Taku Yoshioka (9)

20191123 bayes dl-jp
20191123 bayes dl-jp20191123 bayes dl-jp
20191123 bayes dl-jp
 
20191026 bayes dl
20191026 bayes dl20191026 bayes dl
20191026 bayes dl
 
20191019 sinkhorn
20191019 sinkhorn20191019 sinkhorn
20191019 sinkhorn
 
20181221 q-trader
20181221 q-trader20181221 q-trader
20181221 q-trader
 
20180722 pyro
20180722 pyro20180722 pyro
20180722 pyro
 
20171207 domain-adaptation
20171207 domain-adaptation20171207 domain-adaptation
20171207 domain-adaptation
 
20171025 pp-in-robotics
20171025 pp-in-robotics20171025 pp-in-robotics
20171025 pp-in-robotics
 
20160611 pymc3-latent
20160611 pymc3-latent20160611 pymc3-latent
20160611 pymc3-latent
 
自動微分変分ベイズ法の紹介
自動微分変分ベイズ法の紹介自動微分変分ベイズ法の紹介
自動微分変分ベイズ法の紹介
 

Recently uploaded

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Recently uploaded (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

20181125 pybullet

  • 1. Reinforcement learning with bullet simulator 25 Nov 2018 Taku Yoshioka
  • 2. Disclaimer • Equations in slides are notationally inconsistent; many of the equations are adapted from the textbook of Sutton and Barto, while equations from other documents are also included.
  • 3. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  • 4. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  • 5. Markov Decision Process • Agent’s policy: stochastic action selection conditioned on current state • Environment: stochastic state transition and stochastic reward conditioned on state-action pair Distribution of state transition and reward Policy
  • 6. State value function and expected reward • State value function: expectation of the discounted sum of rewards obtained in future given MDP starting from an initial state • Expected reward: expectation of value function over initial states State value function Expected reward • Training of agent: maximize the expected reward with respect to policy
  • 7. Example: pole balancing • State: • angle of the pole • angular velocity of the pole • vision as shown above • Action: force to move the cart right or left • Reward: +1 if the pole is nearly vertical with a threshold of angle
  • 8. Reinforcement learning objective • Optimize the policy of the agent through interaction with the environment • Reward function is given (e.g., game) or designed (e.g., robot) • MDP model might be (partially) given (e.g., Go) or not given (e.g., robot)
  • 9. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  • 10. Action value function • Expected reward given current state-action pair • Can be used to select an action by maximizing it with respect to action • Implicit representation of policy (value iteration) • Brute-force for discrete (small) action space • Gradient-free optimization like CEM • Can be used as a guide to improve policy (policy iteration, later)
  • 11. Bellman equation • Recursive relationship of value function with a fixed policy pi • For optimal policy • Basis of learning value function (DP, Q-learning, actor-critic)
  • 12. Bellman equation for action value function • For optimal policy • Relationship with state value function
  • 13. Updating action value function • SARSA: using sampled state and action (on-policy) • Q-learning: using optimal action with current policy (off-policy)
  • 14. Function approximation for value function • DQN: training deep network that represents action value function https://towardsdatascience.com/deep-double-q-learning-7fca410b193a • Select an action based on the output of the deep network https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/deep_q_learning.html
  • 15. Parametrized policy • Representation of policy by parametric function Example: linear function + softmax • Policy gradient theorem (PGT) • Applicable to continuous action space
  • 16. Actor-critic methods • Plugging parametric value function instead of sample reward into policy gradient equation : n-step cumulative reward • is considered as improvement of the value by taking the sampled policy compared to the value of the current policy (advantage function) • The above update rule is used in advantage actor critic (A2C)
  • 17. Deterministic policy gradient • More efficient than training stochastic policy with PGT • Deterministic policy gradient theorem (DPGT) (Silver et al., 2014) • Applied to an intensive study of controlling robot with deep network (Lillicrap et al., 2015) • Exploration with action noise or parameter noise for deterministic policy
  • 18. TRPO and PPO • Ensuring improvement of policy with a lower bound of the objective • Trust region policy optimization (TRPO): maximize the lower bound in a trust region, which is close to the current policy parameters (Schulman et al., 2015) • Proximal policy optimization (PPO): using simple clipped constraint instead of KL divergence (Schulman et al., 2017)
  • 19. Expected reward Difference after policy update Written with state distribution on current policy Lower bound of expected reward Constrained maximum Importance sampling TRPO objective
  • 20. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  • 21. OpenAI gym • Providing interface of RL environment with a base class Env class Env(object): """The main OpenAI Gym class. """ def step(self, action): """Run one timestep of the environment's dynamics. Returns observation, reward, done and info. """ raise NotImplementedError def reset(self): """Resets the state of the environment and returns an initial observation. """ raise NotImplementedError def render(self, mode='human'): """Renders the environment.""" raise NotImplementedError def seed(self, seed=None): """Sets the seed for this env's random number generator(s).""" logger.warn("Could not seed environment %s", self)
  • 22. • Collection of environments to compare RL algorithms • Minimal example interacting with CartPole-v0 environment • python scripts/cartpole.py # https://gym.openai.com/docs/#environments import gym env = gym.make('CartPole-v0') env.reset() for _ in range(1000): env.render() env.step(env.action_space.sample()) # take a random action
  • 23. • Show list of environments from gym import envs print(envs.registry.all()) #> [EnvSpec(DoubleDunk-v0), EnvSpec(InvertedDoublePendulum-v0), # EnvSpec(BeamRider-v0), EnvSpec(Phoenix-ram-v0), EnvSpec(Asterix-v0), # EnvSpec(TimePilot-v0), EnvSpec(Alien-v0), EnvSpec(Robotank-ram-v0), # EnvSpec(CartPole-v0), EnvSpec(Berzerk-v0), EnvSpec(Berzerk-ram-v0), # EnvSpec(Gopher-ram-v0), ...
  • 24. OpenAI baselines • Set of high-quality implementations of RL algorithms: A2C, ACER, ACKTR, DDPG, DQN, GAIL, HER, PPO, TRPO • There is a fork “Stable Baselines”: unified structure for algorithms, PEP8 compliant, documented, more tests & more code coverage
  • 25. TensorFlow Agent • Optimized infrastructure for reinforcement learning • Multiple parallel environments, batch PPO • Validated with environments
  • 26. PyBullet • Python wrapper of Bullet physics simulator • URDF/SDF support, forward dynamics, inverse kinematics, collision check, 2D/depth cameras, virtual reality • Used in simulation-to-real transfer of controller for quadruped robot (Tan et al., 2018)
  • 27. Framework of reinforcement learning Markov decision process, value function, objective Reinforcement learning methods Action value function, function approximation, parametrized policy, actor-critic method, advanced learning methods Python modules Open AI gym, Pybullet, Tensorflow agent, OpenAI baselines Example Installation with python venv, running Pybullet, training with Tensorflow agent on a pendulum environment
  • 28. Installation and testing • Python 3.6, venv $ brew install cmake openmpi $ cd $WORKDIR $ python3 -m venv pybullet-env $ source pybullet-env/bin/activate $ pip install tensorflow==1.12 $ pip install gym==0.10.9 $ git clone https://github.com/openai/baselines.git $ cd baselines $ pip install -e . $ cd .. $ pip install pybullet==2.3.8 $ pip install ruamel-yaml==0.15.76 $ pip install stable-baselines==2.2.1 $ brew install ffmpeg # for making video • Test pybullet and gym $ cd $WORKDIR/pybullet-env/lib/python3.6/site-packages/pybullet_envs/examples $ python kukaGymEnvTest.py $ python kukaCamGymEnvTest.py # much slower
  • 29. import pybullet as p import time import pybullet_data physicsClient = p.connect(p.GUI)# or p.DIRECT for non-graphical version p.setAdditionalSearchPath(pybullet_data.getDataPath()) #optionally p.setGravity(0,0,-10) planeId = p.loadURDF("plane.urdf") cubeStartPos = [0,0,1] cubeStartOrientation = p.getQuaternionFromEuler([0,0,0]) boxId = p.loadURDF("r2d2.urdf",cubeStartPos, cubeStartOrientation) for i in range (10000): p.stepSimulation() time.sleep(1./240.) cubePos, cubeOrn = p.getBasePositionAndOrientation(boxId) print(cubePos,cubeOrn) p.disconnect() • python scripts/hello_pybullet.py
  • 30.
  • 31.
  • 34. # Gym environment environment = KukaGymEnv(renders=True, isDiscrete=False) # GUI motorsIds=[] dv = 0.01 motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3)) done = False while (not done): action=[] for motorId in motorsIds: action.append(environment._p.readUserDebugParameter(motorId)) # 1 step forward state, reward, done, info = environment.step(action) obs = environment.getExtendedObservation() # Get more state info kukaGymEnvTest.py Note: code modified from the original for comparison
  • 35. # Gym environment with cameras environment = KukaCamGymEnv(renders=True, isDiscrete=False) # GUI motorsIds=[] dv = 1 motorsIds.append(environment._p.addUserDebugParameter("posX",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("posY",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("posZ",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("yaw",-dv,dv,0)) motorsIds.append(environment._p.addUserDebugParameter("fingerAngle",0,0.3,.3)) done = False while (not done): action=[] for motorId in motorsIds: action.append(environment._p.readUserDebugParameter(motorId)) # 1 step forward state, reward, done, info = environment.step(action) obs = environment.getExtendedObservation() # Get more state info kukaCamGymEnvTest.py Note: code modified from the original for comparison
  • 36. Run PPO for pendulum environment $ python -m pybullet_envs.agents.train_ppo --config=pybullet_pendulum --logdir=pendulum # In another terminal $ tensorboard --logdir=pendulum --port=2222 • Pybullet + OpenAI gym + Tensorflow agent • Train policy for pendulum environment with PPO • Save the result as tensorflow result files (log, model)
  • 37. • Learning curve (mean score)
  • 38. Create video for an episode with the trained policy $ python -m pybullet_envs.agents.visualize_ppo --logdir=pendulum/xxxx-pybullet_pendulum/ --outdir=pendulum_video
  • 39. Configurations • Environment parameters pybullet-env/lib/python3.6/site-packages/pybullet_envs/agents/configs.py def pybullet_pendulum(): locals().update(default()) env = 'InvertedPendulumBulletEnv-v0' max_length = 200 steps = 5e7 # 50M return locals() • Register gym-compatible pybullet environment pybullet-env/lib/python3.6/site-packages/pybullet_envs/__init__.py register( id='InvertedPendulumBulletEnv-v0', entry_point='pybullet_envs.gym_pendulum_envs:InvertedPendulumBulletEnv', max_episode_steps=1000, reward_threshold=950.0, )
  • 40. Configurations pybullet-env/lib/python3.6/site-packages/pybullet_envs/agents/configs.py def default(): """Default configuration for PPO.""" # General algorithm = ppo.PPOAlgorithm num_agents = 30 eval_episodes = 30 use_gpu = False # Network network = networks.feed_forward_gaussian weight_summaries = dict( all=r'.*', policy=r'.*/policy/.*', value=r'.*/value/.*') policy_layers = 200, 100 value_layers = 200, 100 init_mean_factor = 0.1 init_logstd = -1 # Optimization update_every = 30 update_epochs = 25 optimizer = tf.train.AdamOptimizer update_epochs_policy = 64 update_epochs_value = 64 learning_rate = 1e-4 # Losses discount = 0.995 kl_target = 1e-2 kl_cutoff_factor = 2 kl_cutoff_coef = 1000 kl_init_penalty = 1 return locals() • Algorithm (PPO) parameters
  • 42. # Modified from the original code to use PyBullet API directly # for explanation basic idea, class InvertedPendulumBulletEnv: def __init__(self): # Load robot model, wrap interfaces self.robot = InvertedPendulum() def _step(self, a): #self.robot.apply_action(a) self._p.setJointMotorControl2(self.robot.objects[0], jointIndex=index_slider, controlMode=self._p.TORQUE_CONTROL, force=a) self._p.stepSimulation() state = self._p.getBasePositionAndOrientation(self.robot.objects[0])[0, 1] # Return value is (x, y, z), (a, b, c, d) if self.robot.swingup: reward = np.cos(self.robot.theta) done = False else: reward = 1.0 done = np.abs(self.robot.theta) > .2 return state, reward, done, {}
  • 43. • python scripts/hello_stable_baselines.py import gym from stable_baselines.common.policies import MlpPolicy from stable_baselines.common.vec_env import DummyVecEnv from stable_baselines import PPO2 env = gym.make('CartPole-v1') # The algorithms require a vectorized environment to run env = DummyVecEnv([lambda: env]) model = PPO2(MlpPolicy, env, verbose=1) model.learn(total_timesteps=10000) obs = env.reset() for i in range(1000): action, _states = model.predict(obs) obs, rewards, dones, info = env.step(action) env.render()