OpenAI has used Reinforcement Learning to train a humanoid robotic hand to rotate a cube to achieve any desired orientation. This is discussed in arXiv:1808.00177, 2019 and in the blog <openai.com/blog/learning dexterity/>. These slides present results from the paper along with a few important concepts in reinforcement learning I learnt through many other sources.
3. Introduction
▪ Research in control of robotic devices is a subject with great
application across a number of sectors
▪ Prior methods have completely trained and tested either on
simulations alone or on physical robots alone
▪ However, simulations do not transfer with sufficient accuracy to
real world, while training on physical robots require years of
experience to perform satisfactorily
▪ In this study, training is carried out on simulated robots, and the
policies learned in the process are deployed on a physical robot
▪ Without explicit instructions to the robot on how to perform an
action, the problem of completing pre-defined tasks is well-suited
for Reinforcement Learning (RL)
4. Goal
▪ To train a robotic hand, ShadowHand, in dexterous manipulation of
an object, like a block
▪ 24 joints involving 20 actuated degrees of
freedom and 4 under-actuated movements
▪ PhaseSpace sensors capture fingertip motion
▪ Sensors record relative angles between joints
▪ RGB cameras used for pose estimation
▪ Touch sensors in the hand not used
▪ Simulation of the Hand done with MuJoCo physics engine
▪ Model of Hand is based on the robotic environment OpenAI Gym,
a toolkit for developing Reinforcement Learning (RL) algorithms
▪ Rendering of simulations carried out with Unity
ShadowHand holding a bulb All the joints of ShadowHand
5. Reinforcement Learning
▪ RL trains an agent in some environment to take an action in a given
state resulting in a new state and a reward from the environment,
with the aim to maximize the cumulative reward
For the ShadowHand robot:
▪ State is a 60D space describing angles and velocities of all Hand
joints and position, orientation and velocities of object in hand.
▪ Goal is to achieve the desired orientation with an accuracy of 23°
▪ Action is a 20D space corresponding to desired angles of Hand
joints. Each coordinate is discretized and specified relative to
current joint angle, and rescaled to the range [-1,1]
▪ Reward at time-step 𝑡 is 𝑟𝑡 = 𝑑 𝑡– 𝑑 𝑡+1 where 𝑑 𝑡+1 is rotation
angle between desired and current orientation before transition
and 𝑑 𝑡 is the angle after transition
6. ▪ Policy is function that maps the state
to an action and a new state
▪ Value Function describes how good
is the agent’s state or action, and is
used to predict future rewards
▪ Model is the agent’s representation
of environment
▪ Typically, to choose the actions that give the most possible reward
RL agents are categorized as value-based (dynamic programming),
where they follow value function without explicit policy or policy-
based (policy optimization), where they follow a policy without
explicit value function
▪ The Actor-Critic approach combines and tries to get best of both
the approaches
7. Actor-Critic Approach
▪ The Hand is trained where simulations have full access to Hand
state and environment
▪ Ideally, for physical robot to do as well as during simulation, it
should have the same full access to Hand state and environment,
which is very infeasible in a real world setup
▪ Thus we cannot rely on training in simulation alone
▪ Therefore we have Actor-Critic approach, where
▪ in simulation, Critic takes full state as input and learns the state
to action mapping much faster
▪ in real world, Actor sees only partial observations
▪ To generalize the policy and vision to reality, Domain
Randomization makes use of a large variety of randomized
experiences without an accurate modelling of the real world
▪ Randomizations over mass, dimensions, friction, noise, colour,
motor backlash, vision, etc., are carried out
8. Generalized Advantage Estimator (GAE)
▪ In policy gradient (PG) methods, the aim is to maximize the return, i.e.
maximize E[∇ log 𝜋 𝑎 𝑡 𝑠𝑡 𝑓(𝑥)] where 𝑓(𝑥) is a value function, and E
denotes the expectation operator
▪ To simplify the calculation of future rewards as per policy 𝜋, we use a
discount factor, 𝛾 (0 < 𝛾 < 1), and define the value functions as
state-value function: 𝑉 𝜋(𝑠) = E[σ𝑖=0
∞
𝛾 𝑖 𝑟𝑖 | 𝑠0 = 𝑠]
action-value function: 𝑄 𝜋
(𝑠, 𝑎) = E[σ𝑖=0
∞
𝛾 𝑖
𝑟𝑖 | 𝑠0 = 𝑠, 𝑎0 = 𝑎]
▪ The advantage function then 𝐴 𝜋 𝑠, 𝑎 = 𝑄 𝜋 𝑠, 𝑎 − 𝑉 𝜋(𝑠) tells us how
much an action is better than the one prescribed by policy alone
▪ Often, the value function at time 𝑡 needs to be estimated as,
𝑉𝑡 = σ𝑖=𝑡
∞
𝛾 𝑖−𝑡 𝑟𝑖 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ which can be written as
𝑉𝑡 = 𝑟𝑡 + 𝛾 𝑉𝑡+1 or 𝑉𝑡 = 𝑟𝑡 + 𝛾2 𝑟𝑡+2 + 𝛾2 𝑉𝑡+2
in general, 𝑉𝑡
(𝑘)
= σ𝑖=𝑡
𝑡+𝑘−1
𝛾 𝑖−𝑡
𝑟𝑖 + 𝛾 𝑘
𝑉 𝑠𝑡+𝑘 ≈ 𝑉 𝜋
𝑠𝑡, 𝑎 𝑡
is the 𝑘-step return estimator
9. ▪ Now, the 𝑘-step advantage estimator is defined as,
መ𝐴 𝑡
(𝑘)
= σ𝑖=𝑡
𝑡+𝑘−1
𝛾 𝑖−𝑡
𝑟𝑖 + 𝛾 𝑘
𝑉 𝑠𝑡+𝑘 − 𝑉 𝑠𝑡 = 𝑉𝑡
𝑘
− 𝑉 𝑠𝑡
where 𝑉 𝑠𝑡 is the baseline, which lowers the expectation in the event of
bad actions
▪ The Generalized Advantage Estimator (GAE) is then defined as the
exponentially weighted average of the 𝑘-step estimators
መ𝐴 𝑡
𝐺𝐴𝐸
= (1 − λ) መ𝐴 𝑡
(1)
+ λ መ𝐴 𝑡
(2)
+ λ2 መ𝐴 𝑡
(3)
+ ⋯ simplified to,
መ𝐴 𝑡
𝐺𝐴𝐸
= σ𝑙=0
∞
(𝛾λ)𝑙
𝛿𝑡+𝑙
𝑉
where 𝛿𝑡+𝑙
𝑉
= 𝑟𝑡 + 𝛾𝑉 𝑠𝑡 + 1 − 𝑉(𝑠𝑡) is the TD residual term
▪ Using the መ𝐴 𝑡
𝐺𝐴𝐸
, it is possible to estimate value functions, for all the
states in an episode.
10. Proximal Policy Optimization (PPO)
▪ A standard PG method typically performs one gradient update in
the policy direction for every data sample
▪ The maximization objective can be represented as a loss function,
𝐿 𝑃𝐺 𝜃 = E[∇ log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
𝐺𝐴𝐸
] where the policy 𝜋 is
parameterized by 𝜃 (e.g. weights of a neural network)
▪ If 𝜃 𝑜𝑙𝑑 is the vector of policy parameters before an update, then
𝑟𝑡 𝜃 =
𝜋 𝜃(𝑎 𝑡|𝑠 𝑡)
𝜋 𝜃 𝑜𝑙𝑑
(𝑎 𝑡|𝑠 𝑡)
is the probability ratio of taking a given action
as per current policy to taking the action as per old policy.
▪ The loss function can now be modified as
𝐿 𝑃𝑃𝑂 = E min 𝑟𝑡 𝜃 መ𝐴 𝑡
𝐺𝐴𝐸
, clip 𝑟𝑡 𝜃 , 1 − ε, 1 + ε መ𝐴 𝑡
𝐺𝐴𝐸
where the clip function maintains 𝑟𝑡 𝜃 between 1 − ε and 1 + ε
to prevent an excessively large update to the policy, with ε being a
hyperparameter, usually about 0.2
11. Methodology
▪ Pool of 384 rollout workers with 16 CPU cores each, are used,
while optimization is performed on a single machine with 8 GPUs
▪ Current version of policy is used by a worker on a sample from the
distribution of randomizations
▪ States are observed and actions determined by the policy network,
while returns are predicted by value network. These two make up
the PPO. The two networks have the same architecture (LSTM), but
independent parameters.
▪ An episode ends when either 50 successive orientations are
achieved, policy fails to achieve desired orientation within 8 s, or if
the object is dropped
▪ For better transfer to real world, simulated object pose is
determined from rendered images by a pose estimator CNN.
3 RGB cameras are used on the physical robot for this
12. ▪ Distributed infrastructure during
training of rollout workers
▪ Workers randomly connect to a
Redis server to which policy and
parameters are communicated
▪ Experiences are sent from Redis to
GPU through a buffer
▪ Gradients are computed in each
GPU locally before the MPI averages
across all threads to update the
network parameters
▪ The policy network (left) and value
network (right) for determining actions
and rewards respectively
▪ Normalization block ensures uniform
mean and std. dev. for all observations
13. Results
▪ The ShadowHand policy learns several grasping and manipulating
strategies without any incentivization or demonstration
▪ Grasps found in human adults were rediscovered, and adapted as
per the Hand’s limitations and strengths
▪ PhaseSpace trackers on fingers perform better than vision-based
pose estimation in both simulation and real world
▪ Policy learned on a cube when applied to differently shaped object
performs much better in simulation than in real world
14. ▪ Randomized training performs better in
real world with 13 median rotations
▪ Without any randomization, median
rotations achieved reduces to 0
▪ Median orientations of PhaseSpace (13)
and Vision tracking (11.5) are
comparable after randomized training
Training Hand with all randomizations
requires more time
Training with memory enables the Hand to
achieve more rotations faster
▪ Keeping the batch size per GPU
fixed, having 16 GPUs and 12,288
rollout CPU cores is the optimum
▪ Markers for object orientation are
not always possible in real world
▪ However, prediction error in
orientation in real world is still less
than noise during observations
15. Conclusions
▪ The success is mainly due to (1) domain randomizations, (2) policy
with memory (LSTM), and (3) large scale distributed RL
▪ Although equipped with tactile and pressure sensors, they were
used neither in simulation nor in real world. This is because a
lower dimensional state space is easier to model
▪ Only a solid cube was used in simulations, but the policies were
general enough to be used with other objects in the real world, but
with lower levels of accuracy
▪ This work demonstrates that current RL algorithms can be used
effectively for real-world problems
16. References
Literature
▪ OpenAI, Andrychowicz M., et al., ‘Learning Dexterous In-Hand Manipulation’, arXiv
preprint arXiv:1808.00177, 2019
▪ Schulman J., Moritz P., Levine S., Jordan M. & Abbeel P., ‘High-Dimensional Continuous
Control using Generalized Advantage Estimation’, arXiv preprint arXiv:1506.02438, 2015
▪ Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov O., ‘Proximal Policy
Optimization Algorithms’, arXiv preprint arXiv:1707.06347, 2017
▪ Mnih V., Kavukcuoglu K., et al., ‘Human-level Control through Deep Reinforcement
Learning’, Nature, 2015, 518, p. 529
Blogs
▪ openai.com/blog/learning-dexterity/
▪ karpathy.github.io/2016/05/31/rl/
▪ openai.com/blog/openai-baselines-ppo/
▪ openai.com/five/
Youtube
▪ RL course by David Silver (youtu.be/2pWv7GOvuf0)
▪ John Schulman: Deep Reinforcement Learning (youtu.be/aUrX-rP_ss4)
▪ Arxiv Insights (youtu.be/JgvyzIkgxF0)