SlideShare a Scribd company logo
1 of 16
Download to read offline
presentation by
Anand D Joshi
Outline
▪ Introduction
▪ Goal
the humanoid hand, ShadowHand
▪ Reinforcement Learning
Actor-Critic Approach
Proximal Policy Optimization
Generalized Advantage Estimator
▪ Methodology
▪ Results
▪ Conclusions
Introduction
▪ Research in control of robotic devices is a subject with great
application across a number of sectors
▪ Prior methods have completely trained and tested either on
simulations alone or on physical robots alone
▪ However, simulations do not transfer with sufficient accuracy to
real world, while training on physical robots require years of
experience to perform satisfactorily
▪ In this study, training is carried out on simulated robots, and the
policies learned in the process are deployed on a physical robot
▪ Without explicit instructions to the robot on how to perform an
action, the problem of completing pre-defined tasks is well-suited
for Reinforcement Learning (RL)
Goal
▪ To train a robotic hand, ShadowHand, in dexterous manipulation of
an object, like a block
▪ 24 joints involving 20 actuated degrees of
freedom and 4 under-actuated movements
▪ PhaseSpace sensors capture fingertip motion
▪ Sensors record relative angles between joints
▪ RGB cameras used for pose estimation
▪ Touch sensors in the hand not used
▪ Simulation of the Hand done with MuJoCo physics engine
▪ Model of Hand is based on the robotic environment OpenAI Gym,
a toolkit for developing Reinforcement Learning (RL) algorithms
▪ Rendering of simulations carried out with Unity
ShadowHand holding a bulb All the joints of ShadowHand
Reinforcement Learning
▪ RL trains an agent in some environment to take an action in a given
state resulting in a new state and a reward from the environment,
with the aim to maximize the cumulative reward
For the ShadowHand robot:
▪ State is a 60D space describing angles and velocities of all Hand
joints and position, orientation and velocities of object in hand.
▪ Goal is to achieve the desired orientation with an accuracy of 23°
▪ Action is a 20D space corresponding to desired angles of Hand
joints. Each coordinate is discretized and specified relative to
current joint angle, and rescaled to the range [-1,1]
▪ Reward at time-step 𝑡 is 𝑟𝑡 = 𝑑 𝑡– 𝑑 𝑡+1 where 𝑑 𝑡+1 is rotation
angle between desired and current orientation before transition
and 𝑑 𝑡 is the angle after transition
▪ Policy is function that maps the state
to an action and a new state
▪ Value Function describes how good
is the agent’s state or action, and is
used to predict future rewards
▪ Model is the agent’s representation
of environment
▪ Typically, to choose the actions that give the most possible reward
RL agents are categorized as value-based (dynamic programming),
where they follow value function without explicit policy or policy-
based (policy optimization), where they follow a policy without
explicit value function
▪ The Actor-Critic approach combines and tries to get best of both
the approaches
Actor-Critic Approach
▪ The Hand is trained where simulations have full access to Hand
state and environment
▪ Ideally, for physical robot to do as well as during simulation, it
should have the same full access to Hand state and environment,
which is very infeasible in a real world setup
▪ Thus we cannot rely on training in simulation alone
▪ Therefore we have Actor-Critic approach, where
▪ in simulation, Critic takes full state as input and learns the state
to action mapping much faster
▪ in real world, Actor sees only partial observations
▪ To generalize the policy and vision to reality, Domain
Randomization makes use of a large variety of randomized
experiences without an accurate modelling of the real world
▪ Randomizations over mass, dimensions, friction, noise, colour,
motor backlash, vision, etc., are carried out
Generalized Advantage Estimator (GAE)
▪ In policy gradient (PG) methods, the aim is to maximize the return, i.e.
maximize E[∇ log 𝜋 𝑎 𝑡 𝑠𝑡 𝑓(𝑥)] where 𝑓(𝑥) is a value function, and E
denotes the expectation operator
▪ To simplify the calculation of future rewards as per policy 𝜋, we use a
discount factor, 𝛾 (0 < 𝛾 < 1), and define the value functions as
state-value function: 𝑉 𝜋(𝑠) = E[σ𝑖=0
∞
𝛾 𝑖 𝑟𝑖 | 𝑠0 = 𝑠]
action-value function: 𝑄 𝜋
(𝑠, 𝑎) = E[σ𝑖=0
∞
𝛾 𝑖
𝑟𝑖 | 𝑠0 = 𝑠, 𝑎0 = 𝑎]
▪ The advantage function then 𝐴 𝜋 𝑠, 𝑎 = 𝑄 𝜋 𝑠, 𝑎 − 𝑉 𝜋(𝑠) tells us how
much an action is better than the one prescribed by policy alone
▪ Often, the value function at time 𝑡 needs to be estimated as,
෠𝑉𝑡 = σ𝑖=𝑡
∞
𝛾 𝑖−𝑡 𝑟𝑖 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ which can be written as
෠𝑉𝑡 = 𝑟𝑡 + 𝛾 ෠𝑉𝑡+1 or ෠𝑉𝑡 = 𝑟𝑡 + 𝛾2 𝑟𝑡+2 + 𝛾2 ෠𝑉𝑡+2
in general, ෠𝑉𝑡
(𝑘)
= σ𝑖=𝑡
𝑡+𝑘−1
𝛾 𝑖−𝑡
𝑟𝑖 + 𝛾 𝑘
𝑉 𝑠𝑡+𝑘 ≈ 𝑉 𝜋
𝑠𝑡, 𝑎 𝑡
is the 𝑘-step return estimator
▪ Now, the 𝑘-step advantage estimator is defined as,
መ𝐴 𝑡
(𝑘)
= σ𝑖=𝑡
𝑡+𝑘−1
𝛾 𝑖−𝑡
𝑟𝑖 + 𝛾 𝑘
𝑉 𝑠𝑡+𝑘 − 𝑉 𝑠𝑡 = ෠𝑉𝑡
𝑘
− 𝑉 𝑠𝑡
where 𝑉 𝑠𝑡 is the baseline, which lowers the expectation in the event of
bad actions
▪ The Generalized Advantage Estimator (GAE) is then defined as the
exponentially weighted average of the 𝑘-step estimators
መ𝐴 𝑡
𝐺𝐴𝐸
= (1 − λ) መ𝐴 𝑡
(1)
+ λ መ𝐴 𝑡
(2)
+ λ2 መ𝐴 𝑡
(3)
+ ⋯ simplified to,
መ𝐴 𝑡
𝐺𝐴𝐸
= σ𝑙=0
∞
(𝛾λ)𝑙
𝛿𝑡+𝑙
𝑉
where 𝛿𝑡+𝑙
𝑉
= 𝑟𝑡 + 𝛾𝑉 𝑠𝑡 + 1 − 𝑉(𝑠𝑡) is the TD residual term
▪ Using the መ𝐴 𝑡
𝐺𝐴𝐸
, it is possible to estimate value functions, for all the
states in an episode.
Proximal Policy Optimization (PPO)
▪ A standard PG method typically performs one gradient update in
the policy direction for every data sample
▪ The maximization objective can be represented as a loss function,
𝐿 𝑃𝐺 𝜃 = E[∇ log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
𝐺𝐴𝐸
] where the policy 𝜋 is
parameterized by 𝜃 (e.g. weights of a neural network)
▪ If 𝜃 𝑜𝑙𝑑 is the vector of policy parameters before an update, then
𝑟𝑡 𝜃 =
𝜋 𝜃(𝑎 𝑡|𝑠 𝑡)
𝜋 𝜃 𝑜𝑙𝑑
(𝑎 𝑡|𝑠 𝑡)
is the probability ratio of taking a given action
as per current policy to taking the action as per old policy.
▪ The loss function can now be modified as
𝐿 𝑃𝑃𝑂 = E min 𝑟𝑡 𝜃 መ𝐴 𝑡
𝐺𝐴𝐸
, clip 𝑟𝑡 𝜃 , 1 − ε, 1 + ε መ𝐴 𝑡
𝐺𝐴𝐸
where the clip function maintains 𝑟𝑡 𝜃 between 1 − ε and 1 + ε
to prevent an excessively large update to the policy, with ε being a
hyperparameter, usually about 0.2
Methodology
▪ Pool of 384 rollout workers with 16 CPU cores each, are used,
while optimization is performed on a single machine with 8 GPUs
▪ Current version of policy is used by a worker on a sample from the
distribution of randomizations
▪ States are observed and actions determined by the policy network,
while returns are predicted by value network. These two make up
the PPO. The two networks have the same architecture (LSTM), but
independent parameters.
▪ An episode ends when either 50 successive orientations are
achieved, policy fails to achieve desired orientation within 8 s, or if
the object is dropped
▪ For better transfer to real world, simulated object pose is
determined from rendered images by a pose estimator CNN.
3 RGB cameras are used on the physical robot for this
▪ Distributed infrastructure during
training of rollout workers
▪ Workers randomly connect to a
Redis server to which policy and
parameters are communicated
▪ Experiences are sent from Redis to
GPU through a buffer
▪ Gradients are computed in each
GPU locally before the MPI averages
across all threads to update the
network parameters
▪ The policy network (left) and value
network (right) for determining actions
and rewards respectively
▪ Normalization block ensures uniform
mean and std. dev. for all observations
Results
▪ The ShadowHand policy learns several grasping and manipulating
strategies without any incentivization or demonstration
▪ Grasps found in human adults were rediscovered, and adapted as
per the Hand’s limitations and strengths
▪ PhaseSpace trackers on fingers perform better than vision-based
pose estimation in both simulation and real world
▪ Policy learned on a cube when applied to differently shaped object
performs much better in simulation than in real world
▪ Randomized training performs better in
real world with 13 median rotations
▪ Without any randomization, median
rotations achieved reduces to 0
▪ Median orientations of PhaseSpace (13)
and Vision tracking (11.5) are
comparable after randomized training
Training Hand with all randomizations
requires more time
Training with memory enables the Hand to
achieve more rotations faster
▪ Keeping the batch size per GPU
fixed, having 16 GPUs and 12,288
rollout CPU cores is the optimum
▪ Markers for object orientation are
not always possible in real world
▪ However, prediction error in
orientation in real world is still less
than noise during observations
Conclusions
▪ The success is mainly due to (1) domain randomizations, (2) policy
with memory (LSTM), and (3) large scale distributed RL
▪ Although equipped with tactile and pressure sensors, they were
used neither in simulation nor in real world. This is because a
lower dimensional state space is easier to model
▪ Only a solid cube was used in simulations, but the policies were
general enough to be used with other objects in the real world, but
with lower levels of accuracy
▪ This work demonstrates that current RL algorithms can be used
effectively for real-world problems
References
Literature
▪ OpenAI, Andrychowicz M., et al., ‘Learning Dexterous In-Hand Manipulation’, arXiv
preprint arXiv:1808.00177, 2019
▪ Schulman J., Moritz P., Levine S., Jordan M. & Abbeel P., ‘High-Dimensional Continuous
Control using Generalized Advantage Estimation’, arXiv preprint arXiv:1506.02438, 2015
▪ Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov O., ‘Proximal Policy
Optimization Algorithms’, arXiv preprint arXiv:1707.06347, 2017
▪ Mnih V., Kavukcuoglu K., et al., ‘Human-level Control through Deep Reinforcement
Learning’, Nature, 2015, 518, p. 529
Blogs
▪ openai.com/blog/learning-dexterity/
▪ karpathy.github.io/2016/05/31/rl/
▪ openai.com/blog/openai-baselines-ppo/
▪ openai.com/five/
Youtube
▪ RL course by David Silver (youtu.be/2pWv7GOvuf0)
▪ John Schulman: Deep Reinforcement Learning (youtu.be/aUrX-rP_ss4)
▪ Arxiv Insights (youtu.be/JgvyzIkgxF0)

More Related Content

What's hot

What's hot (20)

An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Model Based Episodic Memory
Model Based Episodic MemoryModel Based Episodic Memory
Model Based Episodic Memory
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
POMDP Seminar Backup3
POMDP Seminar Backup3POMDP Seminar Backup3
POMDP Seminar Backup3
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...
 

Similar to Dexterous In-hand Manipulation by OpenAI

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
United International University
 

Similar to Dexterous In-hand Manipulation by OpenAI (20)

Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learning
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
5. 8519 1-pb
5. 8519 1-pb5. 8519 1-pb
5. 8519 1-pb
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Graph-based SLAM
Graph-based SLAMGraph-based SLAM
Graph-based SLAM
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Dexterous In-hand Manipulation by OpenAI

  • 2. Outline ▪ Introduction ▪ Goal the humanoid hand, ShadowHand ▪ Reinforcement Learning Actor-Critic Approach Proximal Policy Optimization Generalized Advantage Estimator ▪ Methodology ▪ Results ▪ Conclusions
  • 3. Introduction ▪ Research in control of robotic devices is a subject with great application across a number of sectors ▪ Prior methods have completely trained and tested either on simulations alone or on physical robots alone ▪ However, simulations do not transfer with sufficient accuracy to real world, while training on physical robots require years of experience to perform satisfactorily ▪ In this study, training is carried out on simulated robots, and the policies learned in the process are deployed on a physical robot ▪ Without explicit instructions to the robot on how to perform an action, the problem of completing pre-defined tasks is well-suited for Reinforcement Learning (RL)
  • 4. Goal ▪ To train a robotic hand, ShadowHand, in dexterous manipulation of an object, like a block ▪ 24 joints involving 20 actuated degrees of freedom and 4 under-actuated movements ▪ PhaseSpace sensors capture fingertip motion ▪ Sensors record relative angles between joints ▪ RGB cameras used for pose estimation ▪ Touch sensors in the hand not used ▪ Simulation of the Hand done with MuJoCo physics engine ▪ Model of Hand is based on the robotic environment OpenAI Gym, a toolkit for developing Reinforcement Learning (RL) algorithms ▪ Rendering of simulations carried out with Unity ShadowHand holding a bulb All the joints of ShadowHand
  • 5. Reinforcement Learning ▪ RL trains an agent in some environment to take an action in a given state resulting in a new state and a reward from the environment, with the aim to maximize the cumulative reward For the ShadowHand robot: ▪ State is a 60D space describing angles and velocities of all Hand joints and position, orientation and velocities of object in hand. ▪ Goal is to achieve the desired orientation with an accuracy of 23° ▪ Action is a 20D space corresponding to desired angles of Hand joints. Each coordinate is discretized and specified relative to current joint angle, and rescaled to the range [-1,1] ▪ Reward at time-step 𝑡 is 𝑟𝑡 = 𝑑 𝑡– 𝑑 𝑡+1 where 𝑑 𝑡+1 is rotation angle between desired and current orientation before transition and 𝑑 𝑡 is the angle after transition
  • 6. ▪ Policy is function that maps the state to an action and a new state ▪ Value Function describes how good is the agent’s state or action, and is used to predict future rewards ▪ Model is the agent’s representation of environment ▪ Typically, to choose the actions that give the most possible reward RL agents are categorized as value-based (dynamic programming), where they follow value function without explicit policy or policy- based (policy optimization), where they follow a policy without explicit value function ▪ The Actor-Critic approach combines and tries to get best of both the approaches
  • 7. Actor-Critic Approach ▪ The Hand is trained where simulations have full access to Hand state and environment ▪ Ideally, for physical robot to do as well as during simulation, it should have the same full access to Hand state and environment, which is very infeasible in a real world setup ▪ Thus we cannot rely on training in simulation alone ▪ Therefore we have Actor-Critic approach, where ▪ in simulation, Critic takes full state as input and learns the state to action mapping much faster ▪ in real world, Actor sees only partial observations ▪ To generalize the policy and vision to reality, Domain Randomization makes use of a large variety of randomized experiences without an accurate modelling of the real world ▪ Randomizations over mass, dimensions, friction, noise, colour, motor backlash, vision, etc., are carried out
  • 8. Generalized Advantage Estimator (GAE) ▪ In policy gradient (PG) methods, the aim is to maximize the return, i.e. maximize E[∇ log 𝜋 𝑎 𝑡 𝑠𝑡 𝑓(𝑥)] where 𝑓(𝑥) is a value function, and E denotes the expectation operator ▪ To simplify the calculation of future rewards as per policy 𝜋, we use a discount factor, 𝛾 (0 < 𝛾 < 1), and define the value functions as state-value function: 𝑉 𝜋(𝑠) = E[σ𝑖=0 ∞ 𝛾 𝑖 𝑟𝑖 | 𝑠0 = 𝑠] action-value function: 𝑄 𝜋 (𝑠, 𝑎) = E[σ𝑖=0 ∞ 𝛾 𝑖 𝑟𝑖 | 𝑠0 = 𝑠, 𝑎0 = 𝑎] ▪ The advantage function then 𝐴 𝜋 𝑠, 𝑎 = 𝑄 𝜋 𝑠, 𝑎 − 𝑉 𝜋(𝑠) tells us how much an action is better than the one prescribed by policy alone ▪ Often, the value function at time 𝑡 needs to be estimated as, ෠𝑉𝑡 = σ𝑖=𝑡 ∞ 𝛾 𝑖−𝑡 𝑟𝑖 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ which can be written as ෠𝑉𝑡 = 𝑟𝑡 + 𝛾 ෠𝑉𝑡+1 or ෠𝑉𝑡 = 𝑟𝑡 + 𝛾2 𝑟𝑡+2 + 𝛾2 ෠𝑉𝑡+2 in general, ෠𝑉𝑡 (𝑘) = σ𝑖=𝑡 𝑡+𝑘−1 𝛾 𝑖−𝑡 𝑟𝑖 + 𝛾 𝑘 𝑉 𝑠𝑡+𝑘 ≈ 𝑉 𝜋 𝑠𝑡, 𝑎 𝑡 is the 𝑘-step return estimator
  • 9. ▪ Now, the 𝑘-step advantage estimator is defined as, መ𝐴 𝑡 (𝑘) = σ𝑖=𝑡 𝑡+𝑘−1 𝛾 𝑖−𝑡 𝑟𝑖 + 𝛾 𝑘 𝑉 𝑠𝑡+𝑘 − 𝑉 𝑠𝑡 = ෠𝑉𝑡 𝑘 − 𝑉 𝑠𝑡 where 𝑉 𝑠𝑡 is the baseline, which lowers the expectation in the event of bad actions ▪ The Generalized Advantage Estimator (GAE) is then defined as the exponentially weighted average of the 𝑘-step estimators መ𝐴 𝑡 𝐺𝐴𝐸 = (1 − λ) መ𝐴 𝑡 (1) + λ መ𝐴 𝑡 (2) + λ2 መ𝐴 𝑡 (3) + ⋯ simplified to, መ𝐴 𝑡 𝐺𝐴𝐸 = σ𝑙=0 ∞ (𝛾λ)𝑙 𝛿𝑡+𝑙 𝑉 where 𝛿𝑡+𝑙 𝑉 = 𝑟𝑡 + 𝛾𝑉 𝑠𝑡 + 1 − 𝑉(𝑠𝑡) is the TD residual term ▪ Using the መ𝐴 𝑡 𝐺𝐴𝐸 , it is possible to estimate value functions, for all the states in an episode.
  • 10. Proximal Policy Optimization (PPO) ▪ A standard PG method typically performs one gradient update in the policy direction for every data sample ▪ The maximization objective can be represented as a loss function, 𝐿 𝑃𝐺 𝜃 = E[∇ log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 መ𝐴 𝑡 𝐺𝐴𝐸 ] where the policy 𝜋 is parameterized by 𝜃 (e.g. weights of a neural network) ▪ If 𝜃 𝑜𝑙𝑑 is the vector of policy parameters before an update, then 𝑟𝑡 𝜃 = 𝜋 𝜃(𝑎 𝑡|𝑠 𝑡) 𝜋 𝜃 𝑜𝑙𝑑 (𝑎 𝑡|𝑠 𝑡) is the probability ratio of taking a given action as per current policy to taking the action as per old policy. ▪ The loss function can now be modified as 𝐿 𝑃𝑃𝑂 = E min 𝑟𝑡 𝜃 መ𝐴 𝑡 𝐺𝐴𝐸 , clip 𝑟𝑡 𝜃 , 1 − ε, 1 + ε መ𝐴 𝑡 𝐺𝐴𝐸 where the clip function maintains 𝑟𝑡 𝜃 between 1 − ε and 1 + ε to prevent an excessively large update to the policy, with ε being a hyperparameter, usually about 0.2
  • 11. Methodology ▪ Pool of 384 rollout workers with 16 CPU cores each, are used, while optimization is performed on a single machine with 8 GPUs ▪ Current version of policy is used by a worker on a sample from the distribution of randomizations ▪ States are observed and actions determined by the policy network, while returns are predicted by value network. These two make up the PPO. The two networks have the same architecture (LSTM), but independent parameters. ▪ An episode ends when either 50 successive orientations are achieved, policy fails to achieve desired orientation within 8 s, or if the object is dropped ▪ For better transfer to real world, simulated object pose is determined from rendered images by a pose estimator CNN. 3 RGB cameras are used on the physical robot for this
  • 12. ▪ Distributed infrastructure during training of rollout workers ▪ Workers randomly connect to a Redis server to which policy and parameters are communicated ▪ Experiences are sent from Redis to GPU through a buffer ▪ Gradients are computed in each GPU locally before the MPI averages across all threads to update the network parameters ▪ The policy network (left) and value network (right) for determining actions and rewards respectively ▪ Normalization block ensures uniform mean and std. dev. for all observations
  • 13. Results ▪ The ShadowHand policy learns several grasping and manipulating strategies without any incentivization or demonstration ▪ Grasps found in human adults were rediscovered, and adapted as per the Hand’s limitations and strengths ▪ PhaseSpace trackers on fingers perform better than vision-based pose estimation in both simulation and real world ▪ Policy learned on a cube when applied to differently shaped object performs much better in simulation than in real world
  • 14. ▪ Randomized training performs better in real world with 13 median rotations ▪ Without any randomization, median rotations achieved reduces to 0 ▪ Median orientations of PhaseSpace (13) and Vision tracking (11.5) are comparable after randomized training Training Hand with all randomizations requires more time Training with memory enables the Hand to achieve more rotations faster ▪ Keeping the batch size per GPU fixed, having 16 GPUs and 12,288 rollout CPU cores is the optimum ▪ Markers for object orientation are not always possible in real world ▪ However, prediction error in orientation in real world is still less than noise during observations
  • 15. Conclusions ▪ The success is mainly due to (1) domain randomizations, (2) policy with memory (LSTM), and (3) large scale distributed RL ▪ Although equipped with tactile and pressure sensors, they were used neither in simulation nor in real world. This is because a lower dimensional state space is easier to model ▪ Only a solid cube was used in simulations, but the policies were general enough to be used with other objects in the real world, but with lower levels of accuracy ▪ This work demonstrates that current RL algorithms can be used effectively for real-world problems
  • 16. References Literature ▪ OpenAI, Andrychowicz M., et al., ‘Learning Dexterous In-Hand Manipulation’, arXiv preprint arXiv:1808.00177, 2019 ▪ Schulman J., Moritz P., Levine S., Jordan M. & Abbeel P., ‘High-Dimensional Continuous Control using Generalized Advantage Estimation’, arXiv preprint arXiv:1506.02438, 2015 ▪ Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov O., ‘Proximal Policy Optimization Algorithms’, arXiv preprint arXiv:1707.06347, 2017 ▪ Mnih V., Kavukcuoglu K., et al., ‘Human-level Control through Deep Reinforcement Learning’, Nature, 2015, 518, p. 529 Blogs ▪ openai.com/blog/learning-dexterity/ ▪ karpathy.github.io/2016/05/31/rl/ ▪ openai.com/blog/openai-baselines-ppo/ ▪ openai.com/five/ Youtube ▪ RL course by David Silver (youtu.be/2pWv7GOvuf0) ▪ John Schulman: Deep Reinforcement Learning (youtu.be/aUrX-rP_ss4) ▪ Arxiv Insights (youtu.be/JgvyzIkgxF0)