Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learning Under Uncertainty
1. Tractable Robust Planning and
Model Learning Under Uncertainty
Jonathan P. How
Aerospace Controls Laboratory
MIT
jhow@mit.edu
March 17th, 2014
2. J. How (MIT) 3
Autonomous Systems: Opportunity
• New era of information and data availability
– To perform efficient data interpretation and information extraction
“Big data” and “Data-to-Decisions”
– In many application domains, including transportation, environment,
ocean exploration and healthcare
• Maturing vehicle GNC raises new challenges in mission
design for heterogeneous manned and autonomous assets
• Cost savings and throughput demands driving rapid infusion
of robotic technologies for airframe
manufacturing/maintenance
– DARPA driving paradigm shift in rapid prototyping, manufacturing
• Rapid solutions being developed to policy issues of
autonomous systems integrated into society
– Google car, manufacturing
3. J. How (MIT) 4
Example: Driving with Uncertainty
• Goal: Improve road safety for urban driving
• Challenge: World complex & dynamic
– Must safely avoid many types of uncertain
static and dynamic obstacles
– Must accurately anticipate other vehicles'
intents and assess danger involved
Reliable Autonomy for
Transportation Systems
– Inference/navigation in dynamic and unstructured
environments — GPS denied navigation
– Provably-correct, real-time planning planning
– Safety & Probabilistic risk assessment
– Learning — Model and policy learning
– Shaping autonomy for use by human operators
Navigating busy intersections
DGC '07: MIT/Cornell accident
4/7/2014
4. J. How (MIT) 5
Planning Without Learning
4/7/2014
J. Leonard, J. How, S. Teller, M. Berger, S. Campbell, G. Fiore, L. Fletcher, E. Frazzoli, A. Huang,
S. Karaman, et al., A perception-driven autonomous urban vehicle. Springer, 2009.
5. J. How (MIT) 6
Example: UAV Turing Test
• Challenge: Autonomous operation at uncontrolled airport
– UAV must approach uncontrolled airport, integrate into traffic
pattern and land in a way that is indistinguishable from a
human pilot as observed by other aircraft
• Problem interesting because while general structure of
traffic is known, specifics must be sensed and behavior
of other traffic inferred
6. J. How (MIT) 7
Challenges
• Goal: Automate mission planning to improve performance
for multiple UAVs in dynamic, uncertain world
– Real-time planning
– Exploration & exploitation — data fusion
– Planning/inference over contested
communication networks
– Human-autonomy interaction
• Challenges:
– Uncertainty: World model is not fully known.
– Dynamic: Objective, world or world model may change
– Stochastic: Same behavior in same situation may result in a
different outcome
– Safety: Arbitrary behaviors can be detrimental to mission/system
4/7/2014
7. J. How (MIT) 8
Similar Challenges in Many Domains
Civil UAVs
Military UAVs
Space Vehicles
Manufacturing
8. J. How (MIT) 9
Planning Challenges
• Issue: most planners are model based, which enables
anticipation
• But models are often approximated and/or wrong
– Model parameter uncertainties
– Modeling errors
• Can yield sub-optimal planner output with large
performance mismatch
– Possibly catastrophic mission impact
4/7/2014
9. J. How (MIT) 10
Planning and Learning
• Two standard approaches
• Baseline Control Algorithms (BCA)
– Fast solutions, but based on simplified models sub-optimal
– Can provide good foundation to boot-strap learning
• Mitigates catastrophic mistakes
• Online Adaptation/Learning Algorithms
– Handle stochastic system/unknown models
– Computational and sample complexity issues
– Exploration can be dangerous
– Can improve on BCA by adapting to time-varying environment and
mission, and generating new strategies that are most beneficial
• Issue: how develop architecture that realizes this
synergistic combination
4/7/2014
10. J. How (MIT) 11
Planning and Learning
• Intelligent Cooperative Control
Architecture (iCCA)
– Synergistic integration of planning and
safe learning to improve performance
– Sand-boxing for planning and learning
• Example: 2 UAVs, 6 targets sim (108 state action pairs)
– Cooperative learners perform well with respect to overall reward
and risk levels when compared with baseline planner (CBBA)
and non-cooperative learning algorithms
4/7/2014
1 2 3
.5[2,3]
+100
4
.5
[2,3]
+100
5 [3,4]
+200
5
8
6
+100
.7
7
+300
.6
40%
50%
60%
70%
80%
90%
Optimality
Learner
Planner-Conservative
Planner-Aggressive
iCCA
iCCA+AdaptiveModel
iCCA can improve baseline
planner performance, but
how solve learning
problems in real-time?
Geramifard et al ``Intelligent cooperative control architecture: A framework for performance improvement
using safe learnings,'’ Journal of Intelligent and Robotic Systems, Vol. 72, pp.~83–-103, October 2013.
11. J. How (MIT) 12
Reinforcement Learning
• Vision: Agents that learn desired behavior from
demonstrations or environment signals.
• Challenge: Continuous/high-dimensional
environments make learning intractable
Algorithm Properties
Bayesian Inverse
Reinforcement
Learning (BNIRL)
Efficient inference of subgoals from
human demonstrations in
continuous domains
Incremental Feature
Dependency
Discovery (iFDD)
Computationally cheap feature
expansion & online learning
Multi-Fidelity
Reinforcement
Learning (MFRL)
Efficient use of simulators to
explore areas where real-world
samples not needed
4/7/2014
B. Michini, M. Cutler, and J. P.
How, “Scalable reward learning
from demonstration,” in IEEE
Interna- tional Conference on
Robotics and Automation
(ICRA), IEEE, 2013.
A. Geramifard, F. Doshi, J. Redding, N. Roy, and J.
How, “Online discovery of feature dependencies,” in
International Conference on Machine Learning
(ICML), pp. 881–888, June 2011.
M. Cutler, T. J. Walsh, and J. P. How,
“Reinforcement learning with multi-fidelity
simulators,” in IEEE International
Conference on Robotics and Automation
(ICRA),, June 2014
12. J. How (MIT) 13
Learning from Demonstration (LfD)
• LfD intuitive method for teaching autonomous system
• Reward vs policy learning: succinct representation and
transferable, but
– Ill-posed (many potential solutions exist)
– Must assume model of rationality for demonstrator
– Many demonstrations contain multiple tasks
• Current methods (e.g. IRL, Ng ‘00) have limitations
– Parametric rewards; scalability; single reward per demonstration
• Developed Bayesian Nonparametric Inverse RL
– Learn multiple subgoal rewards from single demonstration
– Number of rewards learned, not specified
– Strategies given for scalability (approximations, parallelizable)
4/7/2014
B. Michini, M. Cutler, and J. P.
How, “Scalable reward learning
from demonstration,” in IEEE
Interna- tional Conference on
Robotics and Automation
(ICRA), IEEE, 2013.
13. J. How (MIT) 14
Experiment: BNIRL for Learning
Quadrotor Flight Maneuvers
Experiments
14. J. How (MIT) 15
Experimental Results: GPSRL for
Learning RC Car Driving Maneuvers
Introduction Bayesian Nonparametric IRL Gaussian Process SRL Experiments Conclusions
31
15. J. How (MIT) 16
Experimental Results: GPSRL for
Learning RC Car Driving Maneuvers
• Continuous, unsegmented
demonstration captured
and downsampled
• GPSRL partitions
demonstration and learns
corresponding subgoal
reward functions
16. J. How (MIT) 17
Scaling Reinforcement Learning
• Vision: Use learning methods to improve
UAV team performance over time
– Typically very high-dimensional state space
– Computationally challenging
• Steps:
– Developed incremental Feature
Dependency Discovery (iFDD) as
novel adaptive function approximator
• Results:
– iFDD has cheap computational
complexity and asymptotic
convergence guarantees
– iFDD outperforms other methods
4/7/2014
A. Geramifard, F. Doshi, J. Redding, N. Roy, and J. How, “Online discovery of feature dependencies,” in
International Conference on Machine Learning (ICML), pp. 881–888, June 2011.
17. J. How (MIT) 18
RLPy: RL for Education & Research
• Provides growing library of fine-
grained modules for experiments
– (5) Agents, (4) Policies, (10)
Representations, (20) Domains
– Modules can be recombined, frees
researcher from reimplementation
• Reproducible, parallel, platform-
independent experiments
– Rapid prototyping (Python), support for
optimized C code (Cython)
• Tools to automate all parts of
experiment pipeline
– Domain visualization for troubleshooting
– Automatic hyperparameter tuning
4/7/2014
http://acl.mit.edu/RLPy/
18. J. How (MIT) 19
Multi-Fidelity Reinforcement Learning
• Vision: Leverage simulators to learn optimal behavior with few
real world samples
• Challenges:
– What knowledge should be
shared between agents learning
on different simulators?
– Choosing which simulator to sample -
Low-fidelity simulators are less costly
but less accurate
• Contributions: Developed MFRL
– Lower-fidelity agents send up values
to guide exploration
– High-fidelity agents send down learned
parameters
– Rules for switching levels guarantee
limited number of simulator changes
and efficient exploration
4/7/2014
Lowest Fidelity Highest Fidelity
19. J. How (MIT) 20
Bayesian Nonparametric Models for Robotics
• Often significant uncertainty about behaviors
and intents of other agents in the environment
– Bayesian nonparametric models (BNPs) uniquely
provide flexibility to learn model size & parameters
– Important because it is often very difficult
to pre-specify model size
• Example: Gaussian Process (GP) BNP
for continuous functions
– Can learn number of motion models and
their velocity fields using Dirichlet process
GP mixture (DP-GP)
– Can also capture temporally evolving
behaviors using DDP-GP
• Application: threat assessment
– Model, classify & assess intent/behavior
of other drivers and pedestrians
– Embed in robust planner (CC-RRT*)
– Driver aid and/or autonomous car
4/7/2014
T. Campbell, S. S. Ponda, G. Chowdhary, and J. P. How, “Planning
under uncertainty using nonparametric Bayesian models,” in AIAA
Guidance, Navigation, and Control Conference (GNC), August 2012.
G. S. Aoude, B. D. Luders, J. M. Joseph,
N. Roy, and J. P. How, “Probabilistically
safe motion planning to avoid dynamic
obstacles with uncertain motion patterns,”
Autonomous Robots, vol. 35, no. 1, pp.
51–76, 2013.
D. Lin, E. Grimson, and J. Fisher, “Construction of dependent
dirichlet processes based on poisson processes,” in Neural
Information Processing Systems, 2010.
20. J. How (MIT) 21
Fast BNP Learning
• Vision: Flexible learning for temporally evolving data
without sacrificing speed (real-time robotic systems)
• Challenges:
– Flexible models are computationally demanding
(e.g., Gibbs sampling for DP-GP, DDP-GP)
– Computationally cheap models are rigid
• Results: Dynamic Means
– Derived from low-variance asymptotic analysis of DDP mixture
– Cluster birth, death, and transitions
– Guaranteed monotonic convergence in clustering cost
4/7/2014
% Label
Accuracy
log10
CPU Time
T. Campbell, M. Liu, B. Kulis, J. P. How, and L.
Carin, “Dynamic clustering via asymptotics of
the dependent dirichlet process,” in Advances in
Neural Information Processing Systems (NIPS),
2013.
21. J. How (MIT) 22
Experimental Implementation
• Sgun movie
4/7/2014
22. J. How (MIT) 23
Example: Driving with Uncertainty
• Goal: Improve road safety for urban driving
• Challenge: World complex & dynamic
– Must safely avoid many types of uncertain
static and dynamic obstacles
– Must accurately anticipate other vehicles'
intents and assess danger involved
• Objective: Develop probabilistic models
of environment (cars, pedestrians,
cyclists,...), and robust path planner
which utilizes models to safely navigate
urban environments
– Distributions over possible intents, and
trajectories for each intent
– Efficient enough for real-time use
Navigating busy intersections
DGC '07: MIT/Cornell accident
4/7/2014
23. J. How (MIT) 24
Approach
• Simultaneous trajectory prediction and robust avoidance}
of multiple obstacle classes (static and dynamic)
• DP-GP: automatically classifies trajectories into behavior
patterns; uses GP mixture model to compute
– Probability of being in each motion pattern
given observed trajectory
– Position distribution within each pattern at future timesteps
probabilistic models for propagated (intent, path) uncertainty
• RR-GP: refines predictions based on dynamics,
environment
• CC-RRT*: optimized, robust motion planning
4/7/2014
B. D. Luders, S. Karaman, and J. P. How, “Robust
sampling-based motion planning with asymptotic
optimality guarantees,” in AIAA Guidance, Navigation,
and Control Conference (GNC), (August 2013.
G. S. Aoude, B. D. Luders, J. M. Joseph,
N. Roy, and J. P. How, “Probabilistically
safe motion planning to avoid dynamic
obstacles with uncertain motion patterns,”
Autonomous Robots, vol. 35, no. 1, pp.
51–76, 2013.
24. J. How (MIT) 25
CC-RRT* for Robust Motion Planning
• Real-time optimizing algorithm with guaranteed
probabilistic robustness to internal/external uncertainty
– Leverages RRT: anytime algorithm; quickly explores large state
spaces; dynamically feasibility; trajectory-wise constraint checking
• CC-RRT: efficient online risk evaluation
– Well-suited to real-time planning/updates with DPGP motion
models
• RRT*: asymptotic optimality
• CC-RRT* is a very
scalable algorithm
4/7/2014
S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal motion
planning,” International Journal of Robotics Research, vol. 30, pp. 846–894,
June 2011.
25. J. How (MIT) 26
Robust Planning Examples
4/7/2014
26. J. How (MIT) 27
1
2
3
4 5 6 7 8 9
10
11
12
1314
Friday, May 10, 13
Reliable Autonomy for Transportation
• Vision: Safe reliable autonomy crucial component of
future acceptance and deployment of autonomous
systems
• Objective: Develop reliable autonomous
systems that can operate safely and
effectively for long durations in complex
and dynamic environments
– Control theory, verification and validation,
autonomous systems, and software safety
• Currently developing Mobility on
Demand system on campus
– Builds on SMART (Frazzoli)
4/7/2014
27. J. How (MIT) 28
Multiagent Planning With Learning
• Mission: Visually detect target vehicles, then persistently
perform track/surveillance using UGV and UAVs
– On-line planning and learning
– Sensor failure transition model
learned using iFDD
– Policy is re-computed
online using Dec-MMDP
• Cumulative cost reduces
during mission
– Improved performance
due to learning
• Number swaps per time
period reduces
– Team learns that initial probability
of sensor failure too pessimistic
4/7/2014
0 0.5 1 1.5 2 2.5 3 3.5
180
190
200
210
220
230
240
250
260
Time (hours)
IntermediateCumulativeCost
0 0.5 1 1.5 2 2.5 3 3.5
0
5
10
15
20
25
Time (hours)
Numberofswapsper30minutes
N. K. Ure, G. Chowdhary, Y. F. Chen, J. P. How, and J. Vian,
“Distributed learning for planning under uncertainty problems
with heterogeneous teams,” Journal of Intelligent and Robotic
Systems, pp. 1–16, 2013.
28. J. How (MIT) 29
Conclusions
• New era of information and data availability
– Many new opportunities in guidance/control & robotics
• Learning and adaptation are keys to reliable autonomy
– Overcome the sample and computational complexity
– More realistic applications
• Discussed Model Learning, but similar strategies for
Policy Learning
• Very exciting times: Autonomous cars and UAS in NAS
in our lifetime??
• Many references available at http://acl.mit.edu
4/7/2014