MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
Lucio marcenaro tue summer_school
1. An introduction to cognitive robotics
EMJD ICE Summer School - 2013
Lucio Marcenaro – University of
Genova (ITALY)
2. Cognitive robotics?
• Robots with intelligent behavior
– Learn and reason
– Complex goals
– Complex world
• Robots ideal vehicles for developing and
testing cognitive:
– Learning
– Adaptation
– Classification
3. Cognitive robotics
• Traditional behavior modeling approaches
problematic and untenable.
• Perception, action and the notion of symbolic
representation to be addressed in cognitive
robotics.
• Cognitive robotics views animal cognition as a
starting point for the development of robotic
information processing.
4. Cognitive robotics
• “Immobile” Robots and Engineering
Operations
– Robust space probes, ubiquitous computing
• Robots That Navigate
– Hallway robots, Field robots, Underwater
explorers, stunt air vehicles
• Cooperating Robots
– Cooperative Space/Air/Land/Underwater vehicles,
distributed traffic networks, smart dust.
15. NXT Sound Sensor
• Sound sensor can measure in dB and dBA
– dB: in detecting standard [unadjusted]
decibels, all sounds are measured with
equal sensitivity. Thus, these sounds may
include some that are too high or too low
for the human ear to hear.
– dBA: in detecting adjusted decibels, the
sensitivity of the sensor is adapted to the
sensitivity of the human ear. In other words,
these are the sounds that your ears are able
to hear.
• Sound Sensor readings on the NXT are
displayed in percent [%]. The lower the percent
the quieter the sound.
http://mindstorms.lego.com/Overview/Sound_Sensor.aspx
16. NXT Ultrasonic/Distance Sensor
• Measures
distance/proximity
• Range: 0-255 cm
• Precision: +/- 3cm
• Can report in
centimeters or
inches
http://mindstorms.lego.com/Overview/Ultrasonic_Sensor.aspx
18. LEGO Mindstorms for NXT
(NXT-G)
NXT-G graphical programming
language
Based on the LabVIEW programming language G
Program by drawing a flow chart
19. NXT-G PC program interface
Toolbar
Workspace
Configuration
Panel
Help & Navigation
Controller
Palettes
Tutorials Web
Portal
Sequence Beam
20. Issues of the standard firmware
• Only one data type
• Unreliable bluetooth communication
• Limited multi-tasking
• Complex motor control
• Simplistic memory management
• Not suitable for large programs
• Not suitable for development of own tools or
blocks
21. Other programming languages and
environments
– Java leJOS
– Microsoft Robotics Studio
– RobotC
– NXC - Not eXactly C
– NXT Logo
– Lego NXT Open source firmware and software
development kit
22. leJOS
• A Java Virtual Machine for NXT
• Freely available
– http://lejos.sourceforge.net/
• Replaces the NXT-G firmware
• LeJOS plug-in is available for the Eclipse free
development environment
• Faster than NXT-G
23. Example leJOS Program
sonar = new UltrasonicSensor(SensorPort.S4);
Motor.A.forward();
Motor.B.forward();
while (true) {
if (sonar.getDistance() < 25) {
Motor.A.forward();
Motor.B.backward();
} else {
Motor.A.forward();
Motor.B.forward();
}
}
24. Event-driven Control in leJOS
• The Behavior interface
– boolean takeControl()
– void action()
– void suppress()
• Arbitrator class
– Constructor gets an array of Behavior objects
• takeControl() checked for highest index first
– start() method begins event loop
25. Event-driven example
class Go implements Behavior {
private Ultrasonic sonar =
new Ultrasonic(SensorPort.S4);
public boolean takeControl() {
return sonar.getDistance() > 25;
}
26. Event-driven example
public void action() {
Motor.A.forward();
Motor.B.forward();
}
public void suppress() {
Motor.A.stop();
Motor.B.stop();
}
}
27. Event-driven example
class Spin implements Behavior {
private Ultrasonic sonar =
new Ultrasonic(SensorPort.S4);
public boolean takeControl() {
return sonar.getDistance() <= 25;
}
28. Event-driven example
public void action() {
Motor.A.forward();
Motor.B.backward();
}
public void suppress() {
Motor.A.stop();
Motor.B.stop();
}
}
29. Event-driven example
public class FindFreespace {
public static void main(String[] a) {
Behavior[] b = new Behavior[]
{new Go(), new Spin()};
Arbitrator arb =
new Arbitrator(b);
arb.start();
}
}
30. Simple Line Follower
• Use light-sensor as a switch
• If measured value > threshold: ON state (white
surface)
• If measured value < threshold: OFF state
(black surface)
31. Simple Line Follower
• Robot not traveling inside the line but along
the edge
• Turning left until an “OFF” to “ON” transition
is detected
• Turning right until an “ON” to “OFF” transition
is detected
32. Simple Line Follower
NXTMotor rightM = new NXTMotor(MotorPort.A);
NXTMotor leftM = new NXTMotor(MotorPort.C);
ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED);
while (!Button.ESCAPE.isDown())
{
int currentColor = cs.getLightValue();
LCD.drawInt(currentColor, 5, 11, 3);
if (currentColor < 30)
{
rightM.setPower(50);
leftM.setPower(10);
}
else
{
rightM.setPower(10);
leftM.setPower(50);
}
}
34. Advanced Line Follower
• Use light-sensor as an
Analog sensor
• Sensor ranges btween 0
– 100
• Takes the average light
detected over a small
area
35. Advanced Line Follower
• Subtract the current reading of the sensor
from what the sensor should be reading
– Use this value to directly control direction and
power of the wheels
• Multiply this value for a constant: how
strongly the wheels should turn to correct its
path?
• Add a value to be sure that the robot is always
moving forward
36. Advanced Line Follower
NXTMotor rightM = new NXTMotor(MotorPort.A);
NXTMotor leftM = new NXTMotor(MotorPort.C);
int targetValue = 30;
int amplify = 7;
int targetPower = 50;
ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED);
rightM.setPower(targetPower);
leftM.setPower(targetPower);
while (!Button.ESCAPE.isDown())
{
int currentColor = cs.getLightValue();
int difference = currentColor - targetValue;
int ampDiff = difference * amplify;
int rightPower = ampDiff + targetPower;
int leftPower = targetPower;
rightM.setPower(rightPower);
leftM.setPower(leftPower);
}
38. Learn how to follow
• Goal
– Make robots do what we want
– Minimize/eliminate programming
• Proposed Solution: Reinforcement Learning
– Specify desired behavior using rewards
– Express rewards in terms of sensor states
– Use machine learning to induce desired actions
• Target Platform
– Lego Mindstorms NXT
39. Example: Grid World
• A maze-like problem
– The agent lives in a grid
– Walls block the agent’s path
• Noisy movement: actions do not
always go as planned:
– 80% of the time, preferred action is
taken
(if there is no wall there)
– 10% of the time, North takes the agent
West; 10% East
– If there is a wall in the direction the
agent would have been taken, the agent
stays put
• The agent receives rewards each time
step
– Small “living” reward each step (can be
negative)
– Big rewards come at the end (good or
bad)
• Goal: maximize sum of rewards
40. Markov Decision Processes
• An MDP is defined by:
– A set of states s S
– A set of actions a A
– A transition function T(s,a,s’)
• Prob that a from s leads to s’
• i.e., P(s’ | s,a)
• Also called the model (or
dynamics)
– A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
– A start state
– Maybe a terminal state
• MDPs are non-deterministic
search problems
– Reinforcement learning: MDPs
where we don’t know the
transition or reward functions
41. What is Markov about MDPs?
• “Markov” generally means that given the
present state, the future and the past are
independent
• For Markov decision processes, “Markov”
means:
Andrej Andreevič Markov
(1856-1922)
42. Solving MDPs: policies
• In deterministic single-agent search problems, want an
optimal plan, or sequence of actions, from start to a goal
• In an MDP, we want an optimal policy *: S → A
– A policy gives an action for each state
– An optimal policy maximizes expected utility if followed
– An explicit policy defines a reflex agent
Optimal policy when
R(s, a, s’) = -0.03 for all
non-terminals s
44. MDP Search Trees
• Each MDP state gives an expectimax-like search tree
a
s
s’
s, a
(s,a,s’) called a transition
T(s,a,s’) = P(s’|s,a)
R(s,a,s’)
s,a,s’
s is a state
(s, a) is a
q-state
45. Utilities of Sequences
• In order to formalize
optimality of a policy,
need to understand
utilities of sequences of
rewards
• What preferences should
an agent have over
reward sequences?
• More or less?
– [1,2,2] or [2,3,4]
• Now or later?
– [1,0,0] or [0,0,1]
46. Discounting
• It’s reasonable to maximize the sum of
rewards
• It’s also reasonable to prefer rewards now to
rewards later
• One solution:values of rewards decay
exponentially
47. Discounting
• Typically discount rewards
by < 1 each time step
– Sooner rewards have higher
utility than later rewards
– Also helps the algorithms
converge
• Example: discount of 0.5:
– U([1,2,3])=1*1+0.5*2+0.25*3
– U([1,2,3])<U([3,2,1])
48. Stationary Preferences
• Theorem if we assume stationary preferences:
• Then: there are only two ways to define utilities
– Additive utility:
– Discounted utility:
49. Quiz: Discounting
• Given:
– Actions: East, West and Exit (available in exit states a, e)
– Transitions: deterministic
• Quiz 1: For =1, what is the optimal policy?
• Quiz 2: For =0.1, what is the optimal policy?
• Quiz 3: For which are East and West equally good
when in state d?
10 1
a b c d e
10 1
10 1
50. Infinite Utilities?!
• Problem: infinite state sequences have infinite rewards
• Solutions:
– Finite horizon:
• Terminate episodes after a fixed T steps (e.g. life)
• Gives nonstationary policies ( depends on time left)
– Discounting: for 0 < < 1
• Smaller means smaller “horizon” – shorter term focus
• Absorbing state: guarantee that for every policy, a terminal
state will eventually be reached
51. Recap: Defining MDPs
• Markov decision processes:
– States S
– Start state s0
– Actions A
– Transitions P(s’|s,a) (or T(s,a,s’))
– Rewards R(s,a,s’) (and discount )
• MDP quantities so far:
– Policy = Choice of action for each state
– Utility (or return) = sum of discounted rewards
a
s
s, a
s,a,s’
s’
52. Optimal Quantities
• Why? Optimal values define
optimal policies!
• Define the value (utility) of a
state s:
V*(s) = expected utility starting in s
and acting optimally
• Define the value (utility) of a
q-state (s,a):
Q*(s,a) = expected utility starting in
s, taking action a and thereafter
acting optimally
• Define the optimal policy:
*(s) = optimal action from state s
a
s
s, a
s,a,s’
s’
55. Values of States
• Fundamental operation: compute the value of
a state
– Expected utility under optimal action
– Average sum of (discounted) rewards
• Recursive definition of value
a
s
s, a
s,a,s’
s’
56. Why Not Search Trees?
• We’re doing way too much work with
search trees
• Problem: States are repeated
– Idea: Only compute needed quantities once
• Problem: Tree goes on forever
– Idea: Do a depth-limited computations, but
with increasing depths until change is small
– Note: deep parts of the tree eventually don’t
matter if < 1
57. Time-limited Values
• Key idea: time-limited values
• Define Vk(s) to be the optimal value of s if the
game ends in k more time steps
– Equivalently, it’s what a depth-k search tree would
give from s
67. Value Iteration
• Problems with the recursive computation:
– Have to keep all the Vk
*(s) around all the time
– Don’t know which depth k(s) to ask for when planning
• Solution: value iteration
– Calculate values for all states, bottom-up
– Keep increasing k until convergence
68. Value Iteration
• Idea:
– Start with V0
*(s) = 0, which we know is right (why?)
– Given Vi
*, calculate the values for all states for depth i+1:
– This is called a value update or Bellman update
– Repeat until convergence
• Complexity of each iteration: O(S2A)
• Theorem: will converge to unique optimal values
– Basic idea: approximations get refined towards optimal values
– Policy may converge long before values do
69. Practice: Computing Actions
• Which action should we chose from state s:
– Given optimal values V?
– Given optimal q-values Q?
– Lesson: actions are easier to select from Q’s!
70. Utilities for Fixed Policies
• Another basic operation: compute the
utility of a state s under a fixed (general
non-optimal) policy
• Define the utility of a state s, under a
fixed policy :
V(s) = expected total discounted rewards
(return) starting in s and following
• Recursive relation (one-step look-ahead
/ Bellman equation):
(s)
s
s, (s)
s, (s),s’
s’
71. Policy Evaluation
• How do we calculate the V’s for a fixed policy?
• Idea one: modify Bellman updates
• Efficiency: O(S2) per iteration
• Idea two: without the maxes it’s just a linear system,
solve with Matlab (or whatever)
72. Policy Iteration
• Problem with value iteration:
– Considering all actions each iteration is slow: takes |A| times longer than
policy evaluation
– But policy doesn’t change each iteration, time wasted
• Alternative to value iteration:
– Step 1: Policy evaluation: calculate utilities for a fixed policy (not optimal
utilities!) until convergence (fast)
– Step 2: Policy improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities (slow but infrequent)
– Repeat steps until policy converges
• This is policy iteration
– It’s still optimal!
– Can converge faster under some conditions
73. Policy Iteration
• Policy evaluation: with fixed current policy , find values with
simplified Bellman updates:
– Iterate until values converge
• Policy improvement: with fixed utilities, find the best action
according to one-step look-ahead
74. Comparison
• In value iteration:
– Every pass (or “backup”) updates both utilities (explicitly, based on
current utilities) and policy (possibly implicitly, based on current
policy)
• In policy iteration:
– Several passes to update utilities with frozen policy
– Occasional passes to update policies
• Hybrid approaches (asynchronous policy iteration):
– Any sequences of partial updates to either policy entries or utilities
will converge if every state is visited infinitely often
75. Reinforcement Learning
• Basic idea:
– Receive feedback in the form of rewards
– Agent’s utility is defined by the reward function
– Must learn to act so as to maximize expected rewards
– All learning is based on observed samples of outcomes
76. Reinforcement Learning
• Reinforcement learning:
– Still assume an MDP:
• A set of states s S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
– Still looking for a policy (s)
– New twist: don’t know T or R
• I.e. don’t know which states are good or what the actions do
• Must actually try actions and states out to learn
77. Model-Based Learning
• Model-Based Idea:
– Learn the model empirically through experience
– Solve for values as if the learned model were correct
• Step 1: Learn empirical MDP model
– Count outcomes for each s,a
– Normalize to give estimate of T(s,a,s’)
– Discover R(s,a,s’) when we experience (s,a,s’)
• Step 2: Solve the learned MDP
– Iterative policy evaluation, for example
(s)
s
s, (s)
s, (s),s’
s’
78. Example: Model-Based Learning
• Episodes:
x
y
T(<3,3>, right, <4,3>) = 1 / 3
T(<2,3>, right, <3,3>) = 2 / 2
+100
-100
= 1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)
79. Model-Free Learning
• Want to compute an expectation weighted by P(x):
• Model-based: estimate P(x) from samples, compute expectation
• Model-free: estimate expectation directly from samples
• Why does this work? Because samples appear with the right frequencies!
80. Example: Direct Estimation
• Episodes:
x
y
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)
V(2,3) ~ (96 + -103) / 2 = -3.5
V(3,3) ~ (99 + 97 + -102) / 3 = 31.3
= 1, R = -1
+100
-100
81. Sample-Based Policy Evaluation?
• Who needs T and R? Approximate the
expectation with samples (drawn from T!) (s)
s
s, (s)
s1’s2’ s3’
s, (s),s’
s’
Almost! But we only
actually make progress
when we move to i+1.
82. Temporal-Difference Learning
• Big idea: learn from every experience!
– Update V(s) each time we experience (s,a,s’,r)
– Likely s’ will contribute updates more often
• Temporal difference learning
– Policy still fixed!
– Move values toward value of whatever successor
occurs: running average!
(s)
s
s, (s)
s’
Sample of V(s):
Update to V(s):
Same update:
83. Exponential Moving Average
• Exponential moving average
– Makes recent samples more important
– Forgets about the past (distant past values were wrong anyway)
– Easy to compute from the running average
• Decreasing learning rate can give converging averages
84. Example: TD Policy Evaluation
Take = 1, = 0.5
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)
85. Problems with TD Value Learning
• TD value leaning is a model-free way to do
policy evaluation
• However, if we want to turn values into a
(new) policy, we’re sunk:
• Idea: learn Q-values directly
• Makes action selection model-free too!
a
s
s, a
s,a,s’
s’
86. Active Learning
• Full reinforcement learning
– You don’t know the transitions T(s,a,s’)
– You don’t know the rewards R(s,a,s’)
– You can choose any actions you like
– Goal: learn the optimal policy
– … what value iteration did!
• In this case:
– Learner makes choices!
– Fundamental tradeoff: exploration vs. exploitation
– This is NOT offline planning! You actually take actions in the world and
find out what happens…
87. Detour: Q-Value Iteration
• Value iteration: find successive approx optimal values
– Start with V0
*(s) = 0, which we know is right (why?)
– Given Vi
*, calculate the values for all states for depth i+1:
• But Q-values are more useful!
– Start with Q0
*(s,a) = 0, which we know is right (why?)
– Given Qi
*, calculate the q-values for all q-states for depth i+1:
88. Q-Learning
• Q-Learning: sample-based Q-value iteration
• Learn Q*(s,a) values
– Receive a sample (s,a,s’,r)
– Consider your old estimate:
– Consider your new sample estimate:
– Incorporate the new estimate into a running average:
89. Q-Learning Properties
• Amazing result: Q-learning converges to optimal policy
– If you explore enough
– If you make the learning rate small enough
– … but not decrease it too quickly!
– Basically doesn’t matter how you select actions (!)
• Neat property: off-policy learning
– learn optimal policy without following it (some caveats)
90. Q-Learning
• Discrete sets of states and actions
– States form an N-dimensional array
• Unfolded into one dimension in practice
– Individual actions selected on each time step
• Q-values
– 2D array (indexed by state and action)
– Expected rewards for performing actions
91. Q-Learning
• Table of expected rewards (“Q-values”)
– Indexed by state and action
• Algorithm steps
– Calculate state index from sensor values
– Calculate the reward
– Update previous Q-value
– Select and perform an action
• Q(s,a) = (1 - α) Q(s,a) + α (r + γ max(Q(s',a)))
92. • Certain sensors provide continuous values
• Sonar
• Motor encoders
• Q-Learning requires discrete inputs
• Group continuous values into discrete “buckets”
• [Mahadevan and Connell, 1992]
• Q-Learning produces discrete actions
• Forward
• Back-left/Back-right
Q-Learning and Robots
93. Creating Discrete Inputs
• Basic approach
– Discretize continuous values into sets
– Combine each discretized tuple into a single index
• Another approach
– Self-Organizing Map
– Induces a discretization of continuous values
– [Touzet 1997] [Smith 2002]
94. Q-Learning Main Loop
• Select action
• Change motor speeds
• Inspect sensor values
– Calculate updated state
– Calculate reward
• Update Q values
• Set “old state” to be the updated state
95. Calculating the State (Motors)
• For each motor:
– 100% power
– 93.75% power
– 87.5% power
• Six motor states
96. Calculating the State (Sensors)
• No disparity: STRAIGHT
• Left/Right disparity
– 1-5: LEFT_1, RIGHT_1
– 6-12: LEFT_2, RIGHT_2
– 13+: LEFT_3, RIGHT_3
• Seven total sensor states
• 63 states overall
97. Calculating Reward
• No disparity => highest value
• Reward decreases with increasing disparity
98. Action Set for Line Follow
• MAINTAIN
– Both motors unchanged
• UP_LEFT, UP_RIGHT
– Accelerate motor by one motor state
• DOWN_LEFT, DOWN_RIGHT
– Decelerate motor by one motor state
• Five total actions
100. Conclusions
• Lego Mindstorms NXT as a conveniente
platform for «cognitive robotics»
• Executing a task with «rules»
• Learning hot to execute a task
– MDP
– Reinforcement learning
• Q-learning applied to Lego Mindstorms