Learning and imitation in heterogeneous robot groups

Introduction Architecture Imitation in robot groups Conclusion

Learning and imitation in
heterogeneous robot groups

Wilhelm Richert
richert@c-lab.de

Fakultät für Elektrotechnik, Informatik und Mathematik,
Universität Paderborn

22. Dezember 2009

Learning and imitation in heterogeneous robot groups 1 / 58


Motivation
Why do we need learning and imitation?
State of the art
v Off-line learning (mostly population-based)
v Behavior is ﬁxed afterwards

Swarmanoid [Dorigo et al., 2006] Symbrion [Baele et al., 2009]



Motivation
Why do we need learning and imitation?
State of the art
v Off-line learning (mostly population-based)
v Behavior is ﬁxed afterwards

Swarmanoid [Dorigo et al., 2006] Symbrion [Baele et al., 2009]
Desired
v On-line learning to intelligently react on unforeseeable events/problems
v Means to beneﬁt from the “redundancy” in group behavior
v Robustness to arbitrary robot groups



The ﬁve big challenges in imitation
[Dautenhahn and Nehaniv, 2002]

Five big challenges governing successful imitation in multi-robot systems:

whom heterogeneous robot groups
when concentrate on salient behavior
what the results, the actions, or the hidden goals of the imitatee?
how correspondence problem
how to evaluate What should be counted as successful imitation?



Thesis objectives

Robots in a groups shall be able to

1. combine learning with imitation,
2. recognize and learn observed
behavior non-obtrusively, and
3. choose potential imitatees wisely
also in heterogeneous robot groups.



Robot architecture

motivation layer

current motivation
perception

action
strategy layer
choice of the

imitation
imitatee

request result

skill layer

interaction example



Strategy layer raw perception, motivation
I , µi

perception filtering
ot b Is

experience
motivation layer –ô, a, d, µi , f tN , . . . , ô, a, d, µi , f t e

current motivation
perception

abstraction
heuristics

action
s ξô
strategy layer
choice of the

imitation
imitatee

request result model
T, R, γ
skill layer

reinforcement
learning
v Inspired by AMPS [Kochenderfer, 2006]

policy
π

action selection
a π ˆs b A



I , µi

v State abstraction function ξ might use any
abstraction method supporting ot b Is
v insertion of new state observations
v deletion of old state observations experience
–ô, a, d, µi , f tN , . . . , ô, a, d, µi , f t e
v querying most similar state observation to
a new state observation
abstraction
v Experiments use nearest neighbor s ξô
heuristics

model
T, R, γ

reinforcement
learning

region policy
(abstract state) π
state observation
(raw state)
action selection
a π ˆs b A



I , µi

v Heuristics maintain the models so that the same
action feels similar in all observations of the perception filtering
ot b Is
same state
v Heuristics may split or merge regions experience
transition, failure, reward, simplification, experience –ô, a, d, µi , f tN , . . . , ô, a, d, µi , f t e

v Example: transition heuristic
abstraction
heuristics
s ξô

model
T, R, γ

reinforcement
learning

region policy
(abstract state) π
state observation
(raw state)
action selection
a π ˆs b A



Building a policy raw perception, motivation
I , µi

v Reinforcement Learning with SMDP perception filtering
v Qˆs, a Rˆs, a Q Pˆs ƒs, aγˆs, a, s Vπ ˆs
œ
œ œ œ ot b Is
s bS
v Determine current best policy experience
–ô, a, d, µi , f tN , . . . , ô, a, d, µi , f t e
v V π ˆs max Qˆs, a
abA
v π ˆs arg max Qˆs, a abstraction
abA s ξô
heuristics

model
T, R, γ

reinforcement
learning

region policy
(abstract state) π
state observation
(raw state)
action selection
a π ˆs b A



I , µi

ot b Is
v Strategy layer requests symbolic actions
experience
v Execution of these actions is up to the skill layer –ô, a, d, µi , f tN , . . . , ô, a, d, µi , f t e

abstraction
motivation layer heuristics
s ξô

current motivation
perception

model

action
strategy layer T, R, γ
choice of the

imitation
imitatee

request result reinforcement
learning
skill layer

policy
π

action selection
a π ˆs b A



Skill layer
Tasks
1. discover and learn a set of skills that are useful to the
strategy layer
ground symbols b A
2. execute them when requested and optimize at runtime
Skill
v skill s ˆfe , . . . , feN , where
v error function fe ¢ Ia ! Ia
R assigns an error value to a
pair of perception ‰I ˆti , I ˆtj Ž
Example: “approach the ball and orient towards it”
fe Î ˆti , I ˆtj dball Î ˆtj
minimize the ball distance
fe Î ˆti , I ˆtj ƒαball Î ˆtj ƒ
minimize the ball angle
s ˆfe , fe approach the ball and orient towards it



Skill layer
Measuring a skill’s progress
v Progress function fp ¢ Ia ! Ia
, ¥ measures a skill’s progress
v For a skill s ˆfe , . . . , feN it is defined as
¢
¨
¨ if Ca f W Î ˆti , I ˆtj
¨ C W Î ˆt ,I ˆt
¨
fp Î ˆti , I ˆtj ¦
a i j
if Csd W Î ˆti , I ˆtj d Ca
¨
¨ C a C s
¨
¨ if W Î ˆti , I ˆtj f Cs
¤
f ei : error function, I ˆt i : perception when the skill has been started, I ˆt j : current perception, success and
abort thresholds C s b R and Ca b R (Cs d Ca )

v W Î ˆti , I ˆtj PN fek Îˆti , Iˆtj
k
v Example graph:
Cs . , Ca .
full skill definition



observed episode
Imitation `ô I , e I , . . . , ô I , e N e
N
I

Overview of the approach
transform observations
v Robots observe each other permanently
v Moving window of observations and well-being states
subjective observation data
for each observed robot `ô D , e , . . . , ô D , e N e
N
v Imitation process starts when well-being
improvement is detected
interpret behavior

motivation layer recognized episodes
`. . . , ˆˆ t, o D , s , a t , ˆ t œ , oœD , s œ , . . .e
current motivation
perception

action
strategy layer estimate rewards
choice of the

imitation
imitatee

request result
observed interpreted experience
skill layer `. . . , ˆˆ t, o D , s , a t , r t , ˆ t œ , oœD , s œ , . . .e

integrate into experience,
update SMDP



Imitation
HMM and the Viterbi connection [Viterbi, 1967]

sb

sa sc

ox oy oz



Imitation

sb
ƒ sa
P ˆs b
Pˆsc ƒ sa
saPˆox ƒ sa sc

Pˆ P ˆo
oy z ƒs
ƒs a
a

ox oy oz



Imitation

sb
ƒ sa
P ˆs b
Pˆsc ƒ sa
saPôx ƒ sa sc

Pˆ P ô
oy z ƒs
ƒs a
a

ox oy oz

o o . . . oT Ð Viterbi Ð s s . . . sT
V ˆs, t Pôt ƒ st s maxsœ Pˆst s ƒ st sœ V ˆsœ , t ¥



Imitation
Interpreting observed behavior with the imitator’s own knowledge

Knowledge in strategy layer

v Imitator’s own transition probabilities
instead of “foreign” HMM transition
probabilities



Imitation


s

s s

probabilities



Imitation


s

,s
,a

,s

s
Tˆ

,a

,s
s
Tˆ

,a

T ˆs , a , s
s
Tˆ

T ˆs , a , s
s s
T ˆs , a , s

probabilities



Imitation

Knowledge in strategy layer Knowledge in skill layer
approach ball approach goal lift ball
s

,s a a a
,a

,s

s
Tˆ

,a

,s
s
Tˆ

,a

T ˆs , a , s
s
Tˆ

T ˆs , a , s
s s
T ˆs , a , s
∆o ∆o ∆o
’ “ ball dist ’ . “ ’ “
– — – — – —
– . — goal dist – — – —
– — – — – —
– — – — – —
” • ball height ” • ” . •
probabilities



Imitation

s

,s a a a
,a

,s

s
Tˆ

,a

Pˆ
,s

Pˆ∆ ∆o

Pˆ∆o ƒa
s
Tˆ

,a

T ˆs , a , s o ƒa
ƒa
s

Tˆ

T ˆs , a , s
s s
T ˆs , a , s
∆o ∆o ∆o
’ “ ball dist ’ . “ ’ “
– — – — – —
– . — goal dist – — – —
– — – — – —
– — – — – —
instead of “foreign” HMM transition v Skills vote on perceptual changes fpa
probabilities plus the following heuristics ...



Imitation

s

,s a a a
,a

,s

s
Tˆ

,a

Pˆ
,s

Pˆ∆ ∆o

Pˆ∆o ƒa
s
Tˆ

,a

T ˆs , a , s o ƒa
ƒa
s

Tˆ

T ˆs , a , s
s s
T ˆs , a , s
∆o ∆o ∆o
’ “ ball dist ’ . “ ’ “
– — – — – —
– . — goal dist – — – —
– — – — – —
– — – — – —
instead of “foreign” HMM transition v Skills vote on perceptual changes fa
p

probabilities plus the following heuristics ...



Recognition
1. Recognize observation changes ot ot
a) Prefer nearer goals

Ambiguous situation: Robot might drive either to the red or yellow goal base

b) Ignore skills that “seem to have finished”
c) Clip votes to , ¥

f p ˆ o t f p ˆ o t
a a

Pa ôt ƒ ot fpa ôt



Recognition


¢ f p ˆ o t f p ˆ o t
a a
¨ fpa ôt d є

¨ min ‹max ‹ f p t , , ,
Pa ôt ƒ ot ¦
a ô

¨
¨
¤ , otherwise



Recognition

a a
¨ fpa ôt d є

¨ min ‹max ‹ f p t , , ,
Pa ôt ƒ ot ¦
a ô

¨
¨
¤ , otherwise

2. Recognize actions in sequence ot
t ot ot ∆ . . . ot

aml arg max
P t
t t Pa ôt ƒ ot

a t t



Recognition

a a
¨ fpa ôt d є

¨ min ‹max ‹ f p t , , ,
Pa ôt ƒ ot ¦
a ô

¨
¨
¤ , otherwise

2. Recognize actions in sequence ot
t ot ot ∆ . . . ot

aml arg max
P t
t t Pa ôt ƒ ot

a t t

3. Recognize state transitions
Pˆst ƒ st T ˆst , aml , st



Evaluation
Recognition scenario: description

v Demonstrator (right robot) has to
transport the yellow ball onto the
base
v Imitator (left robot) tries to
“understand” its observations
v Two scenarios:
1. Imitator is only able to drive (and
thereby push the ball)
2. Imitator is also able to lift the
ball

ﬁg/lifting.png



Evaluation
Recognition scenario: results

1. Without lifting capabilities
distance [m]

move to move to
???
ball goal

v Recognized “drive to ball” (B) and “drive to
goal” (G) correctly
v Detected “missing behavior” in between



Evaluation
Recognition scenario: results

1. Without lifting capabilities 2. With lifting capabilities
distance [m]

distance [m]
move to move to move to lift the move to
???
ball goal ball ball goal

v Recognized “drive to ball” (B) and “drive to v Recognized “drive to ball” (B), “lift the ball”
goal” (G) correctly (L), and “drive to goal” (G) correctly
v Detected “missing behavior” in between



Evaluation
Multi-robot scenario “three bases”

v Task: transport objects to goal bases
v Reward for reaching an object: 10
v Goal bases provide different reward
v State space consists of
v distance to closest object
v distance of closest object to closest goal
v ID of closest goal



Conclusion
Objectives achieved in this thesis

1. Combination of learning and imitation
2. Non-obtrusive recognition and learning
of observed behavior
3. Support for heterogeneous robot
groups



Conclusion
Objectives achieved in this thesis

1. Combination of learning and imitation
2. Non-obtrusive recognition and learning
of observed behavior
3. Support for heterogeneous robot
groups

Thank you for your attention!


v Architecture v Imitation in robot v Choice of the imitatee
v State of the art v Affordance detection
v
groups v
Overview Affordance network generation
v Overview of the approach
v Layer interaction v Comparing ANs
v Choice of the imitatee
v Recognizing behavior
v Motivation layer v Viterbi
v Evaluation
v Parameterization of the
v Excitation v Interpreting observed behavior
v environment
Prioritizing goals v Recognition example v Robustness experiment
v Integrating recognized behavior v Clustering experiment
v Strategy layer
v Evaluation
v State abstraction
v v CTF with three bases
Heuristics
v v Performance
Policy
v v State abstraction
Sample frequency
v v Group homogeneity
Strategy example
v CTF with ﬁve bases
v Performance
v Skill layer
v State abstraction
v Overview of the approach v Group homogeneity
explore, exploit
v Skill manager
v Model manager
v Error minimizer
v Conﬁguration
v Skill example



State of the art
[Takahashi et al., 2008] use imitation to learn
robotic soccer behaviors (approaching,
shooting a ball)
combines learning with imitation
requires the robot group to stop
whenever a robot imitates
needs multiple presentation of the
same behavior
needs sufﬁcient prior knowledge of
the task to imitate
[Priesterjahn, 2008] evolves game bots with
similar performance as the human
player
[Inamura et al., 2003] combine top-down
teaching with the bottom-up learning
from the robot’s side



State of the art
[Takahashi et al., 2008] use imitation to learn The Rule-Based Operation Cycle of an Agent

shooting a ball)
player
shows that imitation-based
adaptation is able to outperform the
evolutionary only approach
targeted to computer game
scenarios, not stochastic real-world
applications
assumes group homogeneity



State of the art
[Takahashi et al., 2008] use imitation to learn
shooting a ball)
player
exclusive approach (cannot be
combined with other learning
techniques)
Motion capturing system: motion for learning data
HMM is learned and then ﬁx
throughout the robot’s lifetime

A result of motion generation on a humanoid robot



Layer interaction
clock motivation layer strategy layer skill layer perception action

– Strategy step is triggered – next strategy step event
Strategy step
v Determining the current motivation request Im
and the corresponding next strategy processed perception

action. set next motivation
request Is
v The strategy layer requires the most processed perception
current motivation as feedback determine next strategy step

regarding its last chosen action both
are synchronous. — next skill step event
Skill step
— Skill step is triggered request Ia
v Strategy step does not have to be processed perception
set next skill calculate best actuator command
ﬁnished yet
v The skill layer simply executes
according to the action most recently set next low-level action
˜ next skill step event
delivered by the strategy layer
Skill step
request Ia
˜,™ Strategy step has ﬁnished processed perception
v It signals the next action to execute calculate best actuator command

and to the skill layer.
v Subsequent skill steps then perform set next low-level action
this action accordingly. ™ next skill step event



Motivation layer
Motivation system example

v The current motivation µ is the vector
to the current drive state, dependent
drive 1
on
v time current current
v perception motivation drive state
shortest vector
p to desired drive area,
used for prioritization
v Each drive measures the status of well-being
region
accomplishing a sub-goal
(0 = fully accomplished)
drive 2

v A drive i is called satisﬁed (goal
drive 3
achieved) if the corresponding
motivation is below its threshold:
µ i d µ iθ
more



A sub-goal subjected to an excitation

excitation
1

threshold
triggering
behavior

well-being region
0 t

v Excitation describes the force, which the current drive state
is subjected to.
v By specifying it dependent on the perception and on the
internal state of the robot the user is “programming” the
ﬁnal behavior.



Prioritizing goals

v At each time step, the motivation layer provides the current
motivation vector to the strategy layer.
v With µ p the strategy layer prioritizes, which of the sub-goals
are to be handled ﬁrst

’ maxˆ ,µ µθ “
– maxˆ ,µ µ —
θ
µp – —
– —
– ¦ —
”maxˆ , µn µ n •
θ

v Different drives can be prioritized by means of an according
scaling modeling a hierarchy of needs



Strategy layer
Sample frequency

A new interaction is made in one of the following conditions:
v Sufficiently different perception (measured by some scenario-specific distance
metric d):

d ô t , o t e θ o
v Sufficiently interesting motivation change:

ƒ µt µt ƒ e θ r
v Enough time has passed:

t t e θt

θ o , θ r , and θ t are application specific and have to be determined empirically.



Strategy example

S
S

G
G

(3, 1) (4, 1)

(3, 1) (4, 1) (5, 1) (6, 1) (6, 2) (6, 3)
(2, 1)
(4, 2)

(6, 4)
(2, 1)
3 2 6 4
(1, 1)
v (1, 1) v
(6, 5)

G G (6, 6)



Skill layer

1. discover and learn a set of skills that are useful to the
strategy layer ground symbols b A
2. execute them when requested and optimize at runtime

exploration mode exploitation mode
strategy layer strategy layer

training mode notify new skill execution mode request skill
skill layer skill layer
skill explore actions O skill
manager manager

create fetch skills set current skill
perception

perception
action

action
Ia skills Ia skills

create update fetch cur-
mod- mod- rent skill
els els
model error model error O
manager minimizer manager minimizer



Skill layer
Data ﬂow in exploration mode

strategy layer

training mode notify new skill
skill layer
skill explore actions O
manager

create fetch skills
perception

action
Ia skills

create
mod-
els
model error
manager minimizer



Skill layer
Data ﬂow in exploitation mode

strategy layer

execution mode request skill
skill layer
skill
manager

set current skill
perception

action
Ia skills

update fetch cur-
mod- rent skill
els
model error O
manager minimizer



Skill deﬁnition

v extraction function fext ¢ Ia R extracts information from a perception I ˆt b Ia
v control function fc ¢ R ! R R associates an error value to the tuple ˆvt i , vt j
v decrease: fc ˆvti , vtj ƒvtj ƒ
v increase: fc ˆvti , vtj v S tj S
v keep value: fc ˆvti , vtj ƒvti δ vtj ƒ
v error function fe ¢ Ia ! Ia R assigns an error value to a perception pair
v progress function fp ¢ Ia ! Ia , ¥ measures a skill’s progress between two
time points
more about f p



Skill manager
strategy layer

v exploration phase training mode notify new skill
skill layer
v generate skills that enable the robot to skill
manager
explore actions O

control the perceived properties
create fetch skills
v assign a priority to each skill dependent on

perception

action
Ia skills
its execution priority
v determine the skills the robot can reliably create
mod-
els
perform and notify them as new skills to model error
manager minimizer
the strategy layer

strategy layer

skill layer
skill
manager

v exploitation phase set current skill

perception

action
Ia skills
v manage the execution of requested skills
update fetch cur-
mod- rent skill
els
model error O
manager minimizer



Model manager
strategy layer

v creating prediction models for each perceived training mode notify new skill
skill layer
skill explore actions O
property manager

v ˜ ˜
prediction model is the tuple ˆidp , S, M, m create fetch skills

perception
idp b IDp : perception feature to be predicted

action
Ia skills

S – IDo ! IDp : subset of the perceptual features
˜ create

M – O: subset of the actuators to control
˜ mod-
els

˜

˜
m ¢ RƒSƒ ƒMƒ R predicts the value for the
model
manager
error
minimizer

perceptual feature idp at the next input
˜
perception given the values of S and M . ˜
strategy layer
v m in experiments: Poly, RBF
v updating prediction models to reﬂect new skill layer
skill
experiences manager

v scoring each model dependent on its prediction set current skill

perception
accuracy:

action
Ia skills

update fetch cur-
rent skill
n mod-

scoreˆm
P
els
model error O
i k ˆmˆSˆti , M ˆti vt i
k n manager minimizer



Error minimizer
1. Ic ˆt ¢
only those perceptual features, on which the error functions of the current
skill s are dependent on current time t
2. Estimate the next perception, Ic ˆt * , dependent on the motor action M as
predicted by mbest arg maxm ˜scoreˆm:
j

M
I c ˆt šmjbest Îc ˆt, Mˆt ƒ pj b Ic ˆtŸ

3. For each error function fek : calculate the expected next error eM ˆt
k , with Ic ˆti
being the perception when the skill has been started:

e M ˆt
k fek Îc ˆti , Ic ˆt
M

4. Determine the best actuator command M ˆt , by finding the one that minimizes the
accumulated expected error:

Q eM ˆt
N
Mnext ˆt min k
M k

*
t is the time point of the next interaction after time t


Skill layer configuration

Greater universality leads to a bigger exploration space. It is wise to limit the
exploration space by specifying non-changing parameters beforehand. This can be
achieved by configuring the following parameters:
v Degrees of freedom specify the number of actors the skill layer has to control.
v Extraction functions define the language that can be used to specify the error
functions.
v Control functions specify the functions that the error minimizer will minimize by
means of the error functions.
v Regression models are used by the model manager to build predictions for the
environment interaction. A regression model consists of two methods: one that fits
a model to an experience trace and one that predicts the value of the modeled
property.



Skill example
“Minimize angle to object” learned with radial basis functions

Controlling speed dependent on angle and Controlling rotational speed dependent on
distance to the object angle and distance to the object



Imitation
Viterbi [Viterbi, 1967]

Problem description
v Given the observation sequence oN –o , o , . . . oN e ôi b Rd
v Find the most likely hidden state sequence sN –s , s , . . . , sN e ˆsi b S

Approach
v Maximizing probability PˆsN ƒ oN : sN ‡ arg max P ‰sN ƒ oN Ž
sN
by recursively calculating the probability V ˆs, t maxs t Pôt , s . . . st st s that
s b S is the observed hidden state at time t given the observations ot :
v V ˆs, Pô ƒ s sPˆs s ¦ s b S
v V ˆs, t Pôt ƒ st s maxsœ Pˆst s ƒ st s V ˆs , t

œ œ
¥
v φˆs, t arg maxsœ Pˆst s ƒ st s V ˆs , t ¥

œ œ



Imitation
Recognition

Problem description
v Given the observation sequence oN –o , o , . . . oN e ˆoi b Rd
v Find the most likely behavior sequence ˆt b R , o b Rd , s b S, a b A)
Γ ˆ. . . , ˆˆtk , ok , sk , ak , ˆtk , ok , sk , . . .

Approach
v Maximizing probability Pˆsn , an
ƒ oN , n€N
v Adapting V ˆs, and V ˆs, t :
v Use own state and action space for S and A
v Support bootstrapping of probabilities
v Let actions recognize themselves
technical realization of the mirror neuron system


Learning and imitation in heterogeneous robot groups

Learning and imitation in heterogeneous robot groups

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Learning and imitation in heterogeneous robot groups