Common Gesture and Speech Production Framework for Virtual and Physical Agents

A Common Gesture and Speech Production Framework for
Virtual and Physical Agents
Quoc Anh Le - Jing Huang - Catherine Pelachaud
CNRS, LTCI
Telecom-ParisTech, France

Workshop on Speech and Gesture Production, ICMI 2012, Santa Monica, CA, USA

Introduction
 Motivations

• Similar approaches between virtual agents and
humanoid robots
• Limits of existing systems: agent dependent
 Objectives

• Common co-verbal gesture generation framework for
both virtual and physical agents
 Methodologies

• Based on GRETA system
• Use
- same representation languages
- same algorithm for selecting and planning gestures
- different algorithms for creating the animation
page 2

Architecture Overview
Intent Lexicon Behavior Lexicon

Input Data (text, audio, Baselines for Nao Gestuary for Nao
video, etc) Baselines for Greta Gestuary for Greta

Intent Planner Behavior Planner Behavior Realizer
(Common Module) (Common Module) (Common Module)
FML- FML-
BML BML Keyframes
APML APML

ActiveMQ
Messaging Central System

Keyframes Keyframes
FAP-BAP FAP-BAP Joint Nao Built-in
Player Values Animation Realizer Animation Realizer Values Proprietary
(Specific Module) (Specific Module) Procedures

Greta Nao
Animation Lexicon Animation Lexicon

page 3

Behavior Realizer
Behavior Lexicon



FML- FML-
BML BML Keyframes
APML APML

Keyframes Keyframes
FAP-BAP FAP-BAP Joint Nao Built-in
Player Values Animation Realizer Animation Realizer Values Proprietary
(Specific Module) (Specific Module) Procedures

Greta Nao

page 4

Behavior Realizer: Outline

 Common processes to all agents
1. Create gesture from the gestuary of an agent
2. Schedule timing of gesture phases
3. Generate keyframes: pair (absolute time, symbolic
description of hand configuration at this time)
 Different databases
 For Nao
 Gestuary (for instance, pointing with full stretch arm)
 Velocity profile (empirically determined from Nao)
 For Greta
 Gestuary (for instance, pointing with one finger)
 Velocity profile (empirically determined from real humans)

page 5

Example: Different pointing gestures
<bml id=“bml1” >
Nao Gestuary
..
<speech xmlns="" id="s1" start="0">
<text>It is <sync id=« tm1 »/> overthere! <sync id=« tm2 »/> BML Greta Gestuary
..
</speech>
<gesture id=« pointing »> <gesture id=« g1 » lexeme=« pointing » start=«s1:tm1» end=«s2:tm2»> <gesture id=« pointing »>
<phase type=« stroke »>
<vertical>YUpperP</vertical> 1 <description priority=« 1 » type=«GRETA»>
<GRETA:SPC>0.80</GRETA:SPC>
<GRETA:TMP>0.50</GRETA:TMP>
1 <phase type=« stroke »>
<vertical>YP</vertical>
<horizontal>XEP</horizontal> <GRETA:FLD>-0.62</GRETA:FLD> <horizontal>XP</horizontal>
<distance>XFar<distance> <GRETA:PWR>0.30</GRETA:PWR> <distance>XMiddle<distance>
<hShape>OPEN</hShape> <GRETA:REP>0.00</GRETA:REP> <hShape>INDEX</hShape>
<GRETA:OPE>1.00</GRETA:OPE>
</phase> <GRETA:TEN>0.20</GRETA:TEN> </phase>
</gestures> </description> </gestures>
… </gesture> …
</bml>

2, 3 2,3
<keyframe 1 (time, description)> <keyframe 1 (time, description)>
<keyframe 2 (time, description)> <keyframe 2 (time, description)>
… …
<keyframe N (time, description)> <keyframe N (time, description)>

4 4
JOINT VALUES BAP

page 6

BR: Synchronization with speech

 Algorithm
• Compute preparation phase
• Do not perform gesture if not enough time (strokeEnd(i-1) > strokeStart(i)
+duration)

• Add a hold phase to fit gesture planned duration
• Co-articulation between several gestures
- If enough time, retraction phase (ie go back to rest position)

Start end Start end
- Otherwise, go from end of stroke to preparation phase of next
gesture
S-start S-end S-start S-end

end
Start
page 7

BR: Velocity profiles

 Gesture velocity
• Predict a movement duration using Fitts’ law:
• Movement Time = a+b*log2(Distance+1)
• Threshold of maximal speeds (empirically determined)
• Stroke phase is different from other phases in velocity and
acceleration (Quek, 1995)
 Add expressivity
• Temportal extent (TMP): Modulate the duration of whole gesture
=> change coefficient of Fitts’ Law

page 8

BR: Build coefficients of Fitts’ law

page 9

Animation Realizer


FML- FML-
BML BML Keyframes
APML APML

Keyframes Keyframes
FAP-BAP Joint
Values Animation Realizer Animation Realizer Values
(Specific Module) (Specific Module)

Greta Nao

page 10

Implemented expressivity parameters
EXP Definition Nao Greta
TMP Velocity of movement Change coefficient of Fitts’ Change coefficient of
law Fitts’ law
SPC Amplitude of movement Limited in predefined key Change gesture
positions space scales
PWR Acceleration of Modulate stroke duration Modulate stroke
movement acceleration
REP Number of stroke Yes Yes
repetition times
FLD Smoothness and No No
Continuity
OPN Relative spatial extent to No elbow swivel angle
body
TEN Muscular tension No No

 Create animation parameters
 Joint values for Nao
 BAP values for Greta
page 11

Create animation parameters
 Descritization of the gestural space of McNeill (1992)
 One symbolic position will be translated into concrete values of agent joints (for
instance 6 joints of Nao as table below)
Code ArmX ArmY ArmZ Joint values (LShoulderPitch, LShoulderRoll, LElbowYaw, LElbowRoll, LWristYaw, Hand)

000 XEP YUpperEP ZNear (-54.4953, 22.4979, -79.0171, -5.53477, -0.00240423, 1.0)
001 XEP YUpperEP ZMiddle (-65.5696, 22.0584, -78.7534, -8.52309, -0.178188, 1.0)
002 XEP YUpperEP ZFar (-79.2807, 22.0584, -78.6655,-8.4352, -0.178188, 1.0)
010 XEP YUpperP ZNear (-21.0964, 24.2557, -79.4565, -26.8046, 0.261271, 1.0)
... ... ... ... ...

 Translate symbolic keyframes in joint values
 Animation is obtained by interpolating between
 joint values with robot built-in proprietary procedures
 use Slerp (spherical linear interpolation) with time warping: easing in out
functionsfor Greta

page 12

Greta: Full Body IK
Torso IK

Analytic Method: Arm To Torso

Torso target depending on hand position

page 13

Perceptive Evaluation
 Objective
• Evaluate how robot’s gestures are perceived by human users
 Procedure
• Participants (63 French speakers) rate videos of Nao
storyteller
• Random displayed versions to the participants:
- Gestures with expressivity VS. Gestures without expressivity
- Gesture-speech synchronization VS. Gesture-speech asynchronization
 Results (using the ANOVA method)
• Synchronization:
- F(1, 124) = 4.94, p < .05
- 76% agreed that gestures were synchronized with speech for sync version
• Expressivity:
- F(1, 124) = 4.43, p < .05
- 70% agreed that gestures were expressive for expressivity version
page 16

State of the art
 Most similar work: Salem et al. (2012)
• Same idea (based on existing Max virtual agent system)
 Main differences:
• Our system: re-designed GRETA as a common framework
• Salem et al.’s system: adjusted Max’s ACE to ASIMO robot

Features Our model Salem et al.’s system

Gesture Product Online from templates Automatically generated from trained
regardless specific domain specified domain data corpus
Gesture Shapes Agent specific parameter Original for Max and mapped to
ASIMO configurations

Gesture Timing Agent specific parameter Original for Max and adapted to
ASIMO by feedback
Expressivity Yes No
Synchronization Adapt gesture to speech Cross-Modal Adjustment

page 17

Future works

 Short-term plan
• Human like gestures: enhance velocity profiles
• Expressivity: implement fluidity and tension
 Long-term plan

• Feedback mechanism
• Study of the coherence between consecutive
gestures in a G-Unit (Kendon, 2004)

page 18

Common Gesture and Speech Production Framework for Virtual and Physical Agents

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Common Gesture and Speech Production Framework for Virtual and Physical Agents

Similar to Common Gesture and Speech Production Framework for Virtual and Physical Agents (8)

More from Lê Anh

More from Lê Anh (12)

Recently uploaded

Recently uploaded (20)

Common Gesture and Speech Production Framework for Virtual and Physical Agents

Editor's Notes