Self-talk discrimination in Human-Robot Interaction Situations For Engagement Characterization

Noname manuscript No.
(will be inserted by the editor)

Self-talk discrimination in Human-Robot Interaction Situations
For Engagement Characterization
Jade LE MAITRE · Mohamed CHETOUANI

Received: date / Accepted: date

Abstract The estimation of engagement is a funda- 1 Introduction
mental issue in Human-Robot Interaction and assistive
applications. In this paper, we describe (1) the design During the past decades, there has been growing in-
of triadic situation for cognitive stimulation for elderly terest in service robotics partially due to human assis-
users; (2) the characterization of social signals describ- tive applications. The proposed robotic systems are de-
ing engagement: system directed speech (SDS) and self- signed to address various supports: physical, cognitive
talk (ST); (3) a framework for estimating an interac- or either social. Human-Robot Interaction (HRI) plays
tion effort measure revealing the engagement of users. a major role in these applications, identifying more pre-
The proposed triadic situation is formed by a user, a cisely Socially Assistive Robotics (SAR) [1] as a promis-
computer providing cognitive exercises and a robot pro- ing field. Indeed, SAR aims to aid patients through so-
viding encouragements and helps through verbal and cial interaction with several applications, including mo-
non-verbal signals. The methodology followed for the tivations and encouragements during exercises [1–3].
design of this situation is presented. Wizard-of-Oz ex- Providing social signals during interaction is contin-
periments have been carried out and analyzed through uously done during human-human interaction [4] and
eye-contact behaviors and dialogue acts (SDS and ST). lack of them is identified in pathologies such as autism
An automatic recognition systems of these dialogue acts [5]. Interpretation and generation of social signals al-
is proposed with k-NN, decision tree and SVM clas- low sustaining and enriching interactions with conver-
sifiers trained with pitch, energy and rhythmic based sational agents and/or robots. In [6], an early system
features. The best recognition system achieved an ac- that realizes the full-action reaction cycle of communi-
curacy of 71%. Durations of both manually and auto- cation by interpreting multimodal user input and gen-
matically labelled SDS and ST were combined to esti- erating multimodal agent behaviors is presented. The
mate the Interaction Effort (IE) measure. Experiments importance of feedbacks for the regulation of interac-
on collected data prove the effectiveness of the IE mea- tion has been highlighted in several situations [7,8].
sure in capturing the engagement of elderly patients The ROBADOM project [9] is devoted to the design
during the cognitive stimulation task. of a robot-based solution for assistive daily living aids:
management of shopping lists, meetings, medicines, re-
Keywords Social signal processing · Measuring minders of appointments. Within the project, we are
engagement · Prosodic cues developping a specific robot to provide verbal and non-
verbal helps such as encouragement and coaching dur-
Jade Le Maitre ing cognitive stimulation exercises. Cognitive stimula-
ISIR UMR 7222
Universit´ Pierre et Marie Curie
e
tion is identified as one of the methodologies alleviating
E-mail: lemaitre@isir.upmc.fr the elderly decline in some cognitive functions (memory,
Mohamed Chetouani
attention) [10]. The robot would be dedicated to MCI
ISIR UMR 7222 patients (Mild Cognitive Impairment, i.e. the presence
Universit´ Pierre et Marie Curie
e of cognitive impairment that is not severe enough to
E-mail: mohamed.chetouani@upmc.fr meet the criteria of dementia). Cognitive impairment

2 Jade LE MAITRE, Mohamed CHETOUANI

is one of the major health problems facing elderly peo- approaches attempt to estimate engagement from gaze
ple in the new millennium. This does not only refer to [12, 24,13–15] by considering eye-contact as a promi-
dementia, but also to lesser degrees of cognitive deficit nent social signal. Eye-contact is usually employed to
that are associated with a decreased quality of life and, regulate the communication between humans [21]: ini-
in many cases, progress to dementia. tial contact, turn-taking, triggering backchannels...
In the work described in this paper, an engagement
metric is developed for the estimation of interaction Mutual gaze has been shown to contribute to smooth
efforts during cognitive stimulation exercises. Engage- turn-taking [16,15]. Goffman [17] mentioned that eye-
ment is considered as the process to which partners contact during interaction tends to signal each partner
establish, maintain and end interactions [13]. Engage- that they agree to engage in social interaction. Defi-
ment detection is identified as a key element for the ciency or failure in gaze during interaction may be in-
design of socially assistive robots. We propose to study terpreted as lack of interest and attention as noticed
engagement in a triadic framework: user - computer by Argyle and Cook [18]. In face-to-face communica-
(providing cognitive exercises) - robot (providing en- tion, initiation, regulation and/or disambiguation can
couragements and backchannels). We identified specific be achieved by eye-gaze behaviors. Efficiency of an in-
social signals such as system directed speech and self- teraction is based on the ability of shifting roles, which
talk as indicators of engagement during interaction. In is again possible via eye-gaze behaviors [19,20]. Dur-
this work, engagement is not considered as all-or-none ing interaction, gaze might be combined with speech.
phenomenon but rather a continuous characterization Kendon [21] analyzed these situations and, for instance,
is proposed. To our knowledge, this is perhaps the first identified the fact that speakers look away from their
study that attempts to automatically estimate a metric, partners at the beginning of an utterance, and look at
termed Interaction Effort, by exploiting dialogue acts. their partners at the end of an utterance. This proce-
Specifically, our contributions are: dure might be useful since it serves to avoid cognitive
load (i.e. planning of the utterance) as well as shifting
– An automatic recognition system for the detection role with the partner.
of both system directed speech and self-talk. Self-
talk provides insights about the cognitive load of In HRI situations, robots are required to estimate
the patient with MCI. We also proposed relevant the engagement of addressee for efficient communica-
rhythmic features for the characterization. tion. Estimation of gaze is a difficult task in HRI due to
– The definition and evaluation of a measure of en- greater distances between the robot and the addressee,
gagement based the previously detected acts. This consequently other cues such as head orientation, body
measure is employed to understand the strategy of posture, and pointing might also be used to indicate
the patients during cognitive stimulation exercises. at least direction of attention. Most of the proposed
techniques can be seen as based on the concept of face
The remainder of this paper is organized as follows:
engagement proposed by Goffman [17] to describe the
Section 2 describes the related works in human-human
process in which people employ eye contact, gaze and
and human-machine interactions. Section 4 and section
facial gestures to interact with or engage each other. Ba-
5 give an overview of the cognitive stimulation situa-
sically, the engagement detection framework is based on
tion including the design of the robot and the Wizard-
(1) face detection and (2) facial/head gestures classifi-
of-Oz experiment. Section 6 describes the analysis of
cation. In [22], in order to understand behaviors of the
the manually labelled data for the extraction of dia-
potential addressee in human-robot interaction task,
logue acts from the Wizard-of-Oz experiment experi-
the authors proposed to combine multiple cues. A set of
ment: self-talk (ST) and system directed speech (SDS).
utterances is defined and used to start an interaction.
Section 7 shows and discusses the engagement charac-
In addition to the detection step, the authors estimate
terization framework. The experiments carried on for
visual focus of attention of users. They compute prob-
the evaluation of the proposed metric are described in
abilities that the partner is looking at a pre-defined list
section 8. Finally, section 9 concludes our work.
of possible focus targets. Since the focus targets include
the robot itself and other potential users, engagement
2 Related work estimation is reinforced and allows to take benefit from
the eye-gaze functions without an explicit modeling. A
2.1 Social cues of engagement similar work done in [12] in a multi-robot interaction
framework, based on face detection and gestures classi-
The problem of detecting engagement has been stud- fication, makes it possible to select and command indi-
ied using verbal and non-vebal cues but many existing vidual robots.

Self-talk discrimination in Human-Robot Interaction Situations For Engagement Characterization 3

Robots could also use eye-gaze behaviors for the tivated by the fact that, in rehabilitative context, the
improvement of the interaction. Mutlu et al. [23] con- extraction of implicit information on the patient are of
ducted experiments where their robot, Robovie, em- seminal importance. Physiological signals are more cor-
ployed various strategies of eye-gaze behaviors to signal related to internal state of a user and consequently can
specific roles during interaction: addressee, bystander, be employed to infer emotional information [28]. En-
and overhearer. The authors show that gaze direction gagement detection from these signals can be used by a
serves as a moderator since gaze cues support the con- robot to alter the interaction scenario such as coaching,
versation by reinforcing roles and participation of hu- assistance...
man subjects. Peters et al. [29] have highlighted the importance
All the previous mentioned works have shown the and the various elements related to engagement. They
benefit of using eye-gaze behaviors for measuring en- proposed a simplified model based on an action-cognition-
gagement during interaction. However, this social signal perception loop, which makes it possible to differentiate
should be more precisely characterized. Rich et al. [24] between the several aspects of engagement: perception
investigated some relevant cues for recognizing engage- (e.g. detection of cues), cognition (e.g. internal state:
ment firstly in human-human interaction and then pro- motivation), action (e.g. display interest). They also
posed an automatic system for HRI situations. Regard- identified a dimension termed experience, which aims
ing eye-gaze behaviors, they identified directed gaze and at covering subjective experiences felt by individuals.
mutual facial gaze. Directed gaze characterizes events This work shows that engagement is not a simple con-
when one person looks at some object following which cept and, as other social signals, investigations on the
the other person looks. Mutual gaze refers to events characterization, detection and understanding are still
when one person looks at the other person’s face. As a required for the design of adaptive interfaces including
result, various features are employed to describe com- robots.
municative functions of eye-gaze behaviors. Ishii et al. In this paper, we deal with the characterization of
[15] also extract various features from eye-gaze behav- engagement detection in assistive context and more pre-
iors such as gaze transitions, occurrence of mutual gaze, cisely in a cognitive stimulation situation with elderly
gaze duration, distance of eye movement, and pupil size. people. Research works on related topics are usually de-
These features are employed to predict user’s conversa- voted to acceptability [30–32], however maintaining en-
tional engagement. The statistical modeling and combi- gagement is identified as a key component of socially as-
nation of the features have shown the relevance of all of sistive robots [33,34]. The use of interactive technology
them in the recognition of user’s attitudes (engagement may be more challenging for elderly users. Xiao et al.
vs disengagement). [35] have investigated how seniors deal with multimodal
Other cues can be employed to detect engagement interfaces and they found that elderly people require
in social interactions. Castellano et al. [14] proposed more time and make errors as it can be expected. But,
to combine eye-contact with smiling, which is consid- more interestingly, they identified that older users em-
ered in their specific game scenario as an indicator of ploy an audible speech register termed self-talk, which is
engagement. The authors enriched the characterization a kind of think-aloud process produced during difficult
by adding contextual features such as the game state tasks. Discriminating self-talk (SF) from system/robot
and the behavior of the robot used (iCat facial expres- directed speech (SDS) is of great importance for two
sions). A Bayesian network is employed for the model- main reasons: (1) intended interactions are produced
ing of cause-effect relationships between the social sig- during SDS, (2) production of SF can be used as an
nals and contextual features. The evaluation makes it indicator of engagement. The next section aims at de-
possible to identify a set of actions done by users corre- scribing more precisely this specific speech register.
lated with engagement. Spatial movements have been
used for initiating conversations [25] and more generally
for engagement characterization (see [26] for an inter- 2.2 Self-talk in machine interaction
esting discussion on relevant social cues).
All these results show that in various tasks eye-gaze Following the definition of Oppermann [11], self-talk, or
behaviors could be combined with others cues in order private speech, refers to audible or visible talk people
to improve the detection rates. Other strategies can be use to communicate with themselves. This register can
followed by avoiding the estimation of eye-gaze behav- be considered as a part of off-talk, which is a special dia-
iors. For instance in [27], the authors employ physiolog- logue act characterizing ”every utterance that is not di-
ical signals such as skin response and skin temperature rected to the system as a question, a feedback utterance
for the estimation of engagement. The work was mo- or an instruction”. Off-talk is identified as a problem


for automatic speech recognition systems and distin- which is traduced by the production of self-talk. By be-
guishing it from on-talk (or system directed speech) will ing aware of theses phases, the coach robot will be able
clearly improve recognition rates. However, the charac- to produce useful feedback, encouragement or help.
teristics of off-talk make the task difficult. For instance, In order to identify the interaction phases between a
in case the user is reading instructions, lexical informa- therapeuth and a patient, we conducted experiments to
tion are not discriminant and other features should be acquire interaction datas. In 4 we describe the actions
employed. One relevant strategy is to try to combine au- of a therapeuth during a cognitive stimulation exercise,
dio and visual features as proposed in [36,37]. Batliner and how the robot should be adapted. The robot’s in-
et al. formulated the problem by defining on-talk vs off- teraction are described in section 3, and the technical
talk and on-view vs off-view strategies. The combina- setup of the Wizard of Oz experiments are detailed in
tion of them leads to on-focus (on-talk + on-view) and section (5.2).
off-focus, where on-view is not discriminant: listening After the completion of all experiments, the cor-
to someone and looking away. The authors employed pus analysis begins, as described in 6. The very first
an audio-visual framework for the classification of on- analysis is the manual annotation of all the records
talk from various off-talk elements (read, paraphras- taken during the experiments. After a gaze and key-
ing and spontaneous off-talk). Prosodic, part-of-speech words annotation, the next step is the pooling of the
(POS) features and visual features (a simple face detec- annotated keywords in clusters using a Latent Semantic
tion system) are employed. The detection of user’s focus Analysis Method. This method groups together seman-
on interaction yields 76.6% by using prosodic features tically similar keywords in meaningful clusters, struc-
and the combination with linguistic and visual features turing in a non-supervised way the dialogue acts, and
allows to achieve 80.8% and 84.5% respectively. giving them a semantic signification, as explained in
From a conceptual point of view, open-talk (robot- figure 3.
directed speech) is considered as a social speech be- In these clusters occurs another manual annotation,
cause it is produced with the objective of communica- splitting the keywords in two categories, whenever they
tion, while self-talk is known to be a means for think- were spoken to the robot and/or the computer (System-
ing, planning and for self-regulation of behavior [38]. Directed-Speech) or they were spoken to the patient
Lunsford et al. [37] investigated audio-visual cues and himself (Self-Talk). With all these labelled keywords
reviewed some functions of self-talk. Among the most (cluster, ST or SDS), we are able to perform an en-
interesting ones, the authors reported that self-talk sup- gagement characterization as described in section 7.
ports task performance and the self-regulation. Poten-
tial benefits of improving estimation of engagement are
(1) design of adaptive social interfaces (including robots)
(2) improvement of the impact of assistive devices and
(3) understanding the strategies and behaviors of indi-
viduals. Fig. 1 Steps of engagement characterization

3 Objectives

The purpose of this paper is engagement characteriza- 4 Patient-Therapeuth Interaction
tion in an interaction between a MCI patient and a
robot with the help of verbal utterances classification Before conducting experiments between a robot ant pa-
in Self-Talk or System-Directed Speech. tients, we collected data about how patients interacted
As seen in [58], during the completion of a spatial with a therapist. The patient had to solve exercises on
task by seniors, a high amount of self-talk is observed: a tactile screen, while the therapist, seated near the pa-
80% of the subjects of their study engaged in ST at tient, observed the situation and provided help when-
some point during their session. This amount increases ever the patient needed one. The therapist could help
with the difficulty level of the task, which is in strong with the technical setup, indicating how to deal with
correlation with the cognitive load of the person: the the tactile screen, or just provide help for a particu-
ST amount increased from low to high difficulty tasks lar exercise, how to correct an answer or just say if
(26.9% versus 43.7%, respectively). A similar situation the patient answered correctly. The interaction between
is proposed through the cognitive stimulation experi- the patient, the therapist and the cognitive stimulation
ment (section 4) and the key idea is to identify phases exercise is a triadic situation, as shown in Figure 2.
where the patient are less engaged on this activity, Backchannels of the therapist are important for these


patients to gain in confidence and solve these exercises Paro. Various robots fitting these characteristics can
correctly, even if the therapist doesn’t have to say any- be employed but we selected the rabbit shaped Violets
thing - the mere presence of the therapist matters. Psy- Nabaztag, type Nabaztag: tag. This electronic device
chologists at the Broca Hospital therefore organized ses- has enabled Wi-Fi, and can connect to the internet to
sions of recorded cognitive stimulation, which will lead process specific services via a distant server located at
on our side to the analysis of the interaction between http://www.nabaztag.com. The Nabaztag has motor-
the patient and the therapist. Thanks to these sessions, ized ears, 4 color LEDs, a speaker and a microphone.
we were able to determine the interaction phases of the As it will be described in section 5.2, ears and LEDs are
therapist, and duplicate them later to the robot. The employed to enhance the expressiveness of the Nabaz-
therapist is always at the side of the patient, measur- tag. Regarding the acceptability of this robot, exper-
ing and evaluating the attention and engagement of the imental results can be founded in other projects such
patient. The presence of the therapist is important for as SERA (Social Engagement with Robots and Agents)
the patient to gain in confidence. As shown in Figure 2, where social engagement is investigated [30, 31].
during the cognitive stimulation exercises, the patient
sits in front of the tactile screen with the therapist at
his right. Because the acceptance of the robot is our
second concern between the fact that the robot will re-
act appropriately, we investigated as well with our col-
leagues which kind of robot might be acceptable by the 5.2 Description of the Wizard of Oz Experiments
targeted end-users with Mild-Cognitive Impairments.
5.2.1 Technical and Experimental Design

Human-robot communication differs from human-human
communication. Therefore, to gather reliable informa-
tion about human-robot communication it is important
to observe the human behaviour in a situation in which
humans believe to be interacting with a real robotic
system. It is important that the user thinks he or she
is communicating with the system, not a human, as
noted for example by [57]. The purpose of our Wizard
of Oz experiments is to record interactions between the
patient and the robot, using the interactions schemes
observed between the patient and the therapist. Af-
Fig. 2 Triadic situation either with robot or therapist ter analyzing the videos of the sessions between the
therapist and the patient, patterns of interaction were
detected (therapist encouraging, therapist answering a
question, backchannels) and adapted to the Nabazatg.
The purpose was to give to the Nabaztag an interaction
5 Patient-Robot Interaction
panel, leaving the Wizard in charge the responsibility to
choose the right couple [answer+backchannel ] for each
5.1 Designing Robot for MCI patients
situation.
Focus group sessions were conducted at the Broca Hos- The patient seats in front of the tablet-PC, with
pital to identify how the elderly perceive a robot’s ap- the Nabaztag at his left as shown in Figure 2. Dur-
pearance. 15 adults over the age of 65, divided in three ing the sequence of cognitive exercises solved on the
groups, took part in the sessions. 13 of them were re- tablet PC, the robot interacts with the patient. The
cruited from the Memory Clinic at the Broca Hospi- Wizard gathers informations about the situation with
tal; two were recruited from an association for the el- the help of two cameras, and a screen capture of the
derly. Seven of the participant suffered from Mild Cog- tactile screen. The Wizard can hear the patient, but
nitive Impairment, according to the definition criteria reversed situation is impossible. The Nabaztag is re-
of [39]. From the results of the focus group, the robots motely controlled by the Wizard, activating the couple
defined as attractive to them were small robots, often [answer+backchannel ] at the same time. The mean du-
with a modern design, shaped like animals or objects ration of a WOZ experiment is 7min30s and with a total
they could use in their daily life [40], like Mamoru or of 96min.


5.2.2 Verbal and Nonverbal Behaviors for the Nabaztag 5.3 Participants Analysis

A total of eight participants were chosen by the ther-
Regarding the Nabaztag, nonverbal behaviors should
apeuths at the Broca hospital, seven females and one
be defined. Similar to other robots, such as Paro [41],
male, aged from 64 to 82, participated in the experi-
Emotirob [42] or Aibo [8], the Nabaztag can exploit
ments. Two of them had a slight MCI, the remaining
movements and sounds as social communicative signals.
six did’nt have any communication, hearing, or vision
According to the work of Lee and Nam [43] about the
impairments. Examples of interaction between the par-
relation between physical movement and the augmen-
ticipants and the Nabaztag are shown in the following
tation of emotional interaction, the expressions of the
two paragraphs, one dedicated to the SDS, the second
Nabaztag will be correlated with both speed of move-
to ST.
ments and LEDS blinking. As shown in Figure 3, slow
movements or blinking LEDS will express unpleasant
Example Dialogue Between a User and the Nabaztag:
expressions such as sadness or annoyance, expressed
System-Directed-Speech
when the user is getting lost or doesn’t know how to
solve the exercise. Positive expressions are related to Nab. : Good Morning, my name is Carole, what’s your
active movements and blinking, which are employed to name?
encourage the user, for instance. User : My name is Bob.
Nab. : Hello, Bob. I’m here to help you solve your exer-
The nonverbal behaviors for the Nabaztag were im-
cises. Do you want to start them now?
plemented using the NabazFrame, developed by the
User : Yes!
University of Bretagne Sud1 . Nonverbal choreographies
Nab. : Let’s go! First, you have to drag the images into
contain ear movements and different sets of blinking
the box corresponding to their name.
LEDs, the color depending of the mood or feedback
User : How do I do that?
we wanted to transmit. Green and yellow fast blinking
Nab. : Press on the image, then drag it to the box.
LEDs express pleasant expressions, while slow move-
ments with blue and violet LEDs express unpleasant
Example Dialogue Between a User and the Nabaztag:
signals (Fig. 3). These behaviors are currently tested
Self-Talk
by end-users in another work.
User : And I drag the little image with the tree to automn
box... oh, it’s not coming! She’s gone, uh, found, i
drag it to the box, oh, it escaped...
Nab. : When dragging the image to the correct box, you
must always touch the screen.
User : Uh, yes, I drag it, i drag the image, the tree is
in the box. The image with the flowers can’t be the
summer, these flowers grow in the spring, oh, i don’t
know, I drag the image to the spring, she’s not com-
ing, ah, yes, let’s go the the next image.

6 Analysis of the annotation of WOZ
experiments

6.1 Manual Annotation of the Content of the WOZ
Experiments

For each of the 8 participants of the recorded exper-
Fig. 3 Relationship between movements and expressions iments, the first step was the manual annotation of
the videos taken during the Wizard of Oz experimenta-
tions. Before the annotation begun, the videos from the
two cameras were synchronized and edited together, in
order that the annotater could view the patient from
the two different angles: computer and robot. First, the
1
www-valoria.univ-ubs.fr gaze was annotated, without paying attention to the


speech. The annotater carefully annotated when the pa- detection and evaluation through act dialogues such as
tient was looking at the robot (the setup made clearly self-talk and open-talk is relevant. The eye-contact is
visible when the patient was looking at the robot: from not discriminant in our case, but we found that the
the computer point of view, the person moves her head, verbal production of the patient could help to charac-
and from the robot-point of view the person gaze is fo- terize their difficulties. Because the robot needs to know
cused on the camera), and then when the patient was when the patient encounter difficulties, to produce the
looking directly at the computer (same but reversed). appropriate feedback and help the patient, we decided
The second step was to perform the speech annotation. to focus our work on the verbal utterances.
The annotater first listened, then annotated the rele-
vant keywords found in the patient’s utterances. Key-
words can be a single word, but also an expression of 6.2 Latent semantic analysis of the content of
words with close signification: ”I’m doing well ” will be annotation
annotated as one keyword, for example. Filled pauses
were not annotated. The spoken keywords were primary Understanding the content of annotation of interactive
divided in eight simple different categories, defined by databases is made difficult by the strategies of indi-
the annotator after seeing the whole eight films. A gen- viduals in a cognitive stimulation task. Indeed, several
eral structure was defined, as defined below: keywords or utterances correspond to one of the 8 cate-
– Agreement : ”Let’s start”, ”Yes”, ”You’re right” gories. To provide insights on the annotation databases,
– Technical Question (often about the use of the and consequently on the behaviors exhibited, we per-
tactile screen) formed a latent semantic analysis (LSA) [44]. LSA is
– Contextual Question (about the cognitive exer- based on a Singular Value Decomposition (SVD) of
cise itself, such as ”What should I do with this pic- term-document matrix, and results on a reduced di-
ture? ”, ”Should I put it here, or there? ”) mension of the feature space. The interpretation of the
– Non-Obligatory Turn-Taking: Comment with- reduced space makes it possible to identify semantic
out the need of an answer by the robot (Users with concepts. Altough LSA is usually performed on texts,
MCI are often commenting about what they are do- we decided to use it here because the verbal production
ing, ”And I move the picture over there...”) of the patients is not spontaneous. The verbal utter-
– Obligatory Turn-Taking: Comment with the need ances are structured (the exercise, or their technical
of an answer by the robot (such as ”I hope I am do- difficulties). Even in the Self-Talk parts of the interac-
ing right”), they indicate us the difficulty felt by the tion, the verbal production is structured and limited.
user The term-by-documents matrix becomes, in our case,
– Support Needed: User is confused and need more a keyword-by-clusters matrix. To train LSA we esti-
support (”I don’t remember ”, ”I don’t know what I mated the occurrence of each one of the 247 different
should do? ”) keywords or utterances produced by the 8 participants
– Thanks: (some user thanks the Nabaztag when they regarding the 8 categories. And using a Kaiser criterion
receive help, and one complimented it about its color (80% of the information), the reduced rank was set to
and shape) 5. We carefully analyzed the content (keywords) of the
– Disagreement (”No, I don’t want”) obtained clusters and they were interpreted as:
The gaze annotation shows that 89.76% of the time – Positive Feeling: positive feeling expressed toward
users are looking at the computer. This is explained by the robot, or toward the exercises.
the fact that they were asked to realize the exercises. – Comments: useful keywords, but expressing noth-
Other explanations could be found in the specific design ing than a simple statement about the exercise.
of the triadic situation, eye-contact with the robot is not – Social Etiquette: the keywords put in this cluster
required for effective engagement: the robot is only here were all expressing agreement with the robot(”Okay”,
to help and encourage, to solve an exercise the patient ”I’m in”, ”Thanks”). This cluster is based on J.
has to gaze at the computer. As previously discussed, Light’s original work, analyzing communication and
eye-gaze behaviors might not be discriminant for en- characterizing the communication messages into dif-
gagement detection on some tasks, and more specifi- ferent categories [59].
cally for seniors. Similar results have been obtained in – Request Information: all the keywords express-
[37], where 99.5% of the self-talk utterances were asso- ing a question on the exercises were put in this clus-
ciated with a gaze behavior directed to the system, and ter. The questions are context based, they depend
98.1% for system directed speech. Eye-contact is not on which part of the exercises the patient experi-
always discriminant for engagement / disengagement ences difficulties.


– Others: the last cluster, in which the words are rel- highest amount of ST are respectively the Positive Feel-
evant but the amount of words is to small to form a ings, with 91 utterances, and the Others cluster, with
specific cluster. This last cluster covers various com- 130 utterances. These two clusters are in direct relation-
ments usefulness for the cognitive exercise but to- ship with the exercises: the users expresses his feelings
tally relevant for engagement because of the amount toward the exercise in the first cluster, and in the last
of self-talk (see table 1). one are labelled various utterances about the exercises,
expressed by the patients: In fact, all the patients were
totally engaged in the interaction and didn’t perform
6.3 Speech corpus
any other polluting task.
Our speech corpus consists of 543 utterances or key-
word, and each of them correspond to one of the se- Table 1 Distribution of self-talk and system directed speech
mantic cluster. Each utterance or keyword has been over the clusters
carefully annotated: (1) self-talk or (2) system/robot di- Semantic cluster Self-Talk System Directed Speech
rected speech. The annotator simply answered to these Positive Feeling 91 9
two questions: ”According to you, does the person speak Comments 21 24
Social Etiquette 43 69
to herself ? ” (ST) or ”speak to the robot or the com- Request Information 27 33
puter? ” (SDS). Others 130 96
We evaluated the subjectivity of the annotation pro-
cess by evaluating interjudge agreement. A second naive
annotator was chosen, who didn’t have any contacts On table 2, patients 3 and 7 have the highest num-
with the participants, and didn’t even know before the ber of self-talk verbalizations, they are in fact the two
differences between ST and SDS. This annotator watched MCI patients and this result is well explained consid-
the videos and had to choose, based on the same ques- ering their pathology. Patient 3 was the most talkative
tions, wether the verbal utterances were ST or SDS. patient, and expressed a various range of positive feel-
We then performed an inter-annotator score between ings. A measure of the importance of ST per patient
the very first labellization and the one of the naive an- could trace the evolution of a given patient.
notator. With the kappa method, the result was a score
of 0.68 showing a sustainable agreement.
Table 2 Number of self-talk and system directed speech
The distribution of the utterances over the clusters
is given in table 1. Most of the self-talk verbalizations Users Self-Talk System Directed Speech
1 20 19
are present in the most heterogeneous semantic cluster
2 1 7
(Others), which shows that the amount of self-talk plays 3 106 85
a role on engagement. Positive feeling is mostly com- 4 14 2
posed by self-talk elements. This can be well explained 5 30 6
by the fact that we are dealing with the expression of 6 37 20
7 58 37
feelings, or positive comments which appears when the
8 49 55
robot gives satisfaction to the user. As the patient is
resolving a problem thanks to the robot, he is grate-
ful but concentrated on the exercise. He will speak to After an acoustic analysis of all keywords and utter-
himself about his feelig of joy, or gratitude, because ances annotated as self-talk and system directed speech,
he is concentrated on the exercise. After the exercises we selected 293 utterances for ST and 223 for SDS. The
are complete, the cognitive load of the patient is lower, removed utterances were mainly due to their shorter
thus the patients often express their gratitude, but this duration (less than 1s). Durations of the utterances are
time directly to the robot. As one can expect, system between 1 to 2.5s.
directed speech is more present on semantic clusters
characterized by direct relationship to the cognitive ex-
ercise: comments, social etiquette and request informa- 7 Using self-talk detection for engagement
tions. In fact, these clusters express a direct speech to
the robot because the person is asking, or requesting, In this section we describe the system developed for
something very precise: as for a discussion between two the measure of engagement from self-talk. Figure 4 de-
persons, the patient instinctively direct his speech to picts the proposed system. It requires to discriminate
the system. A high amount of ST doesn’t mean a low self-talk from system directed speech. Then, we com-
engagement in the interaction: the two clusters with the bine the duration of both act dialogues in a measure of


engagement, which aims at characterizing the degree of 7.2 Classification
interaction effort.
As previously mentioned, self-talk is produced with After the extraction of features, supervised classifica-
a lower intensity compared to open-talk or robot-directed tion is used to categorise the data into classes cor-
speech. In this paper, we propose to investigate prosodic responding to (1) self-talk, (2) system/robot directed
features including pitch, energy and rhythm. This is speech. This can be implemented using standard ma-
motivated by the fact that most of the classification chine learning methods. In this study, three different
frameworks of specific speech registers such as emotions classifiers, decision trees, k-nearest neighbour (k-NN)
[45], infant-directed speech [46], robot-directed speech (k is experimentally set to 3) [53], Support Vector Ma-
[47] are mainly based on the characterization of supra- chines (SVM) with Radial Basis Function [54], were in-
segmental features. In addition, Batliner et al. [48] have vestigated.
shown the relevance of prosodic features and more pre- A k-fold cross-validation scheme was used for the
cisely duration features to discriminate on-talk to read experimental setup (k is set to 10) and the performance
off-talk. However, researchers on speech characterisa- are expressed in terms of the overall accuracy: average
tion and feature extraction show that is difficult to have of the accuracies obtained over the k partitions.
a consensus on relevant features for the characterization
of emotions, intentions as well as personality. 7.3 Evaluating engagement

Engagement is an interactive process [13], in which two
participants decide when to establish, maintain and end
their connection. Thus, it is important to evaluate the
engagement and find a way to establish the interac-
tion effort caused by maintaining the interaction. Once
the system directed speech and the self-talk sequences
are detected, we propose to combine them to estimate
a dimension termed interaction effort. The interaction
Fig. 4 Description of the proposed system for engagement effort is based on the dialogue: the interaction goes
measure
hand in hand with the dialogue in our experiments.
Because this is a verbal interaction, it is important to
analyze the differences in the adressee (self or system)
to evaluate the interaction effort. In [55], interaction ef-
fort (IE) is defined as a unitless measure of how much
7.1 Feature extraction effort a user put into interacting with a robot. The au-
thors show that IE cannot be measured easily because it
Several studies have shown the relevance of both fun- will require advanced tools such as eye-tracking and/or
damental frequency (F0) and energy based features for brain activity sensors. In our application, we argued
emotion recognition applications [49]. F0 and energy that IE is related to the quantity of robot/system di-
were estimated every 20 ms using Praat [50], and we rected speech, which is characterized by (1) intended
computed several statistics for each voiced segment (seg- communication and (2) on-view state (gazing to the
ment based method) [51]: maximum, minimum, mean, system). While self-talk might be an indicator of plan-
standard deviation, interquartile range, mean absolute ning, cognitive load is related to the difficulty of the
deviation, quantiles corresponding to the cumulative task. We propose to estimate the interaction effort of a
probability values 0.25 and 0.75. resulting in a 16 di- given human robot interaction during a cognitive stim-
mensional vector. ulation situation by the following expression:
Rhythmic features were obtained by applying a set SDS
IE = (1)
of perceptual filters to the audio signal dedicated to SDS + ST
characterization of prominent events in speech termed SDS and ST refer to the duration of system directed
p-centers [52]. Then, we estimated the spectrum of the speech and self-talk (in seconds). The numerator is the
prominent signal for characterizing the speaking rate. amount of time of intended interaction, and the denom-
We estimate 3 features: mean frequency, entropy and inator is the amount of time of interaction time (speak-
barycenter of the spectrum. Differences in rhythm are ing). IE is a unitless measure (0 ≤ IE ≤ 1). If SDS is
indicators of efficiency and clarity of interaction (fluid- small relative to ST then IE is quite small. Typically, ef-
ity). ficient interactions, which do not require self-regulatory


behaviors from elderly people, will obtain an IE close to Table 3 Accuracy of classifiers using 10 folds validation
1. In some interactions, self-regulatory speech can be a Features Decision Tree k-NN SVM
positive measure, improving the interaction, but in our Pitch-based 49.8% 53.35% 52.16%
case, it transcribes the cognitive load of the patient. The Energy-based 55.54 54.29% 59.51%
robot has to be aware of the patient’s difficulties, linked Rhythm-based 52.78% 56.58% 56.97%
Pitch 57.42% 59.28% 64.31%
to his cognitive load, we therefore describe the interac- + Energy
tion as efficient when the cognitive load (the amount of Pitch 55.46% 58.20% 71.62%
ST) is low. + Energy
The IE measure proposed in this paper allows to + Rhythm
evaluate the efficiency of the interaction. In future works,
this measure will allow to change the verbal and non-
verbal behaviors of the robot and more interestingly 8.2 Engagement characterization
to adapt both cognitive exercises and encouragements
provided.
This section describes the estimation of the Interac-
tion Effort measure (equation 1). For the evaluation
8 Experimental results of only the IE measure, we exploited the manually la-
belled data. The results are presented table 4. The best
This section describes experiments and results performed IE measure is obtained for the user 2 (0.83), but one
for the characterization of engagement based on the de- should be careful with this result since he produced only
tection of self-talk. Firstly, the performance of our self- 1 self-talk and 7 system directed speech utterances (ta-
talk detection system is presented then we propose to ble 2). For the most talkative user (patient 3), the IE
derivate a measure of engagement. measure provides insights about his behavior: a relative
balance between ST and SDS.
Patients 4 and 5 Interaction effort estimation’s is
8.1 Detection of self-talk
under 0.20. It can be easily explained because these
patients did not talk directly to the robot. They ad-
Table 3 shows the recognition rates of all the classi-
dressed the system directly at the beginning of the ex-
fiers trained with different feature sets. In [37], energy
ercise, showing they understood the instructions. The
is found to discriminate system directed speech from
other verbal utterances were only comments addressed
self-talk. Compared to pitch based features, classifiers
to themselves. Patient 5 had little difficulties, where she
trained with energy are more efficient and best results
had to address directly to the robot to obtain a proper
are obtained by SVM. One possible explanation is that
answer, but the amount of her self-talk utterances was
the extraction of pitch might be more complex for self-
too considerable to balance these SDS utterances.
talk, which is produced by the users for themselves and
consequently with a lower energy and intelligibility. For the automatic estimation, we followed the frame-
In addition to what has been described in the lit- work described figure 4. A Vocal Activity Detector (VAD),
erature concerning the energy, we argued that rhythm suitable for real-time detection in robotics [8], is em-
should be relevant for our application because of the ployed for the segmentation of speech. The self-talk /
change of speaking rates observed during self-talk. The system directed speech discrimination system is based
experimental results show that using only rhythm based on the SVM classifier trained with pitch, energy and
features allow to achieve acceptable performance (56.97%) rhythmic features as previously designed. Once the speech
between those obtained by energy (59.51%) and those of utterances classified, we extract their duration and at
pitch features (52.16%). Energy and rhythmic features the end of the experiment we estimate the IE measure.
are more robust. Rhythmic features is related to the Table 4 shows that the IE measures computed by au-
vocalic energy of signal [56] and has similar characteris- tomatic approach capture the individual behaviors of
tics to that of short-term energy expect that perceptual each user. In addition, high and low IE measures are
filters are employed before the computation of energy efficiently characterized. However, for very low IE mea-
(from acoustic prominence enhancement). SVM classi- sures such as user 4, the automatic approach under es-
fier trained with the three sets of features outperforms timated the performance. This may due to factors such
(71.62%) all the configurations. One should note that as the small amount of verbalizations of this given user.
adding features will not exhibit the same performance Due to the imperfect classifications, some of the IE mea-
for all the classifiers. Adding features for decision tree sures are over or under estimated but always allowing
and k-NN classifiers decreases the performance. to characterize a trend.


Table 4 Interaction effort (IE) measure estimation Acknowledgements This work has been supported by French
National Research Agency (ANR) through TecSan program
Users From Annotation From Automatic (project Robadom ANR-09-TECS-012). The authors would
Self-talk Detection like to thank the Broca hospital for their work: Ya-Huei Wu,
1 0.5 0.62 Christine Fassert, Victoria Cristancho-Lacroix and Anne-Sophie
2 0.83 0.78 Rigaud
3 0.45 0.53
4 0.13 0.08
5 0.20 0.26
References
6 0.43 0.46
7 0.42 0.38
1. Feil-Seifer D. J., Mataric, M. J. (2005) Defining Socially
8 0.57 0.63 Assistive Robotics. In International Conference on Rehabil-
itation Robotics, pages 465-468, Chicago, IL.
2. Fasola J., Mataric, M. J. (2010) Robot Exercise Instruc-
tor: A Socially Assistive Robot System to Monitor and En-
9 Conclusions
courage Physical Exercise for the Elderly. In 19th IEEE
International Symposium in Robot and Human Interactive
In this paper, we have demonstrated promising results Communication (Ro-Man 2010), Viareggio, Italy.
on automatically estimating the interaction efforts of 3. Mataric M.J., Tapus A., Winstein C. J., and Eriksson
J. (2009) Socially Assistive Robotics for Stroke and Mild
patients during coaching experiment with cognitive stim- TBI Rehabilitation. In Advanced Technologies in Reha-
ulation exercises. After an analysis of WOZ experiments bilitation, Andrea Gaggioli, Emily A. Keshner, Patrice L.
(triadic situation), we identified relevant social signals (Tamar) Weiss, Giuseppe Riva (eds.), IOS Press, 249-262.
4. Vinciarelli A., Pantic M., and Bourlard H. (2009) Social
characterizing the engagement of elderly patients: speech
Signal Processing: Survey of an Emerging Domain,Image
directed talk and self-talk. The last one is employed and Vision Computing Journal, Vol. 27, no. 12, pp. 1743-
during difficult tasks by the users for planning, think- 1759.
ing and for self-regulation. We proposed a system to 5. Saint-Georges C., Cassel R.S, Cohen D., Chetouani M.,
Laznik M-C., Maestro S., and Muratori F. (2010) What
automatically detect these two social signals based on studies of family home movies can teach us about autistic
the extraction of relevant features: pitch, energy and infants: A literature review. Research in Autism Spectrum
rhythm with three different classifiers. The experimen- Disorders. Vol 4 No 3, pages 355-366.
tal results have shown the discriminative function of 6. Cassell J., Bickmore J., Billinghurst M., Campbell L.,
Chang K., Vilhj`lmsson H., and Yan H. (1999) Embodi-
a
energy as described in the literature. In addition, the ment in conversational interfaces: Rea. In CHI’99, pages
performance achieved by proposed rhythmic features 520–527, Pittsburgh.
demonstrate that users employ a different speech regis- 7. Wrede B., Kopp S., Rohlfing K., Lohse M., and Muhl C.
ter for intended communicative acts. (2010) Appropriate feedback in asymmetric interactions,
Journal of Pragmatics, vol. 42, no. 9, pp. 2369 - 2384.
Future work in this area should investigate multi- 8. Al Moubayed S., Baklouti M., Chetouani M., Dutoit
modal cues for the extension to off-talk situations. Off- T., Mahdhaoui A., Martin J.-C., Ondas S., Pelachaud
C., Urbain J., Yilmaz M. (2009) Generating Robot/Agent
talk is defined as the act of not speaking to an ad-
Backchannels During a Storytelling Experiment, In
dressee, and it includes self-talk but also talking to a ICRA?09, IEEE International Conference on Robotics and
third addressee... In this case, the automatic detection Automation. Kobe, Japan.
of on-view states could be of great importance. Among 9. Chetouani M., Wu Y.H., Jost C., Le Pevedic B., Fassert
C., Cristancho-Lacroix V., Lassiaille S., Granata, Tapus A.,
the automatic cues that should be developed, eye-gaze Duhaut D., Rigaud A.S. (2010) Cognitive Services for El-
social signal detection remains one of the challenges. derly People: The ROBADOM project, ECCE 2010 Work-
Eye-tracking systems are not really suitable for com- shop: Robots that Care, European Conference on Cognitive
plex assistive applications and will change the behav- Ergonomics.
10. Yanguas J., Buiza C., Etxeberria I., Urdaneta E., Gal-
iors during the interaction. Interaction effort can also dona N., Gonzlez M.F. (2008) Effectiveness of a non-
include touching and/or manipulation and a more gen- pharmacological cognitive intervention on elderly factorial
eral definition of multimodal and integrative engage- analisys of Donostia Longitudinal Study. Adv. Gerontol. 3,
ment characterization should be proposed. 30-41.
11. Oppermann D., Schiel F., Steininger S., Beringer
Furthermore, we intent to use the IE measure for N. (2001) Off-talk - a problem for human-machine-
the characterization of users and giving the opportu- interaction?, In EUROSPEECH-2001, 2197-2200.
12. Couture-Beil A., Vaughan R., and Mori G. (2010) Se-
nity to adapt the robots behaviors and in our specific
lecting and Commanding Individual Robots in a Vision-
task: the encouragements and potentially the difficulty Based Multi-Robot System. Seventh Canadian Conference
of the cognitive exercises. As future work, we will ex- on Computer and Robot Vision (CRV).
ploit questionnaires in order to understand and esti- 13. Sidner, Candace L. and Kidd, Cory D. and Lee, Christo-
pher and Lesh, Neal (2004) Where to look: a study of
mate the engagement awareness of the users during in-
human-robot engagement. Proceedings of the 9th interna-
teraction (experience of engagement). tional conference on Intelligent user interfaces (IUI’04).


14. Castellano G., Pereira A., Leite I., Paiva A., McOwan P. 32. Heerink M., Krose B.J.A., Wielinga B.J., Evers V. (2006)
W. (2009) Detecting user engagement with a robot com- The Influence of a Robot’s Social Abilities on Acceptance
panion using task and social interaction-based features. by Elderly Users. Proceedings RO-MAN, Hertfordshire,
Proceedings of the 2009 international conference on Mul- september 2006, pp. 521-526
timodal interfaces (ICMI-MLMI’09), pages 119–126. 33. Mataric M.J. (2005) The Role of Embodiment in Assis-
15. Ishii R., Shinohara Y., Nakano T., and Nishida T. (2011) tive Interactive Robotics for the Elderly, AAAI Fall Sym-
Combining Multiple Types of Eye-gaze Information to Pre- posium on Caring Machines: AI for the Elderly, Arlington,
dict User’s Conversational Engagement. 2nd Workshop on VA.
Eye Gaze on Intelligent Human Machine Interaction. 34. Tapus A., Tapus C., and Mataric M. J. (2009) The Use
16. Nakano Y.I., Ishii R. (2010) Estimating User’s Engage- of Socially Assistive Robots in the Design of Intelligent
ment from Eye-gaze Behaviors in Human-Agent Conversa- Cognitive Therapies for People with Dementia, Proceed-
tions. in 2010 International Conference on Intelligent User ings, International Conference on Rehabilitation Robotics
Interfaces (IUI2010). (ICORR-09), Kyoto, Japan.
17. Goffman, E. (1963), Behavior in Public Places: Notes on 35. Xiao B., Lunsford R., Coulston R., Wesson M., Oviatt
the Social Organization of Gatherings. New York: The Free S. (2003) Modeling multimodal integration patterns and
Press. performance in seniors: Toward adaptive processing of in-
18. Argyle M. and Cook M. (1976) Gaze and Mutual Gaze. dividual differences. Proceedings of the 5th international
Cambridge: Cambridge University Press. conference on Multimodal interfaces.
36. Batliner A., Hacker C., Kaiser M., Mogele H., Noth
19. Duncan S. (1972) Some signals and rules for taking speak-
E. (2007) Taking into account the user’s focus of atten-
ing turns in conversations. Journal of Personality and Social
tion with the help of audio-visual information: towards less
Psychology, vol. 23, no. 2, pp. 283-292
artificial human-machine communication, Auditory-Visual
20. Goodwin C. (1986) Gestures as a resource for the organi-
Speech Processing (AVSP 2007).
zation of mutual attention. Semiotica, vol. 62, no. 1/2, pp. 37. Lunsford R., Oviatt S., Coulston R., (2005) Audio-visual
29-49 cues distinguishing self- from system-directed speech in
21. Kendon, A. (1967) Some Functions of Gaze Direction in younger and older adults. Proceedings of the 7th inter-
Social Interaction. Acta Psychologica. 26: pp. 22-63. national conference on Multimodal interfaces (ICMI’05),
22. Klotz D., Wienke J., Peltason J., Wrede B., Wrede S., pages 167-174.
Khalidov V., Odobez J.M. (2011) Engagement-based multi- 38. Diaz, R. & Berk, L.E., ed. (1992), Private speech: From
party dialog with a humanoid Robot. Proceedings of SIG- social interaction to self regulation, Erlbaum, New Jersey,
DIAL 2011: the 12th Annual Meeting of the Special Interest NJ.
Group on Discourse and Dialogue, pages 341-343. 39. Petersen R.C., Doody R., Kurtz A., Mohs R.C., Morris
23. Mutlu B., Shiwa T., Kanda T., Ishiguro H., Hagita N. J.C., Rabins P.V., Ritchie K., Rossor M., Thal L., Winblad
(2009) Footing in human-robot conversations: How robots B., (2001) Current concepts in mild cognitive impairment.
might shape participants roles using gaze cues. In Proc. of Arch. Neurol. 58, 1985-1992, 2001.
ACM Conf. Human Robot Interaction. 40. Wu Y.H., Fassert C., Rigaud A.S. (2011) Designing
24. Rich C., Ponsler B., Holroyd A., Sidner C. L. (2010) Rec- robots for the elderly: appearance issue and beyond.
ognizing engagement in human-robot interaction. In Proc. Archives of Gerontology and Geriatrics.
of ACM Conf. Human Robot Interaction. 41. Shibata T., Wada K., Saito T., and Tanie K. (2001)
25. Shi C., Shimada M., Kanda T., Ishiguro H., Hagita N. Mental Commit Robot and its Application to Therapy
(2011) Spatial Formation Model for Initiating Conversa- of Children. In IEEE/ASME International Conference On
tion. Proceedings of Robotics: Science and Systems. AIM’01.
26. Michalowski M.P., Sabanovic S., Simmons R. (2006) A 42. Saint-Aime S., Le Pevedic B., Duhaut D. (2008)
Spatial Model of Engagement for a Social Robot. IEEE EmotiRob: an emotional interaction model, In IEEE RO-
International Workshop on Advanced Motion Control, pp. MAN 2008, 17th International Symposium on Robot and
762-767. Human Interactive Communication.
27. Mower E., Mataric M. J, and Narayanan S. (2011) A 43. Lee J., Nam T-J. (2006) Augmenting Emotional Inter-
Framework for Automatic Human Emotion Classification action Through Physical Movement, UIST2006, the 19th
Using emotional Profiles, IEEE Transactions on Audio, Annual ACM Symposium on User Interface Software and
Speech, and Language Processing Technology.
44. Steinberger J. (2004) Using Latent Semantic Analysis in
28. Zong, C. and Chetouani, M. (2009). Hilbert-Huang trans-
Text Summarization. Evaluation 93-100.
form based physiological signals analysis for emotion recog- 45. Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T.,
nition. IEEE Symposium on Signal Processing and Infor- Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous,
mation Technology (ISSPIT’09). L., Aharonson, V. (2007) The relevance of feature type
29. Peters C., Castellano G., de Freitas S. (2009) An ex- for the automatic classification of emotional user states:
ploration of user engagement in HCI. Proceedings of low level descriptors and functionals. Proceedings of Inter-
AFFINE’09. speech, pages 2253-2256.
30. Payr S., Wallis P., Cunningham S., Hawley M. (2009) 46. Mahdhoui A. and Chetouani M. (2011) Supervised and
Research on Social Engagement with a Rabbitic User semi-supervised infant-directed speech classification for
Interface, In Tscheligi M., de Ruyter B., Soldatos J., parent-infant interaction analysis. Speech Communication.
Meschtscherjakov A., Buiza C., Streitz N., Mirlacher T. 47. Breazeal C. and Aryananda L. (2002) Recognizing affec-
(eds.), Roots for the Future of Ambient Intelligence. Ad- tive intent in robot directed speech, Autonomous Robots,
junct Proceedings, 3rd European Conference on Ambient 12:1, pp. 83-104.
Intelligence (AmI09), ICT&S Center, Salzburg. 48. Hacker C., Batliner A., and Noth E. (2006) Are you look-
31. Klamer T., Ben Allouch S. (2010) Acceptance and use of ing at me, are you talking with me: multimodal classifica-
a social robot by elderly users in a domestic environment, tion of the focus of attention. In Sojka P., Kopcek I., Pala
ICST PERVASIVE Health 2010. K. (Eds): TSD 2006, LNAI 4188, pp. 581-588.


49. Truong K., van Leeuwen D. (2007) Automatic discrimina-
tion between laughter and speech., Speech Communication
49 (2007) 144-158.
50. Boersma P., Weenink D. (2005) Praat, doing phonetics by
computer, Tech. rep., Institute of Phonetic Sciences, Uni-
versity of Amsterdam, Pays-Bas, URL www.praat.org
51. Shami, M., Verhelst, W. (2007) An Evaluation of the
Robustness of Existing Supervised Machine Learning Ap-
proaches to the Classification of Emotions, Speech. Speech
Communication, vol. 49, issue 3, pages 201-212.
52. Tilsen S. and Johnson, K. (2008). Low-frequency Fourier
analysis of speech rhythm. Journal of the Acoustical Society
of America, 124:2, pp. EL34-39.
53. Duda, R., Hart, P., Stork, D. (2000) Pattern Classifica-
tion, second edition.
54. Vapnik V. (1995) The Nature of Statistical Learning The-
ory. Springer-Verlag.
55. Olsen D. R., Goodrich M. (2003) Metrics for Evaluating
Human-Robot Interaction, PERMIS 2003.
56. Delaherche E. and Chetouani M. (2010). Multimodal co-
ordination: exploring relevant features and measures. Sec-
ond International Workshop on Social Signal Processing,
ACM Multimedia 2010.
57. Dahlbaeck N., Joensson A., and Ahrenberg L., Wizard of
Oz Studies ? Why and How. Proceedings of the 1993 Inter-
national Workshop on Intelligent User Interfaces (IUI193),
ACM Press, 1993, 193-200.
58. Xiao B. , Lunsford R., Coulston R., Wesson M., Ovi-
att S.,(2003) Modeling multimodal integration patterns and
performance in seniors: toward adaptive processing of in-
dividual differences, Proceedings of the 5th international
conference on Multimodal interfaces, Vancouver, British
Columbia, Canada
59. Light J.(1997) Communication is the essence of human
kife: Reflections on communicative competence. AAC Aug-
mentative and Alternative Communication, 61-70.

Self-talk discrimination in Human-Robot Interaction Situations For Engagement Characterization

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Self-talk discrimination in Human-Robot Interaction Situations For Engagement Characterization

Ähnlich wie Self-talk discrimination in Human-Robot Interaction Situations For Engagement Characterization (20)

Mehr von Jade Le Maitre

Mehr von Jade Le Maitre (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Self-talk discrimination in Human-Robot Interaction Situations For Engagement Characterization