1. From Queries to Dialogues:
Predicting User Satisfaction with
Intelligent Assistants
Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah,
Aidan C. Crook, Imed Zitouni, Tasos Anastasakos
Eindhoven University of Technology
Pennsylvania State University
Microsoft
8. From Queries to Dialogues
Q1: how is the weather in Chicago
Q2: how is it this weekend
Q3: find me hotels
Q4: which one of these is the cheapest
Q5: which one of these has at least 4 stars
Q6: find me directions from the Chicago airport to
number one
User’s dialogue
with Cortana:
Task is “Finding
a hotel in
Chicago”
9. From Queries to Dialogues
Q1: find me a pharmacy nearby
Q2: which of these is highly rated
Q3: show more information about number 2
Q4: how long will it take me to get there
Q5: Thanks
User’s dialogue
with Cortana:
Task is “Finding
a pharmacy”
10. Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
“Getting you
direction to the
Mayuri Indian
Cuisine”
User:
“show
restauran
ts near
me”
User:
“show the
best ones”
User:
“show
directions
to the
second
one”
From Queries to Dialogues
11. Main Research Question
How can we automatically predict user
satisfaction with search dialogues on
intelligent assistants using
click, touch, and voice interactions?
12. User:
“Do I need
to have a
jacket
tomorrow?”
Cortana: “You
could probably
go without one.
The forecast
shows …”
Single Task Search Dialogue
13. Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
“Getting you
direction to the
Mayuri Indian
Cuisine”
User:
“show
restauran
ts near
me”
User:
“show the
best ones”
User:
“show
directions
to the
second
one”
Multi-Task Search Dialogues
14. How to define user satisfaction
with search dialogues?
15. Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
“Getting you
direction to the
Mayuri Indian
Cuisine”
User:
“show
restauran
ts near
me”
User:
“show the
best ones”
User:
“show
directions
to the
second
one”
No Clicks
???
16. Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
“Getting you
direction to the
Mayuri Indian
Cuisine”
User:
“show
restauran
ts near
me”
User:
“show the
best ones”
User:
“show
directions
to the
second
one”
SAT? SAT? SAT?
Overall
SAT?
? SAT? SAT? SAT?
17. User Frustration
Q1: what's the weather like in San Francisco
Q2: what's the weather like in Mountain View
Q3: can you find me a hotel close to Mountain
View
Q4: can you show me the cheapest ones
Q5: show me the third one
Q6: show me the directions from SFO to this
hotel
Q6: show me the directions from SFO to this
hotel
Q7: go back to first hotel (misrecognition)
Q8: show me hotels in Mountain View
Q9: show me cheap hotels in Mountain View
Q10: show me more about the third one
Dialog with
Intelligent Assistant
Task is “Planning a
weekend ”
RestartsearchAuserissatisfied
19. Tracking User Interaction:
Click Signals
• Number of queries in a dialogue
• Number of clicks in a dialogue
• Number of SAT clicks (> 30 sec. dwell time) in a dialogue
• Number of DSAT clicks (< 15 sec. dwell time) in a dialogue
• Time (seconds) until the first click in a dialogue
22. 3 seconds 6 seconds
33% of
ViewPort
66% of
ViewPort
ViewPortHeight
2 seconds
20% of
ViewPort
1s 4s 0.4s 5.4s+ + =
Tracking User Interaction
23. • Number of Swipes
• Number of up-swipes
• Number of down-swipes
• Total distance swiped (pixels)
• Number of swipes normalized by
time
• Total distance divided by num. of
swipes
• Total swiped distance divided by
time
• Number of swipe direction
changes
• SERP answer duration (seconds)
which is shown on screen (even
partially)
• Fraction of visible pixels belonging
to SERP answer
• Attributed time (seconds) to viewing
a particular element (answer) on
SERP
• Attributed time (seconds) per unit
height (pixels) associated with a
particular element on SERP
• Attributed time (milliseconds) per
unit area (square pixels) associated
with a particular element on SERP
Tracking User Interaction:
Touch Signals
25. User Study Participants
75%
25%
GENDER
Male Female
55%
45%
LANGUAGE
English Other
82%
8%
2% 8%
EDUCATION Computer
Science
Electrical
Engineering
Mathematics
Other
• 60 Participants
• 25.53 +/- 5.42 years
26. You are planning a
vacation. Pick a place.
Check if the weather is
good enough for the
period you are planning
the vacation. Find a hotel
that suits you. Find the
driving directions to this
place.
27. You are planning a
vacation. Pick a place.
Check if the weather is
good enough for the
period you are planning
the vacation. Find a hotel
that suits you. Find the
driving directions to this
place.
28. Questionnaire
• Were you able to complete the task?
o Yes/No
• How satisfied are you with your experience in this task?
o If the task has sub-tasks participants indicate their graded satisfaction e.g.
o a. How satisfied are you with your experience in finding a hotel?
o b. How satisfied are you with your experience in finding directions?
• How well did Cortana recognize what you said?
o 5-point Likert scale
• Did you put in a lot of effort to complete the task?
o 5-point Likert scale
29. Questionnaire
• Were you able to complete the task?
o Yes/No
• How satisfied are you with your experience in this task?
o If the task has sub-tasks participants indicate their graded satisfaction e.g.
o a. How satisfied are you with your experience in finding a hotel?
o b. How satisfied are you with your experience in finding directions?
• How well did Cortana recognize what you said?
o 5-point Likert scale
• Did you put in a lot of effort to complete the task?
o 5-point Likert scale
8 Tasks:
1 simple,
4 with 2 subtasks,
3 with 3 subtasks
~ 30 Minutes
30. Search Dialog Dataset
• Total amount of queries is 2, 040
• Amount of unique queries is 1, 969
• The average query-length is 7.07
31. Search Dialog Dataset
• Total amount of queries is 2, 040
• Amount of unique queries is 1, 969
• The average query-length is 7.07
• The simple task generated 130 queries
• Tasks with 2 context switches generated 685 queries
• Tasks with 3 context switches generated 1, 355
queries
32. How can we predict user
satisfaction
with search dialogues using
interaction signals?
33. Q1: what do you have medicine for the
stomach ache
Q2: stomach ache medicine over the counter
General
Web
SERP
User’s dialogue about the ‘stomach
ache’
34. Q1: what do you have medicine for the
stomach ache
Q2: stomach ache medicine over the counter
Q3: show me the nearest pharmacy
Q4: more information on the second one
General
Web
SERP
Structured
SERP
User’s dialogue about the ‘stomach
ache’
40. Quality of Interaction Model
Method Accuracy (%) Average F1 (%)
Baseline 70.62 61.38
Interaction Model 1 78.78*
(+11.55)
83.59*
(+35.90)
Interaction Model 2 80.21*
(+13.58)
83.31*
(+35.44)
Interaction Model 3 80.81*
(14.43)
79.08*
(28.83)
* Statistically significant improvement (p < 0,05 )
41. Which interaction signals have
the highest impact on predicting
user satisfaction with search
dialogues?
42. Predicting User Satisfaction
• F1: The SERP for a query is ordered by a measure of relevance as
determined by the system, then additional exploration is unlikely to achieve
user satisfaction, but is more likely an indication that the best-provided
results (i.e. the SERP top) are insufficient to address the user intent
43. Predicting User Satisfaction
• F1: The SERP for a query is ordered by a measure of relevance as
determined by the system, then additional exploration is unlikely to achieve
user satisfaction, but is more likely an indication that the best-provided
results (i.e. the SERP top) are insufficient to address the user intent
• F2: In the converse case of F1, when users find content that satisfies their
intent, their likelihood of scrolling is reduced, and they dwell for an extended
period on the top viewport
44. Predicting User Satisfaction
• F1: The SERP for a query is ordered by a measure of relevance as
determined by the system, then additional exploration is unlikely to achieve
user satisfaction, but is more likely an indication that the best-provided
results (i.e. the SERP top) are insufficient to address the user intent
• F2: In the converse case of F1, when users find content that satisfies their
intent, their likelihood of scrolling is reduced, and they dwell for an extended
period on the top viewport
• F3: When users are involved in a complex task, they are dissatisfied when
redirected to a general web SERP. Unlike F2, the absence of scrolling on this
landing page is an indication of dissatisfaction
45. How can we define user satisfaction with search dialogues?
• User satisfaction with search dialogues is defined in the generalized form,
which showed understanding the nature of user satisfaction as an
aggregation of satisfaction with all dialogue’s tasks and not as a satisfaction
with all dialogue’s queries separately
How can we predict user satisfaction with search dialogues using
interaction signals?
• We showed that features derived from voice and especially from touch and
voice interactions add significant gain in accuracy over the baseline
How can we predict user satisfaction with search dialogues using
interaction signals?
• Our analysis showed a strong negative correlation between user satisfaction
and swipe actions
Conclusion
46. • User satisfaction with search dialogues is defined in
the generalized form, which showed understanding
the nature of user satisfaction as an aggregation of
satisfaction with all dialogue’s tasks and not as a
satisfaction with all dialogue’s queries separately
• We showed that features derived from voice and
especially from touch and voice interactions add
significant gain in accuracy over the baseline
• Our analysis showed a strong negative correlation
between user satisfaction and swipe actions
Thank you!
Questions?
Hinweis der Redaktion
We utilize acoustic feature to characterize
voice interaction happening in search dialogues. More
specifically, we use the phonetic similarity between consecutive
requests to identify patterns of repetition. Metaphone representation
[39] is a way of indexing words by their pronunciation that allows
us to represent words by how they are pronounced as opposed
to how they are written.