Understanding and Predicting User Satisfaction with Intelligent Assistants
1. Understanding and Predicting
User Satisfaction
with Intelligent Assistants
Julia Kiseleva, Kyle Williams, Jiepu Jiang,
Ahmed Hassan Awadallah,
Aidan C. Crook, Imed Zitouni, Tasos Anastasakos
Eindhoven University of Technology
Pennsylvania State University
University of Massachusetts Amherst
Microsoft
2. Why do we care?
0
10
20
30
40
50
60
70
80
90
100
Desktop Mobile
Timeline
PercentageofTraffic
http://gs.statcounter.com
6. Q1: how is the weather in Chicago
Q2: how is it this weekend
Q3: find me hotels
Q4: which one of these is the cheapest
Q5: which one of these has at least 4 stars
Q6: find me directions from the Chicago airport to
number one
User’s dialogue
with Cortana:
Task is “Finding
a hotel in
Chicago”
7. Q1: find me a pharmacy nearby
Q2: which of these is highly rated
Q3: show more information about number 2
Q4: how long will it take me to get there
Q5: Thanks
User’s dialogue
with Cortana:
Task is “Finding
a pharmacy”
17. Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
User:
“show
restaurant
s near me”
User:
“show the
best
restaurants
near me ”
Search Dialogue
18. Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
“Getting you
direction to the
Mayuri Indian
Cuisine”
User:
“show
restaurant
s near me”
User:
“show the
best
restaurants
near me ”
User:
“show
directions to
the second
one”
Search Dialogue
19. Research Questions
• RQ1: What are characteristic types of scenarios of use?
• RQ2: How can we measure different aspects of user satisfaction?
• RQ3: What are key factors determining user satisfaction for the
different scenarios?
• RQ4: How to characterize abandonment in the web search
scenario?
• RQ5: How does query-level satisfaction relate to overall user
satisfaction for the search dialogue scenario?
20. Research Questions
• RQ1: What are characteristic types of scenarios of use?
• RQ2: How can we measure different aspects of user satisfaction?
• RQ3: What are key factors determining user satisfaction for the
different scenarios?
• RQ4: How to characterize abandonment in the web search
scenario?
• RQ5: How does query-level satisfaction relate to overall user
satisfaction for the search dialogue scenario?
USERSTUDY
23. User Study Participants
75%
25%
GENDER
Male Female
55%
45%
LANGUAGE
English Other
82%
8%
2%
8%
EDUCATION
Computer Science Electrical Engineering
Mathematics Other
• 60 Participants
• 25.53 +/- 5.42 years
24. User Study Design
• Video Instructions (same for all participants)
• Tasks are realistic – mined from Cortana logs:
o Control type of tasks
o Queries where users don’t click
o Search dialogue tasks – mostly localization type of queries
25. Find out what is
the hair color of
your favorite
celebrity
26. You are planning a
vacation. Pick a place.
Check if the weather is
good enough for the
period you are planning
the vacation. Find a hotel
that suits you. Find the
driving directions to this
place.
27. You are planning a
vacation. Pick a place.
Check if the weather is
good enough for the
period you are planning
the vacation. Find a hotel
that suits you. Find the
driving directions to this
place.
28. Questionnaire: Controlling Device
• Were you able to complete the task?
o Yes/No
• How satisfied are you with your experience in this task?
o 5-point Likert scale
• How well did Cortana recognize what you said?
o 5-point Likert scale
• Did you put in a lot of effort to complete the task?
o 5-point Likert scale
29. Questionnaire: Controlling Device
• Were you able to complete the task?
o Yes/No
• How satisfied are you with your experience in this task?
o 5-point Likert scale
• How well did Cortana recognize what you said?
o 5-point Likert scale
• Did you put in a lot of effort to complete the task?
o 5-point Likert scale
5 Tasks
20 Minutes
30. Questionnaire: Good Abandonment
• Were you able to complete the task?
o Yes/No
• Where did you find the answer?
o Answer Box, Image, SERP, Visited Website
• Which query led you to finding the answer?
o First, Second, Third, >= Fourth
• How satisfied are you with your experience in this task?
o 5-point Likert scale
• Did you put in a lot of effort to complete the task?
o 5-point Likert scale
31. Questionnaire: Good Abandonment
• Were you able to complete the task?
o Yes/No
• Where did you find the answer?
o Answer Box, Image, SERP, Visited Website
• Which query led you to finding the answer?
o First, Second, Third, >= Fourth
• How satisfied are you with your experience in this task?
o 5-point Likert scale
• Did you put in a lot of effort to complete the task?
o 5-point Likert scale
5 Tasks
20 Minutes
32. Questionnaire: Search Dialogue
• Were you able to complete the task?
o Yes/No
• How satisfied are you with your experience in this task?
o If the task has sub-tasks participants indicate their graded
satisfaction e.g.
o a. How satisfied are you with your experience in finding a hotel?
o b. How satisfied are you with your experience in finding directions?
• How well did Cortana recognize what you said?
o 5-point Likert scale
• Did you put in a lot of effort to complete the task?
o 5-point Likert scale
33. Questionnaire: Search Dialogue
• Were you able to complete the task?
o Yes/No
• How satisfied are you with your experience in this task?
o If the task has sub-tasks participants indicate their graded
satisfaction e.g.
o a. How satisfied are you with your experience in finding a hotel?
o b. How satisfied are you with your experience in finding directions?
• How well did Cortana recognize what you said?
o 5-point Likert scale
• Did you put in a lot of effort to complete the task?
o 5-point Likert scale
8 Tasks: 1 simple,
4 with 2 subtasks,
3 with 3 subtasks
30 Minutes
34. Search Dialog Dataset
• 540 tasks that incorporated
• 2, 040 queries, of which 1, 969 were unique
• the average query-length is 7.07
• The simple task generated 130 queries in total
• Tasks with 2 context switches generated 685 queries
• Tasks with 3 context switches generated 1, 355 queries
39. Search Dialogue Satisfaction
RQ5: How does query-level satisfaction relate to overall
user satisfaction for the structured search dialogue
scenario?
40. Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
“Getting you
direction to the
Mayuri Indian
Cuisine”
User:
“show
restaurant
s near me”
User:
“show the
best
restaurants
near me ”
User:
“show
directions to
the second
one”
SAT? SAT? SAT?
SAT? SAT? SAT?
Overall
SAT?
?
41. Search Dialogue Satisfaction
RQ5: How does query-level satisfaction relate to overall
user satisfaction for the structured search dialogue
scenario?
45. Q1: what do you have medicine for the stomach ache
Q2: stomach ache medicine over the counter
Q3: show me the nearest pharmacy
Q4: more information on the second one
Q5: do they have a stool softener
Q6: does Fred Meyer have stool softeners
General Search
Search Dialog
Combination
of scenarios
User’s dialogue with Cortana related to the ‘stomach ache’ problem
46. Conclusions (1)
• RQ1: What are characteristic types of scenarios of use?
• We proposed three main types of scenarios
• RQ2: How can we measure different aspects of user
satisfaction?
• We designed a series of user studies tailored to the three
scenarios
• RQ3: What are key factors determining user satisfaction for
the different scenarios?
• Effort is a key component of user satisfaction across the
different intelligent assistants scenarios
47. Conclusions (2)
• RQ4: How to characterize abandonment in the web search
scenario?
• We concluded that to measure good abandonment we need
to investigate the other forms of interaction signals that are
not based on clicks or reformulation
• RQ5: How does query-level satisfaction relate to overall user
satisfaction for the search dialogue scenario?
• We looked at user satisfaction as ‘a user journey towards an
information goal where each step is important,’ and showed
the importance of session context
49. Evaluating User Satisfaction
• We need metrics to evaluate user satisfaction
• Good abandonment [Human et. al, 2009]:
Mobile: 36% of abandoned queries in were likely good
Desktop: 14.3%
• Traditional methods use implicit signals: clicks and dwell time
50. Evaluating User Satisfaction
• We need metrics to evaluate user satisfaction
• Good abandonment [Human et. al, 2009]:
Mobile: 36% of abandoned queries in were likely good
Desktop: 14.3%
• Traditional methods use implicit signals: clicks and dwell time
Don’t work
51. Our Main Research Problem
In the absence of clicks, what is the relationship
between a user's gestures and satisfaction and can we
use gestures to detect satisfaction and good
abandonment?
52. Research Questions
• RQ1: What SERP elements are the sources of good
abandonment in mobile search?
• RQ2: Do a user's gestures provide signals that can be used
to detect satisfaction and good abandonment in mobile
search?
• RQ3: Which user gestures provide the strongest signals for
satisfaction and good abandonment?
53. Research Questions
• RQ1: What SERP elements are the sources of good
abandonment in mobile search?
• RQ2: Do a user's gestures provide signals that can be used
to detect satisfaction and good abandonment in mobile
search?
• RQ3: Which user gestures provide the strongest signals for
satisfaction and good abandonment?
USERSTUDY
54. Research Questions
• RQ1: What SERP elements are the sources of good
abandonment in mobile search?
• RQ2: Do a user's gestures provide signals that can be used
to detect satisfaction and good abandonment in mobile
search?
• RQ3: Which user gestures provide the strongest signals for
satisfaction and good abandonment?
USERSTUDY
CROWDSOURCING
55. Crowdsourcing Procedure
Random sample of abandoned queries from the search logs of a
personal digital assistant during one week in June 2015 (no query
suggestion)
60. Query and Session Features
• Session duration
• Number of queries in session
Session
Features
61. Query and Session Features
• Session duration
• Number of queries in session
• Index of query within session
• Time to next query
• Query length (number of words)
• Is this query a reformulation
• Was this query reformulated
Session
Features
Query
Features
62. Query and Session Features
• Session duration
• Number of queries in session
• Index of query within session
• Time to next query
• Query length (number of words)
• Is this query a reformulation
• Was this query reformulated
• Click count
• Number of SAT clicks (> 30 sec)
• Number of back-click clicks (< 30 sec)
Session
Features
Query
Features
Click
Features
63. Baseline 1:Click & Dwell
• Session duration
• Number of queries in session
• Index of query within session
• Time to next query
• Query length (number of words)
• Is this query a reformulation
• Was this query reformulated
• Click count
• Number of SAT clicks (> 30 sec)
• Number of back-click clicks (< 30 sec)
Session
Features
Query
Features
Click
Features
Click >
30 sec
No
Refomul
ation
B1:Click,Dwellwith
noReformulation
64. Baseline 2: Optimistic
• Session duration
• Number of queries in session
• Index of query within session
• Time to next query
• Query length (number of words)
• Is this query a reformulation
• Was this query reformulated
• Click count
• Number of SAT clicks (> 30 sec)
• Number of back-click clicks (< 30 sec)
Session
Features
Query
Features
Click
Features
NO
Click
NO
Refomul
ation
B2:Optimistic
65. Baseline 3: Query-Session Model
• Session duration
• Number of queries in session
• Index of query within session
• Time to next query
• Query length (number of words)
• Is this query a reformulation
• Was this query reformulated
• Click count
• Number of SAT clicks (> 30 sec)
• Number of back-click clicks (< 30 sec)
Session
Features
Query
Features
Click
Features
B3:Query-SessionModel:
TrainingRandomForest
66. Gesture Features (1)
• Viewport features swipes-related:
o up swipes and down swipes
o changes in swipe direction
o swiped distance in pixels and average swiped distance
o swipe distance divided by time spent on the SERP
67. Gesture Features (1)
• Viewport features swipes-related:
o up swipes and down swipes
o changes in swipe direction
o swiped distance in pixels and average swiped distance
o swipe distance divided by time spent on the SERP
• Time To Focus
o Time to focus on Answer
o Time to Focus on Organic Search Results
68. 3 seconds 6 seconds
33% of
ViewPort
66% of
ViewPort
ViewPortHeight
2 seconds
20% of
ViewPort
1s 4s 0.4s 5.4s+ + =
GF(2): Attributed Reading Time
70. Models: Detecting Good Abandonment
M1: Gesture Model:
Training Random Forest based on gesture features
M2: Gesture Model + Query and Session Features:
Training Random Forest based on gesture, query and session features
71. RQ2: Are gestures useful? (1)
On only abandoned user study data:
148 SAT queries and 313 DSAT queries
72. RQ2: Are gestures useful? (2)
On crowdsourced data:
1565 SAT queries and 1924 DSAT queries
73. RQ2: Are gestures useful? (3)
On all user study data:
179 SAT queries and 384 DSAT queries
Gestures Features are useful to detect user satisfaction
in general!
74. Conclusions
• RQ1: What SERP elements are the sources of good abandonment in
mobile search?
Answer, Images and Snippet
• RQ2: Do a user's gestures provide signals that can be used to detect
satisfaction and good abandonment in mobile search?
Yes
• RQ3: Which user gestures provide the strongest signals for satisfaction
and good abandonment
Time spent interacting with Answers is positively correlated. Swipe
actions and time spent with SERP is negatively correlated
75. • Answer, Images and Snippet are
potentially source of the good
abandonment
• User gestures provide useful signals to
detect good abandonment
• Time spent interacting with Answers is
positively correlated. Swipe actions
and time spent with SERP is
negatively correlated
Questions?
Hinweis der Redaktion
Search online for contana screenshots
later. We nd
strong signicant negative correlation of -0.65 between sat-
isfaction and eort, and a negative correlation of -0.08 be-
tween completion and eort, indicating that less eort leads
to more satisfaction and higher completion rates.