1. Evaluating Helpdesk Dialogues:
Initial Considerations from An
Information Access Perspective
Tetsuya Sakai (Waseda University)
Zhaohao Zeng (Waseda University)
Cheng Luo (Tsinghua University/Waseda University)
tetsuyasakai@acm.org
September 29, 2016
@IPSJ SIGNL (unrefereed), Osaka.
2. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
5. Motivation (3)
• We cannot conduct a subjective evaluation for every dialogue that we
want to evaluate. We want an automatic evaluation method that
approximates subjective evaluation.
• Build a human-human helpdesk dialogue test collection with both
subjective annotations (target variables) and clues for automatic
evaluation (explanatory variables).
• Using the test collection, design and verify automatic evaluation
measures that approximate subjective evaluation.
• One step beyond: human-system dialogue evaluation based on the
human-human dialogue test collection.
6. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
7. Evaluating non-task-oriented dialogues (1)
• Evaluating conversational responses
Discriminative BLEU [Galley+15] extends the machine-translation
measure BLEU to incorporate +/- weights for human references (=gold
responses) to reflect different subjective views.
• Dialogue Breakdown Detection Challenge [Higashinaka+16]
Find a point in dialogue where it becomes impossible to continue due
to system’s inappropriate utterances.
System’s output: a probability distribution over NB (not a breakdown),
PB (possible breakdown), or B (breakdown), which is compared against
a gold distribution.
8. Evaluating non-task-oriented dialogues (2)
• Evaluating the Short Text Conversation Task [Sakai+15AIRS,Shang+16]
Human-system single-turn dialogues by searching a repository of past
tweets. Ranked lists evaluated with information retrieval measures.
old post old comment
old post old comment
old post old comment
old post old comment
old post old comment
new post
new post
new post
old comment
old comment
old comment
new post
new post For each new post,
retrieve and rank
old comments!
Graded label (L0-L2) for each comment
Repository Training data Test data
9. Evaluating task-oriented dialogues (1)
• PARADISE [Walker+97]
Task: Train timetable lookup
User satisfaction = f(task success, cost)
Attribute-value matrix (depart-city=?, arrival-city=?, depart-time=?...)
• Spoken Dialogue Challenge [Black+09]
Task: Bus timetable lookup
Live evaluation by calling systems on the phone
• Dialogue State Tracking Challenge [Williams+13,Kim+16]
Task: Bus timetable lookup
Evaluation: at each time t, the system outputs a probability distribution over
possible dialogue states (e.g. different bus routes), which is compared with a gold
label.
Closed-domain, slot filling tasks
10. Evaluating task-oriented dialogues (2)
• Subjective Assessment of Speech System Interfaces (SASSI) [Hone+00]
Task: In-car speech interface
Factor analysis of questionnaires revealed the following as key factors for subjective
assessment:
- system response accuracy
- likeability
- cognitive demand
- annoyance
- habitability
- speed
• SERVQUAL [Hartikainen+04]
Task: Phone-based email application
Closed-domain, slot filling tasks
11. Evaluating task-oriented dialogues (3)
• Response Selection [Lowe+15]
Ubuntu corpus containing “artificial” dyadic dialogues.
Task: Ubuntu Q&A: most similar to ours, with no pre-defined slot filling schemes
Response selection task:
Previous dialogue
context
Correct response in
original dialogue
Previous dialogue
context
Incorrect response
from another dialogue
Previous dialogue
context
Incorrect response
from another dialogue
...
Given the context, can the system choose the correct response from 10 choices?
12. Evaluating textual information access (1)
[Sakai15book]
• ROUGE for summarisation evaluation [Lin04]
Recall and F-measure based on n-grams and skip bigrams.
Requires multiple reference summaries.
• Nugget pyramids and POURPRE for QA [Lin+06]
• Nugget definition at TREC QA: “a fact for which the assessor could
make a binary decision as to whether a response contained that
nugget.”
Nugget recall, allowance-based nugget precision, nugget F-measure.
POURPRE: replaces manual nugget matching with automatic nugget
matching based on unigrams.
Text is regarded as a set of small textual units
13. Evaluating textual information access (2)
[Sakai15book]
• S-measure [Sakai+11CIKM]
A measure for query-focussed summaries, introduces a decay function
over text, just as nDCG uses a decay function over ranks.
• T-measure [Sakai+12AIRS]
Nugget-precision that can handle different allowances for different
nuggets.
• U-measure [Sakai+13SIGIR]
A generalisation of S, which works for any textual information access
tasks, including web search, summaries, sessions etc.
14. Trailtext: <Sentence A> <Sentence Z>
Trailtext:
<Rank 1 snippet> <Rank 2 snippet> <Rank 2 full text> <Rank 1 full text>
Nonlinear traversal
Building trailtexts for U-measure (1)
15. Trailtext: <News 1> <Ad 2> <Blog 1>
Trailtext:
<Rank 1 snippet> <Rank 2 snippet> <Rank 1’ snippet> <Rank 1’ full text>
Building trailtexts for U-measure (2)
17. Advertisement:
http://sigir.org/sigir2017/
Jan 17: full paper abstracts due
Jan 24: full papers due
Feb 28: short papers and demo proposals due
Aug 7: tutorials and doctoral consortium
Aug 8-10: main conference
Aug 11: workshops
The first ever SIGIR in Japan!
18. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
19. Project overview
(1) Construct a pilot Chinese human-human dialogue test collection
with subjective labels, nuggets and English translations.
(2) Design nugget-based evaluation measures and investigate the
correlation with subjective measures.
(3) Revise criteria for subjective and nugget annotations, as well as the
measures
(4) Construct a larger test collection with subjective labels, nuggets and
English translations. Re-investigate the correlations.
(5) Release the finalised test collection with code for computing the
measures.
φ1
φ2
20. Subjective labels = target variables
Possible axes for subjective annotation:
- Is the task clearly stated and is actually accomplished?
- How efficiently is the task accomplished through the dialogue?
- Is Customer satisfied with the dialogue, and to what degree?
Interlocutor viewpoints:
Customer’s viewpoint: Solve my problem efficiently, but I’m giving you
minimal information about it.
Helpdesk’s viewpoint: Solve Customer’s problem efficiently, as time is
money for the company.
The two viewpoints may be weighted depending on practical needs.
21. Why nuggets?
• Subjective labels tell us about the quality of the entire dialogue, but
not about why.
• Helpdesk dialogues lack pre-defined slot filling schemes.
• Subjective scores (gold standard) = f(nuggets) ?
• Parts-Make-The-Whole Hypothesis: The overall quality of a helpdesk
dialogue is governed by the quality of its parts.
C
H
C
H
C
H
C
H
Overall
quality
(subjective)
f(nuggets)
22. Nugget annotation vs subjective annotation
• Consistency Hypothesis: Nugget annotation achieves higher inter-
annotator consistency. (Smaller units = reduces subjectivity and
variations in annotation procedure)
• Sensitivity Hypothesis: Nugget annotation enables finer distinctions
among different dialogues. (Nuggets = details)
• Reusability Hypothesis: Nugget annotation enables us to predict the
quality of unannotated dialogues more accurately.
C
H
C
H
WITH annotations
C
H
C
H
WITHOUT annotations
Same task,
different dialogues
Reuse nuggets
23. Unique features of nuggets for dialogue
evaluation
• A dialogue involves Customer and Helpdesk (not one search engine
user) – two types of nuggets
• Within each nugget type, nuggets are not homogeneous
- Special nuggets that identify the task (trigger nuggets)
- Special nuggets that accomplish the task (goal nuggets)
- Regular nuggets
25. Possible requirements for nugget-based
evaluation measures (1)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
Goal nugget No goal nuggets
C
H
C
H
C
H
C
H
Same task
>
26. Possible requirements for nugget-based
evaluation measures (2)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
C
H
C
H
C
H
Same task
>
27. Possible requirements for nugget-based
evaluation measures (3)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
Goal nugget
C
H
C
H
C
H
C
H
Same task
> Goal nugget
C
28. Possible requirements for nugget-based
evaluation measures (4)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
29. After completing the project...
Evaluating human-system dialogues
Human-human dialogue test collection with
subjective and nugget annotations
Utilise as an unstructured
knowledge base
C
H
C
H
Sampled dialogue with
subjective and nugget
annotations
Task
Initiate a human-system dialogue for the same task,
using participant’s own expressions
Participant
Participant terminates dialogue as soon as
he receives an incoherent or a
breakdown-causing utterance from
System.
Can System still provide the goal nuggets?
How does human-system UCH compare
with human-human UCH?
30. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
31. Dialogue mining (100% done)
Pilot data containing 234 Customer-Helpdesk dialogues obtained as
follows:
1. Collect an initial set of Weibo accounts A0 by searching account
names with keywords such as assistant and helper (in Chinese).
2. For each account in A0, crawl 200 most recent posts that mention
that account using “@”. Filter accounts that did not respond to
more than 50% of the posts. Let the set of “active” accounts be A.
3. For each account in A, crawl 2000 most recent posts that mention
that account, and then extract those with at least 5 Customer posts
AND at least 5 Helpdesk posts.
34. Nugget definition (for annotators)
• A post: a piece of text input by Customer/Helpdesk who presses
ENTER to upload it on Weibo.
• A nugget:
(I) is a post, or a sequence of consecutive posts by the same
interlocutor.
(II) can neither partially nor wholly overlap with another nugget.
(III) should be minimal: it should not contain irrelevant posts at
start/end/middle.
(IV) helps Customer transition from Current State towards Target State.
35. Nugget types (for annotators)
CNUG0: Customer trigger nuggets. Define Customer’s initial problem.
CNUG: Customer regular nuggets.
HNUG: Helpdesk regular nuggets.
CNUG*: Customer goal nuggets. Customer tells Helpdesk that the
problem has been solved.
HNUG*: Helpdesk goal nuggets. Helpdesk provides customer with a
solution to the problem.
Nuggets annotated for 40/234=17% of the dialogues
36. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
37. Pilot measures for dialogue evaluation
• U-measure [Sakai13+SIGIR]
Trailtext = concatenation of all texts that the search engine user has
read
• UCH (U computed based on Customer’s and Helpdesk’s nuggets)
Trailtext = dyadic dialogue
- UC = U computed based on Customer’s nuggets (Helpdesk’s
viewpoint)
- UH = U computed based on Helpdesk’s nuggets (Customer’s
viewpoint)
UCH = (1-α) UC + α UH
40. UCH = (1-α) UC + α UH When α=0.5, UCH is U-measure
that places the two graphs on top of each other
Weight of the goal nugget
higher than the sum of the
others
Normalisation?
Unnecessary if
score standardisation
is applied
[Sakai16ICTIR,Sakai16AIRS]
Maximum tolerable dialogue length
41. Possible variants
• Use different decay functions for Customer and Helpdesk
• Use time rather than trailtext as the basis for discounting as in Time-
Biased Gain [Smucker+12]
+: the gap between the timestamps of two posts can be quantified
-/+: cannot quantify the amount of information conveyed in each post
expressed in a particular language / language independence
But remember Requirement (b): Measures should be easy to compute and to interpret.
Max tolerable dialogue duration
43. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
44. Conclusions and future work (1)
(1) Construct a pilot Chinese human-human dialogue test collection
with subjective labels, nuggets and English translations.
(2) Design nugget-based evaluation measures and investigate the
correlation with subjective measures.
(3) Revise criteria for subjective and nugget annotations, as well as the
measures
(4) Construct a larger test collection with subjective labels, nuggets and
English translations. Re-investigate the correlations.
(5) Release the finalised test collection with code for computing the
measures.
φ1
φ2
Done
45. Human-human dialogue test collection with
subjective and nugget annotations
Utilise as an unstructured
knowledge base
C
H
C
H
Sampled dialogue with
subjective and nugget
annotations
Task
Initiate a human-system dialogue for the same task,
using participant’s own expressions
Participant
Conclusions and future work (2)
Participant terminates dialogue as soon as
he receives an incoherent or a
breakdown-causing utterance from
System.
Can System still provide the goal nuggets?
How does human-system UCH compare
with human-human UCH?
After φ2..
47. Selected References (1)
[Black+09] The Spoken Dialogue Challenge, Proceedings of SIGDIAL 2009
[Galley+15] ΔBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets, Proceedings of ACL 2015.
[Higashinaka+16] The Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation Metrics,
Proceedings of LREC 2016.
[Hartikainen+04] Subjective Evaluation of Spoken Dialogue Systems Using SERVQUAL, Method, Proceedings of INTERSPEECH
2004-ICSLP.
[Hone+00] Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI), Natural Language Engineering,
6(3-4), 2000.
[Kim+16] The Fourth Dialog State Tracking Challenge, Proceedings of IWSDS 2016.
[Lin04] ROUGE: A Package for Automatic Evaluation of Summaries, Proceedings of the Workshop on Text Summarization
Branches Out, 2004.
[Lin+06] Will Pyramids Built of Nuggets Topple Over? Proceedings of HLT/NAACL 2006.
[Lowe+15] The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems,
Proceedings of SIGDIAL 2015.
48. Selected References (2)
[Sakai+11CIKM] Click the Search Button and Be Happy: Evaluating Direct and Immediate Information Access, Proceedings of
ACM CIKM 2011.
[Sakai+12AIRS] One Click One Revisited: Enhancing Evaluation based on Information Units, Proceedings of AIRS 2012.
[Sakai+13SIGIR] Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation,
Proceedings of ACM SIGIR 2013.
[Sakai+15AIRS] Topic Set Size Design with the Evaluation Measures for Short Text Conversation, Proceedings of AIRS 2015.
[Sakai15book] 情報アクセス評価方法論: 検索エンジンの進歩のために, コロナ社, 2015.
[Sakai16AIRS] The Effect of Score Standardisation on Topic Set Size Design, Proceedings of AIRS 2016, to appear.
[Sakai16ICTIR] A Simple and Effective Approach to Score Standardisation, Proceedings of ACM ICTIR 2016.
[Sakai16SIGIR] Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015,
Proceedings of ACM SIGIR 2016.
49. Selected References (3)
[Shang+16] Overview of the NTCIR-12 Short Text Conversation Task, Proceedings of NTCIR-12, 2016.
[Smucker+12] Time-Based Calibration of Effectiveness Measures, Proceedings of ACM SIGIR 2012.
[Walker+97] PARADISE: A Framework for Evaluating Spoken Dialogue Agents, Proceedings of ACL 1997.
[Williams+13] The Dialog State Tracking Challenge, Proceedings of SIGDIAL 2013.
50. Acknowledgements
• We thank Hang Li and Lifeng Shang (Huawei Noah's Ark Lab) for
helpful discussions and continued support; and Guan Jun, Lingtao Li
and Yimeng Fan (Waseda University) for helping us construct the pilot
test collection.
• We also thank Ryuichiro Higashinaka (NTT Media Intelligence
Laboratories) for providing us with valuable information related to the
evaluation of non-task-oriented dialogues.