SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Evaluating Helpdesk Dialogues:
Initial Considerations from An
Information Access Perspective
Tetsuya Sakai (Waseda University)
Zhaohao Zeng (Waseda University)
Cheng Luo (Tsinghua University/Waseda University)
tetsuyasakai@acm.org
September 29, 2016
@IPSJ SIGNL (unrefereed), Osaka.
TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
Motivation (1)
Motivation (2)
We want to evaluate task-oriented
multi-turn dialogues
Motivation (3)
• We cannot conduct a subjective evaluation for every dialogue that we
want to evaluate. We want an automatic evaluation method that
approximates subjective evaluation.
• Build a human-human helpdesk dialogue test collection with both
subjective annotations (target variables) and clues for automatic
evaluation (explanatory variables).
• Using the test collection, design and verify automatic evaluation
measures that approximate subjective evaluation.
• One step beyond: human-system dialogue evaluation based on the
human-human dialogue test collection.
TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
Evaluating non-task-oriented dialogues (1)
• Evaluating conversational responses
Discriminative BLEU [Galley+15] extends the machine-translation
measure BLEU to incorporate +/- weights for human references (=gold
responses) to reflect different subjective views.
• Dialogue Breakdown Detection Challenge [Higashinaka+16]
Find a point in dialogue where it becomes impossible to continue due
to system’s inappropriate utterances.
System’s output: a probability distribution over NB (not a breakdown),
PB (possible breakdown), or B (breakdown), which is compared against
a gold distribution.
Evaluating non-task-oriented dialogues (2)
• Evaluating the Short Text Conversation Task [Sakai+15AIRS,Shang+16]
Human-system single-turn dialogues by searching a repository of past
tweets. Ranked lists evaluated with information retrieval measures.
old post old comment
old post old comment
old post old comment
old post old comment
old post old comment
new post
new post
new post
old comment
old comment
old comment
new post
new post For each new post,
retrieve and rank
old comments!
Graded label (L0-L2) for each comment
Repository Training data Test data
Evaluating task-oriented dialogues (1)
• PARADISE [Walker+97]
Task: Train timetable lookup
User satisfaction = f(task success, cost)
Attribute-value matrix (depart-city=?, arrival-city=?, depart-time=?...)
• Spoken Dialogue Challenge [Black+09]
Task: Bus timetable lookup
Live evaluation by calling systems on the phone
• Dialogue State Tracking Challenge [Williams+13,Kim+16]
Task: Bus timetable lookup
Evaluation: at each time t, the system outputs a probability distribution over
possible dialogue states (e.g. different bus routes), which is compared with a gold
label.
Closed-domain, slot filling tasks
Evaluating task-oriented dialogues (2)
• Subjective Assessment of Speech System Interfaces (SASSI) [Hone+00]
Task: In-car speech interface
Factor analysis of questionnaires revealed the following as key factors for subjective
assessment:
- system response accuracy
- likeability
- cognitive demand
- annoyance
- habitability
- speed
• SERVQUAL [Hartikainen+04]
Task: Phone-based email application
Closed-domain, slot filling tasks
Evaluating task-oriented dialogues (3)
• Response Selection [Lowe+15]
Ubuntu corpus containing “artificial” dyadic dialogues.
Task: Ubuntu Q&A: most similar to ours, with no pre-defined slot filling schemes
Response selection task:
Previous dialogue
context
Correct response in
original dialogue
Previous dialogue
context
Incorrect response
from another dialogue
Previous dialogue
context
Incorrect response
from another dialogue
...
Given the context, can the system choose the correct response from 10 choices?
Evaluating textual information access (1)
[Sakai15book]
• ROUGE for summarisation evaluation [Lin04]
Recall and F-measure based on n-grams and skip bigrams.
Requires multiple reference summaries.
• Nugget pyramids and POURPRE for QA [Lin+06]
• Nugget definition at TREC QA: “a fact for which the assessor could
make a binary decision as to whether a response contained that
nugget.”
Nugget recall, allowance-based nugget precision, nugget F-measure.
POURPRE: replaces manual nugget matching with automatic nugget
matching based on unigrams.
Text is regarded as a set of small textual units
Evaluating textual information access (2)
[Sakai15book]
• S-measure [Sakai+11CIKM]
A measure for query-focussed summaries, introduces a decay function
over text, just as nDCG uses a decay function over ranks.
• T-measure [Sakai+12AIRS]
Nugget-precision that can handle different allowances for different
nuggets.
• U-measure [Sakai+13SIGIR]
A generalisation of S, which works for any textual information access
tasks, including web search, summaries, sessions etc.
Trailtext: <Sentence A> <Sentence Z>
Trailtext:
<Rank 1 snippet> <Rank 2 snippet> <Rank 2 full text> <Rank 1 full text>
Nonlinear traversal
Building trailtexts for U-measure (1)
Trailtext: <News 1> <Ad 2> <Blog 1>
Trailtext:
<Rank 1 snippet> <Rank 2 snippet> <Rank 1’ snippet> <Rank 1’ full text>
Building trailtexts for U-measure (2)
where
U-measure
Lpos
D(pos)
1
Gain at pos Decay function that discounts the gain
pos: position in trailtext
(how much text the user has read)
Advertisement:
http://sigir.org/sigir2017/
Jan 17: full paper abstracts due
Jan 24: full papers due
Feb 28: short papers and demo proposals due
Aug 7: tutorials and doctoral consortium
Aug 8-10: main conference
Aug 11: workshops
The first ever SIGIR in Japan!
TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
Project overview
(1) Construct a pilot Chinese human-human dialogue test collection
with subjective labels, nuggets and English translations.
(2) Design nugget-based evaluation measures and investigate the
correlation with subjective measures.
(3) Revise criteria for subjective and nugget annotations, as well as the
measures
(4) Construct a larger test collection with subjective labels, nuggets and
English translations. Re-investigate the correlations.
(5) Release the finalised test collection with code for computing the
measures.
φ1
φ2
Subjective labels = target variables
Possible axes for subjective annotation:
- Is the task clearly stated and is actually accomplished?
- How efficiently is the task accomplished through the dialogue?
- Is Customer satisfied with the dialogue, and to what degree?
Interlocutor viewpoints:
Customer’s viewpoint: Solve my problem efficiently, but I’m giving you
minimal information about it.
Helpdesk’s viewpoint: Solve Customer’s problem efficiently, as time is
money for the company.
The two viewpoints may be weighted depending on practical needs.
Why nuggets?
• Subjective labels tell us about the quality of the entire dialogue, but
not about why.
• Helpdesk dialogues lack pre-defined slot filling schemes.
• Subjective scores (gold standard) = f(nuggets) ?
• Parts-Make-The-Whole Hypothesis: The overall quality of a helpdesk
dialogue is governed by the quality of its parts.
C
H
C
H
C
H
C
H
Overall
quality
(subjective)
f(nuggets)
Nugget annotation vs subjective annotation
• Consistency Hypothesis: Nugget annotation achieves higher inter-
annotator consistency. (Smaller units = reduces subjectivity and
variations in annotation procedure)
• Sensitivity Hypothesis: Nugget annotation enables finer distinctions
among different dialogues. (Nuggets = details)
• Reusability Hypothesis: Nugget annotation enables us to predict the
quality of unannotated dialogues more accurately.
C
H
C
H
WITH annotations
C
H
C
H
WITHOUT annotations
Same task,
different dialogues
Reuse nuggets
Unique features of nuggets for dialogue
evaluation
• A dialogue involves Customer and Helpdesk (not one search engine
user) – two types of nuggets
• Within each nugget type, nuggets are not homogeneous
- Special nuggets that identify the task (trigger nuggets)
- Special nuggets that accomplish the task (goal nuggets)
- Regular nuggets
Customer’s states and the role of nuggets
Possible requirements for nugget-based
evaluation measures (1)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
Goal nugget No goal nuggets
C
H
C
H
C
H
C
H
Same task
>
Possible requirements for nugget-based
evaluation measures (2)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
C
H
C
H
C
H
Same task
>
Possible requirements for nugget-based
evaluation measures (3)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
Goal nugget
C
H
C
H
C
H
C
H
Same task
> Goal nugget
C
Possible requirements for nugget-based
evaluation measures (4)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
After completing the project...
Evaluating human-system dialogues
Human-human dialogue test collection with
subjective and nugget annotations
Utilise as an unstructured
knowledge base
C
H
C
H
Sampled dialogue with
subjective and nugget
annotations
Task
Initiate a human-system dialogue for the same task,
using participant’s own expressions
Participant
Participant terminates dialogue as soon as
he receives an incoherent or a
breakdown-causing utterance from
System.
Can System still provide the goal nuggets?
How does human-system UCH compare
with human-human UCH?
TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
Dialogue mining (100% done)
Pilot data containing 234 Customer-Helpdesk dialogues obtained as
follows:
1. Collect an initial set of Weibo accounts A0 by searching account
names with keywords such as assistant and helper (in Chinese).
2. For each account in A0, crawl 200 most recent posts that mention
that account using “@”. Filter accounts that did not respond to
more than 50% of the posts. Let the set of “active” accounts be A.
3. For each account in A, crawl 2000 most recent posts that mention
that account, and then extract those with at least 5 Customer posts
AND at least 5 Helpdesk posts.
Subjective annotation criteria
Low inter-annotator agreement
(next slide)
Subjective annotation (100% done)
See
[Randolph05]
[Sakai15book]
Nugget definition (for annotators)
• A post: a piece of text input by Customer/Helpdesk who presses
ENTER to upload it on Weibo.
• A nugget:
(I) is a post, or a sequence of consecutive posts by the same
interlocutor.
(II) can neither partially nor wholly overlap with another nugget.
(III) should be minimal: it should not contain irrelevant posts at
start/end/middle.
(IV) helps Customer transition from Current State towards Target State.
Nugget types (for annotators)
CNUG0: Customer trigger nuggets. Define Customer’s initial problem.
CNUG: Customer regular nuggets.
HNUG: Helpdesk regular nuggets.
CNUG*: Customer goal nuggets. Customer tells Helpdesk that the
problem has been solved.
HNUG*: Helpdesk goal nuggets. Helpdesk provides customer with a
solution to the problem.
Nuggets annotated for 40/234=17% of the dialogues
TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
Pilot measures for dialogue evaluation
• U-measure [Sakai13+SIGIR]
Trailtext = concatenation of all texts that the search engine user has
read
• UCH (U computed based on Customer’s and Helpdesk’s nuggets)
Trailtext = dyadic dialogue
- UC = U computed based on Customer’s nuggets (Helpdesk’s
viewpoint)
- UH = U computed based on Helpdesk’s nuggets (Customer’s
viewpoint)
UCH = (1-α) UC + α UH
Nugget positions (1)
Trigger nugget
Regular nugget
Regular nugget
Nugget positions (2)
Regular nugget
Regular nugget
Goal nugget
Goal nugget
UCH = (1-α) UC + α UH When α=0.5, UCH is U-measure
that places the two graphs on top of each other
Weight of the goal nugget
higher than the sum of the
others
Normalisation?
Unnecessary if
score standardisation
is applied
[Sakai16ICTIR,Sakai16AIRS]
Maximum tolerable dialogue length
Possible variants
• Use different decay functions for Customer and Helpdesk
• Use time rather than trailtext as the basis for discounting as in Time-
Biased Gain [Smucker+12]
+: the gap between the timestamps of two posts can be quantified
-/+: cannot quantify the amount of information conveyed in each post
expressed in a particular language / language independence
But remember Requirement (b): Measures should be easy to compute and to interpret.
Max tolerable dialogue duration
A sneak peek (40 annotated dialogues)
Subjective annotation criterion Q3
TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
Conclusions and future work (1)
(1) Construct a pilot Chinese human-human dialogue test collection
with subjective labels, nuggets and English translations.
(2) Design nugget-based evaluation measures and investigate the
correlation with subjective measures.
(3) Revise criteria for subjective and nugget annotations, as well as the
measures
(4) Construct a larger test collection with subjective labels, nuggets and
English translations. Re-investigate the correlations.
(5) Release the finalised test collection with code for computing the
measures.
φ1
φ2
Done
Human-human dialogue test collection with
subjective and nugget annotations
Utilise as an unstructured
knowledge base
C
H
C
H
Sampled dialogue with
subjective and nugget
annotations
Task
Initiate a human-system dialogue for the same task,
using participant’s own expressions
Participant
Conclusions and future work (2)
Participant terminates dialogue as soon as
he receives an incoherent or a
breakdown-causing utterance from
System.
Can System still provide the goal nuggets?
How does human-system UCH compare
with human-human UCH?
After φ2..
Advertisement
Short Text Conversation@NTCIR-13
http://ntcirstc.noahlab.com.hk/STC2/stc-cn.htm
We Want Web@NTCIR-13
http://www.thuir.cn/ntcirwww/
Single-turn
human-
system
dialogues
Improving ad
hoc web
search over
4.5 years
Selected References (1)
[Black+09] The Spoken Dialogue Challenge, Proceedings of SIGDIAL 2009
[Galley+15] ΔBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets, Proceedings of ACL 2015.
[Higashinaka+16] The Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation Metrics,
Proceedings of LREC 2016.
[Hartikainen+04] Subjective Evaluation of Spoken Dialogue Systems Using SERVQUAL, Method, Proceedings of INTERSPEECH
2004-ICSLP.
[Hone+00] Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI), Natural Language Engineering,
6(3-4), 2000.
[Kim+16] The Fourth Dialog State Tracking Challenge, Proceedings of IWSDS 2016.
[Lin04] ROUGE: A Package for Automatic Evaluation of Summaries, Proceedings of the Workshop on Text Summarization
Branches Out, 2004.
[Lin+06] Will Pyramids Built of Nuggets Topple Over? Proceedings of HLT/NAACL 2006.
[Lowe+15] The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems,
Proceedings of SIGDIAL 2015.
Selected References (2)
[Sakai+11CIKM] Click the Search Button and Be Happy: Evaluating Direct and Immediate Information Access, Proceedings of
ACM CIKM 2011.
[Sakai+12AIRS] One Click One Revisited: Enhancing Evaluation based on Information Units, Proceedings of AIRS 2012.
[Sakai+13SIGIR] Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation,
Proceedings of ACM SIGIR 2013.
[Sakai+15AIRS] Topic Set Size Design with the Evaluation Measures for Short Text Conversation, Proceedings of AIRS 2015.
[Sakai15book] 情報アクセス評価方法論: 検索エンジンの進歩のために, コロナ社, 2015.
[Sakai16AIRS] The Effect of Score Standardisation on Topic Set Size Design, Proceedings of AIRS 2016, to appear.
[Sakai16ICTIR] A Simple and Effective Approach to Score Standardisation, Proceedings of ACM ICTIR 2016.
[Sakai16SIGIR] Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015,
Proceedings of ACM SIGIR 2016.
Selected References (3)
[Shang+16] Overview of the NTCIR-12 Short Text Conversation Task, Proceedings of NTCIR-12, 2016.
[Smucker+12] Time-Based Calibration of Effectiveness Measures, Proceedings of ACM SIGIR 2012.
[Walker+97] PARADISE: A Framework for Evaluating Spoken Dialogue Agents, Proceedings of ACL 1997.
[Williams+13] The Dialog State Tracking Challenge, Proceedings of SIGDIAL 2013.
Acknowledgements
• We thank Hang Li and Lifeng Shang (Huawei Noah's Ark Lab) for
helpful discussions and continued support; and Guan Jun, Lingtao Li
and Yimeng Fan (Waseda University) for helping us construct the pilot
test collection.
• We also thank Ryuichiro Higashinaka (NTT Media Intelligence
Laboratories) for providing us with valuable information related to the
evaluation of non-task-oriented dialogues.

Weitere ähnliche Inhalte

Andere mochten auch

StrayerDiploma
StrayerDiplomaStrayerDiploma
StrayerDiplomaBeth Lynch
 
HE Admissions Policy
HE Admissions PolicyHE Admissions Policy
HE Admissions Policybwcelearning
 
Language learning program at DU
Language learning program at DULanguage learning program at DU
Language learning program at DUDianeSun2426
 
Oportundades con Millennials y Generación Z para Sector Turístico
Oportundades con Millennials y Generación Z para Sector TurísticoOportundades con Millennials y Generación Z para Sector Turístico
Oportundades con Millennials y Generación Z para Sector TurísticoEngel Fonseca
 
Tugas Teknik Tenaga Listrik Transformator
Tugas Teknik Tenaga Listrik TransformatorTugas Teknik Tenaga Listrik Transformator
Tugas Teknik Tenaga Listrik Transformatorfatkhuls
 
How to write business case studies
How to write  business case studiesHow to write  business case studies
How to write business case studiesMaxwell Ranasinghe
 
divian yusi saputra tugas teknik tenaga listrik generator ac dan motor ac
divian yusi saputra tugas teknik tenaga listrik generator ac dan motor acdivian yusi saputra tugas teknik tenaga listrik generator ac dan motor ac
divian yusi saputra tugas teknik tenaga listrik generator ac dan motor acdivianyusi
 

Andere mochten auch (8)

StrayerDiploma
StrayerDiplomaStrayerDiploma
StrayerDiploma
 
HE Admissions Policy
HE Admissions PolicyHE Admissions Policy
HE Admissions Policy
 
Language learning program at DU
Language learning program at DULanguage learning program at DU
Language learning program at DU
 
Oportundades con Millennials y Generación Z para Sector Turístico
Oportundades con Millennials y Generación Z para Sector TurísticoOportundades con Millennials y Generación Z para Sector Turístico
Oportundades con Millennials y Generación Z para Sector Turístico
 
Tugas Teknik Tenaga Listrik Transformator
Tugas Teknik Tenaga Listrik TransformatorTugas Teknik Tenaga Listrik Transformator
Tugas Teknik Tenaga Listrik Transformator
 
2 marketing environment
2 marketing environment2 marketing environment
2 marketing environment
 
How to write business case studies
How to write  business case studiesHow to write  business case studies
How to write business case studies
 
divian yusi saputra tugas teknik tenaga listrik generator ac dan motor ac
divian yusi saputra tugas teknik tenaga listrik generator ac dan motor acdivian yusi saputra tugas teknik tenaga listrik generator ac dan motor ac
divian yusi saputra tugas teknik tenaga listrik generator ac dan motor ac
 

Ähnlich wie Nl201609

Classical Approaches in Test Estimation
Classical Approaches in Test EstimationClassical Approaches in Test Estimation
Classical Approaches in Test EstimationGlobalLogic Ukraine
 
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...krisztianbalog
 
Task-oriented Conversational semantic parsing
Task-oriented Conversational semantic parsingTask-oriented Conversational semantic parsing
Task-oriented Conversational semantic parsingjie cao
 
Assessment outcomes from the TENCompetence project
Assessment outcomes from the TENCompetence projectAssessment outcomes from the TENCompetence project
Assessment outcomes from the TENCompetence projectUniversity of Strathclyde
 
Visual thinking colin_ware_lectures_2013_10_research methods
Visual thinking colin_ware_lectures_2013_10_research methodsVisual thinking colin_ware_lectures_2013_10_research methods
Visual thinking colin_ware_lectures_2013_10_research methodsElsa von Licy
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptxSaravanaD2
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Gunjan Kumar
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionClaudio Greco
 
UNIT 4 - Topic 2 - Agile Development Methodologies (2 - XP).pdf
UNIT 4 - Topic 2 - Agile Development Methodologies (2 - XP).pdfUNIT 4 - Topic 2 - Agile Development Methodologies (2 - XP).pdf
UNIT 4 - Topic 2 - Agile Development Methodologies (2 - XP).pdfRamosIvan2
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionAlessandro Suglia
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directionsTao He
 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?krisztianbalog
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Object Oriented Analysis
Object Oriented AnalysisObject Oriented Analysis
Object Oriented AnalysisAMITJain879
 
Matching Task Profiles And User Needs In Personalized Web Search
Matching Task Profiles And User Needs In Personalized Web SearchMatching Task Profiles And User Needs In Personalized Web Search
Matching Task Profiles And User Needs In Personalized Web Searchceya
 

Ähnlich wie Nl201609 (20)

Painful Test Estimation
Painful Test EstimationPainful Test Estimation
Painful Test Estimation
 
Classical Approaches in Test Estimation
Classical Approaches in Test EstimationClassical Approaches in Test Estimation
Classical Approaches in Test Estimation
 
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
 
Task-oriented Conversational semantic parsing
Task-oriented Conversational semantic parsingTask-oriented Conversational semantic parsing
Task-oriented Conversational semantic parsing
 
Assessment outcomes from the TENCompetence project
Assessment outcomes from the TENCompetence projectAssessment outcomes from the TENCompetence project
Assessment outcomes from the TENCompetence project
 
Visual thinking colin_ware_lectures_2013_10_research methods
Visual thinking colin_ware_lectures_2013_10_research methodsVisual thinking colin_ware_lectures_2013_10_research methods
Visual thinking colin_ware_lectures_2013_10_research methods
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptx
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
UNIT 4 - Topic 2 - Agile Development Methodologies (2 - XP).pdf
UNIT 4 - Topic 2 - Agile Development Methodologies (2 - XP).pdfUNIT 4 - Topic 2 - Agile Development Methodologies (2 - XP).pdf
UNIT 4 - Topic 2 - Agile Development Methodologies (2 - XP).pdf
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Thesis Proposal Presentation
Thesis Proposal PresentationThesis Proposal Presentation
Thesis Proposal Presentation
 
Context Based Citation Recommendation
Context Based Citation RecommendationContext Based Citation Recommendation
Context Based Citation Recommendation
 
Chounta avouris arv2011
Chounta avouris arv2011Chounta avouris arv2011
Chounta avouris arv2011
 
Object Oriented Analysis
Object Oriented AnalysisObject Oriented Analysis
Object Oriented Analysis
 
Matching Task Profiles And User Needs In Personalized Web Search
Matching Task Profiles And User Needs In Personalized Web SearchMatching Task Profiles And User Needs In Personalized Web Search
Matching Task Profiles And User Needs In Personalized Web Search
 

Mehr von Tetsuya Sakai (20)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
ictir2016
ictir2016ictir2016
ictir2016
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Nl201609

  • 1. Evaluating Helpdesk Dialogues: Initial Considerations from An Information Access Perspective Tetsuya Sakai (Waseda University) Zhaohao Zeng (Waseda University) Cheng Luo (Tsinghua University/Waseda University) tetsuyasakai@acm.org September 29, 2016 @IPSJ SIGNL (unrefereed), Osaka.
  • 2. TALK OUTLINE 1. Motivation 2. Related Work 3. Project Overview 4. Pilot Test Collection: Progress Report 5. Pilot Evaluation Measures 6. Summary and Future Work
  • 4. Motivation (2) We want to evaluate task-oriented multi-turn dialogues
  • 5. Motivation (3) • We cannot conduct a subjective evaluation for every dialogue that we want to evaluate. We want an automatic evaluation method that approximates subjective evaluation. • Build a human-human helpdesk dialogue test collection with both subjective annotations (target variables) and clues for automatic evaluation (explanatory variables). • Using the test collection, design and verify automatic evaluation measures that approximate subjective evaluation. • One step beyond: human-system dialogue evaluation based on the human-human dialogue test collection.
  • 6. TALK OUTLINE 1. Motivation 2. Related Work 3. Project Overview 4. Pilot Test Collection: Progress Report 5. Pilot Evaluation Measures 6. Summary and Future Work
  • 7. Evaluating non-task-oriented dialogues (1) • Evaluating conversational responses Discriminative BLEU [Galley+15] extends the machine-translation measure BLEU to incorporate +/- weights for human references (=gold responses) to reflect different subjective views. • Dialogue Breakdown Detection Challenge [Higashinaka+16] Find a point in dialogue where it becomes impossible to continue due to system’s inappropriate utterances. System’s output: a probability distribution over NB (not a breakdown), PB (possible breakdown), or B (breakdown), which is compared against a gold distribution.
  • 8. Evaluating non-task-oriented dialogues (2) • Evaluating the Short Text Conversation Task [Sakai+15AIRS,Shang+16] Human-system single-turn dialogues by searching a repository of past tweets. Ranked lists evaluated with information retrieval measures. old post old comment old post old comment old post old comment old post old comment old post old comment new post new post new post old comment old comment old comment new post new post For each new post, retrieve and rank old comments! Graded label (L0-L2) for each comment Repository Training data Test data
  • 9. Evaluating task-oriented dialogues (1) • PARADISE [Walker+97] Task: Train timetable lookup User satisfaction = f(task success, cost) Attribute-value matrix (depart-city=?, arrival-city=?, depart-time=?...) • Spoken Dialogue Challenge [Black+09] Task: Bus timetable lookup Live evaluation by calling systems on the phone • Dialogue State Tracking Challenge [Williams+13,Kim+16] Task: Bus timetable lookup Evaluation: at each time t, the system outputs a probability distribution over possible dialogue states (e.g. different bus routes), which is compared with a gold label. Closed-domain, slot filling tasks
  • 10. Evaluating task-oriented dialogues (2) • Subjective Assessment of Speech System Interfaces (SASSI) [Hone+00] Task: In-car speech interface Factor analysis of questionnaires revealed the following as key factors for subjective assessment: - system response accuracy - likeability - cognitive demand - annoyance - habitability - speed • SERVQUAL [Hartikainen+04] Task: Phone-based email application Closed-domain, slot filling tasks
  • 11. Evaluating task-oriented dialogues (3) • Response Selection [Lowe+15] Ubuntu corpus containing “artificial” dyadic dialogues. Task: Ubuntu Q&A: most similar to ours, with no pre-defined slot filling schemes Response selection task: Previous dialogue context Correct response in original dialogue Previous dialogue context Incorrect response from another dialogue Previous dialogue context Incorrect response from another dialogue ... Given the context, can the system choose the correct response from 10 choices?
  • 12. Evaluating textual information access (1) [Sakai15book] • ROUGE for summarisation evaluation [Lin04] Recall and F-measure based on n-grams and skip bigrams. Requires multiple reference summaries. • Nugget pyramids and POURPRE for QA [Lin+06] • Nugget definition at TREC QA: “a fact for which the assessor could make a binary decision as to whether a response contained that nugget.” Nugget recall, allowance-based nugget precision, nugget F-measure. POURPRE: replaces manual nugget matching with automatic nugget matching based on unigrams. Text is regarded as a set of small textual units
  • 13. Evaluating textual information access (2) [Sakai15book] • S-measure [Sakai+11CIKM] A measure for query-focussed summaries, introduces a decay function over text, just as nDCG uses a decay function over ranks. • T-measure [Sakai+12AIRS] Nugget-precision that can handle different allowances for different nuggets. • U-measure [Sakai+13SIGIR] A generalisation of S, which works for any textual information access tasks, including web search, summaries, sessions etc.
  • 14. Trailtext: <Sentence A> <Sentence Z> Trailtext: <Rank 1 snippet> <Rank 2 snippet> <Rank 2 full text> <Rank 1 full text> Nonlinear traversal Building trailtexts for U-measure (1)
  • 15. Trailtext: <News 1> <Ad 2> <Blog 1> Trailtext: <Rank 1 snippet> <Rank 2 snippet> <Rank 1’ snippet> <Rank 1’ full text> Building trailtexts for U-measure (2)
  • 16. where U-measure Lpos D(pos) 1 Gain at pos Decay function that discounts the gain pos: position in trailtext (how much text the user has read)
  • 17. Advertisement: http://sigir.org/sigir2017/ Jan 17: full paper abstracts due Jan 24: full papers due Feb 28: short papers and demo proposals due Aug 7: tutorials and doctoral consortium Aug 8-10: main conference Aug 11: workshops The first ever SIGIR in Japan!
  • 18. TALK OUTLINE 1. Motivation 2. Related Work 3. Project Overview 4. Pilot Test Collection: Progress Report 5. Pilot Evaluation Measures 6. Summary and Future Work
  • 19. Project overview (1) Construct a pilot Chinese human-human dialogue test collection with subjective labels, nuggets and English translations. (2) Design nugget-based evaluation measures and investigate the correlation with subjective measures. (3) Revise criteria for subjective and nugget annotations, as well as the measures (4) Construct a larger test collection with subjective labels, nuggets and English translations. Re-investigate the correlations. (5) Release the finalised test collection with code for computing the measures. φ1 φ2
  • 20. Subjective labels = target variables Possible axes for subjective annotation: - Is the task clearly stated and is actually accomplished? - How efficiently is the task accomplished through the dialogue? - Is Customer satisfied with the dialogue, and to what degree? Interlocutor viewpoints: Customer’s viewpoint: Solve my problem efficiently, but I’m giving you minimal information about it. Helpdesk’s viewpoint: Solve Customer’s problem efficiently, as time is money for the company. The two viewpoints may be weighted depending on practical needs.
  • 21. Why nuggets? • Subjective labels tell us about the quality of the entire dialogue, but not about why. • Helpdesk dialogues lack pre-defined slot filling schemes. • Subjective scores (gold standard) = f(nuggets) ? • Parts-Make-The-Whole Hypothesis: The overall quality of a helpdesk dialogue is governed by the quality of its parts. C H C H C H C H Overall quality (subjective) f(nuggets)
  • 22. Nugget annotation vs subjective annotation • Consistency Hypothesis: Nugget annotation achieves higher inter- annotator consistency. (Smaller units = reduces subjectivity and variations in annotation procedure) • Sensitivity Hypothesis: Nugget annotation enables finer distinctions among different dialogues. (Nuggets = details) • Reusability Hypothesis: Nugget annotation enables us to predict the quality of unannotated dialogues more accurately. C H C H WITH annotations C H C H WITHOUT annotations Same task, different dialogues Reuse nuggets
  • 23. Unique features of nuggets for dialogue evaluation • A dialogue involves Customer and Helpdesk (not one search engine user) – two types of nuggets • Within each nugget type, nuggets are not homogeneous - Special nuggets that identify the task (trigger nuggets) - Special nuggets that accomplish the task (goal nuggets) - Regular nuggets
  • 24. Customer’s states and the role of nuggets
  • 25. Possible requirements for nugget-based evaluation measures (1) (a) Highly correlated with subjective labels. (Validates the Parts-Make-The- Whole Hypothesis) (b) Easy to compute and to interpret. (c) Accommodate Customer’s and Helpdesk’s viewpoints and change the balance if required. (d) Accommodate nugget weights (i.e., importance). (e) For a given task, prefer a dialogue that accomplish it over one that does not. (f) Given two dialogues containing the same set of nuggets for the same task, prefer the shorter one. (g) Given two dialogues that accomplish the same task, prefer the one that reaches task accomplishment more quickly. Goal nugget No goal nuggets C H C H C H C H Same task >
  • 26. Possible requirements for nugget-based evaluation measures (2) (a) Highly correlated with subjective labels. (Validates the Parts-Make-The- Whole Hypothesis) (b) Easy to compute and to interpret. (c) Accommodate Customer’s and Helpdesk’s viewpoints and change the balance if required. (d) Accommodate nugget weights (i.e., importance). (e) For a given task, prefer a dialogue that accomplish it over one that does not. (f) Given two dialogues containing the same set of nuggets for the same task, prefer the shorter one. (g) Given two dialogues that accomplish the same task, prefer the one that reaches task accomplishment more quickly. C H C H C H Same task >
  • 27. Possible requirements for nugget-based evaluation measures (3) (a) Highly correlated with subjective labels. (Validates the Parts-Make-The- Whole Hypothesis) (b) Easy to compute and to interpret. (c) Accommodate Customer’s and Helpdesk’s viewpoints and change the balance if required. (d) Accommodate nugget weights (i.e., importance). (e) For a given task, prefer a dialogue that accomplish it over one that does not. (f) Given two dialogues containing the same set of nuggets for the same task, prefer the shorter one. (g) Given two dialogues that accomplish the same task, prefer the one that reaches task accomplishment more quickly. Goal nugget C H C H C H C H Same task > Goal nugget C
  • 28. Possible requirements for nugget-based evaluation measures (4) (a) Highly correlated with subjective labels. (Validates the Parts-Make-The- Whole Hypothesis) (b) Easy to compute and to interpret. (c) Accommodate Customer’s and Helpdesk’s viewpoints and change the balance if required. (d) Accommodate nugget weights (i.e., importance). (e) For a given task, prefer a dialogue that accomplish it over one that does not. (f) Given two dialogues containing the same set of nuggets for the same task, prefer the shorter one. (g) Given two dialogues that accomplish the same task, prefer the one that reaches task accomplishment more quickly.
  • 29. After completing the project... Evaluating human-system dialogues Human-human dialogue test collection with subjective and nugget annotations Utilise as an unstructured knowledge base C H C H Sampled dialogue with subjective and nugget annotations Task Initiate a human-system dialogue for the same task, using participant’s own expressions Participant Participant terminates dialogue as soon as he receives an incoherent or a breakdown-causing utterance from System. Can System still provide the goal nuggets? How does human-system UCH compare with human-human UCH?
  • 30. TALK OUTLINE 1. Motivation 2. Related Work 3. Project Overview 4. Pilot Test Collection: Progress Report 5. Pilot Evaluation Measures 6. Summary and Future Work
  • 31. Dialogue mining (100% done) Pilot data containing 234 Customer-Helpdesk dialogues obtained as follows: 1. Collect an initial set of Weibo accounts A0 by searching account names with keywords such as assistant and helper (in Chinese). 2. For each account in A0, crawl 200 most recent posts that mention that account using “@”. Filter accounts that did not respond to more than 50% of the posts. Let the set of “active” accounts be A. 3. For each account in A, crawl 2000 most recent posts that mention that account, and then extract those with at least 5 Customer posts AND at least 5 Helpdesk posts.
  • 32. Subjective annotation criteria Low inter-annotator agreement (next slide)
  • 33. Subjective annotation (100% done) See [Randolph05] [Sakai15book]
  • 34. Nugget definition (for annotators) • A post: a piece of text input by Customer/Helpdesk who presses ENTER to upload it on Weibo. • A nugget: (I) is a post, or a sequence of consecutive posts by the same interlocutor. (II) can neither partially nor wholly overlap with another nugget. (III) should be minimal: it should not contain irrelevant posts at start/end/middle. (IV) helps Customer transition from Current State towards Target State.
  • 35. Nugget types (for annotators) CNUG0: Customer trigger nuggets. Define Customer’s initial problem. CNUG: Customer regular nuggets. HNUG: Helpdesk regular nuggets. CNUG*: Customer goal nuggets. Customer tells Helpdesk that the problem has been solved. HNUG*: Helpdesk goal nuggets. Helpdesk provides customer with a solution to the problem. Nuggets annotated for 40/234=17% of the dialogues
  • 36. TALK OUTLINE 1. Motivation 2. Related Work 3. Project Overview 4. Pilot Test Collection: Progress Report 5. Pilot Evaluation Measures 6. Summary and Future Work
  • 37. Pilot measures for dialogue evaluation • U-measure [Sakai13+SIGIR] Trailtext = concatenation of all texts that the search engine user has read • UCH (U computed based on Customer’s and Helpdesk’s nuggets) Trailtext = dyadic dialogue - UC = U computed based on Customer’s nuggets (Helpdesk’s viewpoint) - UH = U computed based on Helpdesk’s nuggets (Customer’s viewpoint) UCH = (1-α) UC + α UH
  • 38. Nugget positions (1) Trigger nugget Regular nugget Regular nugget
  • 39. Nugget positions (2) Regular nugget Regular nugget Goal nugget Goal nugget
  • 40. UCH = (1-α) UC + α UH When α=0.5, UCH is U-measure that places the two graphs on top of each other Weight of the goal nugget higher than the sum of the others Normalisation? Unnecessary if score standardisation is applied [Sakai16ICTIR,Sakai16AIRS] Maximum tolerable dialogue length
  • 41. Possible variants • Use different decay functions for Customer and Helpdesk • Use time rather than trailtext as the basis for discounting as in Time- Biased Gain [Smucker+12] +: the gap between the timestamps of two posts can be quantified -/+: cannot quantify the amount of information conveyed in each post expressed in a particular language / language independence But remember Requirement (b): Measures should be easy to compute and to interpret. Max tolerable dialogue duration
  • 42. A sneak peek (40 annotated dialogues) Subjective annotation criterion Q3
  • 43. TALK OUTLINE 1. Motivation 2. Related Work 3. Project Overview 4. Pilot Test Collection: Progress Report 5. Pilot Evaluation Measures 6. Summary and Future Work
  • 44. Conclusions and future work (1) (1) Construct a pilot Chinese human-human dialogue test collection with subjective labels, nuggets and English translations. (2) Design nugget-based evaluation measures and investigate the correlation with subjective measures. (3) Revise criteria for subjective and nugget annotations, as well as the measures (4) Construct a larger test collection with subjective labels, nuggets and English translations. Re-investigate the correlations. (5) Release the finalised test collection with code for computing the measures. φ1 φ2 Done
  • 45. Human-human dialogue test collection with subjective and nugget annotations Utilise as an unstructured knowledge base C H C H Sampled dialogue with subjective and nugget annotations Task Initiate a human-system dialogue for the same task, using participant’s own expressions Participant Conclusions and future work (2) Participant terminates dialogue as soon as he receives an incoherent or a breakdown-causing utterance from System. Can System still provide the goal nuggets? How does human-system UCH compare with human-human UCH? After φ2..
  • 46. Advertisement Short Text Conversation@NTCIR-13 http://ntcirstc.noahlab.com.hk/STC2/stc-cn.htm We Want Web@NTCIR-13 http://www.thuir.cn/ntcirwww/ Single-turn human- system dialogues Improving ad hoc web search over 4.5 years
  • 47. Selected References (1) [Black+09] The Spoken Dialogue Challenge, Proceedings of SIGDIAL 2009 [Galley+15] ΔBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets, Proceedings of ACL 2015. [Higashinaka+16] The Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation Metrics, Proceedings of LREC 2016. [Hartikainen+04] Subjective Evaluation of Spoken Dialogue Systems Using SERVQUAL, Method, Proceedings of INTERSPEECH 2004-ICSLP. [Hone+00] Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI), Natural Language Engineering, 6(3-4), 2000. [Kim+16] The Fourth Dialog State Tracking Challenge, Proceedings of IWSDS 2016. [Lin04] ROUGE: A Package for Automatic Evaluation of Summaries, Proceedings of the Workshop on Text Summarization Branches Out, 2004. [Lin+06] Will Pyramids Built of Nuggets Topple Over? Proceedings of HLT/NAACL 2006. [Lowe+15] The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, Proceedings of SIGDIAL 2015.
  • 48. Selected References (2) [Sakai+11CIKM] Click the Search Button and Be Happy: Evaluating Direct and Immediate Information Access, Proceedings of ACM CIKM 2011. [Sakai+12AIRS] One Click One Revisited: Enhancing Evaluation based on Information Units, Proceedings of AIRS 2012. [Sakai+13SIGIR] Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation, Proceedings of ACM SIGIR 2013. [Sakai+15AIRS] Topic Set Size Design with the Evaluation Measures for Short Text Conversation, Proceedings of AIRS 2015. [Sakai15book] 情報アクセス評価方法論: 検索エンジンの進歩のために, コロナ社, 2015. [Sakai16AIRS] The Effect of Score Standardisation on Topic Set Size Design, Proceedings of AIRS 2016, to appear. [Sakai16ICTIR] A Simple and Effective Approach to Score Standardisation, Proceedings of ACM ICTIR 2016. [Sakai16SIGIR] Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015, Proceedings of ACM SIGIR 2016.
  • 49. Selected References (3) [Shang+16] Overview of the NTCIR-12 Short Text Conversation Task, Proceedings of NTCIR-12, 2016. [Smucker+12] Time-Based Calibration of Effectiveness Measures, Proceedings of ACM SIGIR 2012. [Walker+97] PARADISE: A Framework for Evaluating Spoken Dialogue Agents, Proceedings of ACL 1997. [Williams+13] The Dialog State Tracking Challenge, Proceedings of SIGDIAL 2013.
  • 50. Acknowledgements • We thank Hang Li and Lifeng Shang (Huawei Noah's Ark Lab) for helpful discussions and continued support; and Guan Jun, Lingtao Li and Yimeng Fan (Waseda University) for helping us construct the pilot test collection. • We also thank Ryuichiro Higashinaka (NTT Media Intelligence Laboratories) for providing us with valuable information related to the evaluation of non-task-oriented dialogues.