2. Answer
distillation task
Given query and passage
containing answer, summarize
answer for presentation, on-
screen or read-out.
Answers are query-biased
summaries:
• Single entities or phrases (e.g., “Rome”
for the query “Italy capital”)
• Multi-sentence (e.g., for the query
“how to get a passport”)
• Need not be spans from passage
• Might combine multiple passages
6. Questions are not search queries
Questions are well-formed & curated
Single entity / phrase answers
Multiple-choice answers
Data
E.g., TREC-QA, MCTest,
CBT, WikiQA, SQuAD
Designed for matching short responses, or
Poorly correlate with human judgments, or
Human in the loop (non-repeatable)
Metric
E.g., P/R, BLEU, METEOR
+
Existing QA Datasets
7. Sample queries from Bing logs
Editorially curated reference answers
Many reference answers per query
Data
Phrasing Aware (pa-) metrics
Modified versions of BLEU / METEOR
Metric+
Our proposal
8. Towards variance
reduction
Use single reference passage set to
reduce variance from conflicting
information at source
Get many reference answers to
model the natural variance in answer
phrasing
Extend existing metrics to take better
advantage of the large number of
available reference answers
The law requires all children traveling in the front or
rear seat of any car, van or goods vehicle must use the
correct child car seat until they are either 135cm in
height or 12 years old (which ever they reach first).
After this they must use an adult seat belt. There are
very few exceptions.
law for ages for children allowed to sit in front
seatQuery
Passages
Children under the age of 12 and less than 135cm tall
need a child car seat when traveling in the front or the
rear seat of a car.
Distilled answers
Children of any age can travel in the front or the rear
seat of a car. They need a child seat if under the age
of 12.
Children under the age of 12 need a child seat, unless
more than 135cm tall.
…
A child seat is necessary for children under 12.
Otherwise an adult seat belt must be worn.
9. Generating the dataset
Sample queries
• Randomly sample
from Bing logs
• Remove PII
• Remove navigational,
transactional queries
• Remove queries with
no deterministic
answers (E.g.,
“holiday recipes”)
Retrieve candidate
passages
• Retrieve top-N
candidate passages
per query
• Typically retrieved
from many different
documents
Select minimal
passage set
• Editors select the
minimal but
sufficient passage set
• If multiple passages
are selected then
information across
passages should not
conflict
Curate reference
answers
• Editors curate
minimal but
complete answer for
ach query
• Answers can be
single entity or
phrase, or multi-
sentence passage
10. Phrasing
Aware Metrics
Score candidate answer based on
average similarity with all
available reference answers
Each reference answer is
importance weighted based on
agreement with other reference
answers
Metrics like BLEU (or METEROR)
can be used as similarity metric
11.
12. Request For
Comments
We want to make the proposed Answer
Distillation dataset and corresponding metrics
publicly available for academic research
We need YOUR feedback to build the right
evaluation framework
https://gitter.im/ProjectDistillery/Distillery
Hinweis der Redaktion
I’m a PM: it’s my job to figure out what users want
Our analysis mirror’s Google’s: they want answers directly on SERP, as short as possible
In emerging interfaces – voice, mobile, this is even more critical
Problem: ideal, concise answer often doesn’t appear on the Web
Solution: start with passages from the Web, distill the concise answer automatically
Motivate with a particularly challenging example
Imagine this isn’t in our knowledge base
Conclusion – this problem combines machine comprehension with language synthesis
This is a hard problem, it’s a problem we really care about
We want to accelerate research in this space
Quite possible that the algorithms that can solve this problem have already been invented
Deep learning with memory and attention seem like particularly promising
But without appropriate data, there’s little hope of applying them properly
Working around imperfect data and imperfect metrics