evia2019

Graded Relevance Assessments
and Graded Relevance
Measures of NTCIR
Tetsuya Sakai
Waseda University
tetsuyasakai@acm.org
10th June, 2019@EVIA2019/NTCIR-14, Tokyo.

http://sakailab.com/ntcirbookdraft/

TALK OUTLINE
1. NTCIR and me
2. Survey of NTCIR overviews (1999-2019)
3. Q-measures etc.
4. D-measures etc.
5. Beyond graded relevance
6. Summary

NTCIR-1, -2, -3 (1999-2003)
• Sakai, T., Shibazaki, Y., Suzuki, M., Kajiura, M.,
Manabe, T. and Sumita, K.: Cross-Language
Information Retrieval for NTCIR at Toshiba,
Proceedings of NTCIR-1, 1999.
• Sakai, T., Robertson, S.E. and Walker, S.: Flexible
Pseudo-Relevance Feedback for NTCIR-2,
• Sakai, T., Koyama, M., Suzuki, M. and Manabe, T.:
Toshiba KIDS at NTCIR-3: Japanese and English-
Japanese IR, Proceedings of NTCIR-3, 2003.
1 paper per NTCIR

NTCIR-4 (2004)
• Sakai, T., Koyama, M., Kumano, A. and Manabe, T.:
Toshiba BRIDJE at NTCIR-4 CLIR:
Monolingual/Bilingual IR and Flexible Feedback,
• Sakai, T., Saito, Y., Ichimura, Y., Koyama, M. and
Kokubu, T.: Toshiba ASKMi at NTCIR-4 QAC2,
Procedings of NTCIR-4, 2004.
• Sakai, T.: New Performance Metrics based on
Multigrade Relevance: Their Application to
Question Answering, Proceedings of NTCIR-4
Proceedings (Open Submission Session), 2004.
Q-measure
This later evolved into EVIA
3 papers

NTCIR-5 (2005)
• Kokubu, T., Sakai, T., Saito, Y., Tsutsui, H., Manabe, T.,
Koyama, M. and Fujii, H.: The Relationship between
Answer Ranking and User Satisfaction in a Question
Answering System, Proceedings of NTCIR-5 (Open
Submission Session), 2005.
• Sakai, T.: The Effect of Topic Sampling on Sensitivity
Comparisons of Information Retrieval Metrics,
Proceedings of NTCIR-5 (Open Submission Session),
2005.
• Sakai, T., Manabe, T., Kumano, A., Koyama, M. and
Kokubu, T.: Toshiba BRIDJE at NTCIR-5: Evaluation using
Geometric Means, Proceedings of NTCIR-5, 2005.
3 papers

NTCIR-6 (2007)
• Sakai, T.: On Penalising Late Arrival of Relevant
Documents in Information Retrieval Evaluation with
Graded Relevance, Proceedings of EVIA 2007.
• Sakai, T.: User Satisfaction Task: A Proposal for
NTCIR-7, Proceedings of EVIA 2007.
• Sakai, T., Koyama, M., Izuha, T., Kumano, A.,
Manabe, T. and Kokubu, T.: Toshiba BRIDJE at
NTCIR-6 CLIR: The Head/Lead Method and Graded
Relevance Feedback, Proceedings of NTCIR-6, 2007.
3 papers

NTCIR-7 (2008)
• Sakai, T. and Robertson, S.: Modelling A User Population for
Designing Information Retrieval Metrics, Proceedings of
EVIA 2008.
• Sakai, T. and Kando, N.: Are Popular Documents More Likely
To Be Relevant? A Dive into the ACLIA IR4QA Pools,
Proceedings of EVIA 2008.
• Mitamura, T., Nyberg, E., Shima, H., Kato, T., Mori, T., Lin, C.-
Y., Song, R., Lin, C.-J., Sakai, T., Ji, D. and Kando, N.: Overview
of the NTCIR-7 ACLIA Tasks: Advanced Cross-Lingual
Information Access, Proceedings of NTCIR-7, 2008.
• Sakai, T., Kando, N., Lin, C.-J., Mitamura, T., Shima, H., Ji, D.,
Chen, K.-H., and Nyberg, E.: Overview of the NTCIR-7 ACLIA
IR4QA Task, Proceedings of NTCIR-7, 2008.
NCU
Debut as a task
organiser
4 papers

NTCIR-8 (2010)
• Song, R., Qi, D., Liu, H., Sakai, T., Nie, J.-Y., Hon, H.-W. and Yu, Y.: Constructing a Test Collection
with Multi-Intent Queries, Proceedings of EVIA 2010.
• Sakai, T., Craswell, N., Song, R., Robertson, S., Dou, Z. and Lin, C.-Y.: Simple Evaluation Metrics for
Diversified Search Results, Proceedings of EVIA 2010.
• Sakai, T. and Lin, C.-Y.: Ranking Retrieval Systems without Relevance Assessments ? Revisited,
Proceedings of EVIA 2010.
• Mitamura, T., Shima, H., Sakai, T., Kando, N., Mori, T., Takeda, K., Lin, C.-Y., Song, R., Lin, C.-J. and
Lee, C.-W.: Overview of the NTCIR-8 ACLIA Tasks: Advanced Cross-Lingual Information Access,
• Sakai, T., Shima, H., Kando, N., Song, R., Lin, C.-J., Mitamura, T., Sugimoto, M. and Lee, C.-W.:
Overview of NTCIR-8 ACLIA IR4QA, Proceedings of NTCIR-8, 2010.
• Gey, F., Larson, R., Kando, N., Machado, J. and Sakai, T.: NTCIR-GeoTime Overview: Evaluating
Geographic and Temporal Search, Proceedings of NTCIR-8, 2010.
• Ishikawa, D., Sakai, T. and Kando, N.: Overview of the NTCIR-8 Community QA Pilot Task (Part I):
The Test Collection and the Task, Proceedings of NTCIR-8, 2010.
• Sakai, T., Ishikawa, D. and Kando, N.: Overview of the NTCIR-8 Community QA Pilot Task (Part II):
System Evaluation, Proceedings of NTCIR-8, 2010.
• Song, Y.-I., Liu, J., Sakai, T., Wang, X.-J., Feng, G., Cao, Y., Suzuki, H. and Lin, C.-Y.: Microsoft
Research Asia with Redmond at the NTCIR-8 Community QA Pilot Task, Proceedings of NTCIR-8,
2010.
D-measures
9 papers

NTCIR-9 (2011)
• Ishikawa, D., Kando, N. and Sakai, T.: What Makes a Good Answer in Community
Question Answering? An Analysis of Assessors' Criteria, Proceedings of EVIA
2011.
• Song, R., Zhang, M., Sakai, T., Kato, M.P., Liu, Y., Sugimoto, M., Wang, Q. and Orii,
N.: Overview of the NTCIR-9 INTENT Task, Proceedings of NTCIR-9, 2011.
• Sakai, T., Kato, M.P. and Song, Y.-I.: Overview of NTCIR-9 1CLICK, Proceedings of
NTCIR-9, 2011.
• Orii, N., Song, Y.-I. and Sakai, T.: Microsoft Research Asia at the NTCIR-9 1CLICK
Task, Proceedings of NTCIR-9, 2011.
• Han, J., Wang, Q., Orii, N., Dou, Z., Sakai. T. and Song, R.: Microsoft Research
Asia at the NTCIR-9 Intent Task, Proceedings of NTCIR-9, 2011.
• Morita, H., Makino, T., Sakai, T., Takamura, H. and Okumura, M.: TTOKU
Summarization Based Systems at NTCIR-9 1CLICK Task, Proceedings of NTCIR-9,
2011.
• Joho, H. and Sakai, T.: Grid-based Interaction for NTCIR-9 VisEx Task, Proceedings
of NTCIR-9, 2011.
7 papers

NTCIR-10 (2013)
• Sakai, T.: The Unreusability of Diversified Search Test
Collections, Proceedings of EVIA 2013.
• Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song,
R., Kato, M.P. and Iwata, M.: Overview of the NTCIR-10
INTENT-2 Task, Proceedings of NTCIR-10, 2013.
• Kato, M.P., Ekstrand-Abueg, M., Pavlu, V., Sakai, T.,
Yamamoto, T. and Iwata, M.: Overview of the NTCIR-10
1CLICK-2 Task, Proceedings of NTCIR-10, 2013.
• Tsukuda, K., Dou, Z. and Sakai, T.: Microsoft Research
Asia at the NTCIR-10 Intent Task, Proceedigns of NTCIR-
10, 2013.
• Narita, K., Sakai, T., Dou, Z. and Song, Y.-I.: MSRA at
NTCIR-10 1CLICK-2, Proceedings of NTCIR-10, 2013.
5 papers

NTCIR-11 (2014)
• Sakai, T.: Topic Set Size Design with Variance
Estimates from Two-Way ANOVA, Proceedings of
EVIA 2014.
• Kato, M.P., Ekstrand-Abueg, M., Pavlu, V., Sakai, T.,
Yamamoto, T. and Iwata, M.: Overview of the
NTCIR-11 MobileClick Task, Proceedings of NTCIR-
11, 2014.
Joined Waseda in September 2013
2 papers

NTCIR-12 (2016)
• Sakai, T. and Shang, L: On Estimating Variances for Topic Set Size Design, Proceedings of EVIA
2016.
• Kato, M.P., Pavlu, V., Sakai, T., Yamamoto, T. and Morita, H.: Two-layered Summaries for Mobile
Search: Does the Evaluation Measure Reflect User Preferences?, Proceedings of EVIA 2016.
• Shang, L., Sakai, T., Lu, Z., Li, H., Higashinaka, R. and Miyao, Y.: Overview of the NTCIR-12 Short
Text Conversation Task, Proceedings of NTCIR-12, 2016.
• Kato, M.P., Sakai, T., Yamamoto, T., Pavlu, V., Morita, H. and Fujita, S.: Overview of the NTCIR-12
MobileClick Task, Proceedings of NTCIR-12, 2016.
• Nanba, H., Sakai, T., Kando, N., Keyaki, A., Eguchi, K., Hatano, K., Shimizu, T., Hirate, Y. and Fujii,
A.: NEXTI at NTCIR-12 IMine-2 Task, Proceedings of NTCIR-12, 2016.
• Higuchi, S. and Sakai, T.: SLQAL at the NTCIR-12 QALab-2 Task, Proceedings of NTCIR-12, 2016.
• Denawa, H., Sano, T., Kadotami, Y., Kato, S. and Sakai, T.: SLSTC at the NTCIR-12 STC Task,
• Iijima, S. and Sakai, T.: SLLL at the NTCIR-12 Lifelog Task: Sleepflower and the LIT Subtask,
Proceedings of NTCIR-12
My students’
debut at
NTCIR
8 papers

NTCIR-13 (2017)
• Shang, L., Sakai, T., Li, H., Higashinaka, R., Miyao, Y., Arase, Y., and Nomoto,M.: Overview of the NTCIR-13 Short
Text Conversation Task, Proceedings of NTCIR-13, 2017.
• Luo, C., Sakai, T., Liu, Y., Dou, Z., Xiong, C., and Xu, J.: Overview of the NTCIR-13 We Want Web Task,
• Kashimura, R. and Sakai, T.: SLOLQ at the NTCIR-13 OpenLiveQ Task, Proceedings of NTCIR-13, 2017.
• Sato, K. and Sakai, T.: SLQAL at the NTCIR-13 QA Lab-3 Task, Proceedings of NTCIR-13, 2017.
• Guan, J. and Sakai, T.: SLSTC at the NTCIR-13 STC Task, Proceedings of NTCIR-13, 2017.
• Xiao, P., Li, L., Fan, Y., and Sakai, T.: SLWWW at the NTCIR-13 WWW Task, Proceedings of NTCIR-13, 2017.
• Zeng, Z., Luo, C., Shang, L., Li, H., and Sakai, T.: Test Collections and Measures for Evaluating Customer-
Helpdesk Dialogues, Proceedings of EVIA 2017.
• Sakai, T.: Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths, Proceedings of EVIA
2017.
• Sakai, T.: Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently
Subjective Annotations, Proceedings of EVIA 2017.
• Sakai, T.: The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and
Students, Proceedings of EVIA 2017.
• Sakai, T.: Unanimity-Aware Gain for Highly Subjective Assessments, Proceedings of EVIA 2017.
11 papers

NTCIR-14 (2019)
• Sakai, T., Ferro, N., Soboroff, I., Zeng, Z., Xiao, P., and Maistro,
M.: Overview of the NTCIR-14 CENTRE Task, Proceedings of
NTCIR-14, 2019.
• Mao, J., Sakai, T., Luo, C., Xiao, P., Liu, Y., and Dou, Z.:
Overview of the NTCIR-14 We Want Web Task, Proceedings
of NTCIR-14, 2019.
• Zeng, Z., Kato, S., and Sakai, T.: Overview of the NTCIR-14
Short Text Conversation Task: Dialogue Quality and Nugget
Detection Subtasks, Proceedings of NTCIR-14, 2019.
• Kato, S., Suzuki, R., Zeng, Z., and Sakai, T.: SLSTC at the
NTCIR-14 STC-3 Dialogue Quality and Nugget Detection
Subtasks, Proceedings of NTCIR-14, 2019.
• Xiao, P. and Sakai, T.: SLWWW at the NTCIR-14 We Want
Web Task, Proceedings of NTCIR-14, 2019.
For the first time, I don’t have a paper at EVIA!
5 papers?

Or so I thought...
• Oard, D.W., Sakai, T., and Kando, N.: Celebrating 20
Years of NTCIR: The Book, Proceedings of EVIA 2019.

[Harman05] (The TREC book)
“Relevance was defined within the task
of the information analyst, with TREC
assessors instructed to judge a document
relevant if information from that
document would be used in some
manner for the writing of a report on the
subject of the topic. This also implies the
use of binary relevance judgments;”

NTCIR overviews (1999-2019)
survey method
• Examined all overview papers (for tasks that
involved ranked retrieval only)
• Examined how many relevance levels were used
and how they were obtained in each task (ALL
NTCIR retrieval tasks use graded relevance levels!)
• Examined whether graded relevance measures
were used to evaluate the participating systems.

IF you want (a) > (b) > (c), then you
should use graded relevance
measures.
Relevant
Partially relevant
Partially relevant
Nonrelevant
(a)
Partially relevant
Partially relevant
Relevant
Nonrelevant
Nonrelevant
Nonrelevant
Relevant
Nonrelevant
(b) (c)

IF you want (a) > (b) > (c),
“relaxed relevance” doesn’t work.
Relevant
Partially relevant
Partially relevant
Nonrelevant
(a)
Partially relevant
Partially relevant
Relevant
Nonrelevant
Nonrelevant
Nonrelevant
Relevant
Nonrelevant
(b) (c)
Considered equally effective

IF you want (a) > (b) > (c),
“rigid relevance” doesn’t work.
Relevant
Partially relevant
Partially relevant
Nonrelevant
(a)
Partially relevant
Partially relevant
Relevant
Nonrelevant
Nonrelevant
Nonrelevant
Relevant
Nonrelevant
(b) (c)
Considered equally effective

Tasks that used only binary relevance measures

Tasks that used grade relevance measures (1)

Tasks that used grade relevance measures (2)

Normalised Cumulative Utility (1)
[Sakai+Robertson EVIA08]
:
r
1
2
3
:
Population of
users who scan
the ranked list

:
r
1
2
3
:
Stopping probability at r
Users who abandon the list at r=1
Users who abandon the list at r=3

:
r
1
2
3
:
Measure utility of
this doc for this user
group
Measure utility of
these docs for this
user group
Utility at r
NCU is “expected utility”

AP is an NCU (1)
• Suppose R=3 relevant docs are known.
Nonrelevant
Relevant
Nonrelevant
Relevant
33% of
users
33% of
users Nonrelevant
Stopping
probability
distribution:
uniform
over
relevant
docs
33% of
users
Retrieved
Not retrieved
Relevant

AP is an NCU (2)
• Suppose R=3 relevant docs are known.
Nonrelevant
Relevant
Nonrelevant
Relevant
33% of
users
33% of
users Nonrelevant
Prec(2)
= 1/2
Prec(5)
= 2/5
AP
= ( Prec(2) + Prec(5) + 0 ) / 3
= 0.300

Q-measure is an NCU (1)
• Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
Stopping
probability
distribution:
uniform
over
relevant
docs
33% of
users
Retrieved
Not retrieved
Partially rel: 1

Q-measure is an NCU (2)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
BR(2)
= 4/6
BR(5)
= 6/10
Q
= ( BR(2) + BR(5) + 0 ) / 3
= 0.422
Q generalizes AP by
using the Blended Ratio
instead of Prec as Utility

BR combines Prec and Normalised
Cumulative Gain (1)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
Prec(2)
= 1/2
Highly rel: 3
Partially rel: 1
Partially rel: 1
Ideal list
cg(r) cg*(r)
Cumulative gain
0
3
3
3
4
3
4
5
5
5
BR(2)
= (1+3)/(2+4)
= 4/6
with β=1

BR combines Prec and Normalised
Cumulative Gain (2)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant Prec(5)
= 2/5
Highly rel: 3
Partially rel: 1
Partially rel: 1
Ideal list
cg(r) cg*(r)
Cumulative gain
0
3
3
3
4
3
4
5
5
5
BR(5)
= (2+4)/(5+5)
= 6/10
with β=1

Patience parameter β of BR
(binary relevance environment)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
β=0.1
β=1
β=10
r1 <= R ⇒
BR(r1)=(1+β)/(r1+βr1)=1/r1
r1 > R ⇒
BR(r1)=(1+β)/(r1+βR)
r1 : rank of the 1st relevant doc
Large β ⇒ more
tolerance to relevant
docs at low ranks
BR(r1) R=5

Diversified search
• Given an ambiguous/underspecified query, produce a
single Search Engine Result Page that satisfies
different user intents!
• Challenge: balancing relevance and diversity
SERP(SearchEngineResultPage)
Highly relevant
near the top
Give more
space to
popular intents?
Give more space
to informational
intents?
Cover many
intents

Approaches to evaluating
diversified search
• α-nDCG [Clarke+SIGIR08]
• Intent-Aware measures [Agrawal+WSDM09,
Chapelle+IR11]
(1) Compute a measure for each intent
(2) Combine the measures using intent probabilities as
weights
• D(#)-measures [Sakai+EVIA10,Sakai+SIGIR11]
(1) Combine intentwise graded relevance with intent
probabilities to compute the gain of each document
(2) Construct an ideal list based on the gain, and then
compute a graded relevance measure based on it

D-measures (1)
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
Partially rel:1 Partially rel:1
Reldoc1
Reldoc2
Reldoc3
Per-intent gain values
gi gj
Intent j:
“pottermore.com”
Pr(j|q) = 0.3
R = 3 relevant
documents
2 intents

D-measures (2)
Reldoc1
Reldoc2
Reldoc3
0.7*1+0.3*7=2.8
0.7*1+0.3*1=1.0
0.7*3+0.3*0=2.1
D-DCG*
= 2.8 + 2.1/log2(2+1) +1.0/log2(3+1)
= 4.62
gi gj
R = 3 relevant
documents
2 intents
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
Pr(j|q) = 0.3
Ideal list based on
global gains
Pr(i|q) gi + Pr(j|q) gj
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0

D-measures (3)
nonrel
nonrel
2.1
nonrel
Reldoc1
Reldoc2
Reldoc3Reldoc2
Ideal list based on
global gains
Pr(i|q) gi + Pr(j|q) gj
D-DCG
= 2.1/log2(3+1)
= 1.05
D-DCG*
= 4.62
D-nDCG =
D-DCG/D-DCG*
= 0.23
gi gj
SERP to be
evaluated
R = 3 relevant
documents
2 intents
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
Pr(j|q) = 0.3
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
0.7*1+0.3*7=2.8
0.7*1+0.3*1=1.0
0.7*3+0.3*0=2.1

Intent recall (aka
subtopic recall [Zhai03] )
I-rec =
#intents covered by SERP / #intents
= 1/2
nonrel
nonrel
nonrel
Reldoc2
gi gj
R = 3 relevant
documents
2 intents
Reldoc1
Reldoc2
Reldoc3Only Intent i is
covered by SERP
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
Pr(j|q) = 0.3
SERP to be
evaluated
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0

D#-measure = γ I-rec + (1-γ) D-measure
D#-nDCG
contour
lines
Pure
diversity
Overall
relevance
Official results from the NTCIR-10 INTENT-2 task

So which adhoc/diversity measures are “good”?
https://waseda.box.com/sigir2019preprint

Current approaches:
gold relevance labels
0 1
0 1
Assessors’
diverse ratings
0 1
0 1
Final
relevance
grade: 0.5
Final
relevance
grade: 0.5

New approaches:
gold distributions
0 1
0 1
Assessors’
diverse ratings
0 1
0 1
Use the distributions
directly for evaluation!
The gold data preserves
the diverse views of
users.

Please see the STC-3 overview AND
https://waseda.box.com/SIGIR2018preprint

Summary
• Survey of NTCIR ranked retrieval tasks (1999-2019):
most of them utilise graded relevance measures,
but not all.
• If relevance grades are important for your task,
graded relevance measures should be used.
Converting graded relevance to binary relevance is
inadequate.
• Beyond relevance labels: utilise gold distributions
that preserve diverse views.
• THE NTCIR BOOK WILL BE OUT IN 2020 FROM
SPRINGER!

evia2019

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie evia2019

Ähnlich wie evia2019 (20)

Mehr von Tetsuya Sakai

Mehr von Tetsuya Sakai (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

evia2019