3. TALK OUTLINE
1. NTCIR and me
2. Survey of NTCIR overviews (1999-2019)
3. Q-measures etc.
4. D-measures etc.
5. Beyond graded relevance
6. Summary
4. NTCIR-1, -2, -3 (1999-2003)
⢠Sakai, T., Shibazaki, Y., Suzuki, M., Kajiura, M.,
Manabe, T. and Sumita, K.: Cross-Language
Information Retrieval for NTCIR at Toshiba,
Proceedings of NTCIR-1, 1999.
⢠Sakai, T., Robertson, S.E. and Walker, S.: Flexible
Pseudo-Relevance Feedback for NTCIR-2,
Proceedings of NTCIR-2, 2001.
⢠Sakai, T., Koyama, M., Suzuki, M. and Manabe, T.:
Toshiba KIDS at NTCIR-3: Japanese and English-
Japanese IR, Proceedings of NTCIR-3, 2003.
1 paper per NTCIR
5. NTCIR-4 (2004)
⢠Sakai, T., Koyama, M., Kumano, A. and Manabe, T.:
Toshiba BRIDJE at NTCIR-4 CLIR:
Monolingual/Bilingual IR and Flexible Feedback,
Proceedings of NTCIR-4, 2004.
⢠Sakai, T., Saito, Y., Ichimura, Y., Koyama, M. and
Kokubu, T.: Toshiba ASKMi at NTCIR-4 QAC2,
Procedings of NTCIR-4, 2004.
⢠Sakai, T.: New Performance Metrics based on
Multigrade Relevance: Their Application to
Question Answering, Proceedings of NTCIR-4
Proceedings (Open Submission Session), 2004.
Q-measure
This later evolved into EVIA
3 papers
6. NTCIR-5 (2005)
⢠Kokubu, T., Sakai, T., Saito, Y., Tsutsui, H., Manabe, T.,
Koyama, M. and Fujii, H.: The Relationship between
Answer Ranking and User Satisfaction in a Question
Answering System, Proceedings of NTCIR-5 (Open
Submission Session), 2005.
⢠Sakai, T.: The Effect of Topic Sampling on Sensitivity
Comparisons of Information Retrieval Metrics,
Proceedings of NTCIR-5 (Open Submission Session),
2005.
⢠Sakai, T., Manabe, T., Kumano, A., Koyama, M. and
Kokubu, T.: Toshiba BRIDJE at NTCIR-5: Evaluation using
Geometric Means, Proceedings of NTCIR-5, 2005.
3 papers
7. NTCIR-6 (2007)
⢠Sakai, T.: On Penalising Late Arrival of Relevant
Documents in Information Retrieval Evaluation with
Graded Relevance, Proceedings of EVIA 2007.
⢠Sakai, T.: User Satisfaction Task: A Proposal for
NTCIR-7, Proceedings of EVIA 2007.
⢠Sakai, T., Koyama, M., Izuha, T., Kumano, A.,
Manabe, T. and Kokubu, T.: Toshiba BRIDJE at
NTCIR-6 CLIR: The Head/Lead Method and Graded
Relevance Feedback, Proceedings of NTCIR-6, 2007.
3 papers
8. NTCIR-7 (2008)
⢠Sakai, T. and Robertson, S.: Modelling A User Population for
Designing Information Retrieval Metrics, Proceedings of
EVIA 2008.
⢠Sakai, T. and Kando, N.: Are Popular Documents More Likely
To Be Relevant? A Dive into the ACLIA IR4QA Pools,
Proceedings of EVIA 2008.
⢠Mitamura, T., Nyberg, E., Shima, H., Kato, T., Mori, T., Lin, C.-
Y., Song, R., Lin, C.-J., Sakai, T., Ji, D. and Kando, N.: Overview
of the NTCIR-7 ACLIA Tasks: Advanced Cross-Lingual
Information Access, Proceedings of NTCIR-7, 2008.
⢠Sakai, T., Kando, N., Lin, C.-J., Mitamura, T., Shima, H., Ji, D.,
Chen, K.-H., and Nyberg, E.: Overview of the NTCIR-7 ACLIA
IR4QA Task, Proceedings of NTCIR-7, 2008.
NCU
Debut as a task
organiser
4 papers
9. NTCIR-8 (2010)
⢠Song, R., Qi, D., Liu, H., Sakai, T., Nie, J.-Y., Hon, H.-W. and Yu, Y.: Constructing a Test Collection
with Multi-Intent Queries, Proceedings of EVIA 2010.
⢠Sakai, T., Craswell, N., Song, R., Robertson, S., Dou, Z. and Lin, C.-Y.: Simple Evaluation Metrics for
Diversified Search Results, Proceedings of EVIA 2010.
⢠Sakai, T. and Lin, C.-Y.: Ranking Retrieval Systems without Relevance Assessments ? Revisited,
Proceedings of EVIA 2010.
⢠Mitamura, T., Shima, H., Sakai, T., Kando, N., Mori, T., Takeda, K., Lin, C.-Y., Song, R., Lin, C.-J. and
Lee, C.-W.: Overview of the NTCIR-8 ACLIA Tasks: Advanced Cross-Lingual Information Access,
Proceedings of NTCIR-8, 2010.
⢠Sakai, T., Shima, H., Kando, N., Song, R., Lin, C.-J., Mitamura, T., Sugimoto, M. and Lee, C.-W.:
Overview of NTCIR-8 ACLIA IR4QA, Proceedings of NTCIR-8, 2010.
⢠Gey, F., Larson, R., Kando, N., Machado, J. and Sakai, T.: NTCIR-GeoTime Overview: Evaluating
Geographic and Temporal Search, Proceedings of NTCIR-8, 2010.
⢠Ishikawa, D., Sakai, T. and Kando, N.: Overview of the NTCIR-8 Community QA Pilot Task (Part I):
The Test Collection and the Task, Proceedings of NTCIR-8, 2010.
⢠Sakai, T., Ishikawa, D. and Kando, N.: Overview of the NTCIR-8 Community QA Pilot Task (Part II):
System Evaluation, Proceedings of NTCIR-8, 2010.
⢠Song, Y.-I., Liu, J., Sakai, T., Wang, X.-J., Feng, G., Cao, Y., Suzuki, H. and Lin, C.-Y.: Microsoft
Research Asia with Redmond at the NTCIR-8 Community QA Pilot Task, Proceedings of NTCIR-8,
2010.
D-measures
9 papers
10. NTCIR-9 (2011)
⢠Ishikawa, D., Kando, N. and Sakai, T.: What Makes a Good Answer in Community
Question Answering? An Analysis of Assessors' Criteria, Proceedings of EVIA
2011.
⢠Song, R., Zhang, M., Sakai, T., Kato, M.P., Liu, Y., Sugimoto, M., Wang, Q. and Orii,
N.: Overview of the NTCIR-9 INTENT Task, Proceedings of NTCIR-9, 2011.
⢠Sakai, T., Kato, M.P. and Song, Y.-I.: Overview of NTCIR-9 1CLICK, Proceedings of
NTCIR-9, 2011.
⢠Orii, N., Song, Y.-I. and Sakai, T.: Microsoft Research Asia at the NTCIR-9 1CLICK
Task, Proceedings of NTCIR-9, 2011.
⢠Han, J., Wang, Q., Orii, N., Dou, Z., Sakai. T. and Song, R.: Microsoft Research
Asia at the NTCIR-9 Intent Task, Proceedings of NTCIR-9, 2011.
⢠Morita, H., Makino, T., Sakai, T., Takamura, H. and Okumura, M.: TTOKU
Summarization Based Systems at NTCIR-9 1CLICK Task, Proceedings of NTCIR-9,
2011.
⢠Joho, H. and Sakai, T.: Grid-based Interaction for NTCIR-9 VisEx Task, Proceedings
of NTCIR-9, 2011.
7 papers
11. NTCIR-10 (2013)
⢠Sakai, T.: The Unreusability of Diversified Search Test
Collections, Proceedings of EVIA 2013.
⢠Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song,
R., Kato, M.P. and Iwata, M.: Overview of the NTCIR-10
INTENT-2 Task, Proceedings of NTCIR-10, 2013.
⢠Kato, M.P., Ekstrand-Abueg, M., Pavlu, V., Sakai, T.,
Yamamoto, T. and Iwata, M.: Overview of the NTCIR-10
1CLICK-2 Task, Proceedings of NTCIR-10, 2013.
⢠Tsukuda, K., Dou, Z. and Sakai, T.: Microsoft Research
Asia at the NTCIR-10 Intent Task, Proceedigns of NTCIR-
10, 2013.
⢠Narita, K., Sakai, T., Dou, Z. and Song, Y.-I.: MSRA at
NTCIR-10 1CLICK-2, Proceedings of NTCIR-10, 2013.
5 papers
12. NTCIR-11 (2014)
⢠Sakai, T.: Topic Set Size Design with Variance
Estimates from Two-Way ANOVA, Proceedings of
EVIA 2014.
⢠Kato, M.P., Ekstrand-Abueg, M., Pavlu, V., Sakai, T.,
Yamamoto, T. and Iwata, M.: Overview of the
NTCIR-11 MobileClick Task, Proceedings of NTCIR-
11, 2014.
Joined Waseda in September 2013
2 papers
13. NTCIR-12 (2016)
⢠Sakai, T. and Shang, L: On Estimating Variances for Topic Set Size Design, Proceedings of EVIA
2016.
⢠Kato, M.P., Pavlu, V., Sakai, T., Yamamoto, T. and Morita, H.: Two-layered Summaries for Mobile
Search: Does the Evaluation Measure Reflect User Preferences?, Proceedings of EVIA 2016.
⢠Shang, L., Sakai, T., Lu, Z., Li, H., Higashinaka, R. and Miyao, Y.: Overview of the NTCIR-12 Short
Text Conversation Task, Proceedings of NTCIR-12, 2016.
⢠Kato, M.P., Sakai, T., Yamamoto, T., Pavlu, V., Morita, H. and Fujita, S.: Overview of the NTCIR-12
MobileClick Task, Proceedings of NTCIR-12, 2016.
⢠Nanba, H., Sakai, T., Kando, N., Keyaki, A., Eguchi, K., Hatano, K., Shimizu, T., Hirate, Y. and Fujii,
A.: NEXTI at NTCIR-12 IMine-2 Task, Proceedings of NTCIR-12, 2016.
⢠Higuchi, S. and Sakai, T.: SLQAL at the NTCIR-12 QALab-2 Task, Proceedings of NTCIR-12, 2016.
⢠Denawa, H., Sano, T., Kadotami, Y., Kato, S. and Sakai, T.: SLSTC at the NTCIR-12 STC Task,
Proceedings of NTCIR-12, 2016.
⢠Iijima, S. and Sakai, T.: SLLL at the NTCIR-12 Lifelog Task: Sleepflower and the LIT Subtask,
Proceedings of NTCIR-12
My studentsâ
debut at
NTCIR
8 papers
14. NTCIR-13 (2017)
⢠Shang, L., Sakai, T., Li, H., Higashinaka, R., Miyao, Y., Arase, Y., and Nomoto,M.: Overview of the NTCIR-13 Short
Text Conversation Task, Proceedings of NTCIR-13, 2017.
⢠Luo, C., Sakai, T., Liu, Y., Dou, Z., Xiong, C., and Xu, J.: Overview of the NTCIR-13 We Want Web Task,
Proceedings of NTCIR-13, 2017.
⢠Kashimura, R. and Sakai, T.: SLOLQ at the NTCIR-13 OpenLiveQ Task, Proceedings of NTCIR-13, 2017.
⢠Sato, K. and Sakai, T.: SLQAL at the NTCIR-13 QA Lab-3 Task, Proceedings of NTCIR-13, 2017.
⢠Guan, J. and Sakai, T.: SLSTC at the NTCIR-13 STC Task, Proceedings of NTCIR-13, 2017.
⢠Xiao, P., Li, L., Fan, Y., and Sakai, T.: SLWWW at the NTCIR-13 WWW Task, Proceedings of NTCIR-13, 2017.
⢠Zeng, Z., Luo, C., Shang, L., Li, H., and Sakai, T.: Test Collections and Measures for Evaluating Customer-
Helpdesk Dialogues, Proceedings of EVIA 2017.
⢠Sakai, T.: Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths, Proceedings of EVIA
2017.
⢠Sakai, T.: Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently
Subjective Annotations, Proceedings of EVIA 2017.
⢠Sakai, T.: The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and
Students, Proceedings of EVIA 2017.
⢠Sakai, T.: Unanimity-Aware Gain for Highly Subjective Assessments, Proceedings of EVIA 2017.
11 papers
15. NTCIR-14 (2019)
⢠Sakai, T., Ferro, N., Soboroff, I., Zeng, Z., Xiao, P., and Maistro,
M.: Overview of the NTCIR-14 CENTRE Task, Proceedings of
NTCIR-14, 2019.
⢠Mao, J., Sakai, T., Luo, C., Xiao, P., Liu, Y., and Dou, Z.:
Overview of the NTCIR-14 We Want Web Task, Proceedings
of NTCIR-14, 2019.
⢠Zeng, Z., Kato, S., and Sakai, T.: Overview of the NTCIR-14
Short Text Conversation Task: Dialogue Quality and Nugget
Detection Subtasks, Proceedings of NTCIR-14, 2019.
⢠Kato, S., Suzuki, R., Zeng, Z., and Sakai, T.: SLSTC at the
NTCIR-14 STC-3 Dialogue Quality and Nugget Detection
Subtasks, Proceedings of NTCIR-14, 2019.
⢠Xiao, P. and Sakai, T.: SLWWW at the NTCIR-14 We Want
Web Task, Proceedings of NTCIR-14, 2019.
For the first time, I donât have a paper at EVIA!
5 papers?
16. Or so I thought...
⢠Oard, D.W., Sakai, T., and Kando, N.: Celebrating 20
Years of NTCIR: The Book, Proceedings of EVIA 2019.
17. TALK OUTLINE
1. NTCIR and me
2. Survey of NTCIR overviews (1999-2019)
3. Q-measures etc.
4. D-measures etc.
5. Beyond graded relevance
6. Summary
18. [Harman05] (The TREC book)
âRelevance was defined within the task
of the information analyst, with TREC
assessors instructed to judge a document
relevant if information from that
document would be used in some
manner for the writing of a report on the
subject of the topic. This also implies the
use of binary relevance judgments;â
19. NTCIR overviews (1999-2019)
survey method
⢠Examined all overview papers (for tasks that
involved ranked retrieval only)
⢠Examined how many relevance levels were used
and how they were obtained in each task (ALL
NTCIR retrieval tasks use graded relevance levels!)
⢠Examined whether graded relevance measures
were used to evaluate the participating systems.
20. IF you want (a) > (b) > (c), then you
should use graded relevance
measures.
Relevant
Partially relevant
Partially relevant
Nonrelevant
(a)
Partially relevant
Partially relevant
Relevant
Nonrelevant
Nonrelevant
Nonrelevant
Relevant
Nonrelevant
(b) (c)
28. Normalised Cumulative Utility (2)
:
r
1
2
3
:
Stopping probability at r
Users who abandon the list at r=1
Users who abandon the list at r=3
29. Normalised Cumulative Utility (3)
:
r
1
2
3
:
Measure utility of
this doc for this user
group
Measure utility of
these docs for this
user group
Utility at r
NCU is âexpected utilityâ
30. AP is an NCU (1)
⢠Suppose R=3 relevant docs are known.
Nonrelevant
Relevant
Nonrelevant
Relevant
33% of
users
33% of
users Nonrelevant
Stopping
probability
distribution:
uniform
over
relevant
docs
33% of
users
Retrieved
Not retrieved
Relevant
31. AP is an NCU (2)
⢠Suppose R=3 relevant docs are known.
Nonrelevant
Relevant
Nonrelevant
Relevant
33% of
users
33% of
users Nonrelevant
Prec(2)
= 1/2
Prec(5)
= 2/5
AP
= ( Prec(2) + Prec(5) + 0 ) / 3
= 0.300
32. Q-measure is an NCU (1)
⢠Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
Stopping
probability
distribution:
uniform
over
relevant
docs
33% of
users
Retrieved
Not retrieved
Partially rel: 1
33. Q-measure is an NCU (2)
⢠Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
BR(2)
= 4/6
BR(5)
= 6/10
Q
= ( BR(2) + BR(5) + 0 ) / 3
= 0.422
Q generalizes AP by
using the Blended Ratio
instead of Prec as Utility
36. Patience parameter β of BR
(binary relevance environment)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
β=0.1
β=1
β=10
r1 <= R â
BR(r1)=(1+β)/(r1+βr1)=1/r1
r1 > R â
BR(r1)=(1+β)/(r1+βR)
r1 : rank of the 1st relevant doc
Large β â more
tolerance to relevant
docs at low ranks
BR(r1) R=5
37. TALK OUTLINE
1. NTCIR and me
2. Survey of NTCIR overviews (1999-2019)
3. Q-measures etc.
4. D-measures etc.
5. Beyond graded relevance
6. Summary
38. Diversified search
⢠Given an ambiguous/underspecified query, produce a
single Search Engine Result Page that satisfies
different user intents!
⢠Challenge: balancing relevance and diversity
SERP(SearchEngineResultPage)
Highly relevant
near the top
Give more
space to
popular intents?
Give more space
to informational
intents?
Cover many
intents
39. Approaches to evaluating
diversified search
⢠ι-nDCG [Clarke+SIGIR08]
⢠Intent-Aware measures [Agrawal+WSDM09,
Chapelle+IR11]
(1) Compute a measure for each intent
(2) Combine the measures using intent probabilities as
weights
⢠D(#)-measures [Sakai+EVIA10,Sakai+SIGIR11]
(1) Combine intentwise graded relevance with intent
probabilities to compute the gain of each document
(2) Construct an ideal list based on the gain, and then
compute a graded relevance measure based on it
46. TALK OUTLINE
1. NTCIR and me
2. Survey of NTCIR overviews (1999-2019)
3. Q-measures etc.
4. D-measures etc.
5. Beyond graded relevance
6. Summary
47. Current approaches:
gold relevance labels
0 1
0 1
Assessorsâ
diverse ratings
0 1
0 1
Final
relevance
grade: 0.5
Final
relevance
grade: 0.5
48. New approaches:
gold distributions
0 1
0 1
Assessorsâ
diverse ratings
0 1
0 1
Use the distributions
directly for evaluation!
The gold data preserves
the diverse views of
users.
49. Please see the STC-3 overview AND
https://waseda.box.com/SIGIR2018preprint
50. TALK OUTLINE
1. NTCIR and me
2. Survey of NTCIR overviews (1999-2019)
3. Q-measures etc.
4. D-measures etc.
5. Beyond graded relevance
6. Summary
51. Summary
⢠Survey of NTCIR ranked retrieval tasks (1999-2019):
most of them utilise graded relevance measures,
but not all.
⢠If relevance grades are important for your task,
graded relevance measures should be used.
Converting graded relevance to binary relevance is
inadequate.
⢠Beyond relevance labels: utilise gold distributions
that preserve diverse views.
⢠THE NTCIR BOOK WILL BE OUT IN 2020 FROM
SPRINGER!