2. Outline
Introduction
– Introduction to IR
Kinds of evaluation
Retrieval Effectiveness evaluation
– Measures, Experimentation
– Test Collections
User-based evaluation
Discussion on Evaluation
Conclusion
2
3. Introduction
• Why?
– Put a figure on the benefit we get from a system
– Because without evaluation, there is no research
3
Objective
measurements
4. Information Retrieval
“Information retrieval is a field concerned with the structure,
analysis, organization, storage, searching, and retrieval of
information.” (Salton, 1968)
General definition that can be applied to many types of
information and search applications
Primary focus of IR since the 50s has been on text and
documents
7. Information Retrieval
Key insights of/for information retrieval
– text has no meaning
ฉันมีรถสีแดง
– but it is still the most informative source
ฉันมีรถสีฟ้า is more similar to the above than คุณมีรถไฟฟ้า
– text is not random
I drive a red car is more probable than
– I drive a red horse
– A red car I drive
– Car red a drive I
– meaning is defined by usage
I drive a truck / I drive a car / I drive the bus truck / car / bus
are similar in meaning
8. Information Retrieval
Key insights of/for information retrieval
– text has no meaning
ฉันมีรถสีแดง
– but it is still the most informative source
ฉันมีรถสีฟ้า is more similar to the above than คุณมีรถไฟฟ้า
– text is not random
I drive a red car is more probable than
– I drive a red horse
– A red car I drive
– Car red a drive I
– meaning is defined by usage
I drive a truck / I drive a car / I drive the bus truck / car / bus
are similar in meaning
term frequency (TF), document frequency (DF)
TF-IDF, BM25 (Best match 25)
language models (uni-gram, bi-gram, n-gram)
statistical semantics (latent semantic analysis,
random indexing, deep learning)
9. Big Issues in IR
Relevance
– What is it?
– Simple (and simplistic) definition: A relevant document contains
the information that a person was looking for when they submitted
a query to the search engine
– Many factors influence a person’s decision about what is relevant:
e.g., task, context, novelty, style
– Topical relevance (same topic) vs. user relevance (everything
else)
10. Relevance
– Retrieval models define a view of relevance
– Ranking algorithms used in search engines are based
on retrieval models
– Most models describe statistical properties of text rather
than linguistic
i.e. counting simple text features such as words
instead of parsing and analyzing the sentences
Statistical approach to text processing started with
Luhn in the 50s
Linguistic features can be part of a statistical model
Big Issues in IR
11. Big Issues in IR
Evaluation
– Experimental procedures and measures for comparing system
output with user expectations
Originated in Cranfield experiments in the 60s
– IR evaluation methods now used in many fields
– Typically use test collection of documents, queries, and relevance
judgments
Most commonly used are TREC collections
– Recall and precision are two examples of effectiveness measures
12. Big Issues in IR
Users and Information Needs
– Search evaluation is user-centered
– Keyword queries are often poor descriptions of actual information
needs
– Interaction and context are important for understanding user intent
– Query refinement techniques such as query expansion, query
suggestion, relevance feedback improve ranking
13. Introduction
• Why?
– Put a figure on the benefit we get from a system
– Because without evaluation, there is no research
• Why is this a research field in itself?
– Because there are many kinds of IR
• With different evaluation criteria
– Because it’s difficult
• Why?
– Because it involves human subjectivity (document relevance)
– Because of the amount of data involved (who can sit down
and evaluate 1,750,000 documents returned by Google for
‘university vienna’?)
13
15. Kinds of evaluation
• “Efficient and effective system”
• Time and space: efficiency
– Generally constrained by pre-development specification
• E.g. real-time answers vs. batch jobs
• E.g. index-size constraints
– Easy to measure
• Good results: effectiveness
– Harder to define --> more research into it
• And…
15
16. Kinds of evaluation (cont.)
• User studies
– Does a 2% increase in some retrieval performance measure actually
make a user happier?
– Does displaying a text snippet improve usability even if the
underlying method is 10% weaker than some other method?
– Hard to do
– Mostly anecdotal examples
– Many IR people don’t like to do it (though it’s starting to change)
16
17. Kinds of evaluation (cont.)
Intrinsic
– “internal”
– ultimate goal is the retrieved set
Extrinsic
– “external”
– in the context of the usage of the retrieval tool
17
18. What to measure in an IR system?
1966, Cleverdon:
1. coverage – the extent to which relevant matter exists in the
system
2. time lag ~ efficiency
3. presentation
4. effort on the part of the user to answer his information
need
5. recall
6. precision
18
19. What to measure in an IR system?
1966, Cleverdon:
1. coverage – the extent to which relevant matter exists in the
system
2. time lag ~ efficiency
3. presentation
4. effort on the part of the user to answer his information
need
5. recall
6. precision
Effectiveness
19
A desirable measure of retrieval performance would have the following
properties: 1, it would be a measure of effectiveness. 2, it would not be
confounded by the relative willingness of the system to emit items. 3, it would
be a single number – in preference, for example, to a pair of numbers which
may co-vary in a loosely specified way, or a curve representing a table of
several pairs of numbers 4, it would allow complete ordering of different
performances, and assess the performance of any one system in absolute
terms. Given a measure with these properties, we could be confident of
having a pure and valid index of how well a retrieval system (or method) were
performing the function it was primarily designed to accomplish, and we could
reasonably ask questions of the form “Shall we pay X dollars for Y units of
effectiveness?” (Swets, 1967)
20. Outline
• Introduction
• Kinds of evaluation
• Retrieval Effectiveness evaluation
– Measures
– Test Collections
User-based evaluation
• Discussion on Evaluation
• Conclusion
20
22. Retrieval Effectiveness
Precision
– How happy are we with what we’ve got
Recall
– How much more we could have had
Precision =
Number of relevant documents
retrieved
Number of documents retrieved
Recall =
Number of relevant documents
retrieved
Number of relevant documents
22
25. Retrieval effectiveness
What if we don’t like this twin-measure approach?
A solution:
– Van Rijsbergen’s E-Measure:
– With a special case: Harmonic mean
E =1-
1
a
1
precision
+ 1-a( )
1
recall
F =
2× precision×recall
precision+recall
25
26. Retrieval effectiveness
What if we don’t like this twin-measure approach?
A solution:
– Van Rijsbergen’s E-Measure:
– With a special case: Harmonic mean
E =1-
1
a
1
precision
+ 1-a( )
1
recall
F =
2× precision×recall
precision+recall
26
27. Retrieval effectiveness
Tools we need:
– A set of documents (the “dataset”)
– A set of questions/queries/topics
– For each topic, and for each document, a decision: relevant or not
relevant
Let’s assume for the moment that’s all we need and that
we have it
27
28. Retrieval Effectiveness
• Precision and Recall generally plotted as a “Precision-Recall
curve”
0
1
1
precision
recall
size of retrieved set increases
• They do not play well together
28
29. Precision-Recall Curves
How to build a Precision-Recall Curve?
– For one query at a time
– Make checkpoints on the recall-axis
0
1
1
precision
recall
29
30. Precision-Recall Curves
How to build a Precision-Recall Curve?
– For one query at a time
– Make checkpoints on the recall-axis
0
1
1
precision
recall
30
31. Precision-Recall Curves
• How to build a Precision-Recall Curve?
– For one query at a time
– Make checkpoints on the recall-axis
– Repeat for all queries
0
1
1
precision
recall
31
32. Precision-Recall Curves
• And the average is the system’s P-R curve
0
1
1
precision
recall
# retrieved documents increases
• We can compare systems by comparing the
curves
32
34. Interpolation
To average graphs, calculate precision at standard recall
levels:
– where S is the set of observed (R,P) points
Defines precision at any recall level as the maximum
precision observed in any recall-precision point at a
higher recall level
– produces a step function
– defines precision at recall 0.0
34
36. Average Precision at
Standard Recall Levels
• Recall-precision graph plotted by simply
joining the average precision points at
the standard recall levels
36
39. Retrieval Effectiveness
• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
39
40. Single-value measures
• Fix a “reasonable” cutoff
– R-precision
Precision at R, where R is the number of relevant documents.
Fix the number of desired documents
– Reciprocal rank (RR)
1/rank of first relevant document in the ranked list returned
Make it less sensitive to the cutoff
• Average precision
– For each query:
R= # relevant documents
i = rank
k = # retrieved documents
P(i) precision at rank i
• rel(i)=1 if document at rank i relevant, 0 otherwise
– For each system:
• Compute the mean of these averages: Mean Average
Precision (MAP) – one of the most used measures
AP =
P(i)×rel(i)( )
i=1
k
å
R
40
41. R- Precision
Precision at the R-th position in the ranking of results for
a query that has R relevant documents.
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
41
45. Retrieval Effectiveness
• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
45
46. Cumulative Gain
• For each document d, and query q, define
rel(d,q) >= 0
• The higher the value, the more relevant the document is to
the query
• Pitfalls:
– Graded relevance introduces even more ambiguity in practice
With great flexibility comes great
responsibility to justify parameter values
46
47. Retrieval Effectiveness
• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
47
48. Discounted Cumulative Gain
Popular measure for evaluating web search and related
tasks
Two assumptions:
– Highly relevant documents are more useful than marginally relevant
document
– the lower the ranked position of a relevant document, the less useful
it is for the user, since it is less likely to be examined
48
49. Discounted Cumulative Gain
Uses graded relevance as a measure of the usefulness, or
gain, from examining a document
Gain is accumulated starting at the top of the ranking and
may be reduced, or discounted, at lower ranks
Typical discount is 1/log (rank)
– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
49
50. Discounted Cumulative Gain
DCG is the total gain accumulated at a particular rank p:
Alternative formulation:
– used by some web search companies
– emphasis on retrieving highly relevant documents
[Jarvelin:2000]
[Borges:2005]
50
51. Discounted Cumulative Gain
• Neither CG, nor DCG can be used for comparison
across topics!
depends on the # relevant documents per topic
51
52. Normalised Discounted Cumulative Gain
Compute CG / DCG for the optimal return set
Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)
has the Ideal Discounted Cumulative Gain: IDCG
Normalise:
NDCG(n) =
DCG(n)
IDCG(n)
52
53. some more variations
Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)
has the Ideal Discounted Cumulative Gain: IDCG
“our rank”: (5,2,0,0,5,2,4,0,0,1,4,…)
two ranked lists
– rank correlation measures
kendall Tau (similarity of orderings)
pearson Rho (linear correlation between variables)
spearman Rho (Pearson for ranks)
53
54. some more variations
rank biased precision (RBP)
– “log-based discount is not a good model of users’ behaviour”
– imagine the probability p of the user moving on to the next document
RBP(n) = (1- p)× rel(i)× pi-1
i=1
n
å
p~0.95 p~0.0
54
55. Time-based calibration
Assumption
– The objective of the search engine is to improve the efficiency of an
information seeking task
Extend nDCG to replace discount with a time-based
function
(Smucker and Clarke:2011)
Normalization
Gain Decay, as a function of
time to reach item k in
the ranked list55
56. The water filling model (Luo et al, 2013)
and the corresponding Cube
Test (CT)
also for professional search
– to capture embedded subtopics
no assumption of linear
traversal of documents
– takes into account time
potential cap on the amount of
information taken into account
high discriminative power
56
57. Other diversity metrics
several aspects of the topic might [need to] be covered
– Aspectual recall/precision
discount may take into account previously seen aspects
– α-NDCG = NDCG where
rel(i) = J(di,k)(1-a)
rk,i-1
k=1
m
å
rk,i-1 = J(dj,k)
j=1
i-1
å J(dj,k) =
1 dj relevant to nk
0 otherwise
ì
í
ï
îï
57
58. Other measures
• There are many IR measures!
• trec_eval is a little program that computes many of them
– 37 in v9.0, many of which are multi-point (e.g. Precision @10,
@20…)
• http://trec.nist.gov/trec_eval/
• “there is a measure to make anyone a winner”
– Not really true, but still…
58
59. Other measures
• How about correlations between measures?
• Kendal Tau values
• From Voorhees and Harman,2004
• Overall they correlate
P(30) R-Prec MAP .5 prec
R(1,100
0)
Rel Ret MRR
P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77
P(30) 0.87 0.84 0.82 0.80 0.79 0.72
R-Prec 0.93 0.87 0.83 0.83 0.67
MAP 0.88 0.85 0.85 0.64
.5 prec 0.77 0.78 0.63
R(1,100
0)
0.92 0.67
Rel ret 0.66
59
60. Topic sets
Topic selection
– In early TREC candidates rejected if ambiguous
Are all topics equal?
– Mean Average Precision uses arithmetic mean
– Classical Test Theory experiments (Bodoff and Li,2007) identified
outliers that could change the rankings
MAP: a change in AP from 0.05 to 0.1 has the same effect as a
change from 0.25 to 0.3
GMAP: a change in AP from 0.05 to 0.1 has the same effect as a
change from 0.25 to 0.5
60
61. Measure measures
What is the best measure?
– What makes a measure better?
Match to task
– E.g.
Known item search: MRR
Something more quantitative?
– Correlations between measures
Does the system ranking change when using different measures
Useful to group measures
– Ability to distinguish between runs
– Measure stability
61
62. Ad-hoc quiz
It was necessary to normalize the discounted cumulative
gain (NDCG) because…
of the assumption for normal probability distribution
to be able to compare across topics
normalization is always better
to be able to average across topics
62
63. Ad-hoc quiz
It was necessary to normalize the discounted cumulative
gain (NDCG) because…
of the assumption for normal probability distribution
to be able to compare across topics
normalization is always better
to be able to average across topics
63
64. Measure stability
Success criteria:
– A measure is good if it is able to predict differences between
systems (on the average of future queries)
Method
– Split collection in 2
1. Use as train collection to rank runs
2. Use as test collection to compute how many pair-wise
comparisons hold
Observations
– Cut-off measures less stable than MAP
64
65. Measure stability
Success criteria:
– A measure is good if it is able to predict differences between
systems (on the average of future queries)
Method
– Split collection in 2
1. Use as train collection to rank runs
2. Use as test collection to compute how many pair-wise
comparisons hold
Observations
– Cut-off measures less stable than MAP
Any other criteria for measure
quality?
65
66. Measure measures
started with opinions from ’60s, seen some measures –
have the targets changed?
7 numeric properties of effectiveness metrics (Moffat 2013)
66
67. 7 properties of effectiveness metrics
Boundedness – the set of scores attainable by the metric is bounded,
usually in [0,1]
Monotonicity – if a ranking of length k is extended so that k+1 elements
are included, the score never decreases
Convergence – if a document outside the top k is swapped with a less
relevant document inside the top k, the score strictly increases
Top-weightedness – if a document within the top k is swapped with a
less relevant one higher in the ranking, the score strictly increases
Localization – a score at depth k can be compute based solely on
knowledge of the documents that appear in top k
Completeness – a score can be calculated even if the query has no
relevant documents
Realizability – provided that the collection has at least one relevant
document, it is possible for the score at depth k to be maximal.
68
68. So far
introduction
metrics
we are now able to say
“System A is better than System B”
or are we?
Remember
- we only have limited data
- potential future applications unbounded
a very strong
statement!
69
69. Statistical validity
Whatever evaluation metric used, all experiments must be
statistically valid
– i.e. differences must not be the result of chance
0
0.05
0.1
0.15
0.2
MAP
70
70. Statistical validity
• Ingredients of a significance test
– A test statistic (e.g. the differences between AP values)
– A null hypothesis (e.g. “there is no difference between the two
systems)
This gives us a particular distribution of the test statistic
– An alternative hypothesis (one or two-tailed tests)
don’t change it after the test
– A significance level computed by taking the actual value of the test
statistic and determining how likely it is to see this value given the
distribution implied by the null hypothesis
• P-value
• If the p-value is low, we can feel confident that we can reject
the null hypothesis the systems are different
71
71. Statistical validity
Common practice is to declare systems different when the
p-value <= 0.05
A few tests
– Randomization tests
Wilcoxon Signed Rank test
Sign test
– Boostrap test
– Student’s Paired t-test
See recent discussion in SIGIR Forum
– T. Sakai - Statistical Reform in Information Retrieval?
effect sizes
confidence intervals
72
72. Statistical validity
How do we increase the statistical validity of an
experiment?
By increasing the number of topics
– The more topics, the more confident we are that the difference
between average scores will be significant
What’s the minimum number of topics?
42
• Depends, but
• TREC started with 50
• Below 25 is generally considered
not significant
73
74. t-Test
Assumption is that the difference between the effectiveness
values is a sample from a normal distribution
Null hypothesis is that the mean of the distribution of
differences is zero
Test statistic
– for the example,
75
82. Summary
so far
– introduction
– metrics
next
– where to get ground truth
some more metrics
– discussion
83
83. Retrieval Effectiveness
• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
84
84. Relevance assessments
• Ideally
– Sit down and look at all documents
• Practically
– The ClueWeb09 collection has
• 1,040,809,705 web pages, in 10 languages
• 5 TB, compressed. (25 TB, uncompressed.)
– No way to do this exhaustively
– Look only at the set of returned documents
• Assumption: if there are enough systems being tested and not
one of them returned a document – the document is not relevant
85
85. Relevance assessments - Pooling
Combine the results retrieved by all systems
Choose a parameter k (typically 100)
Choose the top k documents as ranked in each submitted
run
The pool is the union of these sets of docs
– Between k and (# submitted runs) × k documents in pool
– (k+1)st document returned in one run either irrelevant or ranked
higher in another run
Give pool to judges for relevance assessments
86
87. Relevance assessments - Pooling
Conditions under which pooling works [Robertson]
– Range of different kinds of systems, including manual systems
– Reasonably deep pools (100+ from each system)
But depends on collection size
– The collections cannot be too big.
Big is so relative…
88
88. Relevance assessments - Pooling
Advantage of pooling:
– Fewer documents must be manually assessed for relevance
Disadvantages of pooling:
– Can’t be certain that all documents satisfying the query are found
(recall values may not be accurate)
– Runs that did not participate in the pooling may be disadvantaged
– If only one run finds certain relevant documents, but ranked lower
than 100, it will not get credit for these.
89
89. Relevance assessments
Pooling with randomized sampling
As the data collection grows, the top 100 may not be
representative of the entire result set
– (i.e. the assumption that everything after is not relevant does not
hold anymore)
Add, to the pool, a set of documents randomly sampled
from the entire retrieved set
– If the sampling is uniform easy to reason about, but may be too
sparse as the collection grows
– Stratified sampling: get more from the top of the ranked list
[Yilmaz et al.:2008]
90
90. Relevance assessments - incomplete
• The unavoidable conclusion is that we have to handle
incomplete relevance assessments
– Consider unjudged = non relevant
– Do not consider unjudged at all (i.e. compress the ranked lists)
• A new measure:
– BPref (binary preference)
r = a relevant returned document
R = # documents judged relevant
N = # documents judged non-relevant
n = a non-relevant document
BPref =
1
R
1-
|{n |rank(n) > rank(r)}|
min(R, N)
æ
è
ç
ö
ø
÷
r
å
91
91. Relevance assessments - incomplete
• BPref was designed to mimic MAP
• soon after, induced AP and inferred AP were proposed
• if data complete – equal to MAP
indAP =
1
R
1-
|{n | rank(n) > rank(r)}|
rank(r)
æ
è
ç
ö
ø
÷
r
å
inf AP(k) =
1
R
1
k
+
k -1
k
d100
k -1
×
rel +e
rel + nonrel +e
æ
è
çç
ö
ø
÷÷
é
ë
ê
ê
ù
û
ú
úr
å
expectation of precision at rank k
92
92. not only are we incomplete, but we might also be
inconsistent in our judgments
93
93. Relevance assessment - subjectivity
In TREC-CHEM’09 we had each topic evaluated by two
students
– “conflicts” ranged between 2% and 33% (excluding a topic with 60%
conflict)
– This all increased if we considered “strict disagreement”
In general, inter-evaluator agreement is rarely above 80%
There is little one can do about it
94
94. Relevance assessment - subjectivity
Good news:
– “idiosyncratic nature of relevance judgments does not affect
comparative results” (E. Voorhees)
– Mean Kendall Tau between system rankings produced from
different query relevance sets: 0.938
– Similar results held for:
Different query sets
Different evaluation measures
Different assessor types
Single opinion vs .group opinion judgments
95
95. No assessors
Pooling assumes all relevant documents found by systems
– Take this assumption further
Voting based- relevance assessments
– Consider top K only
Soboroff et al:2001
96
96. Test Collections
Generally created as the result of an evaluation campaign
– TREC – Text Retrieval Conference (USA)
– CLEF – Cross Language Evaluation Forum (EU)
– NTCIR - NII Test Collection for IR Systems (JP)
– INEX – Initiative for evaluation of XML Retrieval
– …
First one and paradigm definer:
– The Cranfield Collection
In the 1950s
Aeronautics
1400 queries, about 6000 documents
Fully evaluated
97
97. TREC
Started in 1992
Always organised in the States, on the NIST campus
As leader, introduced most of the jargon used in IR
Evaluation:
– Topic = query / request for information
– Run = a ranked list of results
– Qrel = relevance judgements
98
98. TREC
Organised as a set of tracks that focus on a particular sub-
problem of IR
– E.g.
Patient records, Session, Chemical, Genome, Legal, Blog,
Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech,
OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million
Query, Ad-Hoc, Robust
– Set of tracks in a year depends on
Interest of participants
Fit to TREC
Needs of sponsors
Resource constraints
99
100. TREC – Task definition
Each Track has a set of Tasks:
Examples of tasks from the Blog track:
– 1. Finding blog posts that contain opinions about the topic
– 2. Ranking positive and negative blog posts
– 3. (A separate baseline task to just find blog posts relevant to the
topic)
– 4. Finding blogs that have a principal, recurring interest in the
topic
101
101. TREC - Topics
For TREC, topics generally have a specific format (not
always though)
– <ID>
– <title>
Very short
– <description>
A brief statement of what would be a relevant document
– <narrative>
A long description, meant also for the evaluator to understand
how to judge the topic
102
102. TREC - Topics
Example:
– <ID>
312
– <title>
Hydroponics
– <description>
Document will discuss the science of growing plants in water or
some substance other than soil
– <narrative>
A relevant document will contain specific information on the
necessary nutrients, experiments, types of substrates, and/or
any other pertinent facts related to the science of hydroponics.
Related information includes, but is not limited to, the history
of hydro- …
103
103. CLEF
Cross Language Evaluation Forum
– From 2010: Conference on Multilingual and Multimodal
Information Access Evaluation
– Supported by the PROMISE Network of Excellence
Started in 2000
Grand challenge:
– Fully multilingual, multimodal IR systems
Capable of processing a query in any medium and any
language
Finding relevant information from a multilingual multimedia
collection
And presenting it in the style most likely to be useful for the
user
104
104. CLEF
• Previous tracks:
• Mono-, bi- multilingual text retrieval
• Interactive cross language retrieval
• Cross language spoken document retrieval
• QA in multiple languages
• Cross language retrieval in image collections
• CL geographical retrieval
• CL Video retrieval
• Multilingual information filtering
• Intellectual property
• Log file analysis
• Large scale grid experiments
• From 2010
– Organised as a series of “labs”
105
105. MediaEval
dedicated to evaluating new algorithms for multimedia
access and retrieval.
emphasizes the 'multi' in multimedia
focuses on human and social aspects of multimedia tasks
– speech recognition, multimedia content analysis, music and audio
analysis, user-contributed information (tags, tweets), viewer
affective response, social networks, temporal and geo-
coordinates.
http://www.multimediaeval.org/
106
106. Test collections - summary
it is important to design the right experiment for the right
IR task
– Web retrieval is very different from legal retrieval
The example of Patent retrieval
– High Recall: a single missed document can invalidate a patent
– Session based: single searches may involve days of cycles of results
review and query reformulation
– Defendable: Process and results may need to be defended in court
107
107. Outline
Introduction
Kinds of evaluation
Retrieval Effectiveness evaluation
– Measures, Experimentation
– Test Collections
User-based evaluation
Discussion on Evaluation
Conclusion
108
108. User-based evaluation
Different levels of user involvement
– Based on subjectivity levels
1. Relevant/non-relevant assessments
Used largely in lab-like evaluation as described before
2. User satisfaction evaluation
Some work on 1., very little on 2.
– User satisfaction is very subjective
UIs play a major role
Search dissatisfaction can be a result of the non-existence of
relevant documents
109
111. User-based evaluation
User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
112
112. User-based evaluation
User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
Relative judgements of documents
“Is document X more relevant than document Y for the
given query?”
- Many more assessments needed
- Better inter-annotator agreement [Rees and Schultz,
1967]
113
113. User-based evaluation
User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
– Focus the user on lists of results
114
114. User-based evaluation
User-based relevance assessments
– Focus the user one each query-document pair
– Focus the user on query-document-document
– Focus the user on lists of results
Image from Thomas and Hawking, Evaluation by comparing result sets in context, CIKM2006
115
115. User-based evaluation
User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
– Focus the user on lists of results
Some issues, alternatives
– Control for all sorts of user-based biases
116
116. User-based evaluation
User-based relevance assessments
– Focus the user one each query-document pair
– Focus the user on lists of results
– Focus the user on query-document-document
Some issues, alternatives
– Control for all sorts of user-based biases
Image from Bailey, Thomas and Hawking, Does brandname influence perceived search result quality?, ADCS2007
117
117. User-based evaluation
User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
– Focus the user on lists of results
Some issues, alternatives
– Control for all sorts of user-based biases
– Two-panel evaluation
– limits the number of systems which can be evaluated
– Is unusable in real-life contexts
– Interspersed ranked list with click monitoring
118
118. Effectiveness evaluation
lab-like vs. user-focused
Results are mixed: some experiments show correlations,
some not
Do user preferences and Evaluation Measures Line up?
SIGIR 2010: Sanderson, Paramita, Clough, Kanoulas
– shows the existence of correlations
User preferences is inherently user dependent
Domain specific IR will be different
The relationship between IR effectiveness measures and
user satisfaction, SIGIR 2007, Al-Maskari, Sanderson,
Clough
– strong correlation between user satisfaction and DCG, which
disappeared when normalized to NDCG.
119
119. Predicting performance
Future data and queries
not absolute, but relative performance
– ad-hoc evaluations suffer in particular
– no comparison between lab and operational settings
for justified reasons, but still none
– how much better must a system be?
generally, require statistical significance
[Trippe:2011]
120
120. Predictive performance
Future systems
Test collections are often used to prove we have a better
system than the state of the art
– not all documents were evaluated
121
121. Predictive performance
Future systems
Test collections are often used to prove we have a better
system than the state of the art
– not all documents were evaluated
– “retrofit” metrics that are not considered resilient to such evolution
RBP [Webber:2009]
Precision@n [Lipani:2014], Recall@n […]
122
Why do this?
- Precision@n and Recall@n are loved in industry
- Also in industry, technology migration steps are high (i.e. hold on to a
system that ‘works’ until it is patently obvious it affects business
performance)
122. Are Lab evals sufficient?
Patent search is an active process where the end-user
engages in a process of understanding and interacting with
the information
evaluation needs a definition of success
– success ~ lower risk
partly precision and recall
partly (some argue the most important part) the intellectual and
interactive role of the patent search system as a whole
series of evaluation layers
– lab evals are now the lowest level
– to elevate them, they must measure risk and incentivize systems to
provide estimates of confidence in the results they provide
[Trippe:2011]
123
123. Outline
Introduction
Kinds of evaluation
Retrieval Effectiveness evaluation
– Measures, Experimentation
– Test Collections
User-based evaluation
Discussion on Evaluation
Conclusion
124
124. Discussion on evaluation
Laboratory evaluation – good or bad?
– Rigorous testing
– Over-constrained
I usually make the comparison to a tennis
racket:
– No evaluation of the device will tell you how well it
will perform in real life – that largely depends on the
user
– But the user will chose the device based on the lab
evaluation
125
125. Discussion on evaluation
There is bias to account for
– E.g. number of relevant documents per topic
126
126. Discussion on evaluation
Recall and recall-related measures are often contested
[cooper:73,p95]
– “The involvement of unexamined documents in a performance
formula has long been taken for granted as a perfectly natural thing,
but if one stops to ponder the situation, it begins to appear most
peculiar. … Surely a document which the system user has not been
shown in any form, to which he has not devoted the slightest particle
of time or attention during his use of the system output, and of
whose very existence he is unaware, does that user neither harm
nor good in his search”
Clearly not true in the legal & patent domains
127
127. Discussion on Evaluation
Realistic tasks and user models
– Evaluation has to be based on the available data sets.
This creates the user model
Tasks need to correspond to available techniques
Much literature on generating tasks
– Experts describe typical tasks
– Use of log files of various sorts
IR Research decades behind sociology in terms of user
modeling – there is a place to learn from
128
128. Discussion on Evaluation
Competitiveness
– Most campaigns take pain in explaining “This is not a competition –
this is an evaluation”
Competitions are stimulating, but
– Participants wary of participating if they are not sure to win
Particularly commercial vendors
– Without special care from organizers, it stifles creativity:
Best way to win is to take last year’s method and improve a bit
Original approaches are risky
129
129. Discussion on Evaluation
Topical Relevance
What other kinds of relevance factors are there?
– diversity of information
– quality
– credibility
– ease of reading
130
130. Conclusion
• IR Evaluation is a research field in itself
• Without evaluation, research is pointless
– IR Evaluation research included
• statistical significance testing is a must to validate results
• Most IR Evaluation exercises are laboratory experiments
– As such, care must be taken to match, to the extent possible, real
needs of the users
• Experiments in the wild are rare, small and domain specific:
– VideOlympics (2007-2009)
– PatOlympics (2010-2012)
131
131. Bibliography
Test Collection Based Evaluation of Information Retrieval Systems
– M. Sanderson 2010
TREC – Experiment and Evaluation in Information Retrieval
– E. Voorhees, D. Harman (eds.)
On the history of evaluation in IR
– S. Robertson, 2008, Journal of Information Science
A Comparison of Statistical Significance Tests for Information Retrieval
Evaluation
– M. Smucker, J. Allan, B. Carterette (CIKM’07)
A Simple and Efficient Sampling Methodfor Estimating AP and NDCG
– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)
132
132. Bibliography
Do User Preferences and Evaluation Measures Line Up?, M. Sanderson and M. L. Paramita and P. Clough and E.
Kanoulas 2010
A Review of Factors Influencing User Satisfaction in Information Retrieval, A. Al-Maskari and M. Sanderson 2010
Towards higher quality health search results: Automated quality rating of depression websites, D. Hawking and T.
Tang and R. Sankaranarayana and K. Griffiths and N. Craswell and P. Bailey 2007
Evaluating Sampling Methods for Uncooperative Collections, P. Thomas and D. Hawking 2007
Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski and N. Craswell 2010
Redundancy, Diversity and Interdependent Document Relevance, F. Radlinski and P. Bennett and B. Carterette
and T. Joachims 2009
Does Brandname influence perceived search result quality? Yahoo!, Google, and WebKumara, P. Bailey and P.
Thomas and D. Hawking 2007
Methods for Evaluating Interactive Information Retrieval Systems with Users, D. Kelly 2009
C-TEST: Supporting Novelty and Diversity in TestFiles for Search Tuning, D. Hawking and T. Rowlands and P.
Thomas 2009
Live Web Search Experiments for the Rest of Us, T. Jones and D. Hawking and R. Sankaranarayana 2010
Quality and relevance of domain-specific search: A case study in mental health, T. Tang and N. Craswell and D.
Hawking and K. Griffiths and H. Christensen 2006
New methods for creating testfiles: Tuning enterprise search with C-TEST, D. Hawking and P. Thomas and T.
Gedeon and T. Jones and T. Rowlands 2006
A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching, A. M.
Rees and D. G. Schultz, Final Report to the National Science Foundation. Volume II, Appendices. Clearing- house
for Federal Scientific and Technical Information, October 1967
The Water Filling Model and the Cube Test: Multi-dimensional Evaluation for Professional Search , J. Luo, C. Wing,
H. Yang and M. Hearst, CIKM 2013
On sample sizes for non-matched-pair IR experiments, S. Robertson, 1990, Information Processing & Management
Lipani A, Lupu M, Hanbury A, Splitting Water: Precision and Anti-Precision to Reduce Pool Bias, SIGIR 2015
W. Webber and L. A. F. Park. Score adjustment for correction of pooling bias. In Proc. of SIGIR, 2009
133
Hinweis der Redaktion
Thai text for “I have a red car”
some terms you will be hearing us talking about
In this lecture we will focus on the first, intrinsic, evaluation, and only mention the second part, as it will be discussed in much more detail in K. Jarvelin’s lecture.
A desirable measure of retrieval performance would have the following properties: First, it would express solely the ability of a retrieval system to distinguish between wanted and unwanted items – that is, it would be a measure of effectiveness. Second, the desired measure would not be confounded by the relative willingness of the system to emit items – it would express discrimination power independent of any “acceptance criterion” employed, whether the criterion is characteristic of the system or adjusted by the user. Third, the measure would be a single number – in preference, for example, to a pair of numbers which may co-vary in a loosely specified way, or a curve representing a table of several pairs of numbers – so that it could be transmitted simply and immediately comprehended. Fourth, and finally, the measure would allow complete ordering of different performances, and assess the performance of any one system in absolute terms – that is, the metric would be a scale with a unit, a true zero, and a maximum value. Given a measure with these properties, we could be confident of having a pure and valid index of how well a retrieval system (or method) were performing the function it was primarily designed to accomplish, and we could reasonably ask questions of the form “Shall we pay X dollars for Y units of effectiveness?” (Swets, 1967, http://onlinelibrary.wiley.com/doi/10.1002/asi.4630200110/)
For the E measure, Beta indicates what the user prefers (precision: beta>1, recall: beta<1)
These methods clearly depend on cut-off values, which make them unusable for meaningful comparison between topics (a topic may have very few relevant documents, a topic may have many more)
The harmonic mean is considered better for averaging ratios.
Example: precision:0.1 and recall 0.9, arithmetic average is 0.5 – quite high, while harmonic is 0.18. eVen more extreme case : think precision 0.01 and recall 0.99
For the E measure, Beta indicates what the user prefers (precision: beta>1, recall: beta<1)
These methods clearly depend on cut-off values, which make them unusable for meaningful comparison between topics (a topic may have very few relevant documents, a topic may have many more)
The harmonic mean is considered better for averaging ratios.
Example: precision:0.1 and recall 0.9, arithmetic average is 0.5 – quite high, while harmonic is 0.18. eVen more extreme case : think precision 0.01 and recall 0.99
this interpolation is actually not obvious because we might not always have the same values for recall (remember that that depends on the number of relevant documents per topic).
The common way is to consider as precision at recall_i the highest precision measure at any level greater or equal than recall_i
Cut-off based measures also have the significant disadvantage that they are unstable with respect to the size of the collection
They are also unfaire between topics: number of relevant documents for each topic in the collection generally differs, but improvements are considered the same by these measures.
Across all seven participating groups, P(20) was higher for searches on the 20 GB collection than on the subset; on average 39% higher.
Note also that all forms of AP and R-precision approximate the area under a recall precision graph (Sanderson, Aslam)
Here is where genre may come into play, as well as difficulty. This time function needs to be calibrated to the user.
.5 prec is recall obtained by the system when precision first dips below 0.5 and at least ten documents have been retrieved (heuristic that users will look at the result set as long as there are more relevant than non-relevant documents)
R(1,1000) is a weighted rel ret , such that the topics with most relevant documents do not dominate the measure
Even now, topics are rejected (removed) if no relevant documents have been identified
Cost of evaluation. e.g P@5 is very cheap, while MAP is much more expensive
“What are the required conditions? Well, the evidence suggests that we need to start with a good range of different kinds of systems – preferably, in particular, including some manual systems involving human-designed search strategies and (preferably again) some degree of interaction in the search. Second, we need reasonably deep pools (preferably 100+ from each system, not 10). Third, the collections themselves cannot be too big. “ (Robertson:2008)
Or at least very little on 2. which can be published in IR journals and conferences