IR Evaluation using Rank-Biased Precision

Alistair Moffat and Justin Zobel, ―Rank-Biased Precision for
Measurement of Retrieval Effectiveness‖, TOIS vol.27 no. 1, 2008.

Ofer Egozi
LARA group, Technion

Introduction to IR Evaluation


Mean Average Precision


Rank-Biased Precision


Analysis of RBP


Task: given query q, output ranked list of

documents
◦ Find probability that document d is relevant for q

Task: given query q, output ranked list of

documents
◦ Find probability that document d is relevant for q
Evaluation is difficult

◦ No (per query) test data
◦ Queries vary tremendously
◦ Relevance is a vague (human) concept

Precision / recall

Precision: |alg rel|/|alg|
Recall: |alg rel|/|rel|

D
alg(q,D) rel(q,D)

◦ Precision and recall usually conflict
◦ Single measures proposed
(P@X, RR, AP…)

Relevancy requires human judgment

◦ Exhaustive judging is not scalable
◦ TREC uses pooling
◦ Shown to miss significant relevant portion…
◦ … but shown to compare cross-system well
◦ Bias against novel approaches

In real-world, what does recall measure?

◦ Recall important only with ―perfect‖ knowledge
◦ If I got one result, and there is another I don’t know
of, am I half-satisfied?...
◦ …yes, for specific needs (legal, patent) session
◦ ―Boiling temperature of lead‖

In real-world, what does recall measure?

◦ Recall important only with ―perfect‖ knowledge
◦ If I got one result, and there is another I don’t know
of, am I half-satisfied?...
◦ …yes, for specific needs (legal, patent) session
◦ ―Boiling temperature of lead‖

Precision is more user-oriented

◦ P@10 measures real user satisfaction
◦ Still, P@10=0.3 can mean first three or last three…

Calculated as

◦ Intuitively: sum all P@X where rel found, divide by
total rel to normalize for summing across queries
Example: $$---$----$-----$---


Calculated as

◦ Intuitively: sum all P@X where rel found, divide by
total rel to normalize for summing across queries
Example: $$---$----$-----$---


Consider: $$---$----$-----$$$$


◦ AP is down to 0.5234, despite P@20 increasing
◦ Finding more rels can harm AP performance!
◦ Similar problems if some are initially unjudged

Methodological problem of instability

◦ Results may depend on judging extent
◦ More judging can be destabilizing (meaning error
margins don’t shrink with reducing uncertainty)

Complex abstraction of user satisfaction

◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the
documents I have seen so far, on average how satisfied am I?‖ and writes a number
on a piece of paper. Finally, when the user has examined every document in the
collection — because this is the only way to be sure that all of the relevant ones have
been seen — the user computes the average of the values they have written.‖

How can R be truly calculated?

Think evaluating a Google query…

Complex abstraction of user satisfaction

◦ ―Every time a relevant document is encountered, the user pauses, asks ―Over the
documents I have seen so far, on average how satisfied am I?‖ and writes a number
on a piece of paper. Finally, when the user has examined every document in the
collection — because this is the only way to be sure that all of the relevant ones have
been seen — the user computes the average of the values they have written.‖

How can R be truly calculated?

Think evaluating a Google query…
Still, MAP is highly popular and useful:

◦ Validated in numerous TREC researches
◦ Shown to be stable and robust across query sets
(for deep enough pools)

Induced by a user model


◦ Each document is observed at probability pi-1
◦ Expected #docs seen:
◦ Total expected utility (ri = known relevance function):

◦ RBP = expected utility rate = utility/effort

Values of p reflect user behaviors

◦ P=0.95 persistent user (60% chance for 2nd page)

◦ P=0.5 impatient (0.1% chance for 2nd page)




◦ P=0 I’m feeling lucky  (identical to P@1)




◦ P=0 I’m feeling lucky  (identical to P@1)

Values of p control contribution of each

relevant document
◦ But always positive!

Uncertainty: how many relevant documents?

(down the ranking, or even in current depth)
RBP value is inherently lower bound


Uncertainty: how many relevant documents?

(down the ranking, or even in current depth)
RBP value is inherently lower bound


Residual uncertainty is easy to calculate –

assume relevant…

Similarity
(correlation)
between measures

Detected significance
in evaluated systems’
ranking

RBP has significant advantages:

◦ Based on a solid and supported user model
◦ Real-life, no unknown factors (R, |D|)
◦ Error bounds for uncertainty
◦ Statistical significance as good as others

But also:

◦ Absolute values, not relative to query difficulty
◦ A choice for p must be made

IR Evaluation using Rank-Biased Precision

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie IR Evaluation using Rank-Biased Precision

Ähnlich wie IR Evaluation using Rank-Biased Precision (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

IR Evaluation using Rank-Biased Precision