22 августа, семинар "RUSSIR Summer School Best Practices"
Ben Carterette "Advances in Information Retrieval Evaluation"
There is great interest in producing effectiveness measures that model user behavior in order to better model the utility of a system to its users. These measures are often formulated as a sum over the product of a discount function of ranks and a gain function mapping relevance assessments to numeric utility values. We develop a conceptual framework for analyzing such effectiveness measures based on classifying members of this broad family of measures into four distinct families, each of which reflects a different notion of system utility. This is a theory of model-based measures within which we can hypothesize about the properties that such a measure should have and test those hypotheses against user and system data.
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Ben Carterette "Advances in Information Retrieval Evaluation"
1. System
Effec+veness,
User
Models,
and
User
U+lity
A
Conceptual
Framework
for
Inves+ga+on
Ben
CartereBe
University
of
Delaware
carteret@cis.udel.edu
2. Effec+veness
Evalua+on
• Determine
how
good
the
system
is
at
finding
and
ranking
relevant
documents
• An
effec+veness
measure
should
be
correlated
to
the
user’s
experience
– Value
increases
when
user
experience
gets
beBer;
decreases
when
it
gets
worse
• Thus
interest
in
effec+veness
measures
based
on
explicit
models
of
user
interac+on
– RBP
[Moffat
&
Zobel],
DCG
[Järvelin
&
Kekäläinen],
ERR
[Chapelle
et
al.],
EBU
[Yilmaz
et
al.],
sessions
[Kanoulas
et
al.],
etc.
3. Discounted
Gain
Model
• Simple
model
of
user
interac+on:
– User
steps
down
ranked
results
one-‐by-‐one
– Gains
something
from
relevant
documents
– Increasingly
less
likely
to
see
documents
deeper
in
the
ranking
• Implementa+on
of
model:
– Gain
is
a
func+on
of
relevance
at
rank
k
– Ranks
k
are
increasingly
discounted
– Effec+veness
=
sum
over
ranks
of
gain
+mes
discount
• Most
measures
can
be
made
to
fit
this
framework
4. Rank
Biased
Precision
[Moffat
and
Zobel,
TOIS08]
black powder
ammunition
1
2
Toss
a
biased
coin
(θ)
3
4
If
HEADS,
observe
next
5
6
document
7
8
IF
TAILS,
stop
9
10
…
8. Discounted
Cumula+ve
Gain
[Järvelin
and
Kekäläinen
SIGIR00]
black powder
ammunition
Relevance
Relevance
Discounted
Score
Gain
1
R
1
1
2
R
1
0.63
Discount
3
N
0
by
rank
0
4
N
0
0
DCG
1/log2(r+1)
NDCG =
5
R
1
0.38
optDCG
6
R
1
0.35
DCG
=
2.689
NDCG = 0.91
7
N
0
0
8
R
1
0.31
9
N
0
0
10
N
0
0
…
…
€
9. Discounted
Cumula+ve
Gain
Relevance
0.0 0.2 0.4 0.6 0.8 1.0
1
R
2
R
3
N
4
N
5
R
∞
1
6
R
DCG = ∑ reli
7
N
i=1 log 2 (1+ i)
8
R
9
N
10
N
€
…
…
10. Expected
Reciprocal
Rank
[Chapelle
et
al
CIKM09]
black powder Query
ammunition
1
View
Next
2
Item
3
4
5
Stop
6
7
8
9
10
…
11. Expected
Reciprocal
Rank
black powder Query
ammunition
1
View
Next
2
Item
3
4
5
Relevant?
6
7
8
highly
somewhat
no
9
10
…
Stop
12. Models
of
Browsing
Behavior
black powder
ammunition Posi+on-‐based
models
1
The
chance
of
observing
a
2
document
depends
on
the
posi+on
3
of
the
document
in
the
ranked
list.
4
5
6
Cascade
models
7
The
chance
of
observing
a
8
document
depends
on
its
posi+on
9
as
well
as
the
relevance
of
10
documents
ranked
above
it.
…
13. A
More
Formal
Model
• My
claim:
this
implementa+on
conflates
at
least
four
dis+nct
models
of
user
interac+on
• Formalize
it
a
bit:
– Change
rank
discount
to
stopping
probability
density
P(k)
– Change
gain
func+on
to
either
a
u+lity
func+on
or
a
cost
func+on
• Then
effec+veness
=
expected
u+lity
or
cost
over
stopping
points
∞
M= f (k)P (k)
k=1
14. Our
Framework
• The
components
of
a
measure
are:
– stopping
rank
probability
P(k)
• posi+on-‐based
vs
cascade
is
a
feature
of
this
distribu+on
– document
u+lity
model
(binary
relevance)
– u+lity
accumula+on
model
or
cost
model
• We
can
test
hypotheses
about
general
proper+es
of
stopping
distribu+on,
u+lity/cost
model
– Instead
of
trying
to
evaluate
every
possible
measure
on
its
own,
evaluate
proper+es
of
the
measure
15. Model
Families
• Depending
on
choices,
we
get
four
dis+nct
families
of
user
models
– Each
family
characterized
by
u+lity/cost
model
– Within
family,
freedom
to
choose
P(k),
document
u+lity
model
• Model
1:
expected
u+lity
at
stopping
point
• Model
2:
expected
total
u+lity
• Model
3:
expected
cost
• Model
4:
expected
total
u+lity
per
unit
cost
16. Model
1:
Expected
U+lity
at
Stopping
Point
• Exemplar:
Rank-‐Biased
Precision
(RBP)
∞
RBP = (1 − θ) relk θk−1
k=1
∞
= relk θk−1 (1 − θ)
k=1
• Interpreta+on:
– P(k)
=
geometric
density
func+on
– f(k)
=
relevance
of
document
at
stopping
rank
– Effec+veness
=
expected
relevance
at
stopping
rank
17. Model
2:
Expected
Total
U+lity
• Instead
of
stopping
probability,
think
about
viewing
probability
∞
P (view doc at k) = P (k) = F (k)
i=k
• This
fits
in
discounted
gain
model
framework:
∞
M= relk F (k)
k=1
• Does
it
fit
in
expected
u+lity
framework?
– Yes,
and
Discounted
Cumula+ve
Gain
(DCG;
Jarvelin
et
al.)
is
exemplar
for
this
class
18. Model
2:
Expected
Total
U+lity
∞
∞
∞
M= relk F (k) = relk P (i)
k=1 k=1 i=k
∞
k
∞
= P (k) reli = Rk P (k)
k=1 i=1 k=1
• f(k)
=
Rk
(total
summed
relevance)
• Let
FDCG(k)
=
1/log2(k+1)
– Then
PDCG(k)
=
FDCG(k)
–
FDCG(k+1)
–
PDCG(k)
=
1/log2(k+1)
–
1/log2(k+2)
• Work
algebra
backwards
to
show
that
you
get
binary-‐
relevance
DCG
(if
summing
to
infinity)
19. Model
3:
Expected
Cost
• User
stops
with
probability
based
on
accumulated
u+lity
rather
than
rank
alone
– P(k)
=
P(Rk)
if
document
at
rank
k
is
relevant,
0
otherwise
• Then
use
f(k)
to
model
cost
of
going
to
rank
k
• Exemplar
measure:
Expected
Reciprocal
Rank
(ERR;
Chapelle
et
al.)
(with
binary
relevance)
– P(k)
=
relk · θ Rk −1 (1 − θ)
– 1/cost
=
f(k)
=
1/k
20. Model
4:
Expected
U+lity
per
Unit
Cost
• User
considers
expected
effort
of
further
browsing
axer
each
relevant
document
∞
∞
M= relk f (k)P (k)
k=1 i=k
• Similar
to
M2
family,
manipulate
algebraically
∞
∞
∞
k
relk f (i)P (i) = f (k)P (k) reli
k=1 i=k k=1 i=1
∞
= f (k)Rk P (k)
k=1
21. Model
4:
Expected
U+lity
per
Unit
Cost
• When
f(k)
=
1/k,
we
get:
∞
M= prec@k · P (k)
k=1
• Average
Precision
(AP)
is
exemplar
for
this
class
– P(k)
=
relk/R
– u+lity/cost
=
f(k)
=
prec@k
22. Summary
So
Far
• Four
ways
to
turn
a
sum
over
gain
+mes
discounts
into
an
expecta+on
over
stopping
ranks
– M1,
M2,
M3,
M4
• Four
exemplar
measures
from
IR
literature
– RBP,
DCG,
ERR,
AP
• Four
stopping
probability
distribu+ons
– PRBP,
PDCG,
PERR,
PAP
– Add
two
more:
• PRR(k)
=
1/(k(k+1)),
PRRR(k)
=
1/(Rk(Rk+1))
23. Stopping
Probability
Densi+es
1.0
0.5
PRBP = (1 !PERRFRBP =((1!!)k!1!1
)k!1ERR = 11!RRk
F= relk ( ))k!1
PRR = 1 (k(kRRRF= rel= 1k RkRk + 1))
P + 1)) = k1 (Rk(
FRR
RRR
P 2( F+rel) !R (R 2( k ) 1)
PDCG = 1 logAP kFDCG==1 ! logk2!k1+ 2R
= AP k 1 log
1
0.8
0.4
cumulative probability
0.6
0.3
probability
0.4
0.2
0.2
0.1
0.0
5 10 15 20 25
rank
24. From
Models
to
Measures
• Six
stopping
probability
distribu+ons,
four
model
families
• Mix
and
match
to
create
up
to
24
new
measures
– Many
of
these
are
uninteres+ng:
isomorphic
to
precision/recall,
or
constant-‐valued
– 15
turn
out
to
be
interes+ng
26. Some
Brief
Asides
• From
geometric
to
reciprocal
rank
– Integrate
geometric
w.r.t.
parameter
theta
– Result
is
1/(k(k+1))
– Cumula+ve
form
is
approximately
1/k
• Normaliza+on
– Every
measure
in
M2
family
must
be
normalized
by
max
possible
value
– Other
measures
may
not
fall
between
0
and
1
27. Some
Brief
Asides
• Rank
cut-‐offs
– DCG
formula+on
only
works
for
n
going
to
infinity
– In
reality
we
usually
calculate
DCG@K
for
small
K
– This
fits
our
user
model
if
we
make
worst-‐case
assump+on
about
relevance
of
documents
below
rank
K
28. Analyzing
Measures
• Some
ques+ons
raised:
– Are
models
based
on
u+lity
beBer
than
models
based
on
effort?
(Hypothesis:
no
difference)
– Are
measures
based
on
stopping
probabili+es
beBer
than
measures
based
on
viewing
probabili+es?
(Hypothesis:
laBer
more
robust)
– What
proper+es
should
the
stopping
distribu+on
have?
(Hypothesis:
faBer
tail,
sta+c
more
robust)
29. How
to
Analyze
Measures
• Many
possible
ways,
no
one
widely-‐accepted
– How
well
they
correlate
with
user
sa+sfac+on
– How
robust
they
are
to
changes
in
underlying
data
– How
good
they
are
for
op+mizing
systems
– How
informa+ve
they
are
30. Fit
to
Click
Logs
• How
well
does
a
stopping
distribu+on
fit
to
empirical
click
probabili+es?
– A
click
does
not
mean
the
end
of
a
search
– But
we
need
some
model
of
the
stopping
point,
and
a
click
is
a
decent
proxy
• Good
fit
may
indicate
a
good
stopping
model
31. Fit
to
Logged
Clicks
empirical distribution
PRBP = (1 ! )k!1
PRR = 1 (k(k + 1))
PDCG = 1 log2(k + 1) ! 1 log2(k + 2)
1e−02
probability P(k)
1e−04
1e−06
1 2 5 10 20 50 100 200 500
rank k
32. Robustness
and
Stability
• How
robust
is
the
measure
to
changes
in
underlying
test
collec+on
data?
– If
one
of
the
following
changes:
• topic
sample
• relevance
judgments
• pool
depth
of
judgments
– how
different
are
the
decisions
about
rela+ve
system
effec+veness?
33. Data
• Three
test
collec+ons
+
evalua+on
data:
– TREC-‐6
ad
hoc:
50
topics,
72,270
judgments,
550,000-‐
document
corpus;
74
runs
submiBed
to
TREC
• Second
set
of
judgments
from
Waterloo
– TREC
2006
Terabyte
named
page:
180
topics,
2361
judgments,
25M-‐doc
corpus;
43
runs
submiBed
to
TREC
– TREC
2009
Web
ad
hoc:
50
topics,
18,666
judgments,
500M-‐doc
corpus;
37
runs
submiBed
to
TREC
34. Experimental
Methodology
• Pick
some
part
of
the
collec+on
to
vary
– e.g.
judgments,
topic
sample
size,
pool
depth
• Evaluate
all
submiBed
systems
with
TREC’s
gold
standard
data
• Evaluate
all
submiBed
systems
with
the
modified
data
• Compare
first
evalua+on
to
second
using
Kendall’s
tau
rank
correla+on
• Determine
which
proper+es
are
most
robust
– Model
family,
tail
fatness,
sta+c/dynamic
distribu+on
35. Varying
Assessments
• Compare
evalua+on
with
TREC’s
judgments
to
evalua+on
with
Waterloo’s
type
P(k)
M1
M2
M3
M4
mean
PRBP
RBP
=
0.813
RBTR
=
0.816
RBAP
=
0.801
0.810
Tenta+ve
conclusions:
• sta+c
P CDG
=
0.831
DCG
=
0.920
DCG
DAG
=
0.819
0.857
– M2
most
robust,
fRR
=
0.859
by
M3
(axer
r.812
0.830
P RRG
=
0.819
RR
ollowed
RAP
= 0 emoving
AP
outlier)
P ERR
ERR
=
0.829
EPR
=
0.836
0.833
dynamic
P ARR
=
0.847
AP
=
0.896
0.872
– FaBer-‐tail
distribu+ons
more
=r
0.826
RRAP
=
0.844
0.835
P
AP
RRR
obust
RRR
mean
Dynamic
a0.821
more
robust
than
sta+c
–
bit
0.865
0.834
0.835
36. Varying
Topic
Sample
Size
• Sample
a
subset
of
N
topics
from
the
original
50;
evaluate
systems
over
that
set
1.0
fat
M1 tail: PDCG, PAP
medium tail: PRR, PRRR
M2
M3
slim tail: PRBP, PERR
0.9
M4
mean Kendall’s tau
0.8
0.7
0.6
0.5
10 20 30 40
number of topics
37. Varying
Pool
Depth
• Take
only
judgments
on
documents
appearing
at
ranks
1
to
depth
D
in
submiBed
systems
– D
=
1,
2,
4,
8,
16,
32,
64
1.0
0.9
mean Kendall’s tau
0.8
0.7
0.6
M1
M2
M3
M4
0.5
1 2 5 10 20 50
pool depth
38. Conclusions
• FaBer-‐tailed
distribu+ons
generally
more
robust
– Maybe
beBer
for
mi+ga+ng
risk
of
not
sa+sfying
tail
users
• M2
(expected
total
u+lity;
DCG)
generally
more
robust
– But
does
it
model
users
beBer?
• M3
(expected
cost;
ERR)
more
robust
than
expected
• M4
(expected
u+lity
per
cost;
AP)
not
as
robust
as
expected
– AP
is
an
outlier
with
a
very
fat
tail
• DCG
may
be
based
on
a
more
realis+c
user
model
than
commonly
thought
39. Conclusions
• The
gain
+mes
discount
formula+on
conflates
four
dis+nct
models
of
user
behavior
• Teasing
these
apart
allows
us
to
test
hypotheses
about
general
proper+es
of
measures
• This
is
a
conceptual
framework:
it
organizes
and
describes
measures
in
order
to
provide
structure
for
reasoning
about
general
proper+es
• Hopefully
will
provide
direc+ons
for
future
research
on
evalua+on
measures