Ben Carterett — Advances in Information Retrieval Evaluation

System
Eﬀec+veness,

User
Models,
and
User
U+lity

A
Conceptual
Framework
for
Inves+ga+on

Ben
CartereBe

University
of
Delaware

carteret@cis.udel.edu

Effec+veness
Evalua+on

•  Determine
how
good
the
system
is
at
finding
and

ranking
relevant
documents

•  An
effec+veness
measure
should
be
correlated
to
the

user’s
experience

–  Value
increases
when
user
experience
gets
beBer;

decreases
when
it
gets
worse

•  Thus
interest
in
effec+veness
measures
based
on

explicit
models
of
user
interac+on

–  RBP
[Moffat
&
Zobel],
DCG
[Järvelin
&
Kekäläinen],
ERR

[Chapelle
et
al.],
EBU
[Yilmaz
et
al.],
sessions
[Kanoulas
et

al.],
etc.

Discounted
Gain
Model

•  Simple
model
of
user
interac+on:

–  User
steps
down
ranked
results
one-‐by-‐one

–  Gains
something
from
relevant
documents

–  Increasingly
less
likely
to
see
documents
deeper
in
the

ranking

•  Implementa+on
of
model:

–  Gain
is
a
func+on
of
relevance
at
rank
k

–  Ranks
k
are
increasingly
discounted

–  Eﬀec+veness
=
sum
over
ranks
of
gain
+mes
discount

•  Most
measures
can
be
made
to
ﬁt
this
framework

Rank
Biased
Precision

[Moﬀat
and
Zobel,
TOIS08]

black powder
ammunition

1

2
Toss
a
biased
coin
(θ)

3

4

If
HEADS,
observe
next

5

6

document

7

8
IF
TAILS,
stop

9

10

…

Rank
Biased
Precision

black powder Let
θ=0.8

ammunition

1
0.532<θ

2

3

0.933≥θ

4

5

6

7

8

9

10

…

Rank
Biased
Precision

black powder
ammunition Query

1

2
View
Next

Stop

3
Item

4

5

6

7

8

9

10

…

Rank
Biased
Precision

black powder
ammunition

1

2

∞

3

RBP = (1 − θ) relk θk−1
4

k=1
5

∞

6

= relk θk−1 (1 − θ)
7

k=1
8

9

10
Relevance
discounted
by

…
geometric
distribu+on

Discounted
Cumula+ve
Gain

[Järvelin
and
Kekäläinen
SIGIR00]

black powder
ammunition
Relevance
Relevance

Discounted

Score
Gain

1
R
1

1

2
R
1

0.63

Discount

3
N
0

by
rank
0

4
N
0

0

DCG
1/log2(r+1)
NDCG =
5
R
1

0.38

optDCG
6
R
1

0.35

DCG
=
2.689

NDCG = 0.91
7
N
0

0

8
R
1

0.31

9
N
0

0

10
N
0

0

…
…
€

Discounted
Cumula+ve
Gain

Relevance
0.0 0.2 0.4 0.6 0.8 1.0

1
R

2
R

3
N

4
N

5
R
∞
1
6
R
DCG = ∑ reli
7
N
i=1 log 2 (1+ i)
8
R

9
N

10
N

€
…
…

Expected
Reciprocal
Rank

[Chapelle
et
al
CIKM09]

black powder Query

ammunition

1

View
Next

2
Item

3

4

5
Stop

6

7

8

9

10

…

Expected
Reciprocal
Rank

black powder Query

ammunition

1

View
Next

2
Item

3

4

5

Relevant?

6

7

8

highly
somewhat
no

9

10

…

Stop

Models
of
Browsing
Behavior

black powder
ammunition Posi+on-‐based
models

1
The
chance
of
observing
a

2
document
depends
on
the
posi+on

3
of
the
document
in
the
ranked
list.

4

5

6

Cascade
models

7
The
chance
of
observing
a

8
document
depends
on
its
posi+on

9
as
well
as
the
relevance
of

10

documents
ranked
above
it.

…

A
More
Formal
Model

•  My
claim:

this
implementa+on
conﬂates
at
least
four

dis+nct
models
of
user
interac+on

•  Formalize
it
a
bit:

–  Change
rank
discount
to
stopping
probability
density
P(k)

–  Change
gain
func+on
to
either
a
u+lity
func+on
or
a
cost

func+on

•  Then
eﬀec+veness
=
expected
u+lity
or
cost
over

stopping
points

∞

M= f (k)P (k)
k=1

Our
Framework

•  The
components
of
a
measure
are:

–  stopping
rank
probability
P(k)

•  posi+on-‐based
vs
cascade
is
a
feature
of
this
distribu+on

–  document
u+lity
model
(binary
relevance)

–  u+lity
accumula+on
model
or
cost
model

•  We
can
test
hypotheses
about
general
proper+es

of
stopping
distribu+on,
u+lity/cost
model

–  Instead
of
trying
to
evaluate
every
possible
measure

on
its
own,
evaluate
proper+es
of
the
measure

Model
Families

•  Depending
on
choices,
we
get
four
dis+nct

families
of
user
models

–  Each
family
characterized
by
u+lity/cost
model

–  Within
family,
freedom
to
choose
P(k),
document

u+lity
model

•  Model
1:

expected
u+lity
at
stopping
point

•  Model
2:

expected
total
u+lity

•  Model
3:

expected
cost

•  Model
4:

expected
total
u+lity
per
unit
cost

Model
1:

Expected
U+lity
at
Stopping
Point

•  Exemplar:

Rank-‐Biased
Precision
(RBP)

∞

RBP = (1 − θ) relk θk−1
k=1
∞

= relk θk−1 (1 − θ)
k=1

•  Interpreta+on:

–  P(k)
=
geometric
density
func+on

–  f(k)
=
relevance
of
document
at
stopping
rank

–  Eﬀec+veness
=
expected
relevance
at
stopping

rank

Model
2:

Expected
Total
U+lity

•  Instead
of
stopping
probability,
think
about
viewing

probability
∞

P (view doc at k) = P (k) = F (k)
i=k
•  This
ﬁts
in
discounted
gain
model
framework:

∞

M= relk F (k)
k=1

•  Does
it
ﬁt
in
expected
u+lity
framework?

–  Yes,
and
Discounted
Cumula+ve
Gain
(DCG;
Jarvelin
et
al.)

is
exemplar
for
this
class

Model
2:

Expected
Total
U+lity

∞
∞
∞

M= relk F (k) = relk P (i)
k=1 k=1 i=k
∞
k
∞

= P (k) reli = Rk P (k)
k=1 i=1 k=1

•  f(k)
=
Rk
(total
summed
relevance)

•  Let
FDCG(k)
=
1/log2(k+1)

–  Then
PDCG(k)
=
FDCG(k)
–
FDCG(k+1)

– 

PDCG(k)
=
1/log2(k+1)
–
1/log2(k+2)

•  Work
algebra
backwards
to
show
that
you
get
binary-‐
relevance
DCG
(if
summing
to
inﬁnity)

Model
3:

Expected
Cost

•  User
stops
with
probability
based
on

accumulated
u+lity
rather
than
rank
alone

–  P(k)
=
P(Rk)
if
document
at
rank
k
is
relevant,
0

otherwise

•  Then
use
f(k)
to
model
cost
of
going
to
rank
k

•  Exemplar
measure:

Expected
Reciprocal
Rank

(ERR;
Chapelle
et
al.)
(with
binary
relevance)

–  P(k)
=
relk · θ Rk −1 (1 − θ)
–  1/cost
=
f(k)
=
1/k

Model
4:

Expected
U+lity
per
Unit
Cost

•  User
considers
expected
eﬀort
of
further

browsing
axer
each
relevant
document

∞
∞

M= relk f (k)P (k)
k=1 i=k

•  Similar
to
M2
family,
manipulate
algebraically

∞
∞
∞
k

relk f (i)P (i) = f (k)P (k) reli
k=1 i=k k=1 i=1
∞
= f (k)Rk P (k)
k=1

Model
4:

Expected
U+lity
per
Unit
Cost

•  When
f(k)
=
1/k,
we
get:

∞

M= prec@k · P (k)
k=1

•  Average
Precision
(AP)
is
exemplar
for
this

class

–  P(k)
=
relk/R

–  u+lity/cost
=
f(k)
=
prec@k

Summary
So
Far

•  Four
ways
to
turn
a
sum
over
gain
+mes

discounts
into
an
expecta+on
over
stopping
ranks

–  M1,
M2,
M3,
M4

•  Four
exemplar
measures
from
IR
literature

–  RBP,
DCG,
ERR,
AP

•  Four
stopping
probability
distribu+ons

–  PRBP,
PDCG,
PERR,
PAP

–  Add
two
more:

•  PRR(k)
=
1/(k(k+1)),
PRRR(k)
=
1/(Rk(Rk+1))

Stopping
Probability
Densi+es

1.0
0.5

PRBP = (1 !PERRFRBP =((1!!)k!1!1
)k!1ERR = 11!RRk
F= relk ( ))k!1
PRR = 1 (k(kRRRF= rel= 1k RkRk + 1))
P + 1)) = k1 (Rk(
FRR
RRR
P 2( F+rel) !R (R 2( k ) 1)
PDCG = 1 logAP kFDCG==1 ! logk2!k1+ 2R
= AP k 1 log
1
0.8
0.4
cumulative probability
0.6
0.3
probability
0.4
0.2
0.2
0.1
0.0

5 10 15 20 25
rank

From
Models
to
Measures

•  Six
stopping
probability
distribu+ons,
four

model
families

•  Mix
and
match
to
create
up
to
24
new

measures

–  Many
of
these
are
uninteres+ng:

isomorphic
to

precision/recall,
or
constant-‐valued

–  15
turn
out
to
be
interes+ng

Some
Brief
Asides

•  From
geometric
to
reciprocal
rank

–  Integrate
geometric
w.r.t.
parameter
theta

–  Result
is
1/(k(k+1))

–  Cumula+ve
form
is
approximately
1/k

•  Normaliza+on

–  Every
measure
in
M2
family
must
be
normalized
by

max
possible
value

–  Other
measures
may
not
fall
between
0
and
1

Some
Brief
Asides

•  Rank
cut-‐offs

–  DCG
formula+on
only
works
for
n
going
to
infinity

–  In
reality
we
usually
calculate
DCG@K
for
small
K

–  This
fits
our
user
model
if
we
make
worst-‐case

assump+on
about
relevance
of
documents
below

rank
K

Analyzing
Measures

•  Some
ques+ons
raised:

–  Are
models
based
on
u+lity
beBer
than
models

based
on
eﬀort?

(Hypothesis:
no
diﬀerence)

–  Are
measures
based
on
stopping
probabili+es

beBer
than
measures
based
on
viewing

probabili+es?

(Hypothesis:

laBer
more
robust)

–  What
proper+es
should
the
stopping
distribu+on

have?

(Hypothesis:

faBer
tail,
sta+c
more
robust)

How
to
Analyze
Measures

•  Many
possible
ways,
no
one
widely-‐accepted

–  How
well
they
correlate
with
user
sa+sfac+on

–  How
robust
they
are
to
changes
in
underlying
data

–  How
good
they
are
for
op+mizing
systems

–  How
informa+ve
they
are

Fit
to
Click
Logs

•  How
well
does
a
stopping
distribu+on
ﬁt
to

empirical
click
probabili+es?

–  A
click
does
not
mean
the
end
of
a
search

–  But
we
need
some
model
of
the
stopping
point,

and
a
click
is
a
decent
proxy

•  Good
ﬁt
may
indicate
a
good
stopping
model

Fit
to
Logged
Clicks

empirical distribution
PRBP = (1 ! )k!1
PRR = 1 (k(k + 1))
PDCG = 1 log2(k + 1) ! 1 log2(k + 2)
1e−02
probability P(k)
1e−04
1e−06

1 2 5 10 20 50 100 200 500
rank k

Robustness
and
Stability

•  How
robust
is
the
measure
to
changes
in

underlying
test
collec+on
data?

–  If
one
of
the
following
changes:

•  topic
sample

•  relevance
judgments

•  pool
depth
of
judgments

–  how
diﬀerent
are
the
decisions
about
rela+ve

system
eﬀec+veness?

Data

•  Three
test
collec+ons
+
evalua+on
data:

–  TREC-‐6
ad
hoc:

50
topics,
72,270
judgments,
550,000-‐
document
corpus;
74
runs
submiBed
to
TREC

•  Second
set
of
judgments
from
Waterloo

–  TREC
2006
Terabyte
named
page:

180
topics,
2361

judgments,
25M-‐doc
corpus;
43
runs
submiBed
to

TREC

–  TREC
2009
Web
ad
hoc:

50
topics,
18,666
judgments,

500M-‐doc
corpus;
37
runs
submiBed
to
TREC

Experimental
Methodology

•  Pick
some
part
of
the
collec+on
to
vary

–  e.g.
judgments,
topic
sample
size,
pool
depth

•  Evaluate
all
submiBed
systems
with
TREC’s
gold
standard

data

•  Evaluate
all
submiBed
systems
with
the
modiﬁed
data

•  Compare
ﬁrst
evalua+on
to
second
using
Kendall’s
tau
rank

correla+on

•  Determine
which
proper+es
are
most
robust

–  Model
family,
tail
fatness,
sta+c/dynamic
distribu+on

Varying
Assessments

•  Compare
evalua+on
with
TREC’s
judgments
to

evalua+on
with
Waterloo’s

type
P(k)
M1
M2
M3
M4
mean

PRBP
RBP
=
0.813
RBTR
=
0.816
RBAP
=
0.801
0.810

Tenta+ve
conclusions:

•  sta+c
P CDG
=
0.831
DCG
=
0.920

DCG
DAG
=
0.819
0.857

–  M2
most
robust,
fRR
=
0.859
by
M3
(axer

r.812
0.830

P RRG
=
0.819

RR
ollowed
RAP
= 0 emoving

AP
outlier)

P ERR
ERR
=
0.829
EPR
=
0.836
0.833

dynamic
P ARR
=
0.847
AP
=
0.896
0.872

–  FaBer-‐tail
distribu+ons
more
=r
0.826
RRAP
=
0.844
0.835

P
AP

RRR

obust

RRR

mean
Dynamic
a0.821
more
robust
than
sta+c

– 
bit
0.865
0.834
0.835

Varying
Topic
Sample
Size

•  Sample
a
subset
of
N
topics
from
the
original

50;
evaluate
systems
over
that
set

1.0

fat
M1 tail: PDCG, PAP
medium tail: PRR, PRRR
M2
M3
slim tail: PRBP, PERR
0.9

M4
mean Kendall’s tau
0.8
0.7
0.6
0.5

10 20 30 40
number of topics

Varying
Pool
Depth

•  Take
only
judgments
on
documents
appearing

at
ranks
1
to
depth
D
in
submiBed
systems

–  D
=
1,
2,
4,
8,
16,
32,
64

1.0
0.9
mean Kendall’s tau
0.8
0.7
0.6

M1
M2
M3
M4
0.5

1 2 5 10 20 50
pool depth

Conclusions

•  FaBer-‐tailed
distribu+ons
generally
more
robust

–  Maybe
beBer
for
mi+ga+ng
risk
of
not
sa+sfying
tail
users

•  M2
(expected
total
u+lity;
DCG)
generally
more
robust

–  But
does
it
model
users
beBer?

•  M3
(expected
cost;
ERR)
more
robust
than
expected

•  M4
(expected
u+lity
per
cost;
AP)
not
as
robust
as
expected

–  AP
is
an
outlier
with
a
very
fat
tail

•  DCG
may
be
based
on
a
more
realis+c
user
model
than

commonly
thought

Conclusions

•  The
gain
+mes
discount
formula+on
conﬂates
four

dis+nct
models
of
user
behavior

•  Teasing
these
apart
allows
us
to
test
hypotheses
about

general
proper+es
of
measures

•  This
is
a
conceptual
framework:

it
organizes
and

describes
measures
in
order
to
provide
structure
for

reasoning
about
general
proper+es

•  Hopefully
will
provide
direc+ons
for
future
research
on

evalua+on
measures

Ben Carterett — Advances in Information Retrieval Evaluation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Mehr von yaevents

Mehr von yaevents (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ben Carterett — Advances in Information Retrieval Evaluation