Guide Complete Set of Residential Architectural Drawings PDF
Ben Carterett — Advances in Information Retrieval Evaluation
1. System
Effec+veness,
User
Models,
and
User
U+lity
A
Conceptual
Framework
for
Inves+ga+on
Ben
CartereBe
University
of
Delaware
carteret@cis.udel.edu
2. Effec+veness
Evalua+on
• Determine
how
good
the
system
is
at
finding
and
ranking
relevant
documents
• An
effec+veness
measure
should
be
correlated
to
the
user’s
experience
– Value
increases
when
user
experience
gets
beBer;
decreases
when
it
gets
worse
• Thus
interest
in
effec+veness
measures
based
on
explicit
models
of
user
interac+on
– RBP
[Moffat
&
Zobel],
DCG
[Järvelin
&
Kekäläinen],
ERR
[Chapelle
et
al.],
EBU
[Yilmaz
et
al.],
sessions
[Kanoulas
et
al.],
etc.
3. Discounted
Gain
Model
• Simple
model
of
user
interac+on:
– User
steps
down
ranked
results
one-‐by-‐one
– Gains
something
from
relevant
documents
– Increasingly
less
likely
to
see
documents
deeper
in
the
ranking
• Implementa+on
of
model:
– Gain
is
a
func+on
of
relevance
at
rank
k
– Ranks
k
are
increasingly
discounted
– Effec+veness
=
sum
over
ranks
of
gain
+mes
discount
• Most
measures
can
be
made
to
fit
this
framework
4. Rank
Biased
Precision
[Moffat
and
Zobel,
TOIS08]
black powder
ammunition
1
2
Toss
a
biased
coin
(θ)
3
4
If
HEADS,
observe
next
5
6
document
7
8
IF
TAILS,
stop
9
10
…
8. Discounted
Cumula+ve
Gain
[Järvelin
and
Kekäläinen
SIGIR00]
black powder
ammunition
Relevance
Relevance
Discounted
Score
Gain
1
R
1
1
2
R
1
0.63
Discount
3
N
0
by
rank
0
4
N
0
0
DCG
1/log2(r+1)
NDCG =
5
R
1
0.38
optDCG
6
R
1
0.35
DCG
=
2.689
NDCG = 0.91
7
N
0
0
8
R
1
0.31
9
N
0
0
10
N
0
0
…
…
€
9. Discounted
Cumula+ve
Gain
Relevance
0.0 0.2 0.4 0.6 0.8 1.0
1
R
2
R
3
N
4
N
5
R
∞
1
6
R
DCG = ∑ reli
7
N
i=1 log 2 (1+ i)
8
R
9
N
10
N
€
…
…
10. Expected
Reciprocal
Rank
[Chapelle
et
al
CIKM09]
black powder Query
ammunition
1
View
Next
2
Item
3
4
5
Stop
6
7
8
9
10
…
11. Expected
Reciprocal
Rank
black powder Query
ammunition
1
View
Next
2
Item
3
4
5
Relevant?
6
7
8
highly
somewhat
no
9
10
…
Stop
12. Models
of
Browsing
Behavior
black powder
ammunition Posi+on-‐based
models
1
The
chance
of
observing
a
2
document
depends
on
the
posi+on
3
of
the
document
in
the
ranked
list.
4
5
6
Cascade
models
7
The
chance
of
observing
a
8
document
depends
on
its
posi+on
9
as
well
as
the
relevance
of
10
documents
ranked
above
it.
…
13. A
More
Formal
Model
• My
claim:
this
implementa+on
conflates
at
least
four
dis+nct
models
of
user
interac+on
• Formalize
it
a
bit:
– Change
rank
discount
to
stopping
probability
density
P(k)
– Change
gain
func+on
to
either
a
u+lity
func+on
or
a
cost
func+on
• Then
effec+veness
=
expected
u+lity
or
cost
over
stopping
points
∞
M= f (k)P (k)
k=1
14. Our
Framework
• The
components
of
a
measure
are:
– stopping
rank
probability
P(k)
• posi+on-‐based
vs
cascade
is
a
feature
of
this
distribu+on
– document
u+lity
model
(binary
relevance)
– u+lity
accumula+on
model
or
cost
model
• We
can
test
hypotheses
about
general
proper+es
of
stopping
distribu+on,
u+lity/cost
model
– Instead
of
trying
to
evaluate
every
possible
measure
on
its
own,
evaluate
proper+es
of
the
measure
15. Model
Families
• Depending
on
choices,
we
get
four
dis+nct
families
of
user
models
– Each
family
characterized
by
u+lity/cost
model
– Within
family,
freedom
to
choose
P(k),
document
u+lity
model
• Model
1:
expected
u+lity
at
stopping
point
• Model
2:
expected
total
u+lity
• Model
3:
expected
cost
• Model
4:
expected
total
u+lity
per
unit
cost
16. Model
1:
Expected
U+lity
at
Stopping
Point
• Exemplar:
Rank-‐Biased
Precision
(RBP)
∞
RBP = (1 − θ) relk θk−1
k=1
∞
= relk θk−1 (1 − θ)
k=1
• Interpreta+on:
– P(k)
=
geometric
density
func+on
– f(k)
=
relevance
of
document
at
stopping
rank
– Effec+veness
=
expected
relevance
at
stopping
rank
17. Model
2:
Expected
Total
U+lity
• Instead
of
stopping
probability,
think
about
viewing
probability
∞
P (view doc at k) = P (k) = F (k)
i=k
• This
fits
in
discounted
gain
model
framework:
∞
M= relk F (k)
k=1
• Does
it
fit
in
expected
u+lity
framework?
– Yes,
and
Discounted
Cumula+ve
Gain
(DCG;
Jarvelin
et
al.)
is
exemplar
for
this
class
18. Model
2:
Expected
Total
U+lity
∞
∞
∞
M= relk F (k) = relk P (i)
k=1 k=1 i=k
∞
k
∞
= P (k) reli = Rk P (k)
k=1 i=1 k=1
• f(k)
=
Rk
(total
summed
relevance)
• Let
FDCG(k)
=
1/log2(k+1)
– Then
PDCG(k)
=
FDCG(k)
–
FDCG(k+1)
–
PDCG(k)
=
1/log2(k+1)
–
1/log2(k+2)
• Work
algebra
backwards
to
show
that
you
get
binary-‐
relevance
DCG
(if
summing
to
infinity)
19. Model
3:
Expected
Cost
• User
stops
with
probability
based
on
accumulated
u+lity
rather
than
rank
alone
– P(k)
=
P(Rk)
if
document
at
rank
k
is
relevant,
0
otherwise
• Then
use
f(k)
to
model
cost
of
going
to
rank
k
• Exemplar
measure:
Expected
Reciprocal
Rank
(ERR;
Chapelle
et
al.)
(with
binary
relevance)
– P(k)
=
relk · θ Rk −1 (1 − θ)
– 1/cost
=
f(k)
=
1/k
20. Model
4:
Expected
U+lity
per
Unit
Cost
• User
considers
expected
effort
of
further
browsing
axer
each
relevant
document
∞
∞
M= relk f (k)P (k)
k=1 i=k
• Similar
to
M2
family,
manipulate
algebraically
∞
∞
∞
k
relk f (i)P (i) = f (k)P (k) reli
k=1 i=k k=1 i=1
∞
= f (k)Rk P (k)
k=1
21. Model
4:
Expected
U+lity
per
Unit
Cost
• When
f(k)
=
1/k,
we
get:
∞
M= prec@k · P (k)
k=1
• Average
Precision
(AP)
is
exemplar
for
this
class
– P(k)
=
relk/R
– u+lity/cost
=
f(k)
=
prec@k
22. Summary
So
Far
• Four
ways
to
turn
a
sum
over
gain
+mes
discounts
into
an
expecta+on
over
stopping
ranks
– M1,
M2,
M3,
M4
• Four
exemplar
measures
from
IR
literature
– RBP,
DCG,
ERR,
AP
• Four
stopping
probability
distribu+ons
– PRBP,
PDCG,
PERR,
PAP
– Add
two
more:
• PRR(k)
=
1/(k(k+1)),
PRRR(k)
=
1/(Rk(Rk+1))
23. Stopping
Probability
Densi+es
1.0
0.5
PRBP = (1 !PERRFRBP =((1!!)k!1!1
)k!1ERR = 11!RRk
F= relk ( ))k!1
PRR = 1 (k(kRRRF= rel= 1k RkRk + 1))
P + 1)) = k1 (Rk(
FRR
RRR
P 2( F+rel) !R (R 2( k ) 1)
PDCG = 1 logAP kFDCG==1 ! logk2!k1+ 2R
= AP k 1 log
1
0.8
0.4
cumulative probability
0.6
0.3
probability
0.4
0.2
0.2
0.1
0.0
5 10 15 20 25
rank
24. From
Models
to
Measures
• Six
stopping
probability
distribu+ons,
four
model
families
• Mix
and
match
to
create
up
to
24
new
measures
– Many
of
these
are
uninteres+ng:
isomorphic
to
precision/recall,
or
constant-‐valued
– 15
turn
out
to
be
interes+ng
26. Some
Brief
Asides
• From
geometric
to
reciprocal
rank
– Integrate
geometric
w.r.t.
parameter
theta
– Result
is
1/(k(k+1))
– Cumula+ve
form
is
approximately
1/k
• Normaliza+on
– Every
measure
in
M2
family
must
be
normalized
by
max
possible
value
– Other
measures
may
not
fall
between
0
and
1
27. Some
Brief
Asides
• Rank
cut-‐offs
– DCG
formula+on
only
works
for
n
going
to
infinity
– In
reality
we
usually
calculate
DCG@K
for
small
K
– This
fits
our
user
model
if
we
make
worst-‐case
assump+on
about
relevance
of
documents
below
rank
K
28. Analyzing
Measures
• Some
ques+ons
raised:
– Are
models
based
on
u+lity
beBer
than
models
based
on
effort?
(Hypothesis:
no
difference)
– Are
measures
based
on
stopping
probabili+es
beBer
than
measures
based
on
viewing
probabili+es?
(Hypothesis:
laBer
more
robust)
– What
proper+es
should
the
stopping
distribu+on
have?
(Hypothesis:
faBer
tail,
sta+c
more
robust)
29. How
to
Analyze
Measures
• Many
possible
ways,
no
one
widely-‐accepted
– How
well
they
correlate
with
user
sa+sfac+on
– How
robust
they
are
to
changes
in
underlying
data
– How
good
they
are
for
op+mizing
systems
– How
informa+ve
they
are
30. Fit
to
Click
Logs
• How
well
does
a
stopping
distribu+on
fit
to
empirical
click
probabili+es?
– A
click
does
not
mean
the
end
of
a
search
– But
we
need
some
model
of
the
stopping
point,
and
a
click
is
a
decent
proxy
• Good
fit
may
indicate
a
good
stopping
model
31. Fit
to
Logged
Clicks
empirical distribution
PRBP = (1 ! )k!1
PRR = 1 (k(k + 1))
PDCG = 1 log2(k + 1) ! 1 log2(k + 2)
1e−02
probability P(k)
1e−04
1e−06
1 2 5 10 20 50 100 200 500
rank k
32. Robustness
and
Stability
• How
robust
is
the
measure
to
changes
in
underlying
test
collec+on
data?
– If
one
of
the
following
changes:
• topic
sample
• relevance
judgments
• pool
depth
of
judgments
– how
different
are
the
decisions
about
rela+ve
system
effec+veness?
33. Data
• Three
test
collec+ons
+
evalua+on
data:
– TREC-‐6
ad
hoc:
50
topics,
72,270
judgments,
550,000-‐
document
corpus;
74
runs
submiBed
to
TREC
• Second
set
of
judgments
from
Waterloo
– TREC
2006
Terabyte
named
page:
180
topics,
2361
judgments,
25M-‐doc
corpus;
43
runs
submiBed
to
TREC
– TREC
2009
Web
ad
hoc:
50
topics,
18,666
judgments,
500M-‐doc
corpus;
37
runs
submiBed
to
TREC
34. Experimental
Methodology
• Pick
some
part
of
the
collec+on
to
vary
– e.g.
judgments,
topic
sample
size,
pool
depth
• Evaluate
all
submiBed
systems
with
TREC’s
gold
standard
data
• Evaluate
all
submiBed
systems
with
the
modified
data
• Compare
first
evalua+on
to
second
using
Kendall’s
tau
rank
correla+on
• Determine
which
proper+es
are
most
robust
– Model
family,
tail
fatness,
sta+c/dynamic
distribu+on
35. Varying
Assessments
• Compare
evalua+on
with
TREC’s
judgments
to
evalua+on
with
Waterloo’s
type
P(k)
M1
M2
M3
M4
mean
PRBP
RBP
=
0.813
RBTR
=
0.816
RBAP
=
0.801
0.810
Tenta+ve
conclusions:
• sta+c
P CDG
=
0.831
DCG
=
0.920
DCG
DAG
=
0.819
0.857
– M2
most
robust,
fRR
=
0.859
by
M3
(axer
r.812
0.830
P RRG
=
0.819
RR
ollowed
RAP
= 0 emoving
AP
outlier)
P ERR
ERR
=
0.829
EPR
=
0.836
0.833
dynamic
P ARR
=
0.847
AP
=
0.896
0.872
– FaBer-‐tail
distribu+ons
more
=r
0.826
RRAP
=
0.844
0.835
P
AP
RRR
obust
RRR
mean
Dynamic
a0.821
more
robust
than
sta+c
–
bit
0.865
0.834
0.835
36. Varying
Topic
Sample
Size
• Sample
a
subset
of
N
topics
from
the
original
50;
evaluate
systems
over
that
set
1.0
fat
M1 tail: PDCG, PAP
medium tail: PRR, PRRR
M2
M3
slim tail: PRBP, PERR
0.9
M4
mean Kendall’s tau
0.8
0.7
0.6
0.5
10 20 30 40
number of topics
37. Varying
Pool
Depth
• Take
only
judgments
on
documents
appearing
at
ranks
1
to
depth
D
in
submiBed
systems
– D
=
1,
2,
4,
8,
16,
32,
64
1.0
0.9
mean Kendall’s tau
0.8
0.7
0.6
M1
M2
M3
M4
0.5
1 2 5 10 20 50
pool depth
38. Conclusions
• FaBer-‐tailed
distribu+ons
generally
more
robust
– Maybe
beBer
for
mi+ga+ng
risk
of
not
sa+sfying
tail
users
• M2
(expected
total
u+lity;
DCG)
generally
more
robust
– But
does
it
model
users
beBer?
• M3
(expected
cost;
ERR)
more
robust
than
expected
• M4
(expected
u+lity
per
cost;
AP)
not
as
robust
as
expected
– AP
is
an
outlier
with
a
very
fat
tail
• DCG
may
be
based
on
a
more
realis+c
user
model
than
commonly
thought
39. Conclusions
• The
gain
+mes
discount
formula+on
conflates
four
dis+nct
models
of
user
behavior
• Teasing
these
apart
allows
us
to
test
hypotheses
about
general
proper+es
of
measures
• This
is
a
conceptual
framework:
it
organizes
and
describes
measures
in
order
to
provide
structure
for
reasoning
about
general
proper+es
• Hopefully
will
provide
direc+ons
for
future
research
on
evalua+on
measures