he evaluation of recommender systems is crucial for their development. In today's recommendation landscape there are many standardized recommendation algorithms and approaches, however, there exists no standardized method for experimental setup of evaluation -- not even for widely used measures such as precision and root-mean-squared error. This creates a setting where comparison of recommendation results using the same datasets becomes problematic. In this paper, we propose an evaluation protocol specifically developed with the recommendation use-case in mind, i.e. the recommendation of one or several items to an end user. The protocol attempts to closely mimic a scenario of a deployed (production) recommendation system, taking specific user aspects into consideration and allowing a comparison of small and large scale recommendation systems. The protocol is evaluated on common recommendation datasets and compared to traditional recommendation settings found in research literature. Our results show that the proposed model can better capture the quality of a recommender system than traditional evaluation does, and is not affected by characteristics of the data (e.g. size. sparsity, etc.).
A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems
1. A
Top-‐N
Recommender
System
Evalua8on
Protocol
Inspired
by
Deployed
Systems
Alan
Said,
Alejandro
Bellogín,
Arjen
De
Vries
CWI
@alansaid,
@abellogin,
@arjenpdevries
2. Outline
• Evalua8on
– Real
world
– Offline
• Not
algorithmic
comparison!
• Comparison
of
evalua8on
• Protocol
• Experiments
&
Results
• Conclusions
2013-‐10-‐13
LSRS'13
2
4. Evalua8on
• Does
p@10
in
[Smith,2010a]
measure
the
same
quality
as
p@10
in
[Smith,
2012b]?
– Even
if
it
does
• is
the
underlying
data
the
same?
• was
cross-‐valida8on
performed
similarly?
• etc.
2013-‐10-‐13
LSRS'13
4
5. Evalua8on
• What
metrics
should
we
use?
• How
should
we
evaluate?
– Relevance
criteria
for
test
items
– Cross
valida8on
(n-‐fold,
random)
• Should
all
users
and
items
be
treated
the
same
way?
– Do
certain
users
and
items
reflect
different
evalua8on
quali8es?
2013-‐10-‐13
LSRS'13
5
6. Offline
Evalua8on
Recommender
System
accuracy
evalua8on
is
currently
based
on
methods
from
IR/ML
–
–
–
–
–
One
training
set
One
test
set
(One
valida8on
set)
Algorithms
are
trained
on
the
training
set
Evaluate
using
metric@N
(e.g.
p@N
–
a
page
size)
• Even
when
N
is
larger
than
the
number
of
test
items
• p@N
=
1.0
is
(almost)
impossible
2013-‐10-‐13
LSRS'13
6
7. Evalua8on
in
produc8on
• One
dynamic
training
set
– All
of
the
available
data
at
a
certain
point
in
8me
– Con8nuously
updated
• No
test
set
– Only
live
user
interac8ons
• Clicked/purchased
items
are
good
recommenda8ons
Can
we
simulate
this
offline?
2013-‐10-‐13
LSRS'13
7
8. Evalua8on
Protocol
•
•
•
•
Based
on
“real
world”
concepts
Uses
as
much
available
data
as
possible
Trains
algorithms
once
per
user
and
evalua8on
selng
(e.g.
N)
Evaluates
p@N
when
there
are
exactly
N
correct
items
in
the
test
set
– possible
p@N
=
1
(gold
standard)
2013-‐10-‐13
LSRS'13
8
9. Evalua8on
Protocol
Three
concepts:
1. Personalized
training
&
test
sets
2.
3.
– Use
all
available
informa8on
about
the
system
for
the
candidate
user
– Different
test/training
sets
for
different
levels
of
N
Candidate
item
selec8on
(items
in
test
sets)
– Only
“good”
items
go
in
test
sets
(no
random
80%-‐20%
splits)
– How
“good”
an
item
is
is
based
on
each
user’s
personal
preference
Candidate
user
selec8on
(users
in
test
sets)
– Candidate
users
must
have
items
in
the
training
set
– When
evalua8ng
p@N,
each
user
in
test
set
should
have
N
items
in
test
set
• Effec8vely
precision
becomes
R-‐precision
Train
each
algorithm
once
for
each
user
in
the
test
set
and
once
for
each
N.
2013-‐10-‐13
LSRS'13
9
12. Experiments
– Movielens
100k
•
•
•
•
Minimum
20
ra8ngs
per
user
943
users
6.43%
density
Not
realis8c
– Movielens
1M
sample
• 100k
ra8ngs
• 1000
users
• 3.0%
density
•
number
of
users
Datasets:
10
1
10
100
number
of
raAngs
1000
100
number
of
raAngs
1000
12
100
Algorithms
– SVD
– User-‐based
CF
(kNN)
– Item-‐based
CF
2013-‐10-‐13
number
of
users
•
100
10
1
LSRS'13
10
13. Experimental
Selngs
According
to
proposed
protocol:
• Evaluate
R-‐precision
for
N=[1,5,10,20,50,100]
• Users
evaluated
at
N
must
have
at
least
N
items
rated
above
the
relevance
threshold
(RT)
• RT
depends
on
the
users
mean
ra8ng
and
standard
devia8on
Baseline
• Evaluate
p@N
for
N=[1,5,10,20,50,100]
• 80%-‐20%
training-‐test
split
• Number
of
runs:
|N|*|users|
• Number
of
runs:
1
2013-‐10-‐13
– Items
in
test
set
rated
at
least
3
LSRS'13
13
16. Results
What
about
8me?
– |N|*|users|
vs.
1?
– Trade-‐off
between
a
realis8c
evalua8on
and
complexity?
2013-‐10-‐13
LSRS'13
16
17. Conclusions
• We
can
emulate
a
realis8c
produc8on
scenario
by
crea8ng
personalized
training/test
sets
and
evalua8ng
them
for
each
candidate
user
separately
• We
can
see
how
well
a
recommender
performs
at
different
levels
of
recall
(page
size)
• We
can
compare
towards
a
gold
standard
• We
can
reduce
evalua8on
8me
2013-‐10-‐13
LSRS'13
17
18. Ques8ons?
• Thanks!
• Also:
check
out
– ACM
TIST
Special
Issue
on
RecSys
Benchmarking
–
bit.ly/RecSysBe
– The
ACM
RecSys
Wiki
–
www.recsyswiki.com
2013-‐10-‐13
LSRS'13
18