SlideShare ist ein Scribd-Unternehmen logo
1 von 130
Downloaden Sie, um offline zu lesen
Replication of Recommender
Systems Research
Alejandro Bellogín
Universidad Autónoma de Madrid
@abellogin
Alan Said
University of Skövde
@alansaid
ACM RecSys Summer School 2017
Who are we?
Alejandro Bellogín
•  Lecturer (~Asst. Prof) @ Universidad
Autónoma de Madrid, Spain
•  PhD @ UAM, 2012
•  Research on
–  Evaluation
–  Similarity metrics
–  Replication & reproducibility
Alan Said
•  Lecturer (~Asst. Prof) @ University
of Skövde, Sweden
•  PhD @ TU Berlin, 2013
•  Research on
–  Evaluation
–  Replication & reproducibility
–  Health
2ACM RecSys Summer School 201725-Aug-17
Outline
•  Motivation [10 minutes]
•  Replication and reproducibility
•  Replication in Recommender Systems
•  Demo
•  Conclusions and Wrap-up [10 minutes]
•  Questions [10 minutes]
25-Aug-17 ACM RecSys Summer School 2017 3
Motivation
In RecSys, we find inconsistent results,
for the “same”
–  Dataset
–  Algorithm
–  Evaluation metric
Movielens 1M
[Cremonesi et al, 2010]
Movielens 100k
[Gorla et al, 2013]
Movielens 1M
[Yin et al, 2012]
Movielens 100k, SVD
[Jambor & Wang, 2010]
425-Aug-17 ACM RecSys Summer School 2017
Motivation
In RecSys, we find inconsistent results,
for the “same”
–  Dataset
–  Algorithm
–  Evaluation metric
Movielens 1M
[Cremonesi et al, 2010]
Movielens 100k
[Gorla et al, 2013]
Movielens 1M
[Yin et al, 2012]
Movielens 100k, SVD
[Jambor & Wang, 2010]
525-Aug-17 ACM RecSys Summer School 2017
So what?
Motivation
A proper evaluation culture allows the field to advance
… or at least, identify when there is a problem!
25-Aug-17 ACM RecSys Summer School 2017 6
Motivation
In RecSys, we find inconsistent results,
for the “same”
–  Dataset
–  Algorithm
–  Evaluation metric
Movielens 1M
[Cremonesi et al, 2010]
Movielens 100k
[Gorla et al, 2013]
Movielens 1M
[Yin et al, 2012]
Movielens 100k, SVD
[Jambor & Wang, 2010]
725-Aug-17 ACM RecSys Summer School 2017
Motivation
In RecSys, we find inconsistent results,
for the “same”
–  Dataset
–  Algorithm
–  Evaluation metric
Movielens 1M
[Cremonesi et al, 2010]
Movielens 100k
[Gorla et al, 2013]
Movielens 1M
[Yin et al, 2012]
Movielens 100k, SVD
[Jambor & Wang, 2010]
825-Aug-17 ACM RecSys Summer School 2017
We need to understand why this happens
Goal of this tutorial
•  Identify the steps that can act as hurdles when
replicating experimental results
– Focusing on the specific details inherent to the
recommender systems
•  We will analyze this problem using the
following representation of a generic
recommender system process
25-Aug-17 ACM RecSys Summer School 2017 9
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
25-Aug-17 ACM RecSys Summer School 2017 10
In this tutorial
•  We will focus on replication and
reproducibility
– Define the context
– Present typical setting and problems
– Propose some guidelines
– Exhibit the most typical scenarios where
experimental results in recommendation may
hinder replication
25-Aug-17 ACM RecSys Summer School 2017 11
NOT in this tutorial
•  Definition of evaluation in recommendation:
– In-depth analysis of evaluation metrics
– Novel evaluation dimensions
– User evaluation
Ø Wednesday’s lectures on evaluation
25-Aug-17 ACM RecSys Summer School 2017 12
Outline
•  Motivation [10 minutes]
•  Replication and reproducibility
•  Replication in Recommender Systems
•  Demo
•  Conclusions and Wrap-up [10 minutes]
•  Questions [10 minutes]
1325-Aug-17 ACM RecSys Summer School 2017
Reproducible Experimental Design
•  We need to distinguish
–  Replicability
–  Reproducibility
•  Different aspects:
–  Algorithmic
–  Published results
–  Experimental design
•  Goal:
–  to have an environment for reproducible experiments
25-Aug-17 ACM RecSys Summer School 2017 14
Definition:
Replicability
To copy something
•  The results
•  The data
•  The approach
Being able to evaluate in
the same setting and obtain
the same results
25-Aug-17 ACM RecSys Summer School 2017 15
Definition:
Reproducibility
To recreate something
•  The (complete) set of
experiments
•  The (complete) set of
results
•  The (complete)
experimental setup
To (re)launch it in production
with the same results
25-Aug-17 ACM RecSys Summer School 2017 16
Comparing against the state-of-the-art
25-Aug-17 ACM RecSys Summer School 2017 17
Your settings are not exactly
like those in paper X, but it is
a relevant paper
Congrats, you’re
done!
Congrats!You have shown that
paper X behaves different in
the new setting
There is something wrong/
incomplete in the experimental
design.Try again!
They agree
They do not agree
Do results
match the
original
paper?
Yes!
No!
Do results
agree with
original
paper?
Comparing against the state-of-the-art
25-Aug-17 ACM RecSys Summer School 2017 18
Your settings are not exactly
like those in paper X, but it is
a relevant paper
Reproduce
results of paper X
Congrats, you’re
done!
Replicate results
of paper X
Congrats!You have shown that
paper X behaves different in
the new setting
There is something wrong/
incomplete in the experimental
design.Try again!
They agree
They do not agree
Do results
match the
original
paper?
Yes!
No!
Do results
agree with
original
paper?
What about Reviewer 3?
•  “It would be interesting to see this done on a
different dataset…”
–  Repeatability
–  The same person doing the whole pipeline over again
•  “How does your approach compare to [Reviewer 3 et
al. 2003]?”
–  Reproducibility or replicability (depending on how similar
the two papers are)
25-Aug-17 ACM RecSys Summer School 2017 19
Repeat vs. replicate vs. reproduce vs. reuse
25-Aug-17 ACM RecSys Summer School 2017 20
Motivation for reproducibility
In order to ensure that our experiments, settings,
and results are:
–  Valid
–  Generalizable
–  Comparable
–  Of use for others
–  etc.
we must make sure that others can reproduce our
experiments in their setting
25-Aug-17 ACM RecSys Summer School 2017 21
Making reproducibility
easier
•  Description, description,
description
•  No magic numbers
•  Specify values for all parameters
•  Motivate!
•  Keep a detailed protocol of
everything
•  Describe process clearly
•  Use standards
•  Publish code (nobody expects you
to be an awesome developer,
you’re a researcher)
•  Publish data
•  Publish supplemental material
25-Aug-17 ACM RecSys Summer School 2017 22
Replicability, reproducibility, and progress
•  Can there be actual progress if no valid comparison
can be done?
•  What is the point of comparing two approaches if
the comparison is flawed?
•  How do replicability and reproducibility facilitate
actual progress in the field?
25-Aug-17 ACM RecSys Summer School 2017 23
Summary
•  Important issues when running experiments
–  Validity of results (replicability)
–  Comparability of results (reproducibility)
–  Validity of experimental setup (repeatability)
•  We need to incorporate reproducibility and
replication to facilitate progress in the field
•  If your research is reproducible for others, it has
more value
2425-Aug-17 ACM RecSys Summer School 2017
Outline
•  Motivation [10 minutes]
•  Replication and reproducibility
•  Replication in Recommender Systems
–  Dataset collection
–  Splitting
–  Recommender algorithms
–  Candidate items
–  Evaluation metrics
–  Statistical testing
•  Demo
•  Conclusions and Wrap-up [10 minutes]
•  Questions [10 minutes]
25-Aug-17 ACM RecSys Summer School 2017 25
Replication in Recommender Systems
•  Replicability/reproducibility/repeatability: useful and
desirable in any field
–  How can they be addressed when dealing with
recommender systems?
•  Proposal: analyze the recommendation process and
identify each stage that may affect the final results
25-Aug-17 ACM RecSys Summer School 2017 26
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
25-Aug-17 ACM RecSys Summer School 2017 27
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
25-Aug-17 ACM RecSys Summer School 2017 28
DATA CREATION AND
COLLECTION
2925-Aug-17 ACM RecSys Summer School 2017
What is a dataset?
3025-Aug-17 ACM RecSys Summer School 2017
3125-Aug-17 ACM RecSys Summer School 2017
Public datasets
•  Movielens 20M
–  “Users were selected at random for inclusion.All selected users had rated at least 20
movies.”
•  Netflix Prize
–  Details withheld
•  Xing (RecSys Challenge 2016/2017)
–  Details withheld
•  Last.fm (360k, MSD)
–  Undocumented cleaning applied
•  MovieTweetings
–  All IMDb ratings… from Twitter
–  2nd hand information
3225-Aug-17 ACM RecSys Summer School 2017
Creating your own datasets
•  Ask yourself:
–  What are we collecting?
–  How are we collecting it?
•  How should we be collecting it?
–  Are we collecting all (vital) interactions?
•  dwell time vs. clicks vs. comments vs. swipes vs.
likes vs. etc.
–  Are we documenting the process in sufficient detail?
–  Are we sharing the dataset in a format understood by
others (and supported by software)?
3325-Aug-17 ACM RecSys Summer School 2017
The user-item matrix
User	 Item	 Interac,on	 Timestamp	
1	 1	 1	 2017-…	
1	 2	 1	 …	
2	 3	 2	 …	
3425-Aug-17 ACM RecSys Summer School 2017
Releasing the dataset
•  Make the dataset publicly available
– Otherwise your work is not reproducible
•  Provide an in-depth overview
– Website, paper, etc.
•  Communicate it
– Mailing lists, RecSysWiki, website, etc.
3525-Aug-17 ACM RecSys Summer School 2017
Releasing the dataset
•  Consider releasing official training, test,
validation splits.
•  Present baseline algorithm results for released
splits.
•  Have code examples of how to work with the
data (splits, evaluations, etc.)
3625-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
3725-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
3825-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
3925-Aug-17 ACM RecSys Summer School 2017
DATA SPLITTING AND
PREPARATION
4025-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
4125-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting•  Sizes?
•  How to split?
•  Filtering?
•  How to document?
4225-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting•  Sizes?
•  How to split?
•  Filtering?
•  How to document?
•  What’s the task?
–  Rating prediction
–  Top-n
•  What’s important for the
algorithm?
–  Time
–  Relevance
4325-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting•  Which are the candidate items that we will be
recommending?
•  Who are the candidate users we will be
recommending (and evaluating) for?
•  Do we have any limitations on numbers?
–  Cold start?
–  Temporal/trending recommendations?
–  Other?
4425-Aug-17 ACM RecSys Summer School 2017
Scenarios
•  Random
•  All users at once
•  All items at once
•  One user at once
•  One item at once
•  Temporal
•  Temporal for one user
•  Relevance thresholds
4525-Aug-17 ACM RecSys Summer School 2017
Random
•  The split does not take
into consideration
–  Whether users or items
are left out of the
training or test sets.
–  The relevance of items
–  The scenario of the
recommendation
4625-Aug-17 ACM RecSys Summer School 2017
All users at once
•  The split does not take
into consideration
–  Whether items are left
out of the training or
test sets.
•  Can take into
consideration
–  The relevance of items
(per user or in general)
4725-Aug-17 ACM RecSys Summer School 2017
All items at once
•  The split does not take
into consideration
–  Whether users are left
out of the training or
test sets.
4825-Aug-17 ACM RecSys Summer School 2017
One user at once
•  The split takes into
consideration
–  The interactions of all
other users when
creating the splits for
one specific user
•  Resulting training set
contains all other
user-item interactions
4925-Aug-17 ACM RecSys Summer School 2017
One item at once
•  The split takes into
consideration
–  The interactions of all
other users when
creating the splits for
one specific item
•  Resulting training set
contains all other
user-item interactions
5025-Aug-17 ACM RecSys Summer School 2017
Temporal
•  The split takes into
consideration
–  The timestamp of
interactions
•  All items newer than a
certain timestamp are
discarded part of the
test set.
5125-Aug-17 ACM RecSys Summer School 2017
Filters
•  What filters?
•  Why filters?
•  Removing items/users with few interactions
creates a skewed dataset
– Sometimes this is a good thing
– Needs proper motivation
5225-Aug-17 ACM RecSys Summer School 2017
Implementation
•  Most recommender system frameworks
implement some form of splitting
however
•  Documenting what choices were selected for the
splitting is crucial for the work to be
reproducible. Even when using established
frameworks
5325-Aug-17 ACM RecSys Summer School 2017
Data splitting - LensKit
h)p://lenskit.org/documenta=on/evaluator/data/	
5425-Aug-17 ACM RecSys Summer School 2017
ACM RecSys Summer School 2017
Data splitting - LibRec
h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel	
5525-Aug-17
ACM RecSys Summer School 2017
Data splitting - LibRec
h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel	
5625-Aug-17
ACM RecSys Summer School 2017
Data splitting - LibRec
h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel	
5725-Aug-17
ACM RecSys Summer School 2017
Data splitting - LibRec
h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel	
5825-Aug-17
ACM RecSys Summer School 2017
Data splitting - LibRec
h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel	
5925-Aug-17
ACM RecSys Summer School 2017
Data splitting - LibRec
h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel	
6025-Aug-17
Partitions
Dataset	
Test	set	
Training	set	 Test	set	Valida=on	set	
6125-Aug-17 ACM RecSys Summer School 2017
Partitions
Dataset	
Test	set	
Training	set	 Test	set	Valida=on	set	
6225-Aug-17 ACM RecSys Summer School 2017
Training	set
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
6325-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
6425-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
6525-Aug-17 ACM RecSys Summer School 2017
RECOMMENDATION
6625-Aug-17 ACM RecSys Summer School 2017
The recommender
6725-Aug-17 ACM RecSys Summer School 2017
Defining the recommender
•  Many versions of the same thing
•  Various implementations/design choices for
– Collaborative Filtering
– Similarity/distance measures
– Factorization techniques (MF/FM)
– Probabilistic modeling
6825-Aug-17 ACM RecSys Summer School 2017
Design
•  There are multiple ways of implementing the same
algorithms, similarities, metrics.
•  Irregularities arise even when using a known
implementation (from an existing framework)
–  Look at and understand the source code
–  Report implementational variations (esp. when comparing
to others)
•  Magic numbers
•  Rounding errors
•  Thresholds
•  Optimizations
6925-Aug-17 ACM RecSys Summer School 2017
Collaborative Filtering
7025-Aug-17 ACM RecSys Summer School 2017
Collaborative Filtering
•  Both equations are usually referred to using
the same name, i.e. k-nearest neighbor, user-
based, cf.
7125-Aug-17 ACM RecSys Summer School 2017
Similarities
•  Similarity metrics may have different design
choices as well
– Normalized (by user) ratings/values
– Shrinking parameter
7225-Aug-17 ACM RecSys Summer School 2017
CF Common Exceptions
•  Two users having one rating/interaction each
(same item)
•  Both have liked it/rated similarly
•  What is the similarity of these two?
7325-Aug-17 ACM RecSys Summer School 2017
CF Common Exceptions
•  Two users having rated 500 items each
•  5 item intersect and have the same ratings/
values
•  What is the similarity of these two?
7425-Aug-17 ACM RecSys Summer School 2017
CF Implementation
•  LensKit
– No matter the recommender chosen, there is
always a backup recommender to your chosen
one. (BaselineScorer)
– If your chosen recommender cannot fill your list
of recommender items, the backup recommender
will do so instead.
7525-Aug-17 ACM RecSys Summer School 2017
CF Implementation
•  RankSys
– Allows setting a similarity exponent, making the
similarity stronger/weaker than normal
– Similarity score defaults to 0.0
7625-Aug-17 ACM RecSys Summer School 2017
CF Implementation
•  LibRec
– Defaults to global mean when it cannot predict a
rating for a user
7725-Aug-17 ACM RecSys Summer School 2017
Matrix Factorization
•  Ranking vs. Rating prediction
– Implementations vary between various
frameworks.
– Some frameworks contain several
implementations of the same algorithms to tender
to ranking specifically or rating prediction
specifically
7825-Aug-17 ACM RecSys Summer School 2017
MF Implementation
•  RankSys
–  Bundles probabilistic modeling (PLSA) with matrix
factorization (ALS)
–  Has three parallel ALS implementations
•  Generic ALS
•  Y. Hu,Y. Koren, C.Volinsky. Collaborative filtering for implicit
feedback datasets. ICDM 2008
•  I. Pilászy, D. Zibriczky and D.Tikk. Fast ALS-based Matrix
Factorization for Explicit and Implicit Feedback Datasets. RecSys
2010.
7925-Aug-17 ACM RecSys Summer School 2017
MF Implementation
•  LibRec
– Separate rating prediction and ranking models
– RankALSRecommender - Ranking
•  Takács andTikk. Alternating Least Squares for Personalized
Ranking. RecSys 2012.
– MFALSRecommender – Rating Prediction
•  Zhou et al. Large-Scale Parallel Collaborative Filtering for the
Netflix Prize.AAIM 2008
8025-Aug-17 ACM RecSys Summer School 2017
Probabilistic Modeling
•  Various ways of implementing the same
algorithm
– LDA using Gibbs sampling
– LDA using variational Bayes
8125-Aug-17 ACM RecSys Summer School 2017
Probabilistic Implementation
•  RankSys
– Uses Mallet’s LDA implementation
– Newman et al. 2009. Distributed Algorithms forTopic
Models.
8225-Aug-17 ACM RecSys Summer School 2017
Probabilistic Implementation
•  LibRec
– LDA for implicit feedback
– Griffiths. 2002. Gibbs sampling in the generative
model of Latent Dirichlet Allocation
8325-Aug-17 ACM RecSys Summer School 2017
25-Aug-17 ACM RecSys Summer School 2017
Recommending: LibRec
h)ps://www.librec.net/dokuwiki/doku.php?id=Recommender	
84
25-Aug-17 ACM RecSys Summer School 2017
Recommending: LibRec
h)ps://www.librec.net/dokuwiki/doku.php?id=Recommender	
h)ps://github.com/guoguibing/librec/issues/76	
85
ACM RecSys Summer School 2017
Recommending LensKit
8625-Aug-17
Outline
•  Motivation [10 minutes]
•  Replication and reproducibility
•  Replication in Recommender Systems
–  Dataset collection
–  Splitting
–  Recommender algorithm
–  Candidate items
–  Evaluation metrics
–  Statistical testing
•  Demo
•  Conclusions and Wrap-up [10 minutes]
•  Questions [10 minutes]
25-Aug-17 ACM RecSys Summer School 2017 87
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
8825-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
8925-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
9025-Aug-17 ACM RecSys Summer School 2017
Candidate item generation
Different ways to select candidate items to be
ranked:
TestRatings: rated items by u in test set
TestItems: every item in test set
TrainingItems: every item in training
AllItems: all items in the system
Note: in CF,AllItems and TrainingItems
produce the same results
25-Aug-17 ACM RecSys Summer School 2017 91
Solid triangle represents the target user.
Boxed ratings denote test set.
Candidate item generation
Different ways to select candidate items to be
ranked:
25-Aug-17 ACM RecSys Summer School 2017 92
0
0.05
0.30
0.35
0.40
TR	3 TR	4 TeI TrI AI OPR
P@50 SVD50
IB
UB50
0
0.20
0.90
0.10
0.30
1.00
TR	3 TR	4 TeI TrI AI OPR
Recall@50
0
0.90
0.10
0.05
0.80
1.00
TR	3
Solid triangle represents the target user.
Boxed ratings denote test set.
[Bellogín et al, 2011]
[Said & Bellogín, 2014]
Candidate item generation
Impact of different strategies for candidate item
selection:
RPN: RelPlusN
a ranking with
1 relevant and
N non-relevant items
UT: UserTest
same as TestRatings
25-Aug-17 ACM RecSys Summer School 2017 93
[Bellogín et al, 2017]
Candidate item generation
Impact of test size for different candidate item
selection strategies:
The actual value of the metric may be affected by the amount of
known information
25-Aug-17 ACM RecSys Summer School 2017 94
RPN RPNTestItems
Outline
•  Motivation [10 minutes]
•  Replication and reproducibility
•  Focus on Recommender Systems
–  Dataset collection
–  Splitting
–  Recommender algorithm
–  Candidate items
–  Evaluation metrics
–  Statistical testing
•  Demo
•  Conclusions and Wrap-up [10 minutes]
•  Questions [10 minutes]
25-Aug-17 ACM RecSys Summer School 2017 95
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
9625-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
9725-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
9825-Aug-17 ACM RecSys Summer School 2017
considering
Evaluation metric computation
When coverage is not complete, how are the
metrics computed?
–  If a user receives 0 recommendations
Option a:
Option b:
MAE = Mean Absolute Error
RMSE = Root Mean Squared Error
25-Aug-17 ACM RecSys Summer School 2017 99
Evaluation metric computation
When coverage is not complete, how are the
metrics computed?
–  If a user receives 0 recommendations
Option a:
Option b:
MAE = Mean Absolute Error
RMSE = Root Mean Squared Error
25-Aug-17 ACM RecSys Summer School 2017 100
User
rec1 rec2
#recs metric(u) #recs metric(u)
u1 5 0.8 5 0.7
u2 3 0.2 5 0.5
u3 0 -- 5 0.3
u4 1 1.0 5 0.7
Option a 0.50 0.55
Option b 0.66 0.55
User
coverage
3/4 4/4
Evaluation metric computation
When coverage is not complete, how are the
metrics computed?
–  If a user receives 0 recommendations
–  If a value is not predicted (esp. for error-based
metrics)
MAE = Mean Absolute Error
RMSE = Root Mean Squared Error
25-Aug-17 ACM RecSys Summer School 2017 101
Evaluation metric computation
When coverage is not complete, how are the
metrics computed?
–  If a user receives 0 recommendations
–  If a value is not predicted (esp. for error-based
metrics)
MAE = Mean Absolute Error
RMSE = Root Mean Squared Error
25-Aug-17 ACM RecSys Summer School 2017 102
User-item
pairs
Real Rec1 Rec2 Rec3
(u1, i1) 5 4 NaN 4
(u1, i2) 3 2 4 NaN
(u1, i3) 1 1 NaN 1
(u2, i1) 3 2 4 NaN
MAE/RMSE, ignoring NaNs 0.75/0.87 2.00/2.00 0.50/0.70
MAE/RMSE, NaNs as 0 0.75/0.87 2.00/2.65 1.75/2.18
MAE/RMSE, NaNs as 3 0.75/0.87 1.50/1.58 0.25/0.50
Evaluation metric computation
Variations on metrics:
Error-based metrics can be normalized or averaged per user:
–  Normalize RMSE or MAE by the range of the ratings
(divide by rmax – rmin)
–  Average RMSE or MAE to compensate for unbalanced
distributions of items or users
25-Aug-17 ACM RecSys Summer School 2017 103
Evaluation metric computation
Variations on metrics:
nDCG has at least two discounting functions
(linear and exponential decay)
25-Aug-17 ACM RecSys Summer School 2017 104
Evaluation metric computation
Variations on metrics:
Ranking-based metrics are usually computed up to
a ranking position or cutoff k
P = Precision (Precision at k)
R = Recall (Recall at k)
MAP = Mean Average Precision
Is the cutoff being reported? Are the metrics computed until the
end of the list? Is that number the same across all the users?
25-Aug-17 ACM RecSys Summer School 2017 105
Evaluation metric computation
If ties are present in the ranking scores, results
may depend on the implementation
25-Aug-17 ACM RecSys Summer School 2017 106
[Bellogín et al, 2013]
Evaluation metric computation
Internal evaluation methods of different frameworks (Mahout
(AM), LensKit (LK), MyMediaLite (MML)) present different
implementations of these aspects
25-Aug-17 ACM RecSys Summer School 2017 107[Said & Bellogín, 2014]
Evaluation metric computation
Decisions (implementations) found in some
recommendation frameworks:
Regarding coverage:
•  LK and MML use backup recommenders
•  LR: not actual backup recommender, but default values (global
mean) are provided when not enough neigbors (or all similarities
are negative) are found in KNN
•  RS allows to average metrics explicitly by option a or b (see
different constructors of AverageRecommendationMetric)
25-Aug-17 ACM RecSys Summer School 2017 108
LK: https://github.com/lenskit/lenskit
LR: https://github.com/guoguibing/librec
MML: https://github.com/zenogantner/MyMediaLite
RS: https://github.com/RankSys/RankSys
Evaluation metric computation
Decisions (implementations) found in some
recommendation frameworks:
Regarding metric variations:
•  LK, LR, MML use a logarithmic discount for nDCG
•  RS also uses a logarithmic discount for nDCG, but relevance is
normalized with respect to a threshold
•  LK does not take into account predictions without scores for
error metrics
•  LR fails if coverage is not complete for error metrics
•  RS does not compute error metrics
25-Aug-17 ACM RecSys Summer School 2017 109
LK: https://github.com/lenskit/lenskit
LR: https://github.com/guoguibing/librec
MML: https://github.com/zenogantner/MyMediaLite
RS: https://github.com/RankSys/RankSys
Evaluation metric computation
Decisions (implementations) found in some
recommendation frameworks:
Regarding candidate item generation:
•  LK allows defining a candidate and exclude set
•  LR: delegated to the Recommender class.
AbstractRecommender defaults to TrainingItems
•  MML allows different strategies: training, test, their overlap and
union, or a explicitly provided candidate set
•  RS defines different ways to call the recommender: without
restriction, with a list size limit, with a filter, or with a candidate
set
25-Aug-17 ACM RecSys Summer School 2017 110
Evaluation metric computation
Decisions (implementations) found in some
recommendation frameworks:
Regarding ties:
•  MML: not deterministic (Recommender.Recommend sorts
items by descending score)
•  LK: depends on using predict package (not deterministic:
LongUtils. keyValueComparator only compares scores) or
recommend (same ranking as returned by algorithm)
•  LR: not deterministic (Lists.sortItemEntryListTopK only
compares the scores)
•  RS: deterministic (IntDoubleTopN compares values and then
keys)
25-Aug-17 ACM RecSys Summer School 2017 111
Outline
•  Motivation [10 minutes]
•  Replication and reproducibility
•  Focus on Recommender Systems
–  Dataset collection
–  Splitting
–  Recommender algorithm
–  Candidate items
–  Evaluation metrics
–  Statistical testing
•  Demo
•  Conclusions and Wrap-up [10 minutes]
•  Questions [10 minutes]
25-Aug-17 ACM RecSys Summer School 2017 112
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
11325-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
11425-Aug-17 ACM RecSys Summer School 2017
Train
Test
Validation
Splitting
Recommender
generates
ranking
(for a user)
prediction
for a given
item (& user)
Ranking
Prediction
Coverage
Diversity
.
.
.
Evaluator
Collector
Stats
11525-Aug-17 ACM RecSys Summer School 2017
Statistical testing
Make sure the statistical testing method is
reported
– Paired/unpaired test, effect size, confidence
interval
– Specify why this specific method is used
– Related statistics (such as mean, variance,
population size) are useful to interpret the results
25-Aug-17 ACM RecSys Summer School 2017 116
[Sakai, 2014]
Statistical testing
When doing cross-validation, there are several
options to take the samples for the test:
–  One point for each aggregated value of the metric
•  Very few points (one per fold)
–  One point for each value of the metric, on a user basis
•  If we compute a test for each fold, we may find
inconsistencies
•  If we compute a test with all the (concatenated) values, we
may distort the test: many more points, not completely
independent
25-Aug-17 ACM RecSys Summer School 2017 117
[Bouckaert, 2003]
Replication and Reproducibility in RecSys:
Summary
•  Reproducible experimental results depend on acknowledging
every step in the recommendation process
–  As black boxes so every setting is reported
–  Applies to data collection, data splitting, recommendation, candidate
item generation, metric computation, and statistics
•  There exist several details (in implementation) that might hide
important effects in final results
25-Aug-17 ACM RecSys Summer School 2017 118
Replication and Reproducibility in RecSys:
Key takeaways
•  Every decision has an impact
–  We should log every step taken in the experimental
part and report that log
•  There are more things besides papers
–  Source code, web appendix, etc. are very useful to
provide additional details not present in the paper
•  You should not fool yourself
–  You have to be critical about what you measure and
not trust intermediate “black boxes”
25-Aug-17 ACM RecSys Summer School 2017 119
Replication and Reproducibility in RecSys:
Next steps?
•  We should agree on standard implementations, parameters,
instantiations, …
–  Example: trec_eval in IR
25-Aug-17 ACM RecSys Summer School 2017 120
Replication and Reproducibility in RecSys:
Next steps?
•  We should agree on standard implementations, parameters,
instantiations, …
•  Replicable badges for journals / conferences
25-Aug-17 ACM RecSys Summer School 2017 121
Replication and Reproducibility in RecSys:
Next steps?
•  We should agree on standard implementations, parameters,
instantiations, …
•  Replicable badges for journals / conferences
25-Aug-17 ACM RecSys Summer School 2017 122
h)p://
valida=on.scienceexchange.com	
The	aim	of	the	Reproducibility	Ini=a=ve	is	to	iden=fy	and	reward	high	quality	
reproducible	research	via	independent	valida=on	of	key	experimental	results
Replication and Reproducibility in RecSys:
Next steps?
•  We should agree on standard implementations, parameters,
instantiations, …
•  Replicable badges for journals / conferences
•  Investigate how to improve reproducibility
25-Aug-17 ACM RecSys Summer School 2017 123
Replication and Reproducibility in RecSys:
Next steps?
•  We should agree on standard implementations, parameters,
instantiations, …
•  Replicable badges for journals / conferences
•  Investigate how to improve reproducibility
•  Benchmark, report, and store results
25-Aug-17 ACM RecSys Summer School 2017 124
Pointers
•  Email and Twitter
–  Alejandro Bellogín
•  alejandro.bellogin@uam.es
•  @abellogin
–  Alan Said
•  alansaid@acm.org
•  @alansaid
•  Slides:
•  https://github.com/recommenders/rsss2017
125
RiVal
Recommender System Evaluation
Toolkit
http://rival.recommenders.net
http://github.com/recommenders/rival
126
Thank you!
127
References and Additional reading
•  [Armstrong et al, 2009] Improvements That Don’t Add Up:Ad-Hoc Retrieval Results Since 1998.
CIKM
•  [Bellogín et al, 2010] A Study of Heterogeneity in Recommendations for a Social Music Service.
HetRec
•  [Bellogín et al, 2011] Precision-Oriented Evaluation of Recommender Systems: an Algorithm
Comparison. RecSys
•  [Bellogín et al, 2013] An Empirical Comparison of Social, Collaborative Filtering, and Hybrid
Recommenders.ACM TIST
•  [Bellogín et al, 2017] Statistical biases in Information Retrieval metrics for recommender systems.
Information Retrieval Journal
•  [Ben-Shimon et al, 2015] RecSys Challenge 2015 and theYOOCHOOSE Dataset. RecSys
•  [Bouckaert, 2003] Choosing Between Two Learning Algorithms Based on Calibrated Tests. ICML
•  [Cremonesi et al, 2010] Performance of Recommender Algorithms on Top-N Recommendation
Tasks. RecSys
•  [Filippone & Sanguinetti, 2010] Information Theoretic Novelty Detection. Pattern Recognition
•  [Fleder & Hosanagar, 2009] Blockbuster Culture’s Next Rise or Fall:The Impact of Recommender
Systems on Sales Diversity. Management Science
128
References and Additional reading
•  [Ge et al, 2010] Beyond accuracy: evaluating recommender systems by coverage and serendipity.
RecSys
•  [Gorla et al, 2013] Probabilistic Group Recommendation via Information Matching.WWW
•  [Herlocker et al, 2004] Evaluating Collaborative Filtering Recommender Systems.ACM Transactions
on Information Systems
•  [Jambor & Wang, 2010] Goal-Driven Collaborative Filtering. ECIR
•  [Knijnenburg et al, 2011] A Pragmatic Procedure to Support the User-Centric Evaluation of
Recommender Systems. RecSys
•  [Koren, 2008] Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model.
KDD
•  [Lathia et al, 2010] Temporal Diversity in Recommender Systems. SIGIR
•  [Li et al, 2010] Improving One-Class Collaborative Filtering by Incorporating Rich User Information.
CIKM
•  [Pu et al, 2011] A User-Centric Evaluation Framework for Recommender Systems. RecSys
•  [Said & Bellogín, 2014] Comparative Recommender System Evaluation: Benchmarking
Recommendation Frameworks. RecSys
•  [Sakai, 2014] Statistical reform in Information Retrieval? SIGIR Forum
129
References and Additional reading
•  [Schein et al, 2002] Methods and Metrics for Cold-Start Recommendations. SIGIR
•  [Shani & Gunawardana, 2011] Evaluating Recommendation Systems. Recommender Systems
Handbook
•  [Steck & Xin, 2010] A Generalized Probabilistic Framework and itsVariants for Training Top-k
Recommender Systems. PRSAT
•  [Tikk et al, 2014] Comparative Evaluation of Recommender Systems for Digital Media. IBC
•  [Vargas & Castells, 2011] Rank and Relevance in Novelty and Diversity Metrics for Recommender
Systems. RecSys
•  [Weng et al, 2007] Improving Recommendation Novelty Based on Topic Taxonomy.WI-IAT
•  [Yin et al, 2012] Challenging the Long Tail Recommendation.VLDB
•  [Zhang & Hurley, 2008] Avoiding Monotony: Improving the Diversity of Recommendation Lists.
RecSys
•  [Zhang & Hurley, 2009] Statistical Modeling of Diversity in Top-N Recommender Systems.WI-IAT
•  [Zhou et al, 2010] Solving the Apparent Diversity-Accuracy Dilemma of Recommender Systems.
PNAS
•  [Ziegler et al, 2005] Improving Recommendation Lists Through Topic Diversification.WWW
130

Weitere ähnliche Inhalte

Was ist angesagt?

Product Recommendations Enhanced with Reviews
Product Recommendations Enhanced with ReviewsProduct Recommendations Enhanced with Reviews
Product Recommendations Enhanced with Reviewsmaranlar
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at NetflixJustin Basilico
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisitedXavier Amatriain
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
Recsys2021_slides_sato
Recsys2021_slides_satoRecsys2021_slides_sato
Recsys2021_slides_satoMasahiro Sato
 
Machine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleMachine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleXavier Amatriain
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesAlan Said
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain
 
'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 Georgina Tilby
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
A Scalable, High-performance Algorithm for Hybrid Job Recommendations
A Scalable, High-performance Algorithm for Hybrid Job RecommendationsA Scalable, High-performance Algorithm for Hybrid Job Recommendations
A Scalable, High-performance Algorithm for Hybrid Job RecommendationsToon De Pessemier
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...PyData
 
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...David Zibriczky
 
1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptopRising Media, Inc.
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFMLiangjie Hong
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
The “Bellwether” Effect and Its Implications to Transfer Learning
The “Bellwether” Effect and Its Implications to Transfer LearningThe “Bellwether” Effect and Its Implications to Transfer Learning
The “Bellwether” Effect and Its Implications to Transfer LearningRahul Krishna
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 

Was ist angesagt? (20)

Product Recommendations Enhanced with Reviews
Product Recommendations Enhanced with ReviewsProduct Recommendations Enhanced with Reviews
Product Recommendations Enhanced with Reviews
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisited
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Recsys2021_slides_sato
Recsys2021_slides_satoRecsys2021_slides_sato
Recsys2021_slides_sato
 
Machine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleMachine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora Example
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015 'A critique of testing' UK TMF forum January 2015
'A critique of testing' UK TMF forum January 2015
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
 
A Scalable, High-performance Algorithm for Hybrid Job Recommendations
A Scalable, High-performance Algorithm for Hybrid Job RecommendationsA Scalable, High-performance Algorithm for Hybrid Job Recommendations
A Scalable, High-performance Algorithm for Hybrid Job Recommendations
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
 
1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFM
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
The “Bellwether” Effect and Its Implications to Transfer Learning
The “Bellwether” Effect and Its Implications to Transfer LearningThe “Bellwether” Effect and Its Implications to Transfer Learning
The “Bellwether” Effect and Its Implications to Transfer Learning
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 

Ähnlich wie Replication of Recommender Systems Research

[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineeringIvano Malavolta
 
MSR End of Internship Talk
MSR End of Internship TalkMSR End of Internship Talk
MSR End of Internship TalkRay Buse
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
The Pothole of Automating Too Much
The Pothole of Automating Too MuchThe Pothole of Automating Too Much
The Pothole of Automating Too MuchTechWell
 
[2016/2017] RESEARCH in software engineering
[2016/2017] RESEARCH in software engineering[2016/2017] RESEARCH in software engineering
[2016/2017] RESEARCH in software engineeringIvano Malavolta
 
Assessment Analytics - EUNIS 2015 E-Learning Task Force Workshop
Assessment Analytics - EUNIS 2015 E-Learning Task Force WorkshopAssessment Analytics - EUNIS 2015 E-Learning Task Force Workshop
Assessment Analytics - EUNIS 2015 E-Learning Task Force WorkshopLACE Project
 
Module 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdfModule 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdffathiah5
 
RESEARCH in software engineering
RESEARCH in software engineeringRESEARCH in software engineering
RESEARCH in software engineeringIvano Malavolta
 
Ease2017 - Operationalizing the Experience Factory for Effort Estimation in A...
Ease2017 - Operationalizing the Experience Factory for Effort Estimation in A...Ease2017 - Operationalizing the Experience Factory for Effort Estimation in A...
Ease2017 - Operationalizing the Experience Factory for Effort Estimation in A...Davide Taibi
 
UCD WST February 20 2020
UCD WST February 20 2020UCD WST February 20 2020
UCD WST February 20 2020UCDAgile
 
RecSysChallenge Opening
RecSysChallenge OpeningRecSysChallenge Opening
RecSysChallenge OpeningAlan Said
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
HPC I/O for Computational Scientists
HPC I/O for Computational ScientistsHPC I/O for Computational Scientists
HPC I/O for Computational Scientistsinside-BigData.com
 
Using Analytics to Deliver Engaging Courses
Using Analytics to Deliver Engaging CoursesUsing Analytics to Deliver Engaging Courses
Using Analytics to Deliver Engaging CoursesLambda Solutions
 
Test analysis & design good practices@TDT Iasi 17Oct2013
Test analysis & design   good practices@TDT Iasi 17Oct2013Test analysis & design   good practices@TDT Iasi 17Oct2013
Test analysis & design good practices@TDT Iasi 17Oct2013Tabăra de Testare
 

Ähnlich wie Replication of Recommender Systems Research (20)

[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering
 
MSR End of Internship Talk
MSR End of Internship TalkMSR End of Internship Talk
MSR End of Internship Talk
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
The Pothole of Automating Too Much
The Pothole of Automating Too MuchThe Pothole of Automating Too Much
The Pothole of Automating Too Much
 
lecture1.pdf
lecture1.pdflecture1.pdf
lecture1.pdf
 
SEO and Analytics
SEO and AnalyticsSEO and Analytics
SEO and Analytics
 
[2016/2017] RESEARCH in software engineering
[2016/2017] RESEARCH in software engineering[2016/2017] RESEARCH in software engineering
[2016/2017] RESEARCH in software engineering
 
Assessment Analytics - EUNIS 2015 E-Learning Task Force Workshop
Assessment Analytics - EUNIS 2015 E-Learning Task Force WorkshopAssessment Analytics - EUNIS 2015 E-Learning Task Force Workshop
Assessment Analytics - EUNIS 2015 E-Learning Task Force Workshop
 
Data Analytics: From Basic Skills to Executive Decision-Making
Data Analytics: From Basic Skills to Executive Decision-MakingData Analytics: From Basic Skills to Executive Decision-Making
Data Analytics: From Basic Skills to Executive Decision-Making
 
Module 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdfModule 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdf
 
RESEARCH in software engineering
RESEARCH in software engineeringRESEARCH in software engineering
RESEARCH in software engineering
 
Ease2017 - Operationalizing the Experience Factory for Effort Estimation in A...
Ease2017 - Operationalizing the Experience Factory for Effort Estimation in A...Ease2017 - Operationalizing the Experience Factory for Effort Estimation in A...
Ease2017 - Operationalizing the Experience Factory for Effort Estimation in A...
 
UCD WST February 20 2020
UCD WST February 20 2020UCD WST February 20 2020
UCD WST February 20 2020
 
The Benefits of using Systems Thinking in Project Management
The Benefits of using Systems Thinking in Project ManagementThe Benefits of using Systems Thinking in Project Management
The Benefits of using Systems Thinking in Project Management
 
RecSysChallenge Opening
RecSysChallenge OpeningRecSysChallenge Opening
RecSysChallenge Opening
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
HPC I/O for Computational Scientists
HPC I/O for Computational ScientistsHPC I/O for Computational Scientists
HPC I/O for Computational Scientists
 
Using Analytics to Deliver Engaging Courses
Using Analytics to Deliver Engaging CoursesUsing Analytics to Deliver Engaging Courses
Using Analytics to Deliver Engaging Courses
 
Test analysis & design good practices@TDT Iasi 17Oct2013
Test analysis & design   good practices@TDT Iasi 17Oct2013Test analysis & design   good practices@TDT Iasi 17Oct2013
Test analysis & design good practices@TDT Iasi 17Oct2013
 

Mehr von Alan Said

The Magic Barrier of Recommender Systems - No Magic, Just Ratings
The Magic Barrier of Recommender Systems - No Magic, Just RatingsThe Magic Barrier of Recommender Systems - No Magic, Just Ratings
The Magic Barrier of Recommender Systems - No Magic, Just RatingsAlan Said
 
Information Retrieval and User-centric Recommender System Evaluation
Information Retrieval and User-centric Recommender System EvaluationInformation Retrieval and User-centric Recommender System Evaluation
Information Retrieval and User-centric Recommender System EvaluationAlan Said
 
User-Centric Evaluation of a K-Furthest Neighbor Collaborative Filtering Reco...
User-Centric Evaluation of a K-Furthest Neighbor Collaborative Filtering Reco...User-Centric Evaluation of a K-Furthest Neighbor Collaborative Filtering Reco...
User-Centric Evaluation of a K-Furthest Neighbor Collaborative Filtering Reco...Alan Said
 
A 3D Approach to Recommender System Evaluation
A 3D Approach to Recommender System EvaluationA 3D Approach to Recommender System Evaluation
A 3D Approach to Recommender System EvaluationAlan Said
 
State of RecSys: Recap of RecSys 2012
State of RecSys: Recap of RecSys 2012State of RecSys: Recap of RecSys 2012
State of RecSys: Recap of RecSys 2012Alan Said
 
Estimating the Magic Barrier of Recommender Systems: A User Study
Estimating the Magic Barrier of Recommender Systems: A User StudyEstimating the Magic Barrier of Recommender Systems: A User Study
Estimating the Magic Barrier of Recommender Systems: A User StudyAlan Said
 
Users and Noise: The Magic Barrier of Recommender Systems
Users and Noise: The Magic Barrier of Recommender SystemsUsers and Noise: The Magic Barrier of Recommender Systems
Users and Noise: The Magic Barrier of Recommender SystemsAlan Said
 
Analyzing Weighting Schemes in Collaborative Filtering: Cold Start, Post Cold...
Analyzing Weighting Schemes in Collaborative Filtering: Cold Start, Post Cold...Analyzing Weighting Schemes in Collaborative Filtering: Cold Start, Post Cold...
Analyzing Weighting Schemes in Collaborative Filtering: Cold Start, Post Cold...Alan Said
 
CaRR 2012 Opening Presentation
CaRR 2012 Opening PresentationCaRR 2012 Opening Presentation
CaRR 2012 Opening PresentationAlan Said
 
Personalizing Tags: A Folksonomy-like Approach for Recommending Movies
Personalizing Tags: A Folksonomy-like Approach for Recommending MoviesPersonalizing Tags: A Folksonomy-like Approach for Recommending Movies
Personalizing Tags: A Folksonomy-like Approach for Recommending MoviesAlan Said
 
Inferring Contextual User Profiles - Improving Recommender Performance
Inferring Contextual User Profiles - Improving Recommender PerformanceInferring Contextual User Profiles - Improving Recommender Performance
Inferring Contextual User Profiles - Improving Recommender PerformanceAlan Said
 
Using Social- and Pseudo-Social Networks to Improve Recommendation Quality
Using Social- and Pseudo-Social Networks to Improve Recommendation QualityUsing Social- and Pseudo-Social Networks to Improve Recommendation Quality
Using Social- and Pseudo-Social Networks to Improve Recommendation QualityAlan Said
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsAlan Said
 

Mehr von Alan Said (13)

The Magic Barrier of Recommender Systems - No Magic, Just Ratings
The Magic Barrier of Recommender Systems - No Magic, Just RatingsThe Magic Barrier of Recommender Systems - No Magic, Just Ratings
The Magic Barrier of Recommender Systems - No Magic, Just Ratings
 
Information Retrieval and User-centric Recommender System Evaluation
Information Retrieval and User-centric Recommender System EvaluationInformation Retrieval and User-centric Recommender System Evaluation
Information Retrieval and User-centric Recommender System Evaluation
 
User-Centric Evaluation of a K-Furthest Neighbor Collaborative Filtering Reco...
User-Centric Evaluation of a K-Furthest Neighbor Collaborative Filtering Reco...User-Centric Evaluation of a K-Furthest Neighbor Collaborative Filtering Reco...
User-Centric Evaluation of a K-Furthest Neighbor Collaborative Filtering Reco...
 
A 3D Approach to Recommender System Evaluation
A 3D Approach to Recommender System EvaluationA 3D Approach to Recommender System Evaluation
A 3D Approach to Recommender System Evaluation
 
State of RecSys: Recap of RecSys 2012
State of RecSys: Recap of RecSys 2012State of RecSys: Recap of RecSys 2012
State of RecSys: Recap of RecSys 2012
 
Estimating the Magic Barrier of Recommender Systems: A User Study
Estimating the Magic Barrier of Recommender Systems: A User StudyEstimating the Magic Barrier of Recommender Systems: A User Study
Estimating the Magic Barrier of Recommender Systems: A User Study
 
Users and Noise: The Magic Barrier of Recommender Systems
Users and Noise: The Magic Barrier of Recommender SystemsUsers and Noise: The Magic Barrier of Recommender Systems
Users and Noise: The Magic Barrier of Recommender Systems
 
Analyzing Weighting Schemes in Collaborative Filtering: Cold Start, Post Cold...
Analyzing Weighting Schemes in Collaborative Filtering: Cold Start, Post Cold...Analyzing Weighting Schemes in Collaborative Filtering: Cold Start, Post Cold...
Analyzing Weighting Schemes in Collaborative Filtering: Cold Start, Post Cold...
 
CaRR 2012 Opening Presentation
CaRR 2012 Opening PresentationCaRR 2012 Opening Presentation
CaRR 2012 Opening Presentation
 
Personalizing Tags: A Folksonomy-like Approach for Recommending Movies
Personalizing Tags: A Folksonomy-like Approach for Recommending MoviesPersonalizing Tags: A Folksonomy-like Approach for Recommending Movies
Personalizing Tags: A Folksonomy-like Approach for Recommending Movies
 
Inferring Contextual User Profiles - Improving Recommender Performance
Inferring Contextual User Profiles - Improving Recommender PerformanceInferring Contextual User Profiles - Improving Recommender Performance
Inferring Contextual User Profiles - Improving Recommender Performance
 
Using Social- and Pseudo-Social Networks to Improve Recommendation Quality
Using Social- and Pseudo-Social Networks to Improve Recommendation QualityUsing Social- and Pseudo-Social Networks to Improve Recommendation Quality
Using Social- and Pseudo-Social Networks to Improve Recommendation Quality
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 

Kürzlich hochgeladen

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 

Kürzlich hochgeladen (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 

Replication of Recommender Systems Research

  • 1. Replication of Recommender Systems Research Alejandro Bellogín Universidad Autónoma de Madrid @abellogin Alan Said University of Skövde @alansaid ACM RecSys Summer School 2017
  • 2. Who are we? Alejandro Bellogín •  Lecturer (~Asst. Prof) @ Universidad Autónoma de Madrid, Spain •  PhD @ UAM, 2012 •  Research on –  Evaluation –  Similarity metrics –  Replication & reproducibility Alan Said •  Lecturer (~Asst. Prof) @ University of Skövde, Sweden •  PhD @ TU Berlin, 2013 •  Research on –  Evaluation –  Replication & reproducibility –  Health 2ACM RecSys Summer School 201725-Aug-17
  • 3. Outline •  Motivation [10 minutes] •  Replication and reproducibility •  Replication in Recommender Systems •  Demo •  Conclusions and Wrap-up [10 minutes] •  Questions [10 minutes] 25-Aug-17 ACM RecSys Summer School 2017 3
  • 4. Motivation In RecSys, we find inconsistent results, for the “same” –  Dataset –  Algorithm –  Evaluation metric Movielens 1M [Cremonesi et al, 2010] Movielens 100k [Gorla et al, 2013] Movielens 1M [Yin et al, 2012] Movielens 100k, SVD [Jambor & Wang, 2010] 425-Aug-17 ACM RecSys Summer School 2017
  • 5. Motivation In RecSys, we find inconsistent results, for the “same” –  Dataset –  Algorithm –  Evaluation metric Movielens 1M [Cremonesi et al, 2010] Movielens 100k [Gorla et al, 2013] Movielens 1M [Yin et al, 2012] Movielens 100k, SVD [Jambor & Wang, 2010] 525-Aug-17 ACM RecSys Summer School 2017 So what?
  • 6. Motivation A proper evaluation culture allows the field to advance … or at least, identify when there is a problem! 25-Aug-17 ACM RecSys Summer School 2017 6
  • 7. Motivation In RecSys, we find inconsistent results, for the “same” –  Dataset –  Algorithm –  Evaluation metric Movielens 1M [Cremonesi et al, 2010] Movielens 100k [Gorla et al, 2013] Movielens 1M [Yin et al, 2012] Movielens 100k, SVD [Jambor & Wang, 2010] 725-Aug-17 ACM RecSys Summer School 2017
  • 8. Motivation In RecSys, we find inconsistent results, for the “same” –  Dataset –  Algorithm –  Evaluation metric Movielens 1M [Cremonesi et al, 2010] Movielens 100k [Gorla et al, 2013] Movielens 1M [Yin et al, 2012] Movielens 100k, SVD [Jambor & Wang, 2010] 825-Aug-17 ACM RecSys Summer School 2017 We need to understand why this happens
  • 9. Goal of this tutorial •  Identify the steps that can act as hurdles when replicating experimental results – Focusing on the specific details inherent to the recommender systems •  We will analyze this problem using the following representation of a generic recommender system process 25-Aug-17 ACM RecSys Summer School 2017 9
  • 10. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 25-Aug-17 ACM RecSys Summer School 2017 10
  • 11. In this tutorial •  We will focus on replication and reproducibility – Define the context – Present typical setting and problems – Propose some guidelines – Exhibit the most typical scenarios where experimental results in recommendation may hinder replication 25-Aug-17 ACM RecSys Summer School 2017 11
  • 12. NOT in this tutorial •  Definition of evaluation in recommendation: – In-depth analysis of evaluation metrics – Novel evaluation dimensions – User evaluation Ø Wednesday’s lectures on evaluation 25-Aug-17 ACM RecSys Summer School 2017 12
  • 13. Outline •  Motivation [10 minutes] •  Replication and reproducibility •  Replication in Recommender Systems •  Demo •  Conclusions and Wrap-up [10 minutes] •  Questions [10 minutes] 1325-Aug-17 ACM RecSys Summer School 2017
  • 14. Reproducible Experimental Design •  We need to distinguish –  Replicability –  Reproducibility •  Different aspects: –  Algorithmic –  Published results –  Experimental design •  Goal: –  to have an environment for reproducible experiments 25-Aug-17 ACM RecSys Summer School 2017 14
  • 15. Definition: Replicability To copy something •  The results •  The data •  The approach Being able to evaluate in the same setting and obtain the same results 25-Aug-17 ACM RecSys Summer School 2017 15
  • 16. Definition: Reproducibility To recreate something •  The (complete) set of experiments •  The (complete) set of results •  The (complete) experimental setup To (re)launch it in production with the same results 25-Aug-17 ACM RecSys Summer School 2017 16
  • 17. Comparing against the state-of-the-art 25-Aug-17 ACM RecSys Summer School 2017 17 Your settings are not exactly like those in paper X, but it is a relevant paper Congrats, you’re done! Congrats!You have shown that paper X behaves different in the new setting There is something wrong/ incomplete in the experimental design.Try again! They agree They do not agree Do results match the original paper? Yes! No! Do results agree with original paper?
  • 18. Comparing against the state-of-the-art 25-Aug-17 ACM RecSys Summer School 2017 18 Your settings are not exactly like those in paper X, but it is a relevant paper Reproduce results of paper X Congrats, you’re done! Replicate results of paper X Congrats!You have shown that paper X behaves different in the new setting There is something wrong/ incomplete in the experimental design.Try again! They agree They do not agree Do results match the original paper? Yes! No! Do results agree with original paper?
  • 19. What about Reviewer 3? •  “It would be interesting to see this done on a different dataset…” –  Repeatability –  The same person doing the whole pipeline over again •  “How does your approach compare to [Reviewer 3 et al. 2003]?” –  Reproducibility or replicability (depending on how similar the two papers are) 25-Aug-17 ACM RecSys Summer School 2017 19
  • 20. Repeat vs. replicate vs. reproduce vs. reuse 25-Aug-17 ACM RecSys Summer School 2017 20
  • 21. Motivation for reproducibility In order to ensure that our experiments, settings, and results are: –  Valid –  Generalizable –  Comparable –  Of use for others –  etc. we must make sure that others can reproduce our experiments in their setting 25-Aug-17 ACM RecSys Summer School 2017 21
  • 22. Making reproducibility easier •  Description, description, description •  No magic numbers •  Specify values for all parameters •  Motivate! •  Keep a detailed protocol of everything •  Describe process clearly •  Use standards •  Publish code (nobody expects you to be an awesome developer, you’re a researcher) •  Publish data •  Publish supplemental material 25-Aug-17 ACM RecSys Summer School 2017 22
  • 23. Replicability, reproducibility, and progress •  Can there be actual progress if no valid comparison can be done? •  What is the point of comparing two approaches if the comparison is flawed? •  How do replicability and reproducibility facilitate actual progress in the field? 25-Aug-17 ACM RecSys Summer School 2017 23
  • 24. Summary •  Important issues when running experiments –  Validity of results (replicability) –  Comparability of results (reproducibility) –  Validity of experimental setup (repeatability) •  We need to incorporate reproducibility and replication to facilitate progress in the field •  If your research is reproducible for others, it has more value 2425-Aug-17 ACM RecSys Summer School 2017
  • 25. Outline •  Motivation [10 minutes] •  Replication and reproducibility •  Replication in Recommender Systems –  Dataset collection –  Splitting –  Recommender algorithms –  Candidate items –  Evaluation metrics –  Statistical testing •  Demo •  Conclusions and Wrap-up [10 minutes] •  Questions [10 minutes] 25-Aug-17 ACM RecSys Summer School 2017 25
  • 26. Replication in Recommender Systems •  Replicability/reproducibility/repeatability: useful and desirable in any field –  How can they be addressed when dealing with recommender systems? •  Proposal: analyze the recommendation process and identify each stage that may affect the final results 25-Aug-17 ACM RecSys Summer School 2017 26
  • 27. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 25-Aug-17 ACM RecSys Summer School 2017 27
  • 28. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 25-Aug-17 ACM RecSys Summer School 2017 28
  • 29. DATA CREATION AND COLLECTION 2925-Aug-17 ACM RecSys Summer School 2017
  • 30. What is a dataset? 3025-Aug-17 ACM RecSys Summer School 2017
  • 31. 3125-Aug-17 ACM RecSys Summer School 2017
  • 32. Public datasets •  Movielens 20M –  “Users were selected at random for inclusion.All selected users had rated at least 20 movies.” •  Netflix Prize –  Details withheld •  Xing (RecSys Challenge 2016/2017) –  Details withheld •  Last.fm (360k, MSD) –  Undocumented cleaning applied •  MovieTweetings –  All IMDb ratings… from Twitter –  2nd hand information 3225-Aug-17 ACM RecSys Summer School 2017
  • 33. Creating your own datasets •  Ask yourself: –  What are we collecting? –  How are we collecting it? •  How should we be collecting it? –  Are we collecting all (vital) interactions? •  dwell time vs. clicks vs. comments vs. swipes vs. likes vs. etc. –  Are we documenting the process in sufficient detail? –  Are we sharing the dataset in a format understood by others (and supported by software)? 3325-Aug-17 ACM RecSys Summer School 2017
  • 34. The user-item matrix User Item Interac,on Timestamp 1 1 1 2017-… 1 2 1 … 2 3 2 … 3425-Aug-17 ACM RecSys Summer School 2017
  • 35. Releasing the dataset •  Make the dataset publicly available – Otherwise your work is not reproducible •  Provide an in-depth overview – Website, paper, etc. •  Communicate it – Mailing lists, RecSysWiki, website, etc. 3525-Aug-17 ACM RecSys Summer School 2017
  • 36. Releasing the dataset •  Consider releasing official training, test, validation splits. •  Present baseline algorithm results for released splits. •  Have code examples of how to work with the data (splits, evaluations, etc.) 3625-Aug-17 ACM RecSys Summer School 2017
  • 37. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 3725-Aug-17 ACM RecSys Summer School 2017
  • 38. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 3825-Aug-17 ACM RecSys Summer School 2017
  • 39. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 3925-Aug-17 ACM RecSys Summer School 2017
  • 40. DATA SPLITTING AND PREPARATION 4025-Aug-17 ACM RecSys Summer School 2017
  • 42. Train Test Validation Splitting•  Sizes? •  How to split? •  Filtering? •  How to document? 4225-Aug-17 ACM RecSys Summer School 2017
  • 43. Train Test Validation Splitting•  Sizes? •  How to split? •  Filtering? •  How to document? •  What’s the task? –  Rating prediction –  Top-n •  What’s important for the algorithm? –  Time –  Relevance 4325-Aug-17 ACM RecSys Summer School 2017
  • 44. Train Test Validation Splitting•  Which are the candidate items that we will be recommending? •  Who are the candidate users we will be recommending (and evaluating) for? •  Do we have any limitations on numbers? –  Cold start? –  Temporal/trending recommendations? –  Other? 4425-Aug-17 ACM RecSys Summer School 2017
  • 45. Scenarios •  Random •  All users at once •  All items at once •  One user at once •  One item at once •  Temporal •  Temporal for one user •  Relevance thresholds 4525-Aug-17 ACM RecSys Summer School 2017
  • 46. Random •  The split does not take into consideration –  Whether users or items are left out of the training or test sets. –  The relevance of items –  The scenario of the recommendation 4625-Aug-17 ACM RecSys Summer School 2017
  • 47. All users at once •  The split does not take into consideration –  Whether items are left out of the training or test sets. •  Can take into consideration –  The relevance of items (per user or in general) 4725-Aug-17 ACM RecSys Summer School 2017
  • 48. All items at once •  The split does not take into consideration –  Whether users are left out of the training or test sets. 4825-Aug-17 ACM RecSys Summer School 2017
  • 49. One user at once •  The split takes into consideration –  The interactions of all other users when creating the splits for one specific user •  Resulting training set contains all other user-item interactions 4925-Aug-17 ACM RecSys Summer School 2017
  • 50. One item at once •  The split takes into consideration –  The interactions of all other users when creating the splits for one specific item •  Resulting training set contains all other user-item interactions 5025-Aug-17 ACM RecSys Summer School 2017
  • 51. Temporal •  The split takes into consideration –  The timestamp of interactions •  All items newer than a certain timestamp are discarded part of the test set. 5125-Aug-17 ACM RecSys Summer School 2017
  • 52. Filters •  What filters? •  Why filters? •  Removing items/users with few interactions creates a skewed dataset – Sometimes this is a good thing – Needs proper motivation 5225-Aug-17 ACM RecSys Summer School 2017
  • 53. Implementation •  Most recommender system frameworks implement some form of splitting however •  Documenting what choices were selected for the splitting is crucial for the work to be reproducible. Even when using established frameworks 5325-Aug-17 ACM RecSys Summer School 2017
  • 54. Data splitting - LensKit h)p://lenskit.org/documenta=on/evaluator/data/ 5425-Aug-17 ACM RecSys Summer School 2017
  • 55. ACM RecSys Summer School 2017 Data splitting - LibRec h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel 5525-Aug-17
  • 56. ACM RecSys Summer School 2017 Data splitting - LibRec h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel 5625-Aug-17
  • 57. ACM RecSys Summer School 2017 Data splitting - LibRec h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel 5725-Aug-17
  • 58. ACM RecSys Summer School 2017 Data splitting - LibRec h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel 5825-Aug-17
  • 59. ACM RecSys Summer School 2017 Data splitting - LibRec h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel 5925-Aug-17
  • 60. ACM RecSys Summer School 2017 Data splitting - LibRec h)ps://www.librec.net/dokuwiki/doku.php?id=DataModel 6025-Aug-17
  • 63. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 6325-Aug-17 ACM RecSys Summer School 2017
  • 64. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 6425-Aug-17 ACM RecSys Summer School 2017
  • 65. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 6525-Aug-17 ACM RecSys Summer School 2017
  • 67. The recommender 6725-Aug-17 ACM RecSys Summer School 2017
  • 68. Defining the recommender •  Many versions of the same thing •  Various implementations/design choices for – Collaborative Filtering – Similarity/distance measures – Factorization techniques (MF/FM) – Probabilistic modeling 6825-Aug-17 ACM RecSys Summer School 2017
  • 69. Design •  There are multiple ways of implementing the same algorithms, similarities, metrics. •  Irregularities arise even when using a known implementation (from an existing framework) –  Look at and understand the source code –  Report implementational variations (esp. when comparing to others) •  Magic numbers •  Rounding errors •  Thresholds •  Optimizations 6925-Aug-17 ACM RecSys Summer School 2017
  • 70. Collaborative Filtering 7025-Aug-17 ACM RecSys Summer School 2017
  • 71. Collaborative Filtering •  Both equations are usually referred to using the same name, i.e. k-nearest neighbor, user- based, cf. 7125-Aug-17 ACM RecSys Summer School 2017
  • 72. Similarities •  Similarity metrics may have different design choices as well – Normalized (by user) ratings/values – Shrinking parameter 7225-Aug-17 ACM RecSys Summer School 2017
  • 73. CF Common Exceptions •  Two users having one rating/interaction each (same item) •  Both have liked it/rated similarly •  What is the similarity of these two? 7325-Aug-17 ACM RecSys Summer School 2017
  • 74. CF Common Exceptions •  Two users having rated 500 items each •  5 item intersect and have the same ratings/ values •  What is the similarity of these two? 7425-Aug-17 ACM RecSys Summer School 2017
  • 75. CF Implementation •  LensKit – No matter the recommender chosen, there is always a backup recommender to your chosen one. (BaselineScorer) – If your chosen recommender cannot fill your list of recommender items, the backup recommender will do so instead. 7525-Aug-17 ACM RecSys Summer School 2017
  • 76. CF Implementation •  RankSys – Allows setting a similarity exponent, making the similarity stronger/weaker than normal – Similarity score defaults to 0.0 7625-Aug-17 ACM RecSys Summer School 2017
  • 77. CF Implementation •  LibRec – Defaults to global mean when it cannot predict a rating for a user 7725-Aug-17 ACM RecSys Summer School 2017
  • 78. Matrix Factorization •  Ranking vs. Rating prediction – Implementations vary between various frameworks. – Some frameworks contain several implementations of the same algorithms to tender to ranking specifically or rating prediction specifically 7825-Aug-17 ACM RecSys Summer School 2017
  • 79. MF Implementation •  RankSys –  Bundles probabilistic modeling (PLSA) with matrix factorization (ALS) –  Has three parallel ALS implementations •  Generic ALS •  Y. Hu,Y. Koren, C.Volinsky. Collaborative filtering for implicit feedback datasets. ICDM 2008 •  I. Pilászy, D. Zibriczky and D.Tikk. Fast ALS-based Matrix Factorization for Explicit and Implicit Feedback Datasets. RecSys 2010. 7925-Aug-17 ACM RecSys Summer School 2017
  • 80. MF Implementation •  LibRec – Separate rating prediction and ranking models – RankALSRecommender - Ranking •  Takács andTikk. Alternating Least Squares for Personalized Ranking. RecSys 2012. – MFALSRecommender – Rating Prediction •  Zhou et al. Large-Scale Parallel Collaborative Filtering for the Netflix Prize.AAIM 2008 8025-Aug-17 ACM RecSys Summer School 2017
  • 81. Probabilistic Modeling •  Various ways of implementing the same algorithm – LDA using Gibbs sampling – LDA using variational Bayes 8125-Aug-17 ACM RecSys Summer School 2017
  • 82. Probabilistic Implementation •  RankSys – Uses Mallet’s LDA implementation – Newman et al. 2009. Distributed Algorithms forTopic Models. 8225-Aug-17 ACM RecSys Summer School 2017
  • 83. Probabilistic Implementation •  LibRec – LDA for implicit feedback – Griffiths. 2002. Gibbs sampling in the generative model of Latent Dirichlet Allocation 8325-Aug-17 ACM RecSys Summer School 2017
  • 84. 25-Aug-17 ACM RecSys Summer School 2017 Recommending: LibRec h)ps://www.librec.net/dokuwiki/doku.php?id=Recommender 84
  • 85. 25-Aug-17 ACM RecSys Summer School 2017 Recommending: LibRec h)ps://www.librec.net/dokuwiki/doku.php?id=Recommender h)ps://github.com/guoguibing/librec/issues/76 85
  • 86. ACM RecSys Summer School 2017 Recommending LensKit 8625-Aug-17
  • 87. Outline •  Motivation [10 minutes] •  Replication and reproducibility •  Replication in Recommender Systems –  Dataset collection –  Splitting –  Recommender algorithm –  Candidate items –  Evaluation metrics –  Statistical testing •  Demo •  Conclusions and Wrap-up [10 minutes] •  Questions [10 minutes] 25-Aug-17 ACM RecSys Summer School 2017 87
  • 88. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 8825-Aug-17 ACM RecSys Summer School 2017
  • 89. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 8925-Aug-17 ACM RecSys Summer School 2017
  • 90. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 9025-Aug-17 ACM RecSys Summer School 2017
  • 91. Candidate item generation Different ways to select candidate items to be ranked: TestRatings: rated items by u in test set TestItems: every item in test set TrainingItems: every item in training AllItems: all items in the system Note: in CF,AllItems and TrainingItems produce the same results 25-Aug-17 ACM RecSys Summer School 2017 91 Solid triangle represents the target user. Boxed ratings denote test set.
  • 92. Candidate item generation Different ways to select candidate items to be ranked: 25-Aug-17 ACM RecSys Summer School 2017 92 0 0.05 0.30 0.35 0.40 TR 3 TR 4 TeI TrI AI OPR P@50 SVD50 IB UB50 0 0.20 0.90 0.10 0.30 1.00 TR 3 TR 4 TeI TrI AI OPR Recall@50 0 0.90 0.10 0.05 0.80 1.00 TR 3 Solid triangle represents the target user. Boxed ratings denote test set. [Bellogín et al, 2011]
  • 93. [Said & Bellogín, 2014] Candidate item generation Impact of different strategies for candidate item selection: RPN: RelPlusN a ranking with 1 relevant and N non-relevant items UT: UserTest same as TestRatings 25-Aug-17 ACM RecSys Summer School 2017 93
  • 94. [Bellogín et al, 2017] Candidate item generation Impact of test size for different candidate item selection strategies: The actual value of the metric may be affected by the amount of known information 25-Aug-17 ACM RecSys Summer School 2017 94 RPN RPNTestItems
  • 95. Outline •  Motivation [10 minutes] •  Replication and reproducibility •  Focus on Recommender Systems –  Dataset collection –  Splitting –  Recommender algorithm –  Candidate items –  Evaluation metrics –  Statistical testing •  Demo •  Conclusions and Wrap-up [10 minutes] •  Questions [10 minutes] 25-Aug-17 ACM RecSys Summer School 2017 95
  • 96. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 9625-Aug-17 ACM RecSys Summer School 2017
  • 97. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 9725-Aug-17 ACM RecSys Summer School 2017
  • 98. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 9825-Aug-17 ACM RecSys Summer School 2017
  • 99. considering Evaluation metric computation When coverage is not complete, how are the metrics computed? –  If a user receives 0 recommendations Option a: Option b: MAE = Mean Absolute Error RMSE = Root Mean Squared Error 25-Aug-17 ACM RecSys Summer School 2017 99
  • 100. Evaluation metric computation When coverage is not complete, how are the metrics computed? –  If a user receives 0 recommendations Option a: Option b: MAE = Mean Absolute Error RMSE = Root Mean Squared Error 25-Aug-17 ACM RecSys Summer School 2017 100 User rec1 rec2 #recs metric(u) #recs metric(u) u1 5 0.8 5 0.7 u2 3 0.2 5 0.5 u3 0 -- 5 0.3 u4 1 1.0 5 0.7 Option a 0.50 0.55 Option b 0.66 0.55 User coverage 3/4 4/4
  • 101. Evaluation metric computation When coverage is not complete, how are the metrics computed? –  If a user receives 0 recommendations –  If a value is not predicted (esp. for error-based metrics) MAE = Mean Absolute Error RMSE = Root Mean Squared Error 25-Aug-17 ACM RecSys Summer School 2017 101
  • 102. Evaluation metric computation When coverage is not complete, how are the metrics computed? –  If a user receives 0 recommendations –  If a value is not predicted (esp. for error-based metrics) MAE = Mean Absolute Error RMSE = Root Mean Squared Error 25-Aug-17 ACM RecSys Summer School 2017 102 User-item pairs Real Rec1 Rec2 Rec3 (u1, i1) 5 4 NaN 4 (u1, i2) 3 2 4 NaN (u1, i3) 1 1 NaN 1 (u2, i1) 3 2 4 NaN MAE/RMSE, ignoring NaNs 0.75/0.87 2.00/2.00 0.50/0.70 MAE/RMSE, NaNs as 0 0.75/0.87 2.00/2.65 1.75/2.18 MAE/RMSE, NaNs as 3 0.75/0.87 1.50/1.58 0.25/0.50
  • 103. Evaluation metric computation Variations on metrics: Error-based metrics can be normalized or averaged per user: –  Normalize RMSE or MAE by the range of the ratings (divide by rmax – rmin) –  Average RMSE or MAE to compensate for unbalanced distributions of items or users 25-Aug-17 ACM RecSys Summer School 2017 103
  • 104. Evaluation metric computation Variations on metrics: nDCG has at least two discounting functions (linear and exponential decay) 25-Aug-17 ACM RecSys Summer School 2017 104
  • 105. Evaluation metric computation Variations on metrics: Ranking-based metrics are usually computed up to a ranking position or cutoff k P = Precision (Precision at k) R = Recall (Recall at k) MAP = Mean Average Precision Is the cutoff being reported? Are the metrics computed until the end of the list? Is that number the same across all the users? 25-Aug-17 ACM RecSys Summer School 2017 105
  • 106. Evaluation metric computation If ties are present in the ranking scores, results may depend on the implementation 25-Aug-17 ACM RecSys Summer School 2017 106 [Bellogín et al, 2013]
  • 107. Evaluation metric computation Internal evaluation methods of different frameworks (Mahout (AM), LensKit (LK), MyMediaLite (MML)) present different implementations of these aspects 25-Aug-17 ACM RecSys Summer School 2017 107[Said & Bellogín, 2014]
  • 108. Evaluation metric computation Decisions (implementations) found in some recommendation frameworks: Regarding coverage: •  LK and MML use backup recommenders •  LR: not actual backup recommender, but default values (global mean) are provided when not enough neigbors (or all similarities are negative) are found in KNN •  RS allows to average metrics explicitly by option a or b (see different constructors of AverageRecommendationMetric) 25-Aug-17 ACM RecSys Summer School 2017 108 LK: https://github.com/lenskit/lenskit LR: https://github.com/guoguibing/librec MML: https://github.com/zenogantner/MyMediaLite RS: https://github.com/RankSys/RankSys
  • 109. Evaluation metric computation Decisions (implementations) found in some recommendation frameworks: Regarding metric variations: •  LK, LR, MML use a logarithmic discount for nDCG •  RS also uses a logarithmic discount for nDCG, but relevance is normalized with respect to a threshold •  LK does not take into account predictions without scores for error metrics •  LR fails if coverage is not complete for error metrics •  RS does not compute error metrics 25-Aug-17 ACM RecSys Summer School 2017 109 LK: https://github.com/lenskit/lenskit LR: https://github.com/guoguibing/librec MML: https://github.com/zenogantner/MyMediaLite RS: https://github.com/RankSys/RankSys
  • 110. Evaluation metric computation Decisions (implementations) found in some recommendation frameworks: Regarding candidate item generation: •  LK allows defining a candidate and exclude set •  LR: delegated to the Recommender class. AbstractRecommender defaults to TrainingItems •  MML allows different strategies: training, test, their overlap and union, or a explicitly provided candidate set •  RS defines different ways to call the recommender: without restriction, with a list size limit, with a filter, or with a candidate set 25-Aug-17 ACM RecSys Summer School 2017 110
  • 111. Evaluation metric computation Decisions (implementations) found in some recommendation frameworks: Regarding ties: •  MML: not deterministic (Recommender.Recommend sorts items by descending score) •  LK: depends on using predict package (not deterministic: LongUtils. keyValueComparator only compares scores) or recommend (same ranking as returned by algorithm) •  LR: not deterministic (Lists.sortItemEntryListTopK only compares the scores) •  RS: deterministic (IntDoubleTopN compares values and then keys) 25-Aug-17 ACM RecSys Summer School 2017 111
  • 112. Outline •  Motivation [10 minutes] •  Replication and reproducibility •  Focus on Recommender Systems –  Dataset collection –  Splitting –  Recommender algorithm –  Candidate items –  Evaluation metrics –  Statistical testing •  Demo •  Conclusions and Wrap-up [10 minutes] •  Questions [10 minutes] 25-Aug-17 ACM RecSys Summer School 2017 112
  • 113. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 11325-Aug-17 ACM RecSys Summer School 2017
  • 114. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 11425-Aug-17 ACM RecSys Summer School 2017
  • 115. Train Test Validation Splitting Recommender generates ranking (for a user) prediction for a given item (& user) Ranking Prediction Coverage Diversity . . . Evaluator Collector Stats 11525-Aug-17 ACM RecSys Summer School 2017
  • 116. Statistical testing Make sure the statistical testing method is reported – Paired/unpaired test, effect size, confidence interval – Specify why this specific method is used – Related statistics (such as mean, variance, population size) are useful to interpret the results 25-Aug-17 ACM RecSys Summer School 2017 116 [Sakai, 2014]
  • 117. Statistical testing When doing cross-validation, there are several options to take the samples for the test: –  One point for each aggregated value of the metric •  Very few points (one per fold) –  One point for each value of the metric, on a user basis •  If we compute a test for each fold, we may find inconsistencies •  If we compute a test with all the (concatenated) values, we may distort the test: many more points, not completely independent 25-Aug-17 ACM RecSys Summer School 2017 117 [Bouckaert, 2003]
  • 118. Replication and Reproducibility in RecSys: Summary •  Reproducible experimental results depend on acknowledging every step in the recommendation process –  As black boxes so every setting is reported –  Applies to data collection, data splitting, recommendation, candidate item generation, metric computation, and statistics •  There exist several details (in implementation) that might hide important effects in final results 25-Aug-17 ACM RecSys Summer School 2017 118
  • 119. Replication and Reproducibility in RecSys: Key takeaways •  Every decision has an impact –  We should log every step taken in the experimental part and report that log •  There are more things besides papers –  Source code, web appendix, etc. are very useful to provide additional details not present in the paper •  You should not fool yourself –  You have to be critical about what you measure and not trust intermediate “black boxes” 25-Aug-17 ACM RecSys Summer School 2017 119
  • 120. Replication and Reproducibility in RecSys: Next steps? •  We should agree on standard implementations, parameters, instantiations, … –  Example: trec_eval in IR 25-Aug-17 ACM RecSys Summer School 2017 120
  • 121. Replication and Reproducibility in RecSys: Next steps? •  We should agree on standard implementations, parameters, instantiations, … •  Replicable badges for journals / conferences 25-Aug-17 ACM RecSys Summer School 2017 121
  • 122. Replication and Reproducibility in RecSys: Next steps? •  We should agree on standard implementations, parameters, instantiations, … •  Replicable badges for journals / conferences 25-Aug-17 ACM RecSys Summer School 2017 122 h)p:// valida=on.scienceexchange.com The aim of the Reproducibility Ini=a=ve is to iden=fy and reward high quality reproducible research via independent valida=on of key experimental results
  • 123. Replication and Reproducibility in RecSys: Next steps? •  We should agree on standard implementations, parameters, instantiations, … •  Replicable badges for journals / conferences •  Investigate how to improve reproducibility 25-Aug-17 ACM RecSys Summer School 2017 123
  • 124. Replication and Reproducibility in RecSys: Next steps? •  We should agree on standard implementations, parameters, instantiations, … •  Replicable badges for journals / conferences •  Investigate how to improve reproducibility •  Benchmark, report, and store results 25-Aug-17 ACM RecSys Summer School 2017 124
  • 125. Pointers •  Email and Twitter –  Alejandro Bellogín •  alejandro.bellogin@uam.es •  @abellogin –  Alan Said •  alansaid@acm.org •  @alansaid •  Slides: •  https://github.com/recommenders/rsss2017 125
  • 128. References and Additional reading •  [Armstrong et al, 2009] Improvements That Don’t Add Up:Ad-Hoc Retrieval Results Since 1998. CIKM •  [Bellogín et al, 2010] A Study of Heterogeneity in Recommendations for a Social Music Service. HetRec •  [Bellogín et al, 2011] Precision-Oriented Evaluation of Recommender Systems: an Algorithm Comparison. RecSys •  [Bellogín et al, 2013] An Empirical Comparison of Social, Collaborative Filtering, and Hybrid Recommenders.ACM TIST •  [Bellogín et al, 2017] Statistical biases in Information Retrieval metrics for recommender systems. Information Retrieval Journal •  [Ben-Shimon et al, 2015] RecSys Challenge 2015 and theYOOCHOOSE Dataset. RecSys •  [Bouckaert, 2003] Choosing Between Two Learning Algorithms Based on Calibrated Tests. ICML •  [Cremonesi et al, 2010] Performance of Recommender Algorithms on Top-N Recommendation Tasks. RecSys •  [Filippone & Sanguinetti, 2010] Information Theoretic Novelty Detection. Pattern Recognition •  [Fleder & Hosanagar, 2009] Blockbuster Culture’s Next Rise or Fall:The Impact of Recommender Systems on Sales Diversity. Management Science 128
  • 129. References and Additional reading •  [Ge et al, 2010] Beyond accuracy: evaluating recommender systems by coverage and serendipity. RecSys •  [Gorla et al, 2013] Probabilistic Group Recommendation via Information Matching.WWW •  [Herlocker et al, 2004] Evaluating Collaborative Filtering Recommender Systems.ACM Transactions on Information Systems •  [Jambor & Wang, 2010] Goal-Driven Collaborative Filtering. ECIR •  [Knijnenburg et al, 2011] A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys •  [Koren, 2008] Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. KDD •  [Lathia et al, 2010] Temporal Diversity in Recommender Systems. SIGIR •  [Li et al, 2010] Improving One-Class Collaborative Filtering by Incorporating Rich User Information. CIKM •  [Pu et al, 2011] A User-Centric Evaluation Framework for Recommender Systems. RecSys •  [Said & Bellogín, 2014] Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks. RecSys •  [Sakai, 2014] Statistical reform in Information Retrieval? SIGIR Forum 129
  • 130. References and Additional reading •  [Schein et al, 2002] Methods and Metrics for Cold-Start Recommendations. SIGIR •  [Shani & Gunawardana, 2011] Evaluating Recommendation Systems. Recommender Systems Handbook •  [Steck & Xin, 2010] A Generalized Probabilistic Framework and itsVariants for Training Top-k Recommender Systems. PRSAT •  [Tikk et al, 2014] Comparative Evaluation of Recommender Systems for Digital Media. IBC •  [Vargas & Castells, 2011] Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems. RecSys •  [Weng et al, 2007] Improving Recommendation Novelty Based on Topic Taxonomy.WI-IAT •  [Yin et al, 2012] Challenging the Long Tail Recommendation.VLDB •  [Zhang & Hurley, 2008] Avoiding Monotony: Improving the Diversity of Recommendation Lists. RecSys •  [Zhang & Hurley, 2009] Statistical Modeling of Diversity in Top-N Recommender Systems.WI-IAT •  [Zhou et al, 2010] Solving the Apparent Diversity-Accuracy Dilemma of Recommender Systems. PNAS •  [Ziegler et al, 2005] Improving Recommendation Lists Through Topic Diversification.WWW 130