Fairness and Popularity Bias in Recommender Systems: an Empirical Evaluation

@cataldomusto cataldo.musto@uniba.it
Fairness and Popularity Bias
in Recommender Systems:
an Empirical Evaluation
CATALDO MUSTO, PASQUALE LOPS, GIOVANNI SEMERARO
UNIVERSITÀ DEGLI STUDI DI BARI ‘ALDO MORO’ - ITALY

Recommender Systems
Technology able to push
relevant items (movies, news,
books, etc.) to the users
based on their preferences.
2
Cataldo Musto, Pasquale Lops, Giovanni Semeraro. Fairness and Popularity Bias in Recommender Systems: an Empirical Evaluation.
AIxIA 2021 – 20th International Conference of the Italian Association for Artificial Intelligence. Online Event. December 3, 2021

How to Evaluate a RecSys?
Recommender Systems are
typically evaluated by using
accuracy metrics (Precision,
Recall, NDCG, HitRate, etc.)
Is there anything else we can
evaluate?
3

Popular Non-Accuracy Metrics
Novelty Serendipity Fairness
4

Popular Non-Accuracy Metrics
Novelty Serendipity Fairness
5

Fairness in AI
By definition, fairness refers to the
non-discrimination of certain groups
based on particular protected
attributes (gender, race, etc.)
Fair Al algorithms do not suffer of
these bias (or implement particular
strategies to mitigate them)
6

Fairness in RecSys
As regards RecSys, the concept of fairness
is multi-sided
Items Fairness: if all the items have equal
probability of being recommended
Users Fairness: if the recommendation list
reflects the actual preferences of the users
based on a target features (e.g., genre)
7

Problem: Popularity Bias
8
Fairness of RecSys (and quality of recommendations) is
strongly affected by popularity bias
What does it mean?
Users tend to express preferences on popular
items, and this make popular items to be
recommended more frequently w.r.t. niche
ones [*].
[*] H. Abdollahpouri, M. Mansoury, R. Burke, B. Mobasher,
The unfairness of popularity biasin recommendation, arXiv
preprint arXiv:1907.13286 (2019).

Contribution
9
The goal of our study is to evaluate the
fairness-by-design of the most popular
recommendation paradigms.

Recommendation Paradigms
Collaborative
Filtering
Content-based
RecSys
Graph-based
RecSys
AIxIA 2021 – 20th International Conference of the Italian Association for Artificial Intelligence. Online Event. December 3, 2021 10

Collaborative
Filtering
Paradigms: Collaborative Filtering
Exploits the preferences of the
community to generate
recommendations.
Intuition: to suggest items liked
by users similar to the target
one

target
user
neighbor

target
user Problems: sparsity (cold start)
Neighbor ?
Neighbor ?
Neighbor ?
Neighbor ?

Paradigms: Collabor. Filtering (advanced)
15

Paradigms: Content-based RecSys
Exploit descriptive features
of the items (e.g. genre of a
book, director of a movie) to
generate recommendations.
Insight: to suggest items
similar to those the user
already liked
16

Paradigms: Content-based RecSys
user profile items
Recommendations
are generated by
matching the
features stored in
the user profile
with those
describing the
items to be
recommended.
17

Paradigms: (Recent) Content RecSys
18
Reecent methods exploit word
embedding techniques, which are based
on the distributional hypothesis
beer
wine
glass
spoon

19
Matrix Revolutions
Donnie Darko
Grandi Magazzini
The Matrix
Paradigms: (Recent) Content RecSys
Reecent methods exploit word
embedding techniques, which are based
on the distributional hypothesis

Paradigms: Graph-based RecSys
Exploit a graph-based data
model connecting users,
items and descriptive
properties of the items (e.g.,
gathered from a knowledge
graph)
Insight: recommendation
seen as a path ranking
problem

Typically, the
relevance score for all
the item nodes is
calculated and the
top-N are returned as
recommendations
PageRank
Spreading Activation
Personalized PR
…
Paradigms: Graph-based RecSys

Research Questions
22
1. Which recommendation paradigm is more
prone by design to popularity bias?
2. Which recommendation paradigm
generates more fair recommendation
lists?

Experimental Protocol
23
Algorithm
Top-10
Recommmendations
Evaluation Metrics
Ratings
Content
Data are typically split into
training and test

Experimental Evaluation
24
Two datasets
GoodBooks MovieLens1M
Users 53,424 6,040
Items 10,000 3,883
Ratings 6,000,000 1,000,209
%Positive 68.97% 57.91%
Sparsity 99.82% 96.42%
Content Plot Plot

25
GoodBooks MovieLens1M
Users 53,424 6,040
Items 10,000 3,883
Ratings 6,000,000 1,000,209
%Positive 68.97% 57.91%
Sparsity 99.82% 96.42%
Content Plot Plot
Two datasets

26
Algorithms: Collaborative Filtering
User-to-User CF
Item-to-Item CF
FunkSVD
Biased Matrix Factorization

27
Algorithms: Content-based RecSys
Vector Space Model +
TF/IDF
Word2Vec
Doc2Vec
Latent Semantic Indexing

28
Algorithms: Graph-based RecSys
PageRank
Personalized PageRank

29
Algorithms: Baselines
Random (upper bound)
Popularity-based (lower bound)

Evaluation Metrics
30
Items Fairness: Catalogue Coverage
Items Fairness: Gini Index
Percentage of items in the catalogue recommended to at least one
users (the higher, the better)
Measures how unbalanced (in terms of frequency) is the
distribution of the recommendations to all the users. Values in
range [0,1], the lower the better. Gini Index close to 1 means that
just a few items appear in the recommendation lists.

Evaluation Metrics
31
Users Fairness: ΔGAP
Gap between the average popularity of the items in the profile w.r.t. the average popularity
of the items in the recommendation list. A fair recommendation list reflects the average
popularity of the items in the profile.
Users are first split in groups (block-buster, diverse, niche users) and the average
popularity of each group is calculated

Evaluation Metrics
32
Users Fairness: ΔGAP
Gap between the average popularity of the items in the profile w.r.t. the average popularity
of the items in the recommendation list. A fair recommendation list reflects the average
popularity of the items in the profile.
Users are first split in groups (block-buster, diverse, niche users) and the average
popularity of each group is calculated
ΔGAP=0 fair recommendation list
ΔGAP<0 under-estimation of the popularity
ΔGAP>0 over-estimation of the popularity

Results
33
Catalogue Coverage
MovieLens 1M Goodbooks
Baselines Random 3,688 94.98% 10,000 100.00%
Popular 67 1.63% 243 2.43%
Collaborative
Filtering
U2U-CF 296 7.62% 2,210 22.10%
I2I-CF 471 12.13% 2,819 28.19%
Biased-MF 547 14.09% 1,830 18.30%
Content-based
Recsys
VSM/TF-IDF 444 11.43% 2,622 26.22%
Word2Vec 492 12.67% 3,081 30.81%
Doc2Vec 476 12.26% 2,987 29.87%
Graph-based
RecSys
PR 15 0.38% 11 0.11%
PPR 36 0.92% 22 0.22%

Results
34
Catalogue Coverage
Popular 67 1.63% 243 2.43%
Collaborative
Filtering
U2U-CF 296 7.62% 2,210 22.10%
I2I-CF 471 12.13% 2,819 28.19%
Biased-MF 547 14.09% 1,830 18.30%
Content-based
Recsys
VSM/TF-IDF 444 11.43% 2,622 26.22%
Word2Vec 492 12.67% 3,081 30.81%
Doc2Vec 476 12.26% 2,987 29.87%
Graph-based
RecSys
PR 15 0.38% 11 0.11%
PPR 36 0.92% 22 0.22%

Results
35
Catalogue Coverage
Popular 67 1.63% 243 2.43%
Collaborative
Filtering
U2U-CF 296 7.62% 2,210 22.10%
I2I-CF 471 12.13% 2,819 28.19%
Biased-MF 547 14.09% 1,830 18.30%
Content-based
Recsys
VSM/TF-IDF 444 11.43% 2,622 26.22%
Word2Vec 492 12.67% 3,081 30.81%
Doc2Vec 476 12.26% 2,987 29.87%
Graph-based
RecSys
PR 15 0.38% 11 0.11%
PPR 36 0.92% 22 0.22%

Take-Home Messages
36
1. Collaborative Filtering algorithms cover a larger
number of items when the dataset is less sparse.
2. When the sparsity is higher, content-based
recommendations are best option
3. Graph-based recommender systems behave worse
than popularity-based recsys
• All the algorithms are very prone to popularity bias (Gini
Index values very close to 1)
• Recommendations algorithms tipically under-estimate
the average popularity of the recommendations
• DGap are close to zero. Good fairness for the users, on
average

Results
37
Gini Index
MovieLens1M GoodBooks
Baselines Random 0.185 0.334
Popular 0.995 0.998
Collaborative Filtering U2U-CF 0.989 0.973
I2I-CF 0.990 0.986
Biased-MF 0.984 0.987
Content-based Recsys VSM/TF-IDF 0.990 0.961
Word2Vec 0.985 0.956
Doc2Vec 0.987 0.959
Graph-based RecSys PR 0.996 0.999
PPR 0.995 0.998

Results
38
Gini Index
MovieLens1M GoodBooks
Baselines Random 0.185 0.334
Popular 0.995 0.998
Collaborative Filtering U2U-CF 0.989 0.973
I2I-CF 0.990 0.986
Biased-MF 0.984 0.987
Content-based Recsys VSM/TF-IDF 0.990 0.961
Word2Vec 0.985 0.956
Doc2Vec 0.987 0.959
Graph-based RecSys PR 0.996 0.999
PPR 0.995 0.998

Take-Home Messages
39
1. Collaborative Filtering algorithms cover a larger number
of items when the dataset is less sparse.
3. Graph-based recommender systems behave worse than
popularity-based recsys
4. All the algorithms are very prone to popularity bias (Gini
• Recommendations algorithms tipically under-estimate
• DGap are close to zero. Good fairness for the users, on
average

Results
40
ΔGAP – MovieLens Data

Results
41
ΔGAP – MovieLens Data

Results
42
ΔGAP – GoodBooks Data

Results
43
ΔGAP – GoodBooks Data

Take-Home Messages
44
1. Collaborative Filtering algorithms cover a larger number
of items when the dataset is less sparse.
3. Graph-based recommender systems behave worse than
popularity-based recsys
4. All the algorithms are very prone to popularity bias (Gini
5. Recommendations algorithms tipically under-estimate
6. ΔGap is close to zero. Good fairness for the users, on
average

Conclusions
45
This paper provides a benchmark of the fairness-by-design of some
popular recommendation paradigms

Conclusions
46
This paper provides a benchmark of the fairness-by-design of some
popular recommendation paradigms
• Results showed that algorithms are strongly affected by popularity bias.
• Item coverage is not satisfying, recommendation lists are not fair (as
confirmed by Gini Index). Mitigation strategies are needed.
• ΔGAP values are more satisfying when more advanced techniques are
exploited
• Future Work. Evaluation of hybrid recommendation models and more
recent content-based recommendation techniques. Introduction of other
non-accuracy metrics (novelty, serendipity, etc.)

Thank you! cataldo.musto@uniba.it
@cataldomusto
Contacts
47

Fairness and Popularity Bias in Recommender Systems: an Empirical Evaluation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Fairness and Popularity Bias in Recommender Systems: an Empirical Evaluation

Ähnlich wie Fairness and Popularity Bias in Recommender Systems: an Empirical Evaluation (20)

Mehr von Cataldo Musto

Mehr von Cataldo Musto (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Fairness and Popularity Bias in Recommender Systems: an Empirical Evaluation