Improving aggregate recommendation diversity

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID 1

Improving Aggregate Recommendation Diversity
Using Ranking-Based Techniques
Gediminas Adomavicius, Member, IEEE, and YoungOk Kwon

Abstract— Recommender systems are becoming increasingly important to individual users and businesses for providing personalized
recommendations. However, while the majority of algorithms proposed in recommender systems literature have focused on improving
recommendation accuracy (as exemplified by the recent Netflix Prize competition), other important aspects of recommendation quality, such
as the diversity of recommendations, have often been overlooked. In this paper, we introduce and explore a number of item ranking
techniques that can generate recommendations that have substantially higher aggregate diversity across all users while maintaining
comparable levels of recommendation accuracy. Comprehensive empirical evaluation consistently shows the diversity gains of the proposed
techniques using several real-world rating datasets and different rating prediction algorithms.

Index Terms— Recommender systems, recommendation diversity, ranking functions, performance evaluation metrics, collaborative filtering.

—————————— ——————————

1 Introduction

I n the current age of information overload, it is becoming
increasingly harder to find relevant content. This problem is
recommended items, while maintaining an acceptable level of
accuracy [8], [33], [46], [54], [57]. These studies measure rec-

t.c om
not only widespread but also alarming [28]. Over the last 10- ommendation diversity from an individual user’s perspective

om
po t.c
15 years, recommender systems technologies have been intro- (i.e., individual diversity).

gs po
duced to help people deal with these vast amounts of informa- In contrast to individual diversity, which has been explored
lo s
.b og
tion [1], [7], [9], [30], [36], [39], and they have been widely used in a number of papers, some recent studies [10], [14] started
ts .bl

in research as well as e-commerce applications, such as the examining the impact of recommender systems on sales diver-
ec ts
oj c

ones used by Amazon and Netflix. sity by considering aggregate diversity of recommendations
pr oje

The most common formulation of the recommendation across all users. Note that high individual diversity of recom-
re r
lo rep

problem relies on the notion of ratings, i.e., recommender sys- mendations does not necessarily imply high aggregate diversi-
xp lo

tems estimate ratings of items (or products) that are yet to be ty. For example, if the system recommends to all users the
ee xp
.ie ee

consumed by users, based on the ratings of items already con- same five best-selling items that are not similar to each other,
w e

sumed. Recommender systems typically try to predict the rat- the recommendation list for each user is diverse (i.e., high in-
w .i
w w

ings of unknown items for each user, often using other users’ dividual diversity), but only five distinct items are recom-
:// w
tp //w

ratings, and recommend top N items with the highest pre- mended to all users and purchased by them (i.e., resulting in
ht ttp:

dicted ratings. Accordingly, there have been many studies on low aggregate diversity or high sales concentration).
h

developing new algorithms that can improve the predictive While the benefits of recommender systems that provide
accuracy of recommendations. However, the quality of rec- higher aggregate diversity would be apparent to many users
ommendations can be evaluated along a number of dimen- (because such systems focus on providing wider range of items
sions, and relying on the accuracy of recommendations alone in their recommendations and not mostly bestsellers, which
may not be enough to find the most relevant items for each users are often capable of discovering by themselves), such
user [24], [32]. In particular, the importance of diverse recom- systems could be beneficial for some business models as well
mendations has been previously emphasized in several studies [10], [11], [14], [20]. For example, it would be profitable to Net-
[8], [10], [14], [33], [46], [54], [57]. These studies argue that one flix if the recommender systems can encourage users to rent
of the goals of recommender systems is to provide a user with “long-tail” type of movies (i.e., more obscure items that are
highly idiosyncratic or personalized items, and more diverse located in the tail of the sales distribution [2]) because they are
recommendations result in more opportunities for users to get less costly to license and acquire from distributors than new-
recommended such items. With this motivation, some studies release or highly-popular movies of big studios [20]. However,
proposed new recommendation methods that can increase the the impact of recommender systems on aggregate diversity in
diversity of recommendation sets for a given individual user, real-world e-commerce applications has not been well-
often measured by an average dissimilarity between all pairs of understood. For example, one study [10], using data from on-
line clothing retailer, confirms the “long tail” phenomenon that
————————————————
refers to the increase in the tail of the sales distribution (i.e., the
G. Adomavicius is with the Department of Information and Decision Sciences,
Carlson School of Management, University of Minnesota, Minneapolis, MN
increase in aggregate diversity) attributable to the usage of the
55455. E-mail: gedas@umn.edu. recommender system. On the other hand, another study [14]
Y. Kwon is with the Department of Information and Decision Sciences, Carl- shows a contradictory finding that recommender systems ac-
son School of Management, University of Minnesota, Minneapolis, MN tually can reduce the aggregate diversity in sales. This can be
55455. E-mail: kwonx052@umn.edu.
explained by the fact that the idiosyncratic items often have
Manuscript received (insert date of submission if desired). Please note that all ac-
limited historical data and, thus, are more difficult to recom-
knowledgments should be placed at the end of the paper, before the bibliography.
xxxx-xxxx/0x/$xx.00 © 2009 IEEE

Digital Object Indentifier 10.1109/TKDE.2011.15 1041-4347/11/$26.00 © 2011 IEEE


2 IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID

mend to users; in contrast, popular items typically have more metric (i.e., the percentage of truly “high” ratings among those
ratings and, therefore, can be recommended to more users. For that were predicted to be “high” by the recommender system)
example, in the context of Netflix Prize competition [6], [22], is 82%, but only 49 popular items out of approximately 2000
there is some evidence that, since recommender systems seek available distinct items are recommended across all users. The
to find the common items (among thousands of possible mov- system can improve the diversity of recommendations from 49
ies) that two users have watched, these systems inherently tend up to 695 (a 14-fold increase) by recommending the long-tail
to avoid extremes and recommend very relevant but safe rec- item to each user (i.e., the least popular item among highly-
ommendations to users [50]. predicted items for each user) instead of the popular item.
As seen from this recent debate, there is a growing aware- However, high diversity in this case is obtained at the signifi-
ness of the importance of aggregate diversity in recommender cant expense of accuracy, i.e., drop from 82% to 68%.
systems. Furthermore, while, as mentioned earlier, there has The above example shows that it is possible to obtain higher
been significant amount of work done on improving individual diversity simply by recommending less popular items; howev-
diversity, the issue of aggregate diversity in recommender sys- er, the loss of recommendation accuracy in this case can be
tems has been largely untouched. Therefore, in this paper, we substantial. In this paper, we explore new recommendation
focus on developing algorithmic techniques for improving ag- approaches that can increase the diversity of recommendations
gregate diversity of recommendations (which we will simply with only a minimal (negligible) accuracy loss using different
refer to as diversity throughout the paper, unless explicitly spe- recommendation ranking techniques. In particular, traditional
cified otherwise), which can be intuitively measured by the recommender systems typically rank the relevant items in a
number of distinct items recommended across all users. descending order of their predicted ratings for each user and
Higher diversity (both individual and aggregate), however, then recommend top N items, resulting in high accuracy. In
can come at the expense of accuracy. As known well, there is a contrast, the proposed approaches consider additional factors,
tradeoff between accuracy and diversity because high accuracy such as item popularity, when ranking the recommendation
may often be obtained by safely recommending to users the list to substantially increase recommendation diversity while

t.c om
om
most popular items, which can clearly lead to the reduction in maintaining comparable levels of accuracy. This paper pro-

po t.c
diversity, i.e., less personalized recommendations [8], [33], [46]. vides a comprehensive empirical evaluation of the proposed
gs po
And conversely, higher diversity can be achieved by trying to approaches, where they are tested with various datasets in a
lo s
.b og
uncover and recommend highly idiosyncratic or personalized variety of different settings. For example, the best results show
ts .bl

items for each user, which often have less data and are inhe- up to 20-25% diversity gain with only 0.1% accuracy loss, up to
ec ts
oj c

rently more difficult to predict, and, thus, may lead to a de- 60-80% gain with 1% accuracy loss, and even substantially
pr oje

crease in recommendation accuracy. higher diversity improvements (e.g., up to 250%) if some users
re r
lo rep

Table 1 illustrates an example of accuracy and diversity tra- are willing to tolerate higher accuracy loss.
xp lo
ee xp

deoff in two extreme cases where only popular items or long- In addition to providing significant diversity gains, the pro-
.ie ee

tail type items are recommended to users, using MovieLens posed ranking techniques have several other advantageous
w e
w .i

rating dataset (datasets used in this paper are discussed in Sec- characteristics. In particular, these techniques are extremely
w w
:// w

tion 5.1). In this example, we used a popular recommendation efficient, because they are based on scalable sorting-based heu-
tp //w

technique, i.e., neighborhood-based collaborative filtering (CF) ristics that make decisions based only on the “local” data (i.e.,
ht ttp:

technique [9], to predict unknown ratings. Then, as candidate only on the candidate items of each individual user) without
h

recommendations for each user, we considered only the items having to keep track of the “global” information, such as which
that were predicted above the pre-defined rating threshold to items have been recommended across all users and how many
assure the acceptable level of accuracy, as is typically done in times. The techniques are also parameterizable, since the user
recommender systems. Among these candidate items for each has the control to choose the acceptable level of accuracy for
user, we identified the item that was rated by most users (i.e., which the diversity will be maximized. Also, the proposed
the item with the largest number of known ratings) as a popular ranking techniques provide a flexible solution to improving
item, and the item that was rated by least number of users (i.e., recommendation diversity because: they are applied after the
the item with the smallest number of known ratings) as a long- unknown item ratings have been estimated and, thus, can
tail item. As illustrated by Table 1, if the system recommends achieve diversity gains in conjunction with a number of differ-
each user the most popular item (among the ones that had a ent rating prediction techniques, as illustrated in the paper; as
sufficiently high predicted rating), it is much more likely for mentioned above, the vast majority of current recommender
many users to get the same recommendation (e.g., the best- systems already employ some ranking approach, thus, the
selling item). The accuracy measured by precision-in-top-1 proposed techniques would not introduce new types of proce-
dures into recommender systems (they would replace existing
TABLE 1. ACCURACY-DIVERSITY TRADEOFF: EMPIRICAL EXAMPLE ranking procedures); the proposed ranking approaches do not
Quality Metric: require any additional information about users (e.g., demo-
Accuracy Diversity graphics) or items (e.g., content features) aside from the ratings
Top-1 recommendation of: data, which makes them applicable in a wide variety of rec-
Popular Item (item with the 49 distinct ommendation contexts.
82%
largest number of known ratings) items
The remainder of the paper is organized as follows. Section
“Long-Tail” Item (item with the 695 distinct
smallest number of known ratings)
68%
items
2 reviews relevant literature on traditional recommendation
algorithms and the evaluation of recommendation quality.
Note. Recommendations (top-1 item for each user) are generated for
2828 users among the items that are predicted above the acceptable Section 3 describes our motivations for alternative recommen-
threshold 3.5 (out of 5), using a standard item-based collaborative filter- dation ranking techniques, such as item popularity. We then
ing technique with 50 neighbors on the MovieLens Dataset. propose several additional ranking techniques in Section 4, and


ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES
3

the main empirical results follow in Section 5. Additional ex- the recommender system, we use the R(u, i) notation to
periments are conducted to further explore the proposed rank- represent a known rating (i.e., the actual rating that user u gave
ing techniques in Section 6. Lastly, Section 7 concludes the to item i), and the R*(u, i) notation to represent an unknown
paper by summarizing the contributions and future directions. rating (i.e., the system-predicted rating for item i that user u
has not rated before).
2 Related Work Neighborhood-based CF technique
2.1 Recommendation Techniques for Rating Prediction There exist multiple variations of neighborhood-based CF
Recommender systems are usually classified into three catego- techniques [9], [36], [40]. In this paper, to estimate R*(u, i), i.e.,
ries based on their approach to recommendation: content- the rating that user u would give to item i, we first compute the
based, collaborative, and hybrid approaches [1], [3]. Content- similarity between user u and other users u' using a cosine si-
based recommender systems recommend items similar to the milarity metric [9], [40]:
ones the user preferred in the past. Collaborative filtering (CF) R(u , i ) R(u ' , i )
i I ( u ,u ')
recommender systems recommend items that users with simi- sim(u , u ' ) , (1)
lar preferences (i.e., “neighbors”) have liked in the past. Final- i I ( u , u ') R(u, i ) 2 i I ( u , u ') R(u ' , i ) 2
ly, hybrid approaches can combine content-based and colla-
where I(u, u') represents the set of all items rated by both user
borative methods in several different ways. Recommender
u and user u'. Based on the similarity calculation, set N(u) of
systems can also be classified based on the nature of their algo-
nearest neighbors of user u is obtained. The size of set N(u) can
rithmic technique into heuristic (or memory-based) and model-
range anywhere from 1 to |U|-1, i.e., all other users in the da-
based approaches [1], [9]. Heuristic techniques typically calcu-
taset. Then, R*(u, i) is calculated as the adjusted weighted sum
late recommendations based directly on the previous user ac-
of all known ratings R(u', i), where u' N(u) [13], [34]:
tivities (e.g., transactional data or rating values). One of the
commonly used heuristic techniques is a neighborhood-based sim(u , u ' ) R(u ' , i ) R(u ' )

t.c om
u ' N (u )
R * (u , i ) R(u ) . (2)

om
approach that finds nearest neighbors that have tastes similar
| sim(u , u ' ) |

po t.c
u ' N (u )
to those of the target user [9], [13], [34], [36], [40]. In contrast,
gs po
model-based techniques use previous user activities to first Here R (u ) represents the average rating of user u.
lo s
.b og
learn a predictive model, typically using some statistical or A neighborhood-based CF technique can be user-based or
ts .bl

machine-learning methods, which is then used to make rec-
ec ts

item-based, depending on whether the similarity is calculated
oj c

ommendations. Examples of such techniques include Bayesian
pr oje

between users or items. Formulae (1) and (2) represent the
clustering, aspect model, flexible mixture model, matrix facto-
re r

user-based approach, but they can be straightforwardly rewrit-
lo rep

rization, and other methods [4], [5], [9], [25], [44], [48]. ten for the item-based approach because of the symmetry be-
xp lo
ee xp

In real world settings, recommender systems generally per- tween users and items in all neighborhood-based CF calcula-
.ie ee

form the following two tasks in order to provide recommendations [40]. In our experiments we used both user-based and
w e
w .i

tions to each user. First, the ratings of unrated items are esti- item-based approaches for rating estimation.
w w
:// w

mated based on the available information (typically using
tp //w

known user ratings and possibly also information about item Matrix factorization CF technique
ht ttp:

content or user demographics) using some recommendation Matrix factorization techniques have been the mainstay of nu-
h

algorithm. And second, the system finds items that maximize merical linear algebra dating back to the 1970s [16], [21], [27]
the user’s utility based on the predicted ratings, and recom- and have recently gained popularity in recommender systems
mends them to the user. Ranking approaches proposed in this applications because of their effectiveness in improving rec-
paper are designed to improve the recommendation diversity ommendation accuracy [41], [47], [52], [55]. Many variations of
in the second task of finding the best items for each user. matrix factorization techniques have been developed to solve
Because of the decomposition of rating estimation and rec- the problems of data sparsity, overfitting, and convergence
ommendation ranking tasks, our proposed ranking approaches speed, and they turned out to be a crucial component of many
provide a flexible solution, as mentioned earlier: they do not well-performing algorithms in the popular Netflix Prize1 com-
introduce any new procedures into the recommendation petition [4], [5], [6], [15], [22], [29], [30]. We implemented the
process and also can be used in conjunction with any available basic version of this technique, as presented in [15]. With the
rating estimation algorithm. In our experiments, to illustrate assumption that a user’s rating for an item is composed of a
the broad applicability of the proposed recommendation rank- sum of preferences about the various features of that item, this
ing approaches, we used them in conjunction with the most model is induced by Singular Value Decomposition (SVD) on
popular and widely employed CF techniques for rating predic- the user-item ratings matrix. In particular, using K features
tion: a heuristic neighborhood-based technique and a model- (i.e., rank-K SVD), user u is associated with a user-factors vec-
based matrix factorization technique. tor pu (the user’s preferences for K features), and item i is asso-
Before we provide an overview of each technique, we intro- ciated with an item-factors vector qi (the item’s importance
duce some notation and terminology related to recommenda- weights for K features). The preference of how much user u
tion problem. Let U be the set of users of a recommender sys- likes item i, denoted by R*(u, i), is predicted by taking an inner
tem, and let I be the set of all possible items that can be rec- product of the two vectors, i.e.,
ommended to users. Then, the utility function that represents
the preference of item i I by user u U is often defined as R * (u , i ) pT q . (3)
u i
R:U I Rating, where Rating typically represents some numer-
ic scale used by the users to evaluate each item. Also, in order
to distinguish between the actual ratings and the predictions of 1 More information can be found at www.netflixprize.com.



All values in user- and item-factor vectors are initially as- expectations also should be considered in evaluating the rec-
signed to arbitrary numbers and estimated with a simple gra- ommendation quality. Among many different aspects that
dient descent technique as described in (4). User- and item- cannot be measured by accuracy metrics alone, in this paper
factor vectors are iteratively updated with learning rate para- we focus on the notion of the diversity of recommendations,
meter ( ) as well as regularization parameter ( ), which is used which is discussed next.
to minimize overfitting, until the minimum improvement in
predictive accuracy or a pre-defined number of iterations per 2.3 Diversity of Recommendations
feature is reached. One learning iteration is defined as: As mentioned in Section 1, the diversity of recommendations
can be measured in two ways: individual and aggregate.
For each rating R(u, i)
Most of recent studies have focused on increasing the indi-
T
err R ( u, i ) pu qi vidual diversity, which can be calculated from each user’s rec-
ommendation list (e.g., an average dissimilarity between all
pu pu ( err qi pu ) (4)
pairs of items recommended to a given user) [8], [33], [46], [54],
qi qi ( err pu qi ) [57]. These techniques aim to avoid providing too similar rec-
End For ommendations for the same user. For example, some studies
Finally, unknown ratings are estimated with the final two [8], [46], [57] used an intra-list similarity metric to determine
vectors pu and qi as stated in (3). More details on variations of the individual diversity. Alternatively, [54] used a new evalua-
matrix factorization techniques used in recommender systems tion metric, item novelty, to measure the amount of additional
can be found in [4], [5], [30], [52], [55]. diversity that one item brings to a list of recommendations.
Moreover, the loss of accuracy, resulting from the increase in
2.2 Accuracy of Recommendations diversity, is controlled by changing the granularity of the un-
Numerous recommendation techniques have been developed derlying similarity metrics in the diversity-conscious algo-
over the last few years, and various metrics have been em- rithms [33].

t.c om
On the other hand, except for some work that examined

om
ployed for measuring the accuracy of recommendations, in-

po t.c
cluding statistical accuracy metrics and decision-support sales diversity across all users of the system by measuring a

gs po
measures [24]. As examples of statistical accuracy metrics, statistical dispersion of sales [10], [14], there have been few
lo s
.b og
mean absolute error (MAE) and root mean squared error studies that explore aggregate diversity in recommender sys-
ts .bl

(RMSE) metrics measure how well a system can predict an ex- tems, despite the potential importance of diverse recommenda-
ec ts
oj c

act rating value for a specific item. Examples of decision- tions from both user and business perspectives, as discussed in
pr oje

support metrics include precision (the percentage of truly Section 1. Several metrics can be used to measure aggregate
re r
lo rep

“high” ratings among those that were predicted to be “high” diversity, including the percentage of items that the recom-
xp lo

mender system is able to make recommendations for (often
ee xp

by the recommender system), recall (the percentage of correctly
.ie ee

predicted “high” ratings among all the ratings known to be known as coverage) [24]. Since we intend to measure the re-
w e

commender systems performance based on the top-N recom-
w .i

“high”), and F-measure, which is a harmonic mean of precision
w w

mended items lists that the system provides to its users, in this
:// w

and recall. In particular, the ratings of the datasets that we
tp //w

used in our experiments are integers between 1 and 5, inclu- paper we use the total number of distinct items recommended
ht ttp:

sive, where higher value represents a better-liked item. As across all users as an aggregate diversity measure, which we
h

commonly done in recommender systems literature, we define will refer to as diversity-in-top-N and formally define as follows:
the items greater than 3.5 (threshold for “high” ratings, de-
noted by TH) as “highly-ranked” and the ratings less than 3.5 as diversity - in - top - N LN (u ) .
u U
“non-highly-ranked.” Furthermore, in real world settings, re-
commender systems typically recommend the most highly- Note that the diversity-in-top-N metric can also serve as an
ranked N items since users are usually interested in only sever- indicator of the level of personalization provided by a recom-
al most relevant recommendations, and this list of N items for mender system. For example, a very low diversity-in-top-N
user u can be defined as LN(u) = {i1, …, iN}, where R*(u, ik) TH indicates that all users are being recommended the same top-N
for all k {1, 2,.., N}. Therefore, in our paper, we evaluate the items (low level of personalization), whereas a very high diver-
recommendation accuracy based on the percentage of truly sity-in-top-N points to the fact that every user receives her own
“highly-ranked” ratings, denoted by correct(LN(u)), among unique top-N items (high level of personalization).
those that were predicted to be the N most relevant “highly In summary, the goal of the proposed ranking approaches
ranked” items for each user, i.e., using the popular precision-in- is to improve the diversity of recommendations; however, as
top-N metric [24]. The metric can be written formally as: described in Section 1, there is a potential tradeoff between
recommendation accuracy and diversity. Thus, in this paper,
precision - in - top - N | correct ( LN (u )) | | LN (u ) | , we aim to find techniques that can improve aggregate diversity
u U u U
of recommendations while maintaining adequate accuracy.
where correct(LN(u)) = {i LN(u) | R(u, i) TH}. However, rely-
ing on the accuracy of recommendations alone may not be
3 MOTIVATIONS FOR RECOMMENDATION RE-RANKING
enough to find the most relevant items for a user. It has often
been suggested that recommender systems must be not only In this section, we discuss how re-ranking of the candidate
accurate, but also useful [24], [32]. For example, [32] suggests items whose predictions are above TH can affect the accuracy-
new user-centric directions for evaluating recommender sys- diversity tradeoff and how various item ranking factors, such
tems beyond the conventional accuracy metrics. They claim as popularity-based approach, can improve the diversity of
that serendipity in recommendations or user experiences and recommendations. Note that the general idea of personalized


5

information ordering is not new; e.g., its importance has been
discussed in information retrieval literature [35], [45], includ-
ing some attempts to reduce redundancy and promote the di-
versity of retrieved results by re-ranking them [12], [38], [53].

3.1 Standard Ranking Approach
Typical recommender systems predict unknown ratings based
on known ratings, using any traditional recommendation tech-
nique such as neighborhood-based or matrix factorization CF
techniques, discussed in Section 2.1. Then, the predicted rat-
ings are used to support the user’s decision-making. In partic-
ular, each user u gets recommended a list of top-N items, LN(u),
selected according to some ranking criterion. More formally,
item ix is ranked ahead of item iy (i.e., ix iy) if rank(ix) < MovieLens data, item-based CF (50 neighbors), top-5 item recommendation
rank(iy), where rank: I is a function representing the rank-
Fig. 1. Performance of the standard ranking approach and item
ing criterion. The vast majority of current recommender sys- popularity-based approach with its parameterized versions
tems use the predicted rating value as the ranking criterion:
how much loss is tolerable in a given application).
rankStandard(i)=R*(u, i)-1.
The power of -1 in the above expression indicates that the 3.3 Controlling Accuracy-Diversity Trade-Off:
items with highest-predicted (as opposed to lowest-predicted) Parameterized Ranking Approaches
ratings R*(u, i) are the ones being recommended to user. In the The item popularity-based ranking approach as well as all oth-
paper we refer to this as the standard ranking approach, and it er ranking approaches proposed in this paper (to be discussed
shares the motivation with the widely used probability ranking in Section 4) are parameterized with “ranking threshold”

t.c om
om
principle in information retrieval literature that ranks the doc- TR [TH, Tmax] (where Tmax is the largest possible rating on the

po t.c
gs po
uments in order of decreasing probability of relevance [37]. rating scale, e.g., Tmax=5) to allow user the ability to choose a
lo s
Note that, by definition, recommending the most highly certain level of recommendation accuracy. In particular, given
.b og
ts .bl

predicted items selected by the standard ranking approach is any ranking function rankX(i), ranking threshold TR is used for
ec ts

designed to help improve recommendation accuracy, but not creating the parameterized version of this ranking function,
oj c
pr oje

recommendation diversity. Therefore, new ranking criteria are rankX(i, TR), which is formally defined as:
re r
lo rep

needed in order to achieve diversity improvement. Since re-
if R* ( u ,i ) TR ,Tmax
xp lo

commending best-selling items to each user typically leads to rank x (i ),
ee xp

diversity reduction, recommending less popular items intui- rank x (i , TR )
.ie ee

*
tively should have an effect towards increasing recommenda- u rankStandard (i ), if R ( u ,i ) TH ,TR
w e
w .i
w w

*
tion diversity. And, as seen from the example in Table 1 (in where I u (TR ) {i I | R * (u , i ) TR }, max rank x (i ) .
:// w

u
tp //w

*
Section 1), this intuition has empirical support. Following this i I u (TR )
ht ttp:

motivation, we explore the possibility to use item popularity as a Simply put, items that are predicted above ranking thre-
h

recommendation ranking criterion, and in the next subsection shold TR are ranked according to rankX(i), while items that are
we show how this approach can affect the recommendation below TR are ranked according to the standard ranking ap-
quality in terms of accuracy and diversity. proach rankStandard(i). In addition, all items that are above TR get
3.2 Proposed Approach: Item Popularity-Based Ranking ranked ahead of all items that are below TR (as ensured by u in
the above formal definition). Thus, increasing the ranking
Item popularity-based ranking approach ranks items directly
threshold TR [TH, Tmax] towards Tmax would enable choosing
based on their popularity, from lowest to highest, where popu-
the most highly predicted items resulting in more accuracy and
larity is represented by the number of known ratings that each
less diversity (becoming increasingly similar to the standard
item has. More formally, item popularity-based ranking func-
ranking approach); in contrast, decreasing the ranking thre-
tion can be written as follows:
shold TR [TH, Tmax] towards TH would make rankX(i, TR) increa-
rankItemPop(i) = |U(i)|, where U(i) = {u U | R(u, i)}. singly more similar to the pure ranking function rankX(i), re-
We compared the performance of the item popularity- sulting in more diversity with some accuracy loss.
based ranking approach with the standard ranking approach Therefore, choosing different TR values in-between the ex-
using MovieLens dataset and item-based CF, and we present tremes allows the user to set the desired balance between accu-
this comparison using the accuracy-diversity plot in Fig.1. In racy and diversity. In particular, as Fig. 1 shows, the recom-
particular, the results show that, as compared to the standard mendation accuracy of item popularity-based ranking ap-
ranking approach, the item popularity-based ranking approach proach could be improved by increasing the ranking threshold.
increased recommendation diversity from 385 to 1395 (i.e., 3.6 For example, the item popularity-based ranking approach with
times!); however, recommendation accuracy dropped from ranking threshold 4.4 could minimize the accuracy loss to
89% to 69%. Here, despite the significant diversity gain, such a 1.32%, but still could obtain 83% diversity gain (from 385 to
significant accuracy loss (20%) would not be acceptable in most 703), compared to the standard ranking approach. An even
real-life personalization applications. Therefore, next we in- higher threshold 4.7 still makes it possible to achieve 20% di-
troduce a general technique to parameterize recommendation versity gain (from 385 to 462) with only 0.06% of accuracy loss.
ranking approaches, which allows to achieve significant diver- Also note that, even when there are less than N items
sity gains while controlling accuracy losses (e.g., according to above the ranking threshold TR, by definition, all the items



above TR are recommended to a user, and the remaining top-N second step (b) demonstrates this accuracy-diversity tradeoff.
items are selected according to the standard ranking approach. The third step, shown in Fig. 2c, can significantly minimize
This ensures that all the ranking approaches proposed in this accuracy loss by confining the re-ranked recommendations to
paper provide the same exact number of recommendations as the items above newly introduced ranking threshold TR (e.g.,
their corresponding baseline techniques (the ones using the 3.8 out of 5). In this particular illustration, note that the in-
standard ranking approach), which is very important from the creased ranking threshold makes the fifth recommended item
experimental analysis point of view as well in order to have a in step (b) (i.e., item with predicted rating value of 3.65) filtered
fair performance comparison of different ranking techniques. out and the next possible item above the new ranking thre-
shold (i.e. the item predicted as 3.81) is recommended to user
3.4 General Steps for Recommendation Re-ranking u. Averaged across all users, this parameterization helps to
The item popularity-based ranking approach described above make the level of accuracy loss fairly small with still a signifi-
is just one example of possible ranking approaches for improv- cant diversity gain (as compared to the standard ranking ap-
ing recommendation diversity, and a number of additional proach), as shown in the performance graph of step (c).
ranking functions, rankX(i), will be introduced in Section 4. We now introduce several additional item ranking func-
Here, based on the previous discussion in Section 3, we sum- tions, and provide empirical evidence that supports our moti-
marize the general ideas behind the proposed ranking ap- vation of using these item criteria for diversity improvement.
proaches, as illustrated by Fig. 2.
The first step, shown in Fig. 2a, represents the standard
approach, which, for each user, ranks all the predicted items 4 ADDITIONAL RANKING APPROACHES
according to the predicted rating value and selects top-N can- In many personalization applications (e.g., movie or book rec-
didate items, as long as they are above the highly-predicted ommendations), there often exist more highly-predicted rat-
rating threshold TH. The recommendation quality of the over- ings for a given user than can be put in her top-N list. This
all recommendation technique is measured in terms of the pre- provides opportunities to have a number of alternative ranking

t.c om
cision-in-top-N and the diversity-in-top-N, as shown in the approaches, where different sets of items can possibly be rec-

om
po t.c
accuracy-diversity plot at the right side of the example (a). ommended to the user. In this section, we introduce six addi-

gs po
The second step, illustrated in Fig. 2b, shows the recom- tional ranking approaches that can be used as alternatives to
lo s
.b og
mendations provided by applying one of the proposed ranking rankStandard to improve recommendation diversity, and Fig. 3
ts .bl

functions, rankX(i), where several different items (that are not provides some empirical evidence that supports the use of
ec ts
oj c

necessarily among N most highly predicted, but are still above these item ranking criteria. Because of the space limitations, in
pr oje

TH) are recommended to the user. This way, a user can get Fig. 3 we present the empirical results for MovieLens dataset;
re r
lo rep

recommended more idiosyncratic, long-tail, less frequently however, consistently similar patterns were found in other da-
xp lo

recommended items that may not be as widely popular, but tasets (discussed in Section 5.1) as well.
ee xp
.ie ee

can still be very relevant to this user (as indicated by relatively In particular, in our empirical analysis we consistently ob-
w e

high predicted rating). Therefore, re-ranking the candidate served that popular items, on average, are likely to have higher
w .i
w w

items can significantly improve the recommendation diversity
:// w

predicted ratings than less popular items, using both heuristic-
tp //w

although, as discussed, this typically comes at some loss of and model-based techniques for rating prediction, as shown in
ht ttp:

recommendation accuracy. The performance graph of the Fig. 3a. As discussed in Section 3, recommending less popular
h

(a) Recommending top-N highly predicted items for user u, according to standard ranking approach
(b) Recommending top-N items, according to some other ranking approach for better diversity
(c) Confining re-ranked recommendations to the items above new ranking threshold TR (e.g., 3.8) for better accuracy

Fig. 2. General overview of ranking-based approaches for improving recommendation diversity


7

items helps to improve recommendation diversity; therefore, as ommended for better diversity.
can be immediately suggested from the monotonic relationship Item Absolute Likeability, i.e., ranking items according to
between average item popularity and predicted rating value, how many users liked them (i.e., rated the item above TH):
recommending not as highly predicted items (but still pre-
dicted to be above TH) likely implies recommending, on aver- rankAbsLike(i) = |UH(i)|, where UH(i)={u U(i)| R(u,i) TH}.
age, less popular items, potentially leading to diversity im- Item Relative Likeability, i.e., ranking items according to
provements. Therefore, we propose to use predicted rating the percentage of the users who liked an item (among all
value itself as an item ranking criterion: users who rated it):
Reverse Predicted Rating Value, i.e., ranking the candidate rankRelLike(i) = |UH(i)| / |U(i)|.
(highly predicted) items based on their predicted rating We can also use two different types of rating variances to
value, from lowest to highest (as a result choosing less improve recommendation diversity. With any traditional rec-
popular items, according to Fig. 3a). More formally: ommendation technique, each item’s rating variance (which
rankRevPred(i) = R*(u,i). can be computed from known ratings submitted for that item)
can be used for re-ranking candidate items. Also, if any neigh-
We now propose several other ranking criteria that exhibit
borhood-based recommendation technique is used for predic-
consistent relationships to predicted rating value, including
tion, we can use the rating variance of neighbors whose ratings
average rating, absolute likeability, relative likeability, item
are used to predict the rating for re-ranking candidate items.
rating variance, and neighbors’ rating variance, as shown in
As shown in Fig. 3e and 3f, the relationship between the pre-
Figures 3b-3f. In particular, the relationship between predicted
dicted rating value and each item’s rating variance and the
rating values and the average actual rating of each item (as ex-
relationship between predicted rating value and 50 neighbors’
plicitly rated by users), shown in Fig. 3b, also supports a simi-
rating variance obtained by using a neighborhood-based CF
lar conjecture that items with lower average rating, on average,
technique demonstrate that highly predicted items tend to be
are more likely to have lower predicted rating values (likely
low in both item rating variance and neighbors’ rating va-

t.c om
representing less popular items, as shown earlier). Thus, such

om
riance. In other words, among the highly-predicted ratings

po t.c
items could be recommended for better diversity.
gs po
(i.e., above TH) there is more user consensus for higher-
Item Average Rating, i.e., ranking items according to an lo s
predicted items than for lower-predicted ones. These findings
.b og
average of all known ratings for each item:
ts .bl

indicate that re-ranking recommendation list by rating variance
ec ts

1 and choosing the items with higher variance could improve
oj c
pr oje

rankAvgRating(i) = R(i) , where R(i ) R(u , i ) . recommendation diversity.
| U (i ) | u
re r
lo rep

U (i )
Item Rating Variance, i.e., ranking items according to each
xp lo

Similarly, the relationship between predicted rating values item’s rating variance (i.e., rating variance of users who
ee xp

and item absolute (or relative) likeability, shown in Fig. 3c and
.ie ee

rated the item):
w e

3d, also suggests that the items with lower likeability, on aver-
w .i
w w

age, are more likely to have lower predicted rating values (like-
:// w
tp //w

ly representing less popular movies) and, thus, could be rec-
ht ttp:
h

(a) Average Predicted Rating Value (b) Item Average Rating (c) Average Item Absolute Likeability

(d) Average Item Relative Likeability (e) Average Item Rating Variance (f) Average Neighbors’ Rating Variance

Fig. 3. Relationships between various item-ranking criteria and predicted rating value, for highly-predicted ratings (MovieLens data)



1 TABLE 2. BASIC INFORMATION OF MOVIE RATING DATASETS
rankItemVar(i) = ( R (u, i) R (i )) 2 .
| U (i) | u U (i ) MovieLens Netflix
Yahoo!
Movies
Neighbors’ Rating Variance, i.e., ranking items according Number of users 2,830 3,333 1,349
to the rating variance of neighbors of a particular user for a Number of movies 1,919 2,092 721
particular item. The closest neighbors of user u among the Number of ratings 775,176 1,067,999 53,622
users who rated the particular item i, denoted by u', are Data Sparsity 14.27% 15.32% 5.51%
chosen from the set of U(i) N(u). Avg # of common movies
64.6 57.3 4.1
between two users
1 2 Avg # of common users
rankNeighborVar(i) = ( R(u ' , i ) Ru (i )) between two movies 85.1 99.5 6.5
| U (i ) N (u ) | u ' U (i ) N (u ) Avg # of users per movie 404.0 510.5 74.4
1 Avg # of movies per user 274.1 320.4 39.8
where Ru (i ) R (u ' , i ) .
| U (i ) N (u ) | u ' U (i ) N (u ) Consistently with the accuracy-diversity tradeoff discussed
in the introduction, all the proposed ranking approaches im-
In summary, there exist a number of different ranking ap- proved the diversity of recommendations by sacrificing the
proaches that can improve recommendation diversity by re- accuracy of recommendations. However, with each ranking
commending items other than the ones with topmost predicted approach, as ranking threshold TR increases, the accuracy loss
rating values to a user. In addition, as indicated in Fig. 1, the is significantly minimized (smaller precision loss) while still
degree of improvement (and, more importantly, the degree of exhibiting substantial diversity improvement. Therefore, with
tolerable accuracy loss) can be controlled by the chosen rank- different ranking thresholds, one can obtain different diversity
ing threshold value TR. The next section presents comprehen- gains for different levels of tolerable precision loss, as com-
sive empirical results demonstrating the effectiveness and ro- pared to the standard ranking approach. Following this idea,

t.c om
bustness of the proposed ranking techniques.

om
in our experiments we compare the effectiveness (i.e., diversity

po t.c
gain) of different recommendation ranking techniques for a
5 EMPIRICAL RESULTS gs po
variety of different precision loss levels (0.1-10%).
lo s
.b og
While, as mentioned earlier, a comprehensive set of experi-
ts .bl

5.1 Data
ments was performed using every rating prediction technique
ec ts
oj c

The proposed recommendation ranking approaches were
pr oje

in conjunction with every recommendation ranking function
tested with several movie rating datasets, including MovieLens
re r

on every dataset for different number of top-N recommenda-
lo rep

(data file available at grouplens.org), Netflix (data file available tions, the results were very consistent across all experiments
xp lo

at netflixprize.com), and Yahoo! Movies (individual ratings
ee xp

and, therefore, for illustration purposes and because of the
.ie ee

collected from movie pages at movies.yahoo.com). We pre- space limitations, we show only three results: each using all
w e

processed each dataset to include users and movies with signif-
w .i

possible ranking techniques on a different dataset, a different
w w

icant rating history, which makes it possible to have sufficient
:// w

recommendation technique, and a different number of recom-
tp //w

number of highly-predicted items for recommendations to each mendations. (See Table 3.)
ht ttp:

user (in the test data). The basic statistical information of the For example, Table 3a shows the performance of the pro-
h

resulting datasets is summarized in Table 2. For each dataset, posed ranking approaches used in conjunction with item-based
we randomly chose 60% of the ratings as training data and CF technique to provide top-5 recommendations on the Mo-
used them to predict the remaining 40% (i.e., test data). vieLens dataset. In particular, one can observe that, with the
5.2 Performance of Proposed Ranking Approaches precision loss of only 0.001 or 0.1% (i.e., with precision of 0.891,
down from 0.892 of the standard ranking approach), item aver-
We conducted experiments on the three datasets described in
age rating-based ranking approach can already increase rec-
Section 5.1, using three widely popular recommendation tech-
ommendation diversity by 20% (i.e., absolute diversity gain of
niques for rating prediction, including two heuristic-based (us-
78 on top of the 385 achieved by the standard ranking ap-
er-based and item-based CF) and one model-based (matrix
proach). If users can tolerate precision loss up to 1% (i.e., pre-
factorization CF) techniques, discussed in Section 2.1. All sev-
cision of 0.882 or 88.2%), the diversity could be increased by
en proposed ranking approaches were used in conjunction
81% with the same ranking technique; and 5% precision loss
with each of the three rating prediction techniques to generate
(i.e., 84.2%) can provide diversity gains up to 189% for this rec-
top-N (N=1, 5, 10) recommendations to each user on each data-
ommendation technique on this dataset. Substantial diversity
set, with the exception of neighbors’ variance-based ranking of
improvements can be observed across different ranking tech-
model-based predicted ratings. In particular, because there is
niques, different rating prediction techniques, and different
no concept of neighbors in a pure matrix factorization tech-
datasets, as shown in Tables 3a, 3b, and 3c.
nique, the ranking approach based on neighbors’ rating va-
In general, all proposed ranking approaches were able to
riance was applied only with heuristic-based techniques. We
provide significant diversity gains, and the best-performing
set predicted rating threshold as TH = 3.5 (out of 5) to ensure
ranking approach may be different depending on the chosen
that only relevant items are recommended to users, and rank-
dataset and rating prediction technique. Thus, system design-
ing threshold TR was varied from 3.5 to 4.9. The performance
ers have the flexibility to choose the most desirable ranking
of each ranking approach was measured in terms of precision-
approach based on the data in a given application. We would
in-top-N and diversity-in-top-N (N=1, 5, 10), and, for compari-
also like to point out that, since the proposed approaches es-
son purposes, its diversity gain and precision loss with respect
sentially are implemented as sorting algorithms based on cer-
to the standard ranking approach was calculated.
tain ranking heuristics, they are extremely scalable. For exam-


9

TABLE 3. DIVERSITY GAINS OF PROPOSED RANKING APPROACHES FOR DIFFERENT LEVELS OF PRECISION LOSS
Item Reverse Item Average Item Abs Item Relative Item Rating Neighbors’
Popularity Prediction Rating Likeability Likeability Variance Rating Variance
Precision Loss Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain
-0.1 +800 3.078 +848 3.203 +975 3.532 +897 3.330 +937 3.434 +386 2.003 +702 2.823
-0.05 +594 2.543 +594 2.543 +728 2.891 +642 2.668 +699 2.816 +283 1.735 +451 2.171
-0.025 +411 2.068 +411 2.068 +513 2.332 +445 2.156 +484 2.257 +205 1.532 +258 1.670
-0.01 +270 1.701 +234 1.608 +311 1.808 +282 1.732 +278 1.722 +126 1.327 +133 1.345
-0.005 +189 1.491 +173 1.449 +223 1.579 +196 1.509 +199 1.517 +91 1.236 +87 1.226
-0.001 +93 1.242 +44 1.114 +78 1.203 +104 1.270 +96 1.249 +21 1.055 +20 1.052
Standard:0.892 385 1.000 385 1.000 385 1.000 385 1.000 385 1.000 385 1.000 385 1.000
(a) MovieLens dataset, top-5 items, heuristic-based technique (item-based CF, 50 neighbors)
Item Reverse Item Average Item Abs Item Relative Item Rating
Popularity Prediction Rating Likeability Likeability Variance
Precision Loss Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain
-0.1 +314 1.356 +962 2.091 +880 1.998 +732 1.830 +860 1.975 +115 1.130
-0.05 +301 1.341 +757 1.858 +718 1.814 +614 1.696 +695 1.788 +137 1.155
-0.025 +238 1.270 +568 1.644 +535 1.607 +464 1.526 +542 1.615 +110 1.125
-0.01 +156 1.177 +363 1.412 +382 1.433 +300 1.340 +385 1.437 +63 1.071
-0.005 +128 1.145 +264 1.299 +282 1.320 +247 1.280 +288 1.327 +47 1.053
-0.001 +64 1.073 +177 1.201 +118 1.134 +89 1.101 +148 1.168 +8 1.009
Standard:0.834 882 1.000 882 1.000 882 1.000 882 1.000 882 1.000 882 1.000

t.c om
om
(b) Netflix dataset, top-5 items, model-based technique (matrix factorization CF, K=64)

po t.c
gs po
Item Reverse Item Average Item Abs
lo s Item Relative Item Rating Neighbors’
.b og
Popularity Prediction Rating Likeability Likeability Variance Rating Variance
ts .bl

Precision Loss Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain
ec ts
oj c

-0.1 +220 1.794 +178 1.643 +149 1.538 +246 1.888 +122 1.440 +86 1.310 +128 1.462
pr oje

-0.05 +198 1.715 +165 1.596 +141 1.509 +226 1.816 +117 1.422 +72 1.260 +108 1.390
re r
lo rep

-0.025 +134 1.484 +134 1.484 +103 1.372 +152 1.549 +86 1.310 +70 1.253 +98 1.354
xp lo

-0.01 +73 1.264 +92 1.332 +56 1.202 +77 1.278 +58 1.209 +56 1.202 +65 1.235
ee xp
.ie ee

-0.005 +57 1.206 +86 1.310 +38 1.137 +63 1.227 +36 1.130 +28 1.101 +51 1.184
w e
w .i

-0.001 +42 1.152 +71 1.256 +25 1.090 +43 1.155 +30 1.110 +19 1.069 +22 1.079
w w
:// w

Standard:0.911 277 1.000 277 1.000 277 1.000 277 1.000 277 1.000 277 1.000 277 1.000
tp //w

(c) Yahoo dataset, top-1 item, heuristic-based technique (user-based CF, 15 neighbors)
ht ttp:
h

Notation: Precision Loss = [Precision-in-top-N of proposed ranking approach] – [Precision-in-top-N of standard ranking approach]
Diversity Gain (column 1) = [Diversity-in-top-N of proposed ranking approach] – [Diversity-in-top-N of standard ranking approach]
Diversity Gain (column 2) = [Diversity-in-top-N of proposed ranking approach] / [Diversity-in-top-N of standard ranking approach]

ple, it took, on average, less than 6 seconds to rank all the pre- using the recommendation ranking techniques with any of the
dicted items and select top-N recommendations for nearly parameter values, it is possible to obtain substantial diversity
3,000 users in our experiments with MovieLens data. improvements with only a small accuracy loss.
We also vary the number of top-N recommendations pro-
5.3 Robustness Analysis for Different Parameters vided by the system. Note that, while it is intuitively clear that
In this subsection, we present robustness analysis of the pro- top-1, top-5, and top-10 recommendations will provide differ-
posed techniques with respect to several parameters: number ent accuracy and diversity levels (i.e., it is much easier to accu-
of neighbors used in heuristic-based CF, number of features rately recommend one relevant item than relevant 10 items,
used in matrix factorization CF, number of top-N recommenda- and it is much easier to have more aggregate diversity when
tions provided to each user, the value of predicted rating thre- you can provide more recommendations), again we observe
shold TH, and the level of data sparsity. that, with any number of top-N recommendations, the pro-
We tested the heuristic-based technique with a different posed techniques exhibit robust and consistent behavior, i.e.,
number of neighbors (15, 20, 30, and 50 neighbors) and the they allow to obtain substantial diversity gains at a small accu-
model-based technique with a different number of features racy loss, as shown in Fig. 4c. For example, with only 1% pre-
(K=8, 16, 32, and 64). For illustration purposes, Fig. 4a and 4b cision loss, we were able to increase the diversity from 133 to
show how two different ranking approaches for both heuristic- 311 (134% gain) using the reverse predicted rating value-based
based and model-based rating prediction techniques are af- ranking approach in the top-1 recommendation task, and from
fected by different parameter values. While different parame- 385 to 655 (70% gain) using the item-popularity-based ranking
ter values may result in slightly different performance (as is approach in the top-5 recommendation task.
well-known in recommender systems literature), the funda- In addition, our finding that the proposed ranking ap-
mental behavior of the proposed techniques remains robust proaches help to improve recommendation diversity is also
and consistent, as shown in Fig. 4a and 4b. In other words, robust with respect to the “highly-predicted” rating threshold

Improving aggregate recommendation diversity

Improving aggregate recommendation diversity

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von ingenioustech

Mehr von ingenioustech (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Improving aggregate recommendation diversity