SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


     IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID                                                                                                   1




        Improving Aggregate Recommendation Diversity
               Using Ranking-Based Techniques
                                             Gediminas Adomavicius, Member, IEEE, and YoungOk Kwon

             Abstract— Recommender systems are becoming increasingly important to individual users and businesses for providing personalized
             recommendations. However, while the majority of algorithms proposed in recommender systems literature have focused on improving
             recommendation accuracy (as exemplified by the recent Netflix Prize competition), other important aspects of recommendation quality, such
             as the diversity of recommendations, have often been overlooked. In this paper, we introduce and explore a number of item ranking
             techniques that can generate recommendations that have substantially higher aggregate diversity across all users while maintaining
             comparable levels of recommendation accuracy. Comprehensive empirical evaluation consistently shows the diversity gains of the proposed
             techniques using several real-world rating datasets and different rating prediction algorithms.

             Index Terms— Recommender systems, recommendation diversity, ranking functions, performance evaluation metrics, collaborative filtering.

                                                             ——————————                              ——————————

     1 Introduction

     I  n the current age of information overload, it is becoming
        increasingly harder to find relevant content. This problem is
                                                                                                   recommended items, while maintaining an acceptable level of
                                                                                                   accuracy [8], [33], [46], [54], [57]. These studies measure rec-




                                                                                                          t.c om
        not only widespread but also alarming [28]. Over the last 10-                              ommendation diversity from an individual user’s perspective




                                                                                                             om
                                                                                                        po t.c
     15 years, recommender systems technologies have been intro-                                   (i.e., individual diversity).

                                                                                                      gs po
     duced to help people deal with these vast amounts of informa-                                     In contrast to individual diversity, which has been explored
                                                                                                    lo s
                                                                                                  .b og
     tion [1], [7], [9], [30], [36], [39], and they have been widely used                          in a number of papers, some recent studies [10], [14] started
                                                                                                ts .bl

     in research as well as e-commerce applications, such as the                                   examining the impact of recommender systems on sales diver-
                                                                                              ec ts
                                                                                            oj c




     ones used by Amazon and Netflix.                                                              sity by considering aggregate diversity of recommendations
                                                                                          pr oje




         The most common formulation of the recommendation                                         across all users. Note that high individual diversity of recom-
                                                                                        re r
                                                                                      lo rep




     problem relies on the notion of ratings, i.e., recommender sys-                               mendations does not necessarily imply high aggregate diversi-
                                                                                    xp lo




     tems estimate ratings of items (or products) that are yet to be                               ty. For example, if the system recommends to all users the
                                                                                  ee xp
                                                                               .ie ee




     consumed by users, based on the ratings of items already con-                                 same five best-selling items that are not similar to each other,
                                                                              w e




     sumed. Recommender systems typically try to predict the rat-                                  the recommendation list for each user is diverse (i.e., high in-
                                                                             w .i
                                                                            w w




     ings of unknown items for each user, often using other users’                                 dividual diversity), but only five distinct items are recom-
                                                                         :// w
                                                                      tp //w




     ratings, and recommend top N items with the highest pre-                                      mended to all users and purchased by them (i.e., resulting in
                                                                    ht ttp:




     dicted ratings. Accordingly, there have been many studies on                                  low aggregate diversity or high sales concentration).
                                                                     h




     developing new algorithms that can improve the predictive                                         While the benefits of recommender systems that provide
     accuracy of recommendations. However, the quality of rec-                                     higher aggregate diversity would be apparent to many users
     ommendations can be evaluated along a number of dimen-                                        (because such systems focus on providing wider range of items
     sions, and relying on the accuracy of recommendations alone                                   in their recommendations and not mostly bestsellers, which
     may not be enough to find the most relevant items for each                                    users are often capable of discovering by themselves), such
     user [24], [32]. In particular, the importance of diverse recom-                              systems could be beneficial for some business models as well
     mendations has been previously emphasized in several studies                                  [10], [11], [14], [20]. For example, it would be profitable to Net-
     [8], [10], [14], [33], [46], [54], [57]. These studies argue that one                         flix if the recommender systems can encourage users to rent
     of the goals of recommender systems is to provide a user with                                 “long-tail” type of movies (i.e., more obscure items that are
     highly idiosyncratic or personalized items, and more diverse                                  located in the tail of the sales distribution [2]) because they are
     recommendations result in more opportunities for users to get                                 less costly to license and acquire from distributors than new-
     recommended such items. With this motivation, some studies                                    release or highly-popular movies of big studios [20]. However,
     proposed new recommendation methods that can increase the                                     the impact of recommender systems on aggregate diversity in
     diversity of recommendation sets for a given individual user,                                 real-world e-commerce applications has not been well-
     often measured by an average dissimilarity between all pairs of                               understood. For example, one study [10], using data from on-
                                                                                                   line clothing retailer, confirms the “long tail” phenomenon that
                               ————————————————
                                                                                                   refers to the increase in the tail of the sales distribution (i.e., the
         G. Adomavicius is with the Department of Information and Decision Sciences,
         Carlson School of Management, University of Minnesota, Minneapolis, MN
                                                                                                   increase in aggregate diversity) attributable to the usage of the
         55455. E-mail: gedas@umn.edu.                                                             recommender system. On the other hand, another study [14]
         Y. Kwon is with the Department of Information and Decision Sciences, Carl-                shows a contradictory finding that recommender systems ac-
         son School of Management, University of Minnesota, Minneapolis, MN                        tually can reduce the aggregate diversity in sales. This can be
         55455. E-mail: kwonx052@umn.edu.
                                                                                                   explained by the fact that the idiosyncratic items often have
      Manuscript received (insert date of submission if desired). Please note that all ac-
                                                                                                   limited historical data and, thus, are more difficult to recom-
      knowledgments should be placed at the end of the paper, before the bibliography.
                                                                                   xxxx-xxxx/0x/$xx.00 © 2009 IEEE

Digital Object Indentifier 10.1109/TKDE.2011.15                               1041-4347/11/$26.00 © 2011 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


2                                                                               IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID



mend to users; in contrast, popular items typically have more                              metric (i.e., the percentage of truly “high” ratings among those
ratings and, therefore, can be recommended to more users. For                              that were predicted to be “high” by the recommender system)
example, in the context of Netflix Prize competition [6], [22],                            is 82%, but only 49 popular items out of approximately 2000
there is some evidence that, since recommender systems seek                                available distinct items are recommended across all users. The
to find the common items (among thousands of possible mov-                                 system can improve the diversity of recommendations from 49
ies) that two users have watched, these systems inherently tend                            up to 695 (a 14-fold increase) by recommending the long-tail
to avoid extremes and recommend very relevant but safe rec-                                item to each user (i.e., the least popular item among highly-
ommendations to users [50].                                                                predicted items for each user) instead of the popular item.
    As seen from this recent debate, there is a growing aware-                             However, high diversity in this case is obtained at the signifi-
ness of the importance of aggregate diversity in recommender                               cant expense of accuracy, i.e., drop from 82% to 68%.
systems. Furthermore, while, as mentioned earlier, there has                                   The above example shows that it is possible to obtain higher
been significant amount of work done on improving individual                               diversity simply by recommending less popular items; howev-
diversity, the issue of aggregate diversity in recommender sys-                            er, the loss of recommendation accuracy in this case can be
tems has been largely untouched. Therefore, in this paper, we                              substantial. In this paper, we explore new recommendation
focus on developing algorithmic techniques for improving ag-                               approaches that can increase the diversity of recommendations
gregate diversity of recommendations (which we will simply                                 with only a minimal (negligible) accuracy loss using different
refer to as diversity throughout the paper, unless explicitly spe-                         recommendation ranking techniques. In particular, traditional
cified otherwise), which can be intuitively measured by the                                recommender systems typically rank the relevant items in a
number of distinct items recommended across all users.                                     descending order of their predicted ratings for each user and
    Higher diversity (both individual and aggregate), however,                             then recommend top N items, resulting in high accuracy. In
can come at the expense of accuracy. As known well, there is a                             contrast, the proposed approaches consider additional factors,
tradeoff between accuracy and diversity because high accuracy                              such as item popularity, when ranking the recommendation
may often be obtained by safely recommending to users the                                  list to substantially increase recommendation diversity while




                                                                                                    t.c om
                                                                                                       om
most popular items, which can clearly lead to the reduction in                             maintaining comparable levels of accuracy. This paper pro-



                                                                                                  po t.c
diversity, i.e., less personalized recommendations [8], [33], [46].                        vides a comprehensive empirical evaluation of the proposed
                                                                                                gs po
And conversely, higher diversity can be achieved by trying to                              approaches, where they are tested with various datasets in a
                                                                                              lo s
                                                                                            .b og
uncover and recommend highly idiosyncratic or personalized                                 variety of different settings. For example, the best results show
                                                                                          ts .bl


items for each user, which often have less data and are inhe-                              up to 20-25% diversity gain with only 0.1% accuracy loss, up to
                                                                                        ec ts
                                                                                      oj c




rently more difficult to predict, and, thus, may lead to a de-                             60-80% gain with 1% accuracy loss, and even substantially
                                                                                    pr oje




crease in recommendation accuracy.                                                         higher diversity improvements (e.g., up to 250%) if some users
                                                                                  re r
                                                                                lo rep




    Table 1 illustrates an example of accuracy and diversity tra-                          are willing to tolerate higher accuracy loss.
                                                                              xp lo
                                                                            ee xp




deoff in two extreme cases where only popular items or long-                                   In addition to providing significant diversity gains, the pro-
                                                                         .ie ee




tail type items are recommended to users, using MovieLens                                  posed ranking techniques have several other advantageous
                                                                        w e
                                                                       w .i




rating dataset (datasets used in this paper are discussed in Sec-                          characteristics. In particular, these techniques are extremely
                                                                      w w
                                                                   :// w




tion 5.1). In this example, we used a popular recommendation                               efficient, because they are based on scalable sorting-based heu-
                                                                tp //w




technique, i.e., neighborhood-based collaborative filtering (CF)                           ristics that make decisions based only on the “local” data (i.e.,
                                                              ht ttp:




technique [9], to predict unknown ratings. Then, as candidate                              only on the candidate items of each individual user) without
                                                               h




recommendations for each user, we considered only the items                                having to keep track of the “global” information, such as which
that were predicted above the pre-defined rating threshold to                              items have been recommended across all users and how many
assure the acceptable level of accuracy, as is typically done in                           times. The techniques are also parameterizable, since the user
recommender systems. Among these candidate items for each                                  has the control to choose the acceptable level of accuracy for
user, we identified the item that was rated by most users (i.e.,                           which the diversity will be maximized. Also, the proposed
the item with the largest number of known ratings) as a popular                            ranking techniques provide a flexible solution to improving
item, and the item that was rated by least number of users (i.e.,                          recommendation diversity because: they are applied after the
the item with the smallest number of known ratings) as a long-                             unknown item ratings have been estimated and, thus, can
tail item. As illustrated by Table 1, if the system recommends                             achieve diversity gains in conjunction with a number of differ-
each user the most popular item (among the ones that had a                                 ent rating prediction techniques, as illustrated in the paper; as
sufficiently high predicted rating), it is much more likely for                            mentioned above, the vast majority of current recommender
many users to get the same recommendation (e.g., the best-                                 systems already employ some ranking approach, thus, the
selling item). The accuracy measured by precision-in-top-1                                 proposed techniques would not introduce new types of proce-
                                                                                           dures into recommender systems (they would replace existing
       TABLE 1. ACCURACY-DIVERSITY TRADEOFF: EMPIRICAL EXAMPLE                             ranking procedures); the proposed ranking approaches do not
                               Quality Metric:                                             require any additional information about users (e.g., demo-
                                                     Accuracy          Diversity           graphics) or items (e.g., content features) aside from the ratings
      Top-1 recommendation of:                                                             data, which makes them applicable in a wide variety of rec-
            Popular Item (item with the                              49 distinct           ommendation contexts.
                                                       82%
        largest number of known ratings)                               items
                                                                                               The remainder of the paper is organized as follows. Section
          “Long-Tail” Item (item with the                            695 distinct
       smallest number of known ratings)
                                                       68%
                                                                       items
                                                                                           2 reviews relevant literature on traditional recommendation
                                                                                           algorithms and the evaluation of recommendation quality.
    Note. Recommendations (top-1 item for each user) are generated for
    2828 users among the items that are predicted above the acceptable                     Section 3 describes our motivations for alternative recommen-
    threshold 3.5 (out of 5), using a standard item-based collaborative filter-            dation ranking techniques, such as item popularity. We then
    ing technique with 50 neighbors on the MovieLens Dataset.                              propose several additional ranking techniques in Section 4, and
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES
                                                                                                                                                                                                     3


the main empirical results follow in Section 5. Additional ex-                         the recommender system, we use the R(u, i) notation to
periments are conducted to further explore the proposed rank-                          represent a known rating (i.e., the actual rating that user u gave
ing techniques in Section 6. Lastly, Section 7 concludes the                           to item i), and the R*(u, i) notation to represent an unknown
paper by summarizing the contributions and future directions.                          rating (i.e., the system-predicted rating for item i that user u
                                                                                       has not rated before).
2 Related Work                                                                         Neighborhood-based CF technique
2.1 Recommendation Techniques for Rating Prediction                                    There exist multiple variations of neighborhood-based CF
Recommender systems are usually classified into three catego-                          techniques [9], [36], [40]. In this paper, to estimate R*(u, i), i.e.,
ries based on their approach to recommendation: content-                               the rating that user u would give to item i, we first compute the
based, collaborative, and hybrid approaches [1], [3]. Content-                         similarity between user u and other users u' using a cosine si-
based recommender systems recommend items similar to the                               milarity metric [9], [40]:
ones the user preferred in the past. Collaborative filtering (CF)                                                                                R(u , i ) R(u ' , i )
                                                                                                                                 i I ( u ,u ')
recommender systems recommend items that users with simi-                                       sim(u , u ' )                                                                                ,           (1)
lar preferences (i.e., “neighbors”) have liked in the past. Final-                                                   i I ( u , u ')   R(u, i ) 2            i I ( u , u ')   R(u ' , i ) 2
ly, hybrid approaches can combine content-based and colla-
                                                                                       where I(u, u') represents the set of all items rated by both user
borative methods in several different ways. Recommender
                                                                                       u and user u'. Based on the similarity calculation, set N(u) of
systems can also be classified based on the nature of their algo-
                                                                                       nearest neighbors of user u is obtained. The size of set N(u) can
rithmic technique into heuristic (or memory-based) and model-
                                                                                       range anywhere from 1 to |U|-1, i.e., all other users in the da-
based approaches [1], [9]. Heuristic techniques typically calcu-
                                                                                       taset. Then, R*(u, i) is calculated as the adjusted weighted sum
late recommendations based directly on the previous user ac-
                                                                                       of all known ratings R(u', i), where u' N(u) [13], [34]:
tivities (e.g., transactional data or rating values). One of the
commonly used heuristic techniques is a neighborhood-based                                                                              sim(u , u ' ) R(u ' , i ) R(u ' )




                                                                                              t.c om
                                                                                                                         u ' N (u )
                                                                                              R * (u , i )   R(u )                                                                               .       (2)




                                                                                                 om
approach that finds nearest neighbors that have tastes similar
                                                                                                                                                         | sim(u , u ' ) |


                                                                                            po t.c
                                                                                                                                            u ' N (u )
to those of the target user [9], [13], [34], [36], [40]. In contrast,
                                                                                          gs po
model-based techniques use previous user activities to first                           Here R (u ) represents the average rating of user u.
                                                                                        lo s
                                                                                      .b og
learn a predictive model, typically using some statistical or                             A neighborhood-based CF technique can be user-based or
                                                                                    ts .bl


machine-learning methods, which is then used to make rec-
                                                                                  ec ts


                                                                                       item-based, depending on whether the similarity is calculated
                                                                                oj c




ommendations. Examples of such techniques include Bayesian
                                                                              pr oje




                                                                                       between users or items. Formulae (1) and (2) represent the
clustering, aspect model, flexible mixture model, matrix facto-
                                                                            re r




                                                                                       user-based approach, but they can be straightforwardly rewrit-
                                                                          lo rep




rization, and other methods [4], [5], [9], [25], [44], [48].                           ten for the item-based approach because of the symmetry be-
                                                                        xp lo
                                                                      ee xp




    In real world settings, recommender systems generally per-                         tween users and items in all neighborhood-based CF calcula-
                                                                   .ie ee




form the following two tasks in order to provide recommenda-                           tions [40]. In our experiments we used both user-based and
                                                                  w e
                                                                 w .i




tions to each user. First, the ratings of unrated items are esti-                      item-based approaches for rating estimation.
                                                                w w
                                                             :// w




mated based on the available information (typically using
                                                          tp //w




known user ratings and possibly also information about item                            Matrix factorization CF technique
                                                        ht ttp:




content or user demographics) using some recommendation                                Matrix factorization techniques have been the mainstay of nu-
                                                         h




algorithm. And second, the system finds items that maximize                            merical linear algebra dating back to the 1970s [16], [21], [27]
the user’s utility based on the predicted ratings, and recom-                          and have recently gained popularity in recommender systems
mends them to the user. Ranking approaches proposed in this                            applications because of their effectiveness in improving rec-
paper are designed to improve the recommendation diversity                             ommendation accuracy [41], [47], [52], [55]. Many variations of
in the second task of finding the best items for each user.                            matrix factorization techniques have been developed to solve
    Because of the decomposition of rating estimation and rec-                         the problems of data sparsity, overfitting, and convergence
ommendation ranking tasks, our proposed ranking approaches                             speed, and they turned out to be a crucial component of many
provide a flexible solution, as mentioned earlier: they do not                         well-performing algorithms in the popular Netflix Prize1 com-
introduce any new procedures into the recommendation                                   petition [4], [5], [6], [15], [22], [29], [30]. We implemented the
process and also can be used in conjunction with any available                         basic version of this technique, as presented in [15]. With the
rating estimation algorithm. In our experiments, to illustrate                         assumption that a user’s rating for an item is composed of a
the broad applicability of the proposed recommendation rank-                           sum of preferences about the various features of that item, this
ing approaches, we used them in conjunction with the most                              model is induced by Singular Value Decomposition (SVD) on
popular and widely employed CF techniques for rating predic-                           the user-item ratings matrix. In particular, using K features
tion: a heuristic neighborhood-based technique and a model-                            (i.e., rank-K SVD), user u is associated with a user-factors vec-
based matrix factorization technique.                                                  tor pu (the user’s preferences for K features), and item i is asso-
    Before we provide an overview of each technique, we intro-                         ciated with an item-factors vector qi (the item’s importance
duce some notation and terminology related to recommenda-                              weights for K features). The preference of how much user u
tion problem. Let U be the set of users of a recommender sys-                          likes item i, denoted by R*(u, i), is predicted by taking an inner
tem, and let I be the set of all possible items that can be rec-                       product of the two vectors, i.e.,
ommended to users. Then, the utility function that represents
the preference of item i I by user u U is often defined as                                                            R * (u , i )                pT q .                                                 (3)
                                                                                                                                                   u i
R:U I Rating, where Rating typically represents some numer-
ic scale used by the users to evaluate each item. Also, in order
to distinguish between the actual ratings and the predictions of                         1   More information can be found at www.netflixprize.com.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


4                                                                               IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID



   All values in user- and item-factor vectors are initially as-                            expectations also should be considered in evaluating the rec-
signed to arbitrary numbers and estimated with a simple gra-                                ommendation quality. Among many different aspects that
dient descent technique as described in (4). User- and item-                                cannot be measured by accuracy metrics alone, in this paper
factor vectors are iteratively updated with learning rate para-                             we focus on the notion of the diversity of recommendations,
meter ( ) as well as regularization parameter ( ), which is used                            which is discussed next.
to minimize overfitting, until the minimum improvement in
predictive accuracy or a pre-defined number of iterations per                               2.3 Diversity of Recommendations
feature is reached. One learning iteration is defined as:                                   As mentioned in Section 1, the diversity of recommendations
                                                                                            can be measured in two ways: individual and aggregate.
             For each rating R(u, i)
                                                                                                 Most of recent studies have focused on increasing the indi-
                                  T
                  err R ( u, i ) pu qi                                                      vidual diversity, which can be calculated from each user’s rec-
                                                                                            ommendation list (e.g., an average dissimilarity between all
                    pu     pu      ( err qi             pu )                          (4)
                                                                                            pairs of items recommended to a given user) [8], [33], [46], [54],
                    qi    qi     ( err pu              qi )                                 [57]. These techniques aim to avoid providing too similar rec-
             End For                                                                        ommendations for the same user. For example, some studies
   Finally, unknown ratings are estimated with the final two                                [8], [46], [57] used an intra-list similarity metric to determine
vectors pu and qi as stated in (3). More details on variations of                           the individual diversity. Alternatively, [54] used a new evalua-
matrix factorization techniques used in recommender systems                                 tion metric, item novelty, to measure the amount of additional
can be found in [4], [5], [30], [52], [55].                                                 diversity that one item brings to a list of recommendations.
                                                                                            Moreover, the loss of accuracy, resulting from the increase in
2.2 Accuracy of Recommendations                                                             diversity, is controlled by changing the granularity of the un-
Numerous recommendation techniques have been developed                                      derlying similarity metrics in the diversity-conscious algo-
over the last few years, and various metrics have been em-                                  rithms [33].




                                                                                                t.c om
                                                                                                 On the other hand, except for some work that examined




                                                                                                   om
ployed for measuring the accuracy of recommendations, in-


                                                                                              po t.c
cluding statistical accuracy metrics and decision-support                                   sales diversity across all users of the system by measuring a

                                                                                            gs po
measures [24]. As examples of statistical accuracy metrics,                                 statistical dispersion of sales [10], [14], there have been few
                                                                                          lo s
                                                                                        .b og
mean absolute error (MAE) and root mean squared error                                       studies that explore aggregate diversity in recommender sys-
                                                                                      ts .bl


(RMSE) metrics measure how well a system can predict an ex-                                 tems, despite the potential importance of diverse recommenda-
                                                                                    ec ts
                                                                                  oj c




act rating value for a specific item. Examples of decision-                                 tions from both user and business perspectives, as discussed in
                                                                                pr oje




support metrics include precision (the percentage of truly                                  Section 1. Several metrics can be used to measure aggregate
                                                                              re r
                                                                            lo rep




“high” ratings among those that were predicted to be “high”                                 diversity, including the percentage of items that the recom-
                                                                          xp lo




                                                                                            mender system is able to make recommendations for (often
                                                                        ee xp




by the recommender system), recall (the percentage of correctly
                                                                     .ie ee




predicted “high” ratings among all the ratings known to be                                  known as coverage) [24]. Since we intend to measure the re-
                                                                    w e




                                                                                            commender systems performance based on the top-N recom-
                                                                   w .i




“high”), and F-measure, which is a harmonic mean of precision
                                                                  w w




                                                                                            mended items lists that the system provides to its users, in this
                                                               :// w




and recall. In particular, the ratings of the datasets that we
                                                            tp //w




used in our experiments are integers between 1 and 5, inclu-                                paper we use the total number of distinct items recommended
                                                          ht ttp:




sive, where higher value represents a better-liked item. As                                 across all users as an aggregate diversity measure, which we
                                                           h




commonly done in recommender systems literature, we define                                  will refer to as diversity-in-top-N and formally define as follows:
the items greater than 3.5 (threshold for “high” ratings, de-
noted by TH) as “highly-ranked” and the ratings less than 3.5 as                                            diversity - in - top - N            LN (u ) .
                                                                                                                                          u U
“non-highly-ranked.” Furthermore, in real world settings, re-
commender systems typically recommend the most highly-                                           Note that the diversity-in-top-N metric can also serve as an
ranked N items since users are usually interested in only sever-                            indicator of the level of personalization provided by a recom-
al most relevant recommendations, and this list of N items for                              mender system. For example, a very low diversity-in-top-N
user u can be defined as LN(u) = {i1, …, iN}, where R*(u, ik) TH                            indicates that all users are being recommended the same top-N
for all k {1, 2,.., N}. Therefore, in our paper, we evaluate the                            items (low level of personalization), whereas a very high diver-
recommendation accuracy based on the percentage of truly                                    sity-in-top-N points to the fact that every user receives her own
“highly-ranked” ratings, denoted by correct(LN(u)), among                                   unique top-N items (high level of personalization).
those that were predicted to be the N most relevant “highly                                      In summary, the goal of the proposed ranking approaches
ranked” items for each user, i.e., using the popular precision-in-                          is to improve the diversity of recommendations; however, as
top-N metric [24]. The metric can be written formally as:                                   described in Section 1, there is a potential tradeoff between
                                                                                            recommendation accuracy and diversity. Thus, in this paper,
    precision - in - top - N           | correct ( LN (u )) |         | LN (u ) | ,         we aim to find techniques that can improve aggregate diversity
                                 u U                            u U
                                                                                            of recommendations while maintaining adequate accuracy.
where correct(LN(u)) = {i LN(u) | R(u, i) TH}. However, rely-
ing on the accuracy of recommendations alone may not be
                                                                                            3 MOTIVATIONS FOR RECOMMENDATION RE-RANKING
enough to find the most relevant items for a user. It has often
been suggested that recommender systems must be not only                                    In this section, we discuss how re-ranking of the candidate
accurate, but also useful [24], [32]. For example, [32] suggests                            items whose predictions are above TH can affect the accuracy-
new user-centric directions for evaluating recommender sys-                                 diversity tradeoff and how various item ranking factors, such
tems beyond the conventional accuracy metrics. They claim                                   as popularity-based approach, can improve the diversity of
that serendipity in recommendations or user experiences and                                 recommendations. Note that the general idea of personalized
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES
                                                                                                                                                                  5


information ordering is not new; e.g., its importance has been
discussed in information retrieval literature [35], [45], includ-
ing some attempts to reduce redundancy and promote the di-
versity of retrieved results by re-ranking them [12], [38], [53].

3.1 Standard Ranking Approach
Typical recommender systems predict unknown ratings based
on known ratings, using any traditional recommendation tech-
nique such as neighborhood-based or matrix factorization CF
techniques, discussed in Section 2.1. Then, the predicted rat-
ings are used to support the user’s decision-making. In partic-
ular, each user u gets recommended a list of top-N items, LN(u),
selected according to some ranking criterion. More formally,
item ix is ranked ahead of item iy (i.e., ix       iy) if rank(ix) <                   MovieLens data, item-based CF (50 neighbors), top-5 item recommendation
rank(iy), where rank: I      is a function representing the rank-
                                                                                        Fig. 1. Performance of the standard ranking approach and item
ing criterion. The vast majority of current recommender sys-                                    popularity-based approach with its parameterized versions
tems use the predicted rating value as the ranking criterion:
                                                                                       how much loss is tolerable in a given application).
    rankStandard(i)=R*(u, i)-1.
     The power of -1 in the above expression indicates that the                        3.3 Controlling Accuracy-Diversity Trade-Off:
items with highest-predicted (as opposed to lowest-predicted)                              Parameterized Ranking Approaches
ratings R*(u, i) are the ones being recommended to user. In the                        The item popularity-based ranking approach as well as all oth-
paper we refer to this as the standard ranking approach, and it                        er ranking approaches proposed in this paper (to be discussed
shares the motivation with the widely used probability ranking                         in Section 4) are parameterized with “ranking threshold”




                                                                                              t.c om
                                                                                                 om
principle in information retrieval literature that ranks the doc-                      TR [TH, Tmax] (where Tmax is the largest possible rating on the


                                                                                            po t.c
                                                                                          gs po
uments in order of decreasing probability of relevance [37].                           rating scale, e.g., Tmax=5) to allow user the ability to choose a
                                                                                        lo s
     Note that, by definition, recommending the most highly                            certain level of recommendation accuracy. In particular, given
                                                                                      .b og
                                                                                    ts .bl

predicted items selected by the standard ranking approach is                           any ranking function rankX(i), ranking threshold TR is used for
                                                                                  ec ts



designed to help improve recommendation accuracy, but not                              creating the parameterized version of this ranking function,
                                                                                oj c
                                                                              pr oje




recommendation diversity. Therefore, new ranking criteria are                          rankX(i, TR), which is formally defined as:
                                                                            re r
                                                                          lo rep




needed in order to achieve diversity improvement. Since re-
                                                                                                                                              if R* ( u ,i ) TR ,Tmax
                                                                        xp lo




commending best-selling items to each user typically leads to                                                    rank x (i ),
                                                                      ee xp




diversity reduction, recommending less popular items intui-                               rank x (i , TR )
                                                                   .ie ee




                                                                                                                                            *
tively should have an effect towards increasing recommenda-                                                        u rankStandard (i ), if R ( u ,i ) TH ,TR
                                                                  w e
                                                                 w .i
                                                                w w




                                                                                                  *
tion diversity. And, as seen from the example in Table 1 (in                              where I u (TR ) {i         I | R * (u , i ) TR },         max rank x (i ) .
                                                             :// w




                                                                                                                                              u
                                                          tp //w




                                                                                                                                                      *
Section 1), this intuition has empirical support. Following this                                                                                  i I u (TR )
                                                        ht ttp:




motivation, we explore the possibility to use item popularity as a                          Simply put, items that are predicted above ranking thre-
                                                         h




recommendation ranking criterion, and in the next subsection                           shold TR are ranked according to rankX(i), while items that are
we show how this approach can affect the recommendation                                below TR are ranked according to the standard ranking ap-
quality in terms of accuracy and diversity.                                            proach rankStandard(i). In addition, all items that are above TR get
3.2 Proposed Approach: Item Popularity-Based Ranking                                   ranked ahead of all items that are below TR (as ensured by u in
                                                                                       the above formal definition). Thus, increasing the ranking
Item popularity-based ranking approach ranks items directly
                                                                                       threshold TR [TH, Tmax] towards Tmax would enable choosing
based on their popularity, from lowest to highest, where popu-
                                                                                       the most highly predicted items resulting in more accuracy and
larity is represented by the number of known ratings that each
                                                                                       less diversity (becoming increasingly similar to the standard
item has. More formally, item popularity-based ranking func-
                                                                                       ranking approach); in contrast, decreasing the ranking thre-
tion can be written as follows:
                                                                                       shold TR [TH, Tmax] towards TH would make rankX(i, TR) increa-
   rankItemPop(i) = |U(i)|, where U(i) = {u U | R(u, i)}.                              singly more similar to the pure ranking function rankX(i), re-
     We compared the performance of the item popularity-                               sulting in more diversity with some accuracy loss.
based ranking approach with the standard ranking approach                                   Therefore, choosing different TR values in-between the ex-
using MovieLens dataset and item-based CF, and we present                              tremes allows the user to set the desired balance between accu-
this comparison using the accuracy-diversity plot in Fig.1. In                         racy and diversity. In particular, as Fig. 1 shows, the recom-
particular, the results show that, as compared to the standard                         mendation accuracy of item popularity-based ranking ap-
ranking approach, the item popularity-based ranking approach                           proach could be improved by increasing the ranking threshold.
increased recommendation diversity from 385 to 1395 (i.e., 3.6                         For example, the item popularity-based ranking approach with
times!); however, recommendation accuracy dropped from                                 ranking threshold 4.4 could minimize the accuracy loss to
89% to 69%. Here, despite the significant diversity gain, such a                       1.32%, but still could obtain 83% diversity gain (from 385 to
significant accuracy loss (20%) would not be acceptable in most                        703), compared to the standard ranking approach. An even
real-life personalization applications. Therefore, next we in-                         higher threshold 4.7 still makes it possible to achieve 20% di-
troduce a general technique to parameterize recommendation                             versity gain (from 385 to 462) with only 0.06% of accuracy loss.
ranking approaches, which allows to achieve significant diver-                              Also note that, even when there are less than N items
sity gains while controlling accuracy losses (e.g., according to                       above the ranking threshold TR, by definition, all the items
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


6                                                                          IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID



above TR are recommended to a user, and the remaining top-N                           second step (b) demonstrates this accuracy-diversity tradeoff.
items are selected according to the standard ranking approach.                             The third step, shown in Fig. 2c, can significantly minimize
This ensures that all the ranking approaches proposed in this                         accuracy loss by confining the re-ranked recommendations to
paper provide the same exact number of recommendations as                             the items above newly introduced ranking threshold TR (e.g.,
their corresponding baseline techniques (the ones using the                           3.8 out of 5). In this particular illustration, note that the in-
standard ranking approach), which is very important from the                          creased ranking threshold makes the fifth recommended item
experimental analysis point of view as well in order to have a                        in step (b) (i.e., item with predicted rating value of 3.65) filtered
fair performance comparison of different ranking techniques.                          out and the next possible item above the new ranking thre-
                                                                                      shold (i.e. the item predicted as 3.81) is recommended to user
3.4 General Steps for Recommendation Re-ranking                                       u. Averaged across all users, this parameterization helps to
The item popularity-based ranking approach described above                            make the level of accuracy loss fairly small with still a signifi-
is just one example of possible ranking approaches for improv-                        cant diversity gain (as compared to the standard ranking ap-
ing recommendation diversity, and a number of additional                              proach), as shown in the performance graph of step (c).
ranking functions, rankX(i), will be introduced in Section 4.                              We now introduce several additional item ranking func-
Here, based on the previous discussion in Section 3, we sum-                          tions, and provide empirical evidence that supports our moti-
marize the general ideas behind the proposed ranking ap-                              vation of using these item criteria for diversity improvement.
proaches, as illustrated by Fig. 2.
     The first step, shown in Fig. 2a, represents the standard
approach, which, for each user, ranks all the predicted items                         4 ADDITIONAL RANKING APPROACHES
according to the predicted rating value and selects top-N can-                        In many personalization applications (e.g., movie or book rec-
didate items, as long as they are above the highly-predicted                          ommendations), there often exist more highly-predicted rat-
rating threshold TH. The recommendation quality of the over-                          ings for a given user than can be put in her top-N list. This
all recommendation technique is measured in terms of the pre-                         provides opportunities to have a number of alternative ranking




                                                                                               t.c om
cision-in-top-N and the diversity-in-top-N, as shown in the                           approaches, where different sets of items can possibly be rec-




                                                                                                  om
                                                                                             po t.c
accuracy-diversity plot at the right side of the example (a).                         ommended to the user. In this section, we introduce six addi-

                                                                                           gs po
     The second step, illustrated in Fig. 2b, shows the recom-                        tional ranking approaches that can be used as alternatives to
                                                                                         lo s
                                                                                       .b og
mendations provided by applying one of the proposed ranking                           rankStandard to improve recommendation diversity, and Fig. 3
                                                                                     ts .bl

functions, rankX(i), where several different items (that are not                      provides some empirical evidence that supports the use of
                                                                                   ec ts
                                                                                 oj c




necessarily among N most highly predicted, but are still above                        these item ranking criteria. Because of the space limitations, in
                                                                               pr oje




TH) are recommended to the user. This way, a user can get                             Fig. 3 we present the empirical results for MovieLens dataset;
                                                                             re r
                                                                           lo rep




recommended more idiosyncratic, long-tail, less frequently                            however, consistently similar patterns were found in other da-
                                                                         xp lo




recommended items that may not be as widely popular, but                              tasets (discussed in Section 5.1) as well.
                                                                       ee xp
                                                                    .ie ee




can still be very relevant to this user (as indicated by relatively                      In particular, in our empirical analysis we consistently ob-
                                                                   w e




high predicted rating). Therefore, re-ranking the candidate                           served that popular items, on average, are likely to have higher
                                                                  w .i
                                                                 w w




items can significantly improve the recommendation diversity
                                                              :// w




                                                                                      predicted ratings than less popular items, using both heuristic-
                                                           tp //w




although, as discussed, this typically comes at some loss of                          and model-based techniques for rating prediction, as shown in
                                                         ht ttp:




recommendation accuracy. The performance graph of the                                 Fig. 3a. As discussed in Section 3, recommending less popular
                                                          h




        (a)   Recommending top-N highly predicted items for user u, according to standard ranking approach
        (b)   Recommending top-N items, according to some other ranking approach for better diversity
        (c)   Confining re-ranked recommendations to the items above new ranking threshold TR (e.g., 3.8) for better accuracy

                               Fig. 2. General overview of ranking-based approaches for improving recommendation diversity
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES
                                                                                                                                                                  7


items helps to improve recommendation diversity; therefore, as                          ommended for better diversity.
can be immediately suggested from the monotonic relationship                                 Item Absolute Likeability, i.e., ranking items according to
between average item popularity and predicted rating value,                                  how many users liked them (i.e., rated the item above TH):
recommending not as highly predicted items (but still pre-
dicted to be above TH) likely implies recommending, on aver-                                  rankAbsLike(i) = |UH(i)|, where UH(i)={u U(i)| R(u,i)                   TH}.
age, less popular items, potentially leading to diversity im-                                Item Relative Likeability, i.e., ranking items according to
provements. Therefore, we propose to use predicted rating                                    the percentage of the users who liked an item (among all
value itself as an item ranking criterion:                                                   users who rated it):
    Reverse Predicted Rating Value, i.e., ranking the candidate                              rankRelLike(i) = |UH(i)| / |U(i)|.
    (highly predicted) items based on their predicted rating                                We can also use two different types of rating variances to
    value, from lowest to highest (as a result choosing less                            improve recommendation diversity. With any traditional rec-
    popular items, according to Fig. 3a). More formally:                                ommendation technique, each item’s rating variance (which
    rankRevPred(i) = R*(u,i).                                                           can be computed from known ratings submitted for that item)
                                                                                        can be used for re-ranking candidate items. Also, if any neigh-
   We now propose several other ranking criteria that exhibit
                                                                                        borhood-based recommendation technique is used for predic-
consistent relationships to predicted rating value, including
                                                                                        tion, we can use the rating variance of neighbors whose ratings
average rating, absolute likeability, relative likeability, item
                                                                                        are used to predict the rating for re-ranking candidate items.
rating variance, and neighbors’ rating variance, as shown in
                                                                                        As shown in Fig. 3e and 3f, the relationship between the pre-
Figures 3b-3f. In particular, the relationship between predicted
                                                                                        dicted rating value and each item’s rating variance and the
rating values and the average actual rating of each item (as ex-
                                                                                        relationship between predicted rating value and 50 neighbors’
plicitly rated by users), shown in Fig. 3b, also supports a simi-
                                                                                        rating variance obtained by using a neighborhood-based CF
lar conjecture that items with lower average rating, on average,
                                                                                        technique demonstrate that highly predicted items tend to be
are more likely to have lower predicted rating values (likely
                                                                                        low in both item rating variance and neighbors’ rating va-




                                                                                               t.c om
representing less popular items, as shown earlier). Thus, such




                                                                                                  om
                                                                                        riance. In other words, among the highly-predicted ratings


                                                                                             po t.c
items could be recommended for better diversity.
                                                                                           gs po
                                                                                        (i.e., above TH) there is more user consensus for higher-
    Item Average Rating, i.e., ranking items according to an                             lo s
                                                                                        predicted items than for lower-predicted ones. These findings
                                                                                       .b og
    average of all known ratings for each item:
                                                                                     ts .bl

                                                                                        indicate that re-ranking recommendation list by rating variance
                                                                                   ec ts



                                                     1                                  and choosing the items with higher variance could improve
                                                                                 oj c
                                                                               pr oje




    rankAvgRating(i) = R(i) , where R(i )                             R(u , i ) .       recommendation diversity.
                                                  | U (i ) | u
                                                                             re r
                                                                           lo rep




                                                                 U (i )
                                                                                             Item Rating Variance, i.e., ranking items according to each
                                                                         xp lo




   Similarly, the relationship between predicted rating values                               item’s rating variance (i.e., rating variance of users who
                                                                       ee xp




and item absolute (or relative) likeability, shown in Fig. 3c and
                                                                    .ie ee




                                                                                             rated the item):
                                                                   w e




3d, also suggests that the items with lower likeability, on aver-
                                                                  w .i
                                                                 w w




age, are more likely to have lower predicted rating values (like-
                                                              :// w
                                                           tp //w




ly representing less popular movies) and, thus, could be rec-
                                                         ht ttp:
                                                          h




      (a) Average Predicted Rating Value                                   (b) Item Average Rating                        (c) Average Item Absolute Likeability




     (d) Average Item Relative Likeability                          (e) Average Item Rating Variance                    (f) Average Neighbors’ Rating Variance

    Fig. 3. Relationships between various item-ranking criteria and predicted rating value, for highly-predicted ratings (MovieLens data)
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


8                                                                                     IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID



                           1                                                                       TABLE 2. BASIC INFORMATION OF MOVIE RATING DATASETS
     rankItemVar(i) =                       ( R (u, i) R (i )) 2 .
                        | U (i) | u    U (i )                                                                              MovieLens          Netflix
                                                                                                                                                             Yahoo!
                                                                                                                                                             Movies
    Neighbors’ Rating Variance, i.e., ranking items according                                   Number of users                  2,830           3,333           1,349
    to the rating variance of neighbors of a particular user for a                              Number of movies                 1,919           2,092             721
    particular item. The closest neighbors of user u among the                                  Number of ratings              775,176      1,067,999           53,622
    users who rated the particular item i, denoted by u', are                                   Data Sparsity                  14.27%          15.32%           5.51%
    chosen from the set of U(i) N(u).                                                           Avg # of common movies
                                                                                                                                   64.6            57.3             4.1
                                                                                                between two users
                                       1                                                  2     Avg # of common users
    rankNeighborVar(i) =                                         ( R(u ' , i ) Ru (i ))         between two movies                 85.1            99.5             6.5
                           | U (i )        N (u ) | u '   U (i ) N (u )                         Avg # of users per movie         404.0           510.5             74.4
                                   1                                                            Avg # of movies per user         274.1           320.4             39.8
    where Ru (i )                                             R (u ' , i ) .
                        | U (i )       N (u ) | u '   U (i ) N (u )                               Consistently with the accuracy-diversity tradeoff discussed
                                                                                              in the introduction, all the proposed ranking approaches im-
   In summary, there exist a number of different ranking ap-                                  proved the diversity of recommendations by sacrificing the
proaches that can improve recommendation diversity by re-                                     accuracy of recommendations. However, with each ranking
commending items other than the ones with topmost predicted                                   approach, as ranking threshold TR increases, the accuracy loss
rating values to a user. In addition, as indicated in Fig. 1, the                             is significantly minimized (smaller precision loss) while still
degree of improvement (and, more importantly, the degree of                                   exhibiting substantial diversity improvement. Therefore, with
tolerable accuracy loss) can be controlled by the chosen rank-                                different ranking thresholds, one can obtain different diversity
ing threshold value TR. The next section presents comprehen-                                  gains for different levels of tolerable precision loss, as com-
sive empirical results demonstrating the effectiveness and ro-                                pared to the standard ranking approach. Following this idea,




                                                                                                            t.c om
bustness of the proposed ranking techniques.




                                                                                                               om
                                                                                              in our experiments we compare the effectiveness (i.e., diversity


                                                                                                          po t.c
                                                                                              gain) of different recommendation ranking techniques for a
5 EMPIRICAL RESULTS                                                                                     gs po
                                                                                              variety of different precision loss levels (0.1-10%).
                                                                                                      lo s
                                                                                                    .b og
                                                                                                  While, as mentioned earlier, a comprehensive set of experi-
                                                                                                  ts .bl

5.1 Data
                                                                                              ments was performed using every rating prediction technique
                                                                                                ec ts
                                                                                              oj c




The proposed recommendation ranking approaches were
                                                                                            pr oje




                                                                                              in conjunction with every recommendation ranking function
tested with several movie rating datasets, including MovieLens
                                                                                          re r




                                                                                              on every dataset for different number of top-N recommenda-
                                                                                        lo rep




(data file available at grouplens.org), Netflix (data file available                          tions, the results were very consistent across all experiments
                                                                                      xp lo




at netflixprize.com), and Yahoo! Movies (individual ratings
                                                                                    ee xp




                                                                                              and, therefore, for illustration purposes and because of the
                                                                                 .ie ee




collected from movie pages at movies.yahoo.com). We pre-                                      space limitations, we show only three results: each using all
                                                                                w e




processed each dataset to include users and movies with signif-
                                                                               w .i




                                                                                              possible ranking techniques on a different dataset, a different
                                                                              w w




icant rating history, which makes it possible to have sufficient
                                                                           :// w




                                                                                              recommendation technique, and a different number of recom-
                                                                        tp //w




number of highly-predicted items for recommendations to each                                  mendations. (See Table 3.)
                                                                      ht ttp:




user (in the test data). The basic statistical information of the                                 For example, Table 3a shows the performance of the pro-
                                                                       h




resulting datasets is summarized in Table 2. For each dataset,                                posed ranking approaches used in conjunction with item-based
we randomly chose 60% of the ratings as training data and                                     CF technique to provide top-5 recommendations on the Mo-
used them to predict the remaining 40% (i.e., test data).                                     vieLens dataset. In particular, one can observe that, with the
5.2 Performance of Proposed Ranking Approaches                                                precision loss of only 0.001 or 0.1% (i.e., with precision of 0.891,
                                                                                              down from 0.892 of the standard ranking approach), item aver-
We conducted experiments on the three datasets described in
                                                                                              age rating-based ranking approach can already increase rec-
Section 5.1, using three widely popular recommendation tech-
                                                                                              ommendation diversity by 20% (i.e., absolute diversity gain of
niques for rating prediction, including two heuristic-based (us-
                                                                                              78 on top of the 385 achieved by the standard ranking ap-
er-based and item-based CF) and one model-based (matrix
                                                                                              proach). If users can tolerate precision loss up to 1% (i.e., pre-
factorization CF) techniques, discussed in Section 2.1. All sev-
                                                                                              cision of 0.882 or 88.2%), the diversity could be increased by
en proposed ranking approaches were used in conjunction
                                                                                              81% with the same ranking technique; and 5% precision loss
with each of the three rating prediction techniques to generate
                                                                                              (i.e., 84.2%) can provide diversity gains up to 189% for this rec-
top-N (N=1, 5, 10) recommendations to each user on each data-
                                                                                              ommendation technique on this dataset. Substantial diversity
set, with the exception of neighbors’ variance-based ranking of
                                                                                              improvements can be observed across different ranking tech-
model-based predicted ratings. In particular, because there is
                                                                                              niques, different rating prediction techniques, and different
no concept of neighbors in a pure matrix factorization tech-
                                                                                              datasets, as shown in Tables 3a, 3b, and 3c.
nique, the ranking approach based on neighbors’ rating va-
                                                                                                  In general, all proposed ranking approaches were able to
riance was applied only with heuristic-based techniques. We
                                                                                              provide significant diversity gains, and the best-performing
set predicted rating threshold as TH = 3.5 (out of 5) to ensure
                                                                                              ranking approach may be different depending on the chosen
that only relevant items are recommended to users, and rank-
                                                                                              dataset and rating prediction technique. Thus, system design-
ing threshold TR was varied from 3.5 to 4.9. The performance
                                                                                              ers have the flexibility to choose the most desirable ranking
of each ranking approach was measured in terms of precision-
                                                                                              approach based on the data in a given application. We would
in-top-N and diversity-in-top-N (N=1, 5, 10), and, for compari-
                                                                                              also like to point out that, since the proposed approaches es-
son purposes, its diversity gain and precision loss with respect
                                                                                              sentially are implemented as sorting algorithms based on cer-
to the standard ranking approach was calculated.
                                                                                              tain ranking heuristics, they are extremely scalable. For exam-
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES
                                                                                                                                                                   9

                     TABLE 3. DIVERSITY GAINS OF PROPOSED RANKING APPROACHES FOR DIFFERENT LEVELS OF PRECISION LOSS
                           Item                  Reverse           Item Average          Item Abs            Item Relative         Item Rating          Neighbors’
                         Popularity             Prediction            Rating             Likeability           Likeability           Variance         Rating Variance
  Precision Loss       Diversity Gain         Diversity Gain       Diversity Gain      Diversity Gain        Diversity Gain       Diversity Gain       Diversity Gain
              -0.1    +800         3.078      +848      3.203     +975       3.532      +897     3.330       +937      3.434      +386       2.003     +702       2.823
            -0.05     +594         2.543      +594      2.543     +728       2.891      +642     2.668       +699      2.816      +283       1.735     +451       2.171
           -0.025     +411         2.068      +411      2.068     +513       2.332      +445     2.156       +484      2.257      +205       1.532     +258       1.670
            -0.01     +270         1.701      +234      1.608      +311      1.808      +282     1.732       +278      1.722      +126       1.327     +133       1.345
           -0.005     +189         1.491      +173      1.449     +223       1.579      +196     1.509       +199      1.517          +91    1.236      +87       1.226
           -0.001       +93        1.242       +44       1.114        +78    1.203      +104     1.270        +96      1.249          +21    1.055      +20       1.052
  Standard:0.892        385        1.000       385      1.000         385    1.000       385     1.000        385      1.000          385    1.000      385       1.000
                               (a) MovieLens dataset, top-5 items, heuristic-based technique (item-based CF, 50 neighbors)
                            Item                        Reverse              Item Average              Item Abs                Item Relative           Item Rating
                          Popularity                   Prediction                Rating                Likeability               Likeability             Variance
  Precision Loss        Diversity Gain               Diversity Gain          Diversity Gain          Diversity Gain            Diversity Gain         Diversity Gain
             -0.1         +314        1.356          +962        2.091       +880        1.998       +732        1.830         +860         1.975     +115        1.130
            -0.05         +301        1.341          +757        1.858       +718        1.814       +614        1.696         +695         1.788     +137        1.155
           -0.025         +238        1.270          +568        1.644       +535        1.607       +464        1.526         +542         1.615     +110        1.125
            -0.01         +156        1.177          +363        1.412       +382        1.433       +300        1.340         +385         1.437      +63        1.071
           -0.005         +128        1.145          +264        1.299       +282        1.320       +247        1.280         +288         1.327      +47        1.053
           -0.001            +64      1.073          +177        1.201       +118        1.134         +89       1.101         +148         1.168        +8       1.009
  Standard:0.834             882      1.000           882        1.000         882       1.000         882       1.000          882         1.000      882        1.000




                                                                                                  t.c om
                                                                                                     om
                                    (b) Netflix dataset, top-5 items, model-based technique (matrix factorization CF, K=64)



                                                                                                po t.c
                                                                                              gs po
                          Item                   Reverse          Item Average           Item Abs
                                                                                            lo s             Item Relative         Item Rating          Neighbors’
                                                                                          .b og
                        Popularity              Prediction           Rating              Likeability           Likeability           Variance         Rating Variance
                                                                                        ts .bl

  Precision Loss      Diversity Gain          Diversity Gain      Diversity Gain       Diversity Gain        Diversity Gain       Diversity Gain       Diversity Gain
                                                                                      ec ts
                                                                                    oj c




             -0.1     +220         1.794    +178        1.643     +149      1.538      +246      1.888       +122      1.440       +86        1.310    +128       1.462
                                                                                  pr oje




            -0.05     +198         1.715    +165        1.596     +141      1.509      +226      1.816       +117      1.422       +72        1.260    +108       1.390
                                                                                re r
                                                                              lo rep




           -0.025     +134         1.484    +134        1.484     +103      1.372      +152      1.549        +86      1.310       +70        1.253      +98      1.354
                                                                            xp lo




            -0.01      +73         1.264       +92      1.332         +56   1.202        +77     1.278        +58      1.209       +56        1.202      +65      1.235
                                                                          ee xp
                                                                       .ie ee




           -0.005      +57         1.206       +86      1.310         +38   1.137        +63     1.227        +36      1.130       +28        1.101      +51      1.184
                                                                      w e
                                                                     w .i




           -0.001      +42         1.152       +71      1.256         +25   1.090        +43     1.155        +30      1.110       +19        1.069      +22      1.079
                                                                    w w
                                                                 :// w




  Standard:0.911       277         1.000       277      1.000         277   1.000        277     1.000        277      1.000       277        1.000      277      1.000
                                                              tp //w




                                    (c) Yahoo dataset, top-1 item, heuristic-based technique (user-based CF, 15 neighbors)
                                                            ht ttp:
                                                             h




Notation: Precision Loss = [Precision-in-top-N of proposed ranking approach] – [Precision-in-top-N of standard ranking approach]
          Diversity Gain (column 1) = [Diversity-in-top-N of proposed ranking approach] – [Diversity-in-top-N of standard ranking approach]
          Diversity Gain (column 2) = [Diversity-in-top-N of proposed ranking approach] / [Diversity-in-top-N of standard ranking approach]

ple, it took, on average, less than 6 seconds to rank all the pre-                      using the recommendation ranking techniques with any of the
dicted items and select top-N recommendations for nearly                                parameter values, it is possible to obtain substantial diversity
3,000 users in our experiments with MovieLens data.                                     improvements with only a small accuracy loss.
                                                                                           We also vary the number of top-N recommendations pro-
5.3 Robustness Analysis for Different Parameters                                        vided by the system. Note that, while it is intuitively clear that
In this subsection, we present robustness analysis of the pro-                          top-1, top-5, and top-10 recommendations will provide differ-
posed techniques with respect to several parameters: number                             ent accuracy and diversity levels (i.e., it is much easier to accu-
of neighbors used in heuristic-based CF, number of features                             rately recommend one relevant item than relevant 10 items,
used in matrix factorization CF, number of top-N recommenda-                            and it is much easier to have more aggregate diversity when
tions provided to each user, the value of predicted rating thre-                        you can provide more recommendations), again we observe
shold TH, and the level of data sparsity.                                               that, with any number of top-N recommendations, the pro-
   We tested the heuristic-based technique with a different                             posed techniques exhibit robust and consistent behavior, i.e.,
number of neighbors (15, 20, 30, and 50 neighbors) and the                              they allow to obtain substantial diversity gains at a small accu-
model-based technique with a different number of features                               racy loss, as shown in Fig. 4c. For example, with only 1% pre-
(K=8, 16, 32, and 64). For illustration purposes, Fig. 4a and 4b                        cision loss, we were able to increase the diversity from 133 to
show how two different ranking approaches for both heuristic-                           311 (134% gain) using the reverse predicted rating value-based
based and model-based rating prediction techniques are af-                              ranking approach in the top-1 recommendation task, and from
fected by different parameter values. While different parame-                           385 to 655 (70% gain) using the item-popularity-based ranking
ter values may result in slightly different performance (as is                          approach in the top-5 recommendation task.
well-known in recommender systems literature), the funda-                                  In addition, our finding that the proposed ranking ap-
mental behavior of the proposed techniques remains robust                               proaches help to improve recommendation diversity is also
and consistent, as shown in Fig. 4a and 4b. In other words,                             robust with respect to the “highly-predicted” rating threshold
Improving aggregate recommendation diversity
Improving aggregate recommendation diversity
Improving aggregate recommendation diversity
Improving aggregate recommendation diversity
Improving aggregate recommendation diversity
Improving aggregate recommendation diversity

Weitere ähnliche Inhalte

Mehr von ingenioustech

Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicastingingenioustech
 
Monitoring service systems from
Monitoring service systems fromMonitoring service systems from
Monitoring service systems fromingenioustech
 
Locally consistent concept factorization for
Locally consistent concept factorization forLocally consistent concept factorization for
Locally consistent concept factorization foringenioustech
 
Measurement and diagnosis of address
Measurement and diagnosis of addressMeasurement and diagnosis of address
Measurement and diagnosis of addressingenioustech
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation foringenioustech
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization iningenioustech
 
Online social network
Online social networkOnline social network
Online social networkingenioustech
 
It auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingIt auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingingenioustech
 
Bayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlBayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlingenioustech
 
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]ingenioustech
 
Active reranking for web image search
Active reranking for web image searchActive reranking for web image search
Active reranking for web image searchingenioustech
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_controlingenioustech
 
Java & dotnet titles
Java & dotnet titlesJava & dotnet titles
Java & dotnet titlesingenioustech
 

Mehr von ingenioustech (19)

Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicasting
 
Monitoring service systems from
Monitoring service systems fromMonitoring service systems from
Monitoring service systems from
 
Locally consistent concept factorization for
Locally consistent concept factorization forLocally consistent concept factorization for
Locally consistent concept factorization for
 
Measurement and diagnosis of address
Measurement and diagnosis of addressMeasurement and diagnosis of address
Measurement and diagnosis of address
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization in
 
Tcp
TcpTcp
Tcp
 
Privacy preserving
Privacy preservingPrivacy preserving
Privacy preserving
 
Peace
PeacePeace
Peace
 
Online social network
Online social networkOnline social network
Online social network
 
Layered approach
Layered approachLayered approach
Layered approach
 
It auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingIt auditing to assure a secure cloud computing
It auditing to assure a secure cloud computing
 
Intrution detection
Intrution detectionIntrution detection
Intrution detection
 
Bayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlBayesian classifiers programmed in sql
Bayesian classifiers programmed in sql
 
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
 
Active reranking for web image search
Active reranking for web image searchActive reranking for web image search
Active reranking for web image search
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_control
 
Vebek
VebekVebek
Vebek
 
Java & dotnet titles
Java & dotnet titlesJava & dotnet titles
Java & dotnet titles
 

Kürzlich hochgeladen

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 

Improving aggregate recommendation diversity

  • 1. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID 1 Improving Aggregate Recommendation Diversity Using Ranking-Based Techniques Gediminas Adomavicius, Member, IEEE, and YoungOk Kwon Abstract— Recommender systems are becoming increasingly important to individual users and businesses for providing personalized recommendations. However, while the majority of algorithms proposed in recommender systems literature have focused on improving recommendation accuracy (as exemplified by the recent Netflix Prize competition), other important aspects of recommendation quality, such as the diversity of recommendations, have often been overlooked. In this paper, we introduce and explore a number of item ranking techniques that can generate recommendations that have substantially higher aggregate diversity across all users while maintaining comparable levels of recommendation accuracy. Comprehensive empirical evaluation consistently shows the diversity gains of the proposed techniques using several real-world rating datasets and different rating prediction algorithms. Index Terms— Recommender systems, recommendation diversity, ranking functions, performance evaluation metrics, collaborative filtering. —————————— —————————— 1 Introduction I n the current age of information overload, it is becoming increasingly harder to find relevant content. This problem is recommended items, while maintaining an acceptable level of accuracy [8], [33], [46], [54], [57]. These studies measure rec- t.c om not only widespread but also alarming [28]. Over the last 10- ommendation diversity from an individual user’s perspective om po t.c 15 years, recommender systems technologies have been intro- (i.e., individual diversity). gs po duced to help people deal with these vast amounts of informa- In contrast to individual diversity, which has been explored lo s .b og tion [1], [7], [9], [30], [36], [39], and they have been widely used in a number of papers, some recent studies [10], [14] started ts .bl in research as well as e-commerce applications, such as the examining the impact of recommender systems on sales diver- ec ts oj c ones used by Amazon and Netflix. sity by considering aggregate diversity of recommendations pr oje The most common formulation of the recommendation across all users. Note that high individual diversity of recom- re r lo rep problem relies on the notion of ratings, i.e., recommender sys- mendations does not necessarily imply high aggregate diversi- xp lo tems estimate ratings of items (or products) that are yet to be ty. For example, if the system recommends to all users the ee xp .ie ee consumed by users, based on the ratings of items already con- same five best-selling items that are not similar to each other, w e sumed. Recommender systems typically try to predict the rat- the recommendation list for each user is diverse (i.e., high in- w .i w w ings of unknown items for each user, often using other users’ dividual diversity), but only five distinct items are recom- :// w tp //w ratings, and recommend top N items with the highest pre- mended to all users and purchased by them (i.e., resulting in ht ttp: dicted ratings. Accordingly, there have been many studies on low aggregate diversity or high sales concentration). h developing new algorithms that can improve the predictive While the benefits of recommender systems that provide accuracy of recommendations. However, the quality of rec- higher aggregate diversity would be apparent to many users ommendations can be evaluated along a number of dimen- (because such systems focus on providing wider range of items sions, and relying on the accuracy of recommendations alone in their recommendations and not mostly bestsellers, which may not be enough to find the most relevant items for each users are often capable of discovering by themselves), such user [24], [32]. In particular, the importance of diverse recom- systems could be beneficial for some business models as well mendations has been previously emphasized in several studies [10], [11], [14], [20]. For example, it would be profitable to Net- [8], [10], [14], [33], [46], [54], [57]. These studies argue that one flix if the recommender systems can encourage users to rent of the goals of recommender systems is to provide a user with “long-tail” type of movies (i.e., more obscure items that are highly idiosyncratic or personalized items, and more diverse located in the tail of the sales distribution [2]) because they are recommendations result in more opportunities for users to get less costly to license and acquire from distributors than new- recommended such items. With this motivation, some studies release or highly-popular movies of big studios [20]. However, proposed new recommendation methods that can increase the the impact of recommender systems on aggregate diversity in diversity of recommendation sets for a given individual user, real-world e-commerce applications has not been well- often measured by an average dissimilarity between all pairs of understood. For example, one study [10], using data from on- line clothing retailer, confirms the “long tail” phenomenon that ———————————————— refers to the increase in the tail of the sales distribution (i.e., the G. Adomavicius is with the Department of Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, MN increase in aggregate diversity) attributable to the usage of the 55455. E-mail: gedas@umn.edu. recommender system. On the other hand, another study [14] Y. Kwon is with the Department of Information and Decision Sciences, Carl- shows a contradictory finding that recommender systems ac- son School of Management, University of Minnesota, Minneapolis, MN tually can reduce the aggregate diversity in sales. This can be 55455. E-mail: kwonx052@umn.edu. explained by the fact that the idiosyncratic items often have Manuscript received (insert date of submission if desired). Please note that all ac- limited historical data and, thus, are more difficult to recom- knowledgments should be placed at the end of the paper, before the bibliography. xxxx-xxxx/0x/$xx.00 © 2009 IEEE Digital Object Indentifier 10.1109/TKDE.2011.15 1041-4347/11/$26.00 © 2011 IEEE
  • 2. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. 2 IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID mend to users; in contrast, popular items typically have more metric (i.e., the percentage of truly “high” ratings among those ratings and, therefore, can be recommended to more users. For that were predicted to be “high” by the recommender system) example, in the context of Netflix Prize competition [6], [22], is 82%, but only 49 popular items out of approximately 2000 there is some evidence that, since recommender systems seek available distinct items are recommended across all users. The to find the common items (among thousands of possible mov- system can improve the diversity of recommendations from 49 ies) that two users have watched, these systems inherently tend up to 695 (a 14-fold increase) by recommending the long-tail to avoid extremes and recommend very relevant but safe rec- item to each user (i.e., the least popular item among highly- ommendations to users [50]. predicted items for each user) instead of the popular item. As seen from this recent debate, there is a growing aware- However, high diversity in this case is obtained at the signifi- ness of the importance of aggregate diversity in recommender cant expense of accuracy, i.e., drop from 82% to 68%. systems. Furthermore, while, as mentioned earlier, there has The above example shows that it is possible to obtain higher been significant amount of work done on improving individual diversity simply by recommending less popular items; howev- diversity, the issue of aggregate diversity in recommender sys- er, the loss of recommendation accuracy in this case can be tems has been largely untouched. Therefore, in this paper, we substantial. In this paper, we explore new recommendation focus on developing algorithmic techniques for improving ag- approaches that can increase the diversity of recommendations gregate diversity of recommendations (which we will simply with only a minimal (negligible) accuracy loss using different refer to as diversity throughout the paper, unless explicitly spe- recommendation ranking techniques. In particular, traditional cified otherwise), which can be intuitively measured by the recommender systems typically rank the relevant items in a number of distinct items recommended across all users. descending order of their predicted ratings for each user and Higher diversity (both individual and aggregate), however, then recommend top N items, resulting in high accuracy. In can come at the expense of accuracy. As known well, there is a contrast, the proposed approaches consider additional factors, tradeoff between accuracy and diversity because high accuracy such as item popularity, when ranking the recommendation may often be obtained by safely recommending to users the list to substantially increase recommendation diversity while t.c om om most popular items, which can clearly lead to the reduction in maintaining comparable levels of accuracy. This paper pro- po t.c diversity, i.e., less personalized recommendations [8], [33], [46]. vides a comprehensive empirical evaluation of the proposed gs po And conversely, higher diversity can be achieved by trying to approaches, where they are tested with various datasets in a lo s .b og uncover and recommend highly idiosyncratic or personalized variety of different settings. For example, the best results show ts .bl items for each user, which often have less data and are inhe- up to 20-25% diversity gain with only 0.1% accuracy loss, up to ec ts oj c rently more difficult to predict, and, thus, may lead to a de- 60-80% gain with 1% accuracy loss, and even substantially pr oje crease in recommendation accuracy. higher diversity improvements (e.g., up to 250%) if some users re r lo rep Table 1 illustrates an example of accuracy and diversity tra- are willing to tolerate higher accuracy loss. xp lo ee xp deoff in two extreme cases where only popular items or long- In addition to providing significant diversity gains, the pro- .ie ee tail type items are recommended to users, using MovieLens posed ranking techniques have several other advantageous w e w .i rating dataset (datasets used in this paper are discussed in Sec- characteristics. In particular, these techniques are extremely w w :// w tion 5.1). In this example, we used a popular recommendation efficient, because they are based on scalable sorting-based heu- tp //w technique, i.e., neighborhood-based collaborative filtering (CF) ristics that make decisions based only on the “local” data (i.e., ht ttp: technique [9], to predict unknown ratings. Then, as candidate only on the candidate items of each individual user) without h recommendations for each user, we considered only the items having to keep track of the “global” information, such as which that were predicted above the pre-defined rating threshold to items have been recommended across all users and how many assure the acceptable level of accuracy, as is typically done in times. The techniques are also parameterizable, since the user recommender systems. Among these candidate items for each has the control to choose the acceptable level of accuracy for user, we identified the item that was rated by most users (i.e., which the diversity will be maximized. Also, the proposed the item with the largest number of known ratings) as a popular ranking techniques provide a flexible solution to improving item, and the item that was rated by least number of users (i.e., recommendation diversity because: they are applied after the the item with the smallest number of known ratings) as a long- unknown item ratings have been estimated and, thus, can tail item. As illustrated by Table 1, if the system recommends achieve diversity gains in conjunction with a number of differ- each user the most popular item (among the ones that had a ent rating prediction techniques, as illustrated in the paper; as sufficiently high predicted rating), it is much more likely for mentioned above, the vast majority of current recommender many users to get the same recommendation (e.g., the best- systems already employ some ranking approach, thus, the selling item). The accuracy measured by precision-in-top-1 proposed techniques would not introduce new types of proce- dures into recommender systems (they would replace existing TABLE 1. ACCURACY-DIVERSITY TRADEOFF: EMPIRICAL EXAMPLE ranking procedures); the proposed ranking approaches do not Quality Metric: require any additional information about users (e.g., demo- Accuracy Diversity graphics) or items (e.g., content features) aside from the ratings Top-1 recommendation of: data, which makes them applicable in a wide variety of rec- Popular Item (item with the 49 distinct ommendation contexts. 82% largest number of known ratings) items The remainder of the paper is organized as follows. Section “Long-Tail” Item (item with the 695 distinct smallest number of known ratings) 68% items 2 reviews relevant literature on traditional recommendation algorithms and the evaluation of recommendation quality. Note. Recommendations (top-1 item for each user) are generated for 2828 users among the items that are predicted above the acceptable Section 3 describes our motivations for alternative recommen- threshold 3.5 (out of 5), using a standard item-based collaborative filter- dation ranking techniques, such as item popularity. We then ing technique with 50 neighbors on the MovieLens Dataset. propose several additional ranking techniques in Section 4, and
  • 3. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 3 the main empirical results follow in Section 5. Additional ex- the recommender system, we use the R(u, i) notation to periments are conducted to further explore the proposed rank- represent a known rating (i.e., the actual rating that user u gave ing techniques in Section 6. Lastly, Section 7 concludes the to item i), and the R*(u, i) notation to represent an unknown paper by summarizing the contributions and future directions. rating (i.e., the system-predicted rating for item i that user u has not rated before). 2 Related Work Neighborhood-based CF technique 2.1 Recommendation Techniques for Rating Prediction There exist multiple variations of neighborhood-based CF Recommender systems are usually classified into three catego- techniques [9], [36], [40]. In this paper, to estimate R*(u, i), i.e., ries based on their approach to recommendation: content- the rating that user u would give to item i, we first compute the based, collaborative, and hybrid approaches [1], [3]. Content- similarity between user u and other users u' using a cosine si- based recommender systems recommend items similar to the milarity metric [9], [40]: ones the user preferred in the past. Collaborative filtering (CF) R(u , i ) R(u ' , i ) i I ( u ,u ') recommender systems recommend items that users with simi- sim(u , u ' ) , (1) lar preferences (i.e., “neighbors”) have liked in the past. Final- i I ( u , u ') R(u, i ) 2 i I ( u , u ') R(u ' , i ) 2 ly, hybrid approaches can combine content-based and colla- where I(u, u') represents the set of all items rated by both user borative methods in several different ways. Recommender u and user u'. Based on the similarity calculation, set N(u) of systems can also be classified based on the nature of their algo- nearest neighbors of user u is obtained. The size of set N(u) can rithmic technique into heuristic (or memory-based) and model- range anywhere from 1 to |U|-1, i.e., all other users in the da- based approaches [1], [9]. Heuristic techniques typically calcu- taset. Then, R*(u, i) is calculated as the adjusted weighted sum late recommendations based directly on the previous user ac- of all known ratings R(u', i), where u' N(u) [13], [34]: tivities (e.g., transactional data or rating values). One of the commonly used heuristic techniques is a neighborhood-based sim(u , u ' ) R(u ' , i ) R(u ' ) t.c om u ' N (u ) R * (u , i ) R(u ) . (2) om approach that finds nearest neighbors that have tastes similar | sim(u , u ' ) | po t.c u ' N (u ) to those of the target user [9], [13], [34], [36], [40]. In contrast, gs po model-based techniques use previous user activities to first Here R (u ) represents the average rating of user u. lo s .b og learn a predictive model, typically using some statistical or A neighborhood-based CF technique can be user-based or ts .bl machine-learning methods, which is then used to make rec- ec ts item-based, depending on whether the similarity is calculated oj c ommendations. Examples of such techniques include Bayesian pr oje between users or items. Formulae (1) and (2) represent the clustering, aspect model, flexible mixture model, matrix facto- re r user-based approach, but they can be straightforwardly rewrit- lo rep rization, and other methods [4], [5], [9], [25], [44], [48]. ten for the item-based approach because of the symmetry be- xp lo ee xp In real world settings, recommender systems generally per- tween users and items in all neighborhood-based CF calcula- .ie ee form the following two tasks in order to provide recommenda- tions [40]. In our experiments we used both user-based and w e w .i tions to each user. First, the ratings of unrated items are esti- item-based approaches for rating estimation. w w :// w mated based on the available information (typically using tp //w known user ratings and possibly also information about item Matrix factorization CF technique ht ttp: content or user demographics) using some recommendation Matrix factorization techniques have been the mainstay of nu- h algorithm. And second, the system finds items that maximize merical linear algebra dating back to the 1970s [16], [21], [27] the user’s utility based on the predicted ratings, and recom- and have recently gained popularity in recommender systems mends them to the user. Ranking approaches proposed in this applications because of their effectiveness in improving rec- paper are designed to improve the recommendation diversity ommendation accuracy [41], [47], [52], [55]. Many variations of in the second task of finding the best items for each user. matrix factorization techniques have been developed to solve Because of the decomposition of rating estimation and rec- the problems of data sparsity, overfitting, and convergence ommendation ranking tasks, our proposed ranking approaches speed, and they turned out to be a crucial component of many provide a flexible solution, as mentioned earlier: they do not well-performing algorithms in the popular Netflix Prize1 com- introduce any new procedures into the recommendation petition [4], [5], [6], [15], [22], [29], [30]. We implemented the process and also can be used in conjunction with any available basic version of this technique, as presented in [15]. With the rating estimation algorithm. In our experiments, to illustrate assumption that a user’s rating for an item is composed of a the broad applicability of the proposed recommendation rank- sum of preferences about the various features of that item, this ing approaches, we used them in conjunction with the most model is induced by Singular Value Decomposition (SVD) on popular and widely employed CF techniques for rating predic- the user-item ratings matrix. In particular, using K features tion: a heuristic neighborhood-based technique and a model- (i.e., rank-K SVD), user u is associated with a user-factors vec- based matrix factorization technique. tor pu (the user’s preferences for K features), and item i is asso- Before we provide an overview of each technique, we intro- ciated with an item-factors vector qi (the item’s importance duce some notation and terminology related to recommenda- weights for K features). The preference of how much user u tion problem. Let U be the set of users of a recommender sys- likes item i, denoted by R*(u, i), is predicted by taking an inner tem, and let I be the set of all possible items that can be rec- product of the two vectors, i.e., ommended to users. Then, the utility function that represents the preference of item i I by user u U is often defined as R * (u , i ) pT q . (3) u i R:U I Rating, where Rating typically represents some numer- ic scale used by the users to evaluate each item. Also, in order to distinguish between the actual ratings and the predictions of 1 More information can be found at www.netflixprize.com.
  • 4. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. 4 IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID All values in user- and item-factor vectors are initially as- expectations also should be considered in evaluating the rec- signed to arbitrary numbers and estimated with a simple gra- ommendation quality. Among many different aspects that dient descent technique as described in (4). User- and item- cannot be measured by accuracy metrics alone, in this paper factor vectors are iteratively updated with learning rate para- we focus on the notion of the diversity of recommendations, meter ( ) as well as regularization parameter ( ), which is used which is discussed next. to minimize overfitting, until the minimum improvement in predictive accuracy or a pre-defined number of iterations per 2.3 Diversity of Recommendations feature is reached. One learning iteration is defined as: As mentioned in Section 1, the diversity of recommendations can be measured in two ways: individual and aggregate. For each rating R(u, i) Most of recent studies have focused on increasing the indi- T err R ( u, i ) pu qi vidual diversity, which can be calculated from each user’s rec- ommendation list (e.g., an average dissimilarity between all pu pu ( err qi pu ) (4) pairs of items recommended to a given user) [8], [33], [46], [54], qi qi ( err pu qi ) [57]. These techniques aim to avoid providing too similar rec- End For ommendations for the same user. For example, some studies Finally, unknown ratings are estimated with the final two [8], [46], [57] used an intra-list similarity metric to determine vectors pu and qi as stated in (3). More details on variations of the individual diversity. Alternatively, [54] used a new evalua- matrix factorization techniques used in recommender systems tion metric, item novelty, to measure the amount of additional can be found in [4], [5], [30], [52], [55]. diversity that one item brings to a list of recommendations. Moreover, the loss of accuracy, resulting from the increase in 2.2 Accuracy of Recommendations diversity, is controlled by changing the granularity of the un- Numerous recommendation techniques have been developed derlying similarity metrics in the diversity-conscious algo- over the last few years, and various metrics have been em- rithms [33]. t.c om On the other hand, except for some work that examined om ployed for measuring the accuracy of recommendations, in- po t.c cluding statistical accuracy metrics and decision-support sales diversity across all users of the system by measuring a gs po measures [24]. As examples of statistical accuracy metrics, statistical dispersion of sales [10], [14], there have been few lo s .b og mean absolute error (MAE) and root mean squared error studies that explore aggregate diversity in recommender sys- ts .bl (RMSE) metrics measure how well a system can predict an ex- tems, despite the potential importance of diverse recommenda- ec ts oj c act rating value for a specific item. Examples of decision- tions from both user and business perspectives, as discussed in pr oje support metrics include precision (the percentage of truly Section 1. Several metrics can be used to measure aggregate re r lo rep “high” ratings among those that were predicted to be “high” diversity, including the percentage of items that the recom- xp lo mender system is able to make recommendations for (often ee xp by the recommender system), recall (the percentage of correctly .ie ee predicted “high” ratings among all the ratings known to be known as coverage) [24]. Since we intend to measure the re- w e commender systems performance based on the top-N recom- w .i “high”), and F-measure, which is a harmonic mean of precision w w mended items lists that the system provides to its users, in this :// w and recall. In particular, the ratings of the datasets that we tp //w used in our experiments are integers between 1 and 5, inclu- paper we use the total number of distinct items recommended ht ttp: sive, where higher value represents a better-liked item. As across all users as an aggregate diversity measure, which we h commonly done in recommender systems literature, we define will refer to as diversity-in-top-N and formally define as follows: the items greater than 3.5 (threshold for “high” ratings, de- noted by TH) as “highly-ranked” and the ratings less than 3.5 as diversity - in - top - N LN (u ) . u U “non-highly-ranked.” Furthermore, in real world settings, re- commender systems typically recommend the most highly- Note that the diversity-in-top-N metric can also serve as an ranked N items since users are usually interested in only sever- indicator of the level of personalization provided by a recom- al most relevant recommendations, and this list of N items for mender system. For example, a very low diversity-in-top-N user u can be defined as LN(u) = {i1, …, iN}, where R*(u, ik) TH indicates that all users are being recommended the same top-N for all k {1, 2,.., N}. Therefore, in our paper, we evaluate the items (low level of personalization), whereas a very high diver- recommendation accuracy based on the percentage of truly sity-in-top-N points to the fact that every user receives her own “highly-ranked” ratings, denoted by correct(LN(u)), among unique top-N items (high level of personalization). those that were predicted to be the N most relevant “highly In summary, the goal of the proposed ranking approaches ranked” items for each user, i.e., using the popular precision-in- is to improve the diversity of recommendations; however, as top-N metric [24]. The metric can be written formally as: described in Section 1, there is a potential tradeoff between recommendation accuracy and diversity. Thus, in this paper, precision - in - top - N | correct ( LN (u )) | | LN (u ) | , we aim to find techniques that can improve aggregate diversity u U u U of recommendations while maintaining adequate accuracy. where correct(LN(u)) = {i LN(u) | R(u, i) TH}. However, rely- ing on the accuracy of recommendations alone may not be 3 MOTIVATIONS FOR RECOMMENDATION RE-RANKING enough to find the most relevant items for a user. It has often been suggested that recommender systems must be not only In this section, we discuss how re-ranking of the candidate accurate, but also useful [24], [32]. For example, [32] suggests items whose predictions are above TH can affect the accuracy- new user-centric directions for evaluating recommender sys- diversity tradeoff and how various item ranking factors, such tems beyond the conventional accuracy metrics. They claim as popularity-based approach, can improve the diversity of that serendipity in recommendations or user experiences and recommendations. Note that the general idea of personalized
  • 5. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 5 information ordering is not new; e.g., its importance has been discussed in information retrieval literature [35], [45], includ- ing some attempts to reduce redundancy and promote the di- versity of retrieved results by re-ranking them [12], [38], [53]. 3.1 Standard Ranking Approach Typical recommender systems predict unknown ratings based on known ratings, using any traditional recommendation tech- nique such as neighborhood-based or matrix factorization CF techniques, discussed in Section 2.1. Then, the predicted rat- ings are used to support the user’s decision-making. In partic- ular, each user u gets recommended a list of top-N items, LN(u), selected according to some ranking criterion. More formally, item ix is ranked ahead of item iy (i.e., ix iy) if rank(ix) < MovieLens data, item-based CF (50 neighbors), top-5 item recommendation rank(iy), where rank: I is a function representing the rank- Fig. 1. Performance of the standard ranking approach and item ing criterion. The vast majority of current recommender sys- popularity-based approach with its parameterized versions tems use the predicted rating value as the ranking criterion: how much loss is tolerable in a given application). rankStandard(i)=R*(u, i)-1. The power of -1 in the above expression indicates that the 3.3 Controlling Accuracy-Diversity Trade-Off: items with highest-predicted (as opposed to lowest-predicted) Parameterized Ranking Approaches ratings R*(u, i) are the ones being recommended to user. In the The item popularity-based ranking approach as well as all oth- paper we refer to this as the standard ranking approach, and it er ranking approaches proposed in this paper (to be discussed shares the motivation with the widely used probability ranking in Section 4) are parameterized with “ranking threshold” t.c om om principle in information retrieval literature that ranks the doc- TR [TH, Tmax] (where Tmax is the largest possible rating on the po t.c gs po uments in order of decreasing probability of relevance [37]. rating scale, e.g., Tmax=5) to allow user the ability to choose a lo s Note that, by definition, recommending the most highly certain level of recommendation accuracy. In particular, given .b og ts .bl predicted items selected by the standard ranking approach is any ranking function rankX(i), ranking threshold TR is used for ec ts designed to help improve recommendation accuracy, but not creating the parameterized version of this ranking function, oj c pr oje recommendation diversity. Therefore, new ranking criteria are rankX(i, TR), which is formally defined as: re r lo rep needed in order to achieve diversity improvement. Since re- if R* ( u ,i ) TR ,Tmax xp lo commending best-selling items to each user typically leads to rank x (i ), ee xp diversity reduction, recommending less popular items intui- rank x (i , TR ) .ie ee * tively should have an effect towards increasing recommenda- u rankStandard (i ), if R ( u ,i ) TH ,TR w e w .i w w * tion diversity. And, as seen from the example in Table 1 (in where I u (TR ) {i I | R * (u , i ) TR }, max rank x (i ) . :// w u tp //w * Section 1), this intuition has empirical support. Following this i I u (TR ) ht ttp: motivation, we explore the possibility to use item popularity as a Simply put, items that are predicted above ranking thre- h recommendation ranking criterion, and in the next subsection shold TR are ranked according to rankX(i), while items that are we show how this approach can affect the recommendation below TR are ranked according to the standard ranking ap- quality in terms of accuracy and diversity. proach rankStandard(i). In addition, all items that are above TR get 3.2 Proposed Approach: Item Popularity-Based Ranking ranked ahead of all items that are below TR (as ensured by u in the above formal definition). Thus, increasing the ranking Item popularity-based ranking approach ranks items directly threshold TR [TH, Tmax] towards Tmax would enable choosing based on their popularity, from lowest to highest, where popu- the most highly predicted items resulting in more accuracy and larity is represented by the number of known ratings that each less diversity (becoming increasingly similar to the standard item has. More formally, item popularity-based ranking func- ranking approach); in contrast, decreasing the ranking thre- tion can be written as follows: shold TR [TH, Tmax] towards TH would make rankX(i, TR) increa- rankItemPop(i) = |U(i)|, where U(i) = {u U | R(u, i)}. singly more similar to the pure ranking function rankX(i), re- We compared the performance of the item popularity- sulting in more diversity with some accuracy loss. based ranking approach with the standard ranking approach Therefore, choosing different TR values in-between the ex- using MovieLens dataset and item-based CF, and we present tremes allows the user to set the desired balance between accu- this comparison using the accuracy-diversity plot in Fig.1. In racy and diversity. In particular, as Fig. 1 shows, the recom- particular, the results show that, as compared to the standard mendation accuracy of item popularity-based ranking ap- ranking approach, the item popularity-based ranking approach proach could be improved by increasing the ranking threshold. increased recommendation diversity from 385 to 1395 (i.e., 3.6 For example, the item popularity-based ranking approach with times!); however, recommendation accuracy dropped from ranking threshold 4.4 could minimize the accuracy loss to 89% to 69%. Here, despite the significant diversity gain, such a 1.32%, but still could obtain 83% diversity gain (from 385 to significant accuracy loss (20%) would not be acceptable in most 703), compared to the standard ranking approach. An even real-life personalization applications. Therefore, next we in- higher threshold 4.7 still makes it possible to achieve 20% di- troduce a general technique to parameterize recommendation versity gain (from 385 to 462) with only 0.06% of accuracy loss. ranking approaches, which allows to achieve significant diver- Also note that, even when there are less than N items sity gains while controlling accuracy losses (e.g., according to above the ranking threshold TR, by definition, all the items
  • 6. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. 6 IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID above TR are recommended to a user, and the remaining top-N second step (b) demonstrates this accuracy-diversity tradeoff. items are selected according to the standard ranking approach. The third step, shown in Fig. 2c, can significantly minimize This ensures that all the ranking approaches proposed in this accuracy loss by confining the re-ranked recommendations to paper provide the same exact number of recommendations as the items above newly introduced ranking threshold TR (e.g., their corresponding baseline techniques (the ones using the 3.8 out of 5). In this particular illustration, note that the in- standard ranking approach), which is very important from the creased ranking threshold makes the fifth recommended item experimental analysis point of view as well in order to have a in step (b) (i.e., item with predicted rating value of 3.65) filtered fair performance comparison of different ranking techniques. out and the next possible item above the new ranking thre- shold (i.e. the item predicted as 3.81) is recommended to user 3.4 General Steps for Recommendation Re-ranking u. Averaged across all users, this parameterization helps to The item popularity-based ranking approach described above make the level of accuracy loss fairly small with still a signifi- is just one example of possible ranking approaches for improv- cant diversity gain (as compared to the standard ranking ap- ing recommendation diversity, and a number of additional proach), as shown in the performance graph of step (c). ranking functions, rankX(i), will be introduced in Section 4. We now introduce several additional item ranking func- Here, based on the previous discussion in Section 3, we sum- tions, and provide empirical evidence that supports our moti- marize the general ideas behind the proposed ranking ap- vation of using these item criteria for diversity improvement. proaches, as illustrated by Fig. 2. The first step, shown in Fig. 2a, represents the standard approach, which, for each user, ranks all the predicted items 4 ADDITIONAL RANKING APPROACHES according to the predicted rating value and selects top-N can- In many personalization applications (e.g., movie or book rec- didate items, as long as they are above the highly-predicted ommendations), there often exist more highly-predicted rat- rating threshold TH. The recommendation quality of the over- ings for a given user than can be put in her top-N list. This all recommendation technique is measured in terms of the pre- provides opportunities to have a number of alternative ranking t.c om cision-in-top-N and the diversity-in-top-N, as shown in the approaches, where different sets of items can possibly be rec- om po t.c accuracy-diversity plot at the right side of the example (a). ommended to the user. In this section, we introduce six addi- gs po The second step, illustrated in Fig. 2b, shows the recom- tional ranking approaches that can be used as alternatives to lo s .b og mendations provided by applying one of the proposed ranking rankStandard to improve recommendation diversity, and Fig. 3 ts .bl functions, rankX(i), where several different items (that are not provides some empirical evidence that supports the use of ec ts oj c necessarily among N most highly predicted, but are still above these item ranking criteria. Because of the space limitations, in pr oje TH) are recommended to the user. This way, a user can get Fig. 3 we present the empirical results for MovieLens dataset; re r lo rep recommended more idiosyncratic, long-tail, less frequently however, consistently similar patterns were found in other da- xp lo recommended items that may not be as widely popular, but tasets (discussed in Section 5.1) as well. ee xp .ie ee can still be very relevant to this user (as indicated by relatively In particular, in our empirical analysis we consistently ob- w e high predicted rating). Therefore, re-ranking the candidate served that popular items, on average, are likely to have higher w .i w w items can significantly improve the recommendation diversity :// w predicted ratings than less popular items, using both heuristic- tp //w although, as discussed, this typically comes at some loss of and model-based techniques for rating prediction, as shown in ht ttp: recommendation accuracy. The performance graph of the Fig. 3a. As discussed in Section 3, recommending less popular h (a) Recommending top-N highly predicted items for user u, according to standard ranking approach (b) Recommending top-N items, according to some other ranking approach for better diversity (c) Confining re-ranked recommendations to the items above new ranking threshold TR (e.g., 3.8) for better accuracy Fig. 2. General overview of ranking-based approaches for improving recommendation diversity
  • 7. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 7 items helps to improve recommendation diversity; therefore, as ommended for better diversity. can be immediately suggested from the monotonic relationship Item Absolute Likeability, i.e., ranking items according to between average item popularity and predicted rating value, how many users liked them (i.e., rated the item above TH): recommending not as highly predicted items (but still pre- dicted to be above TH) likely implies recommending, on aver- rankAbsLike(i) = |UH(i)|, where UH(i)={u U(i)| R(u,i) TH}. age, less popular items, potentially leading to diversity im- Item Relative Likeability, i.e., ranking items according to provements. Therefore, we propose to use predicted rating the percentage of the users who liked an item (among all value itself as an item ranking criterion: users who rated it): Reverse Predicted Rating Value, i.e., ranking the candidate rankRelLike(i) = |UH(i)| / |U(i)|. (highly predicted) items based on their predicted rating We can also use two different types of rating variances to value, from lowest to highest (as a result choosing less improve recommendation diversity. With any traditional rec- popular items, according to Fig. 3a). More formally: ommendation technique, each item’s rating variance (which rankRevPred(i) = R*(u,i). can be computed from known ratings submitted for that item) can be used for re-ranking candidate items. Also, if any neigh- We now propose several other ranking criteria that exhibit borhood-based recommendation technique is used for predic- consistent relationships to predicted rating value, including tion, we can use the rating variance of neighbors whose ratings average rating, absolute likeability, relative likeability, item are used to predict the rating for re-ranking candidate items. rating variance, and neighbors’ rating variance, as shown in As shown in Fig. 3e and 3f, the relationship between the pre- Figures 3b-3f. In particular, the relationship between predicted dicted rating value and each item’s rating variance and the rating values and the average actual rating of each item (as ex- relationship between predicted rating value and 50 neighbors’ plicitly rated by users), shown in Fig. 3b, also supports a simi- rating variance obtained by using a neighborhood-based CF lar conjecture that items with lower average rating, on average, technique demonstrate that highly predicted items tend to be are more likely to have lower predicted rating values (likely low in both item rating variance and neighbors’ rating va- t.c om representing less popular items, as shown earlier). Thus, such om riance. In other words, among the highly-predicted ratings po t.c items could be recommended for better diversity. gs po (i.e., above TH) there is more user consensus for higher- Item Average Rating, i.e., ranking items according to an lo s predicted items than for lower-predicted ones. These findings .b og average of all known ratings for each item: ts .bl indicate that re-ranking recommendation list by rating variance ec ts 1 and choosing the items with higher variance could improve oj c pr oje rankAvgRating(i) = R(i) , where R(i ) R(u , i ) . recommendation diversity. | U (i ) | u re r lo rep U (i ) Item Rating Variance, i.e., ranking items according to each xp lo Similarly, the relationship between predicted rating values item’s rating variance (i.e., rating variance of users who ee xp and item absolute (or relative) likeability, shown in Fig. 3c and .ie ee rated the item): w e 3d, also suggests that the items with lower likeability, on aver- w .i w w age, are more likely to have lower predicted rating values (like- :// w tp //w ly representing less popular movies) and, thus, could be rec- ht ttp: h (a) Average Predicted Rating Value (b) Item Average Rating (c) Average Item Absolute Likeability (d) Average Item Relative Likeability (e) Average Item Rating Variance (f) Average Neighbors’ Rating Variance Fig. 3. Relationships between various item-ranking criteria and predicted rating value, for highly-predicted ratings (MovieLens data)
  • 8. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. 8 IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING, MANUSCRIPT ID 1 TABLE 2. BASIC INFORMATION OF MOVIE RATING DATASETS rankItemVar(i) = ( R (u, i) R (i )) 2 . | U (i) | u U (i ) MovieLens Netflix Yahoo! Movies Neighbors’ Rating Variance, i.e., ranking items according Number of users 2,830 3,333 1,349 to the rating variance of neighbors of a particular user for a Number of movies 1,919 2,092 721 particular item. The closest neighbors of user u among the Number of ratings 775,176 1,067,999 53,622 users who rated the particular item i, denoted by u', are Data Sparsity 14.27% 15.32% 5.51% chosen from the set of U(i) N(u). Avg # of common movies 64.6 57.3 4.1 between two users 1 2 Avg # of common users rankNeighborVar(i) = ( R(u ' , i ) Ru (i )) between two movies 85.1 99.5 6.5 | U (i ) N (u ) | u ' U (i ) N (u ) Avg # of users per movie 404.0 510.5 74.4 1 Avg # of movies per user 274.1 320.4 39.8 where Ru (i ) R (u ' , i ) . | U (i ) N (u ) | u ' U (i ) N (u ) Consistently with the accuracy-diversity tradeoff discussed in the introduction, all the proposed ranking approaches im- In summary, there exist a number of different ranking ap- proved the diversity of recommendations by sacrificing the proaches that can improve recommendation diversity by re- accuracy of recommendations. However, with each ranking commending items other than the ones with topmost predicted approach, as ranking threshold TR increases, the accuracy loss rating values to a user. In addition, as indicated in Fig. 1, the is significantly minimized (smaller precision loss) while still degree of improvement (and, more importantly, the degree of exhibiting substantial diversity improvement. Therefore, with tolerable accuracy loss) can be controlled by the chosen rank- different ranking thresholds, one can obtain different diversity ing threshold value TR. The next section presents comprehen- gains for different levels of tolerable precision loss, as com- sive empirical results demonstrating the effectiveness and ro- pared to the standard ranking approach. Following this idea, t.c om bustness of the proposed ranking techniques. om in our experiments we compare the effectiveness (i.e., diversity po t.c gain) of different recommendation ranking techniques for a 5 EMPIRICAL RESULTS gs po variety of different precision loss levels (0.1-10%). lo s .b og While, as mentioned earlier, a comprehensive set of experi- ts .bl 5.1 Data ments was performed using every rating prediction technique ec ts oj c The proposed recommendation ranking approaches were pr oje in conjunction with every recommendation ranking function tested with several movie rating datasets, including MovieLens re r on every dataset for different number of top-N recommenda- lo rep (data file available at grouplens.org), Netflix (data file available tions, the results were very consistent across all experiments xp lo at netflixprize.com), and Yahoo! Movies (individual ratings ee xp and, therefore, for illustration purposes and because of the .ie ee collected from movie pages at movies.yahoo.com). We pre- space limitations, we show only three results: each using all w e processed each dataset to include users and movies with signif- w .i possible ranking techniques on a different dataset, a different w w icant rating history, which makes it possible to have sufficient :// w recommendation technique, and a different number of recom- tp //w number of highly-predicted items for recommendations to each mendations. (See Table 3.) ht ttp: user (in the test data). The basic statistical information of the For example, Table 3a shows the performance of the pro- h resulting datasets is summarized in Table 2. For each dataset, posed ranking approaches used in conjunction with item-based we randomly chose 60% of the ratings as training data and CF technique to provide top-5 recommendations on the Mo- used them to predict the remaining 40% (i.e., test data). vieLens dataset. In particular, one can observe that, with the 5.2 Performance of Proposed Ranking Approaches precision loss of only 0.001 or 0.1% (i.e., with precision of 0.891, down from 0.892 of the standard ranking approach), item aver- We conducted experiments on the three datasets described in age rating-based ranking approach can already increase rec- Section 5.1, using three widely popular recommendation tech- ommendation diversity by 20% (i.e., absolute diversity gain of niques for rating prediction, including two heuristic-based (us- 78 on top of the 385 achieved by the standard ranking ap- er-based and item-based CF) and one model-based (matrix proach). If users can tolerate precision loss up to 1% (i.e., pre- factorization CF) techniques, discussed in Section 2.1. All sev- cision of 0.882 or 88.2%), the diversity could be increased by en proposed ranking approaches were used in conjunction 81% with the same ranking technique; and 5% precision loss with each of the three rating prediction techniques to generate (i.e., 84.2%) can provide diversity gains up to 189% for this rec- top-N (N=1, 5, 10) recommendations to each user on each data- ommendation technique on this dataset. Substantial diversity set, with the exception of neighbors’ variance-based ranking of improvements can be observed across different ranking tech- model-based predicted ratings. In particular, because there is niques, different rating prediction techniques, and different no concept of neighbors in a pure matrix factorization tech- datasets, as shown in Tables 3a, 3b, and 3c. nique, the ranking approach based on neighbors’ rating va- In general, all proposed ranking approaches were able to riance was applied only with heuristic-based techniques. We provide significant diversity gains, and the best-performing set predicted rating threshold as TH = 3.5 (out of 5) to ensure ranking approach may be different depending on the chosen that only relevant items are recommended to users, and rank- dataset and rating prediction technique. Thus, system design- ing threshold TR was varied from 3.5 to 4.9. The performance ers have the flexibility to choose the most desirable ranking of each ranking approach was measured in terms of precision- approach based on the data in a given application. We would in-top-N and diversity-in-top-N (N=1, 5, 10), and, for compari- also like to point out that, since the proposed approaches es- son purposes, its diversity gain and precision loss with respect sentially are implemented as sorting algorithms based on cer- to the standard ranking approach was calculated. tain ranking heuristics, they are extremely scalable. For exam-
  • 9. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. ADOMAVICIUS AND KWON: IMPROVING AGGREGATE RECOMMENDATION DIVERSITY USING RANKING-BASED TECHNIQUES 9 TABLE 3. DIVERSITY GAINS OF PROPOSED RANKING APPROACHES FOR DIFFERENT LEVELS OF PRECISION LOSS Item Reverse Item Average Item Abs Item Relative Item Rating Neighbors’ Popularity Prediction Rating Likeability Likeability Variance Rating Variance Precision Loss Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain -0.1 +800 3.078 +848 3.203 +975 3.532 +897 3.330 +937 3.434 +386 2.003 +702 2.823 -0.05 +594 2.543 +594 2.543 +728 2.891 +642 2.668 +699 2.816 +283 1.735 +451 2.171 -0.025 +411 2.068 +411 2.068 +513 2.332 +445 2.156 +484 2.257 +205 1.532 +258 1.670 -0.01 +270 1.701 +234 1.608 +311 1.808 +282 1.732 +278 1.722 +126 1.327 +133 1.345 -0.005 +189 1.491 +173 1.449 +223 1.579 +196 1.509 +199 1.517 +91 1.236 +87 1.226 -0.001 +93 1.242 +44 1.114 +78 1.203 +104 1.270 +96 1.249 +21 1.055 +20 1.052 Standard:0.892 385 1.000 385 1.000 385 1.000 385 1.000 385 1.000 385 1.000 385 1.000 (a) MovieLens dataset, top-5 items, heuristic-based technique (item-based CF, 50 neighbors) Item Reverse Item Average Item Abs Item Relative Item Rating Popularity Prediction Rating Likeability Likeability Variance Precision Loss Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain -0.1 +314 1.356 +962 2.091 +880 1.998 +732 1.830 +860 1.975 +115 1.130 -0.05 +301 1.341 +757 1.858 +718 1.814 +614 1.696 +695 1.788 +137 1.155 -0.025 +238 1.270 +568 1.644 +535 1.607 +464 1.526 +542 1.615 +110 1.125 -0.01 +156 1.177 +363 1.412 +382 1.433 +300 1.340 +385 1.437 +63 1.071 -0.005 +128 1.145 +264 1.299 +282 1.320 +247 1.280 +288 1.327 +47 1.053 -0.001 +64 1.073 +177 1.201 +118 1.134 +89 1.101 +148 1.168 +8 1.009 Standard:0.834 882 1.000 882 1.000 882 1.000 882 1.000 882 1.000 882 1.000 t.c om om (b) Netflix dataset, top-5 items, model-based technique (matrix factorization CF, K=64) po t.c gs po Item Reverse Item Average Item Abs lo s Item Relative Item Rating Neighbors’ .b og Popularity Prediction Rating Likeability Likeability Variance Rating Variance ts .bl Precision Loss Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain Diversity Gain ec ts oj c -0.1 +220 1.794 +178 1.643 +149 1.538 +246 1.888 +122 1.440 +86 1.310 +128 1.462 pr oje -0.05 +198 1.715 +165 1.596 +141 1.509 +226 1.816 +117 1.422 +72 1.260 +108 1.390 re r lo rep -0.025 +134 1.484 +134 1.484 +103 1.372 +152 1.549 +86 1.310 +70 1.253 +98 1.354 xp lo -0.01 +73 1.264 +92 1.332 +56 1.202 +77 1.278 +58 1.209 +56 1.202 +65 1.235 ee xp .ie ee -0.005 +57 1.206 +86 1.310 +38 1.137 +63 1.227 +36 1.130 +28 1.101 +51 1.184 w e w .i -0.001 +42 1.152 +71 1.256 +25 1.090 +43 1.155 +30 1.110 +19 1.069 +22 1.079 w w :// w Standard:0.911 277 1.000 277 1.000 277 1.000 277 1.000 277 1.000 277 1.000 277 1.000 tp //w (c) Yahoo dataset, top-1 item, heuristic-based technique (user-based CF, 15 neighbors) ht ttp: h Notation: Precision Loss = [Precision-in-top-N of proposed ranking approach] – [Precision-in-top-N of standard ranking approach] Diversity Gain (column 1) = [Diversity-in-top-N of proposed ranking approach] – [Diversity-in-top-N of standard ranking approach] Diversity Gain (column 2) = [Diversity-in-top-N of proposed ranking approach] / [Diversity-in-top-N of standard ranking approach] ple, it took, on average, less than 6 seconds to rank all the pre- using the recommendation ranking techniques with any of the dicted items and select top-N recommendations for nearly parameter values, it is possible to obtain substantial diversity 3,000 users in our experiments with MovieLens data. improvements with only a small accuracy loss. We also vary the number of top-N recommendations pro- 5.3 Robustness Analysis for Different Parameters vided by the system. Note that, while it is intuitively clear that In this subsection, we present robustness analysis of the pro- top-1, top-5, and top-10 recommendations will provide differ- posed techniques with respect to several parameters: number ent accuracy and diversity levels (i.e., it is much easier to accu- of neighbors used in heuristic-based CF, number of features rately recommend one relevant item than relevant 10 items, used in matrix factorization CF, number of top-N recommenda- and it is much easier to have more aggregate diversity when tions provided to each user, the value of predicted rating thre- you can provide more recommendations), again we observe shold TH, and the level of data sparsity. that, with any number of top-N recommendations, the pro- We tested the heuristic-based technique with a different posed techniques exhibit robust and consistent behavior, i.e., number of neighbors (15, 20, 30, and 50 neighbors) and the they allow to obtain substantial diversity gains at a small accu- model-based technique with a different number of features racy loss, as shown in Fig. 4c. For example, with only 1% pre- (K=8, 16, 32, and 64). For illustration purposes, Fig. 4a and 4b cision loss, we were able to increase the diversity from 133 to show how two different ranking approaches for both heuristic- 311 (134% gain) using the reverse predicted rating value-based based and model-based rating prediction techniques are af- ranking approach in the top-1 recommendation task, and from fected by different parameter values. While different parame- 385 to 655 (70% gain) using the item-popularity-based ranking ter values may result in slightly different performance (as is approach in the top-5 recommendation task. well-known in recommender systems literature), the funda- In addition, our finding that the proposed ranking ap- mental behavior of the proposed techniques remains robust proaches help to improve recommendation diversity is also and consistent, as shown in Fig. 4a and 4b. In other words, robust with respect to the “highly-predicted” rating threshold