Weitere ähnliche Inhalte
Ähnlich wie A survey of memory based methods for collaborative filtering based techniques
Ähnlich wie A survey of memory based methods for collaborative filtering based techniques (20)
Mehr von IAEME Publication
Mehr von IAEME Publication (20)
Kürzlich hochgeladen (20)
A survey of memory based methods for collaborative filtering based techniques
- 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
366
A SURVEY OF MEMORY BASED METHODS FOR COLLABORATIVE
FILTERING BASED TECHNIQUES FOR ONLINE RECOMMENDER
SYSTEMS
Anuj Verma1
, Kishore Bhamidipati2
1
(Dept. of Computer Science and Engineering, Manipal Institute of Technology, Manipal
University, Manipal, Karnataka - 576104, India)
2
(Asst. Professor - Sr. Scale, Dept. of Computer Science and Engineering, Manipal Institute
of Technology, Manipal University, Manipal, Karnataka - 576104, India)
ABSTRACT
The cyberspace aims at providing an increasingly dynamic experience to users. The
rise of electronic commerce has led to efforts for providing a highly efficient and qualitative
experience to the consumer. Recommender Systems are a step in this direction. They aid in
understanding the unlimited amount of data available and in particularly knowing each user.
One of the most flourishing techniques to generate recommendations is Collaborative
filtering. The technique focuses on using available information about existing users to
generate prediction for the active user. A widely employed approach for the purpose is the
memory based algorithm. The existing preferences of a user are represented in form of a user-
item matrix. The method makes use of the complete or partial user-item matrix in order to
isolate the nearest users for the active user and then generate the prediction. The majority of
initial efforts dedicated to understanding electronic commerce and recommender systems
concentrate only on the technical aspects like algorithm building and computational needs of
such systems. Not much attention has been provided to questions pertaining to the need of
such systems or how effective they are at what they try to perform. Along with looking at the
various stages corresponding to a memory based collaborative filtering system, we propose
an experiment to check the effectiveness of predictions or ratings generated by such systems.
Keywords: Collaborative Filtering, Collective Intelligence, Memory Based Algorithm,
Recommendation System
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 2, March – April (2013), pp. 366-372
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
367
1. INTRODUCTION
Recommender Systems prove to be an important tool for market analysis as well as
deeper understanding of customer behavior. This is more relevant in current time of
information explosion when a customer gets confounded by the variety of options that are
available concerning every issue. They try to bridge the gap between the user and the market
by mathematically determining what a user may prefer.
The uniqueness of collaborative filtering lies with its miscellany. The technique is
unchanged for any type of data, i.e., the working of the system remains the same irrespective
of the nature of information, so the system structure is same for any application- from a book
recommendation system to a movie recommendation system.
The collaborative filtering technique can be utilized by two different approaches-
Memory-based and Model-based. Memory based collaborative- filtering systems use the
complete user-item rating matrix or a part of it to generate recommendations. The Model-
based approach attempts to determine a pattern or trend in the given ratings data and then
construct a model to generate recommendations [1].
Memory-based approach has been discussed at length and is predominantly utilized in
commercial systems due to several factors. The first reason is its ease of use. Since it
concentrates on the user item database, it is easier to apply and account for. The second
reason is its intuitive nature. As the system keeps collecting data about a particular user, it
spontaneously acts to generate recommendations after considering this new information.
Hence the predictions are always up-to-date. The third reason is the cost. They are less costly
and hence outperform the other approach in speed and resource usage [2].
The first limitation of this approach is that it is rating dependent. The behavioral
trends or taste of a user may change over time. The user can also get resistive during the
rating process and may selectively or incorrectly rate items. Another factor is the limited
scope of ratings. Data belonging to a particular domain can be used to successfully generate
predictions for that specific domain only. It is difficult to generate a prediction about the
breakfast preferences of a user after analyzing the music that user hears. The second
limitation is data sparsity. When a new user is introduced to the system it takes time to build
a profile for him as no information exists about him. This is called the cold start problem [2].
We observe the various stages of a general collaborative filtering algorithm and the
try to analyze the effectiveness of various techniques that are employed for the same.
2. COLLABORATIVE FILTERING ALGORITHM
A Recommender System can be imagined as a black box used as a filter of
information. The input is the data gathered about a user (active user). One of the most
important algorithms that functions inside the filter is the similarity computation algorithm
that aims to determine the proximity of different users and represent this ‘nearness’ in form
of a numerical weight. This weight can be any measure that can be used to determine
‘nearness’ between two entities. Euclidean distance and Correlation Coefficient are widely
used measures. Angular Distance can also be used. The output of the filter is the generated
recommendation. The three main stages of the process are: Representation of Input Data,
Similarity Computation and Recommendation Generation [3].
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
368
2.1 Representation of Input data
Data about a specific user can be gathered in explicit and implicit methods. The
explicit method involves collecting ratings or asking about likes and dislikes of a consumer
(Thumbs Up/Thumbs Down buttons). Implicit method is concerned with checking the
browsing history, tracking the number of clicks and recording time spent on a particular page.
The gathered data is represented in the form of the user-item matrix [4].
2.2 Similarity Computation
Similarity Computation is the most important step of the Recommendation System
because the accuracy of these calculations determines the accuracy of the system. This step is
concerned with identifying the knearest users to the active user. These k users form the
neighborhood of the active user. The rating is generated keeping in mind the neighborhood of
the active user [3]. The different methods to calculate the similarity are:
2.2.1 Euclidean Distance
This method takes into account all the items users have rated in common and
represents them on the axes for a graph. The users are then represented as points on the graph
and the distance between the different points is measured using the Euclidean distance
formula. The distance between users A and B can be represented as follows:
w(A,B) = ∑ ඥሺܣ െ ܤሻଶ
ୀଵ (1)
where ܣ and ܤare the ratings for the ݅௧
item of users A and B respectively, who have a
total of n co-rated items. A disadvantage of this method is the two dimensional nature of the
measure despite its simplicity. The range of this measure is [0,1]. [5]
2.2.2 Pearson Correlation
Pearson Correlation Coefficient is a widely used statistical measure used to check
how strongly two entities are related. It determines the degree of association between two
variables. The nearer the points are to a linear trajectory, the higher their strength of
association. The Pearson Correlation between users A and B can be represented as follows
[6]:
w(A,B) =
∑ ሺିҧሻሺିതሻ
సభ
ට∑ ሺିҧሻమ
సభ ට∑ ሺିതሻమ
సభ
(2)
where ܣ and ܤare the ratings for the ݅௧
item of users A and B respectively, who have a
total of n co-rated items. ܣҧ and ܤത are the average ratings for user A and user B respectively.
Unlike the Euclidean distance, it has a wider range [-1,1] and also assumes negative values.
Its strength lies in the fact that it can also accommodate any form of scaling and can correct
for any non-normalized nature of data.
2.2.3 Cosine Similarity
Vector based cosine similarity is an important technique used for string matching and
in checking the similarity of two documents. It can be suitably applied to the cause of
collaborative filtering as well- if the users are considered as documents to be matched, items
are considered as words and the ratings are considered to be the frequency of the occurrence
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
369
of words. By using this measure we are trying to establish the angle between the two vectors
[6]. The cosine similarity between two users A and B can be represented as follows:
w(A,B) =
∑ ሺ
సభ ሻ
ට∑ ሺሻమభ
ೕసభ ට∑ ሺሻమమ
ೕసభ
(3)
where ܣ and ܤare the ratings for the ݅௧
item of users A and B respectively, who have a
total of n co-rated items. n1 and n2 denote the rated items for user A and user B respectively.
It has a range from [-1,1]. Cosine of 0 is 1 which indicates that vectors are overlapping, hence
indicating that users have similar tastes. This measure is particularly useful when data is
sparse or the co-rated items are few and useful relationship cannot be determined using other
measures.
2.3 Prediction Generation
Once the task of computation of similarity is completed and a suitable neighborhood
is formed, the generation of prediction is performed. The task can be accomplished using
various methods the most trivial of which is taking a simple average or mean of the obtained
ratings. A more efficient method is to take the weighted average of the available ratings. The
rating for a particular item k for user A can be represented as follows [6]:
, = ܣҧ ݇ ∑ ݓሺ,ܣ ݍሻ
ୀଵ ሺݍ െ ݍത) (4)
where ܣҧ is the average rating for items rated by user A. ݓሺ,ܣ ݍሻ is the similarity between user
A and neighborq, n is the number of neighbors in the neighborhood. ݍ is user q’s rating for
item k and ݍത is the average rating for all items rated by him.
The aim is to calculate the expected rating for all items that have not yet been rated by the
active user and then recommend the N most recommendedhighest rated items in the
neighborhood. This is called ‘Top N’ recommendation approach [7].
3. EXPERIMENTAL DESIGN
The success of the collaborative filtering based algorithm is dependent on the
effectiveness of similarity computation method used. Hence our task is the evaluation of three
different techniques commonly used for memory based collaborative filtering- Euclidean
Distance, Pearson Correlation Coefficient and Vector Based Cosine Similarity. An empirical
study to calculate the effectiveness of the ratings generated using various similarity measures
was conducted. For our study, we considered explicit input, i.e., numerical ratings given by
the user to examine the system. The dataset used was a popularly used dataset for Movie
Recommendation Systems-The MovieLens 100,000 movie ratings dataset (MovieLens is a
free service provided by GroupLens Research atthe University of Minnesota).
A standard one page questionnaire was prepared containing a list of 100 common movies
belonging to the dataset and users were asked to provide ratings for the same. The ratings
were collected from 52 users locally as well as through online social networking media. Basic
demographic information about the users was also recorded. For an input form to be valid, a
user must have rated at least 15% of the items, i.e., 15 movies out of 100.Participant details
are as follows:
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
370
• Total Participants = 52
• Age Range= 14-48 years
• Gender Ratio (Male:Female) = 2:1
• Valid Ratings=1188
To empirically evaluate the techniques, we will use the process of repeated random sub
sampling validation. 10% of the collected ratings act as validation data, i.e., 120 ratings from
1188 ratings, so that we can generate the corresponding ratings from the system and then
compare the deviation. The validation data subset is generated at random. Rest of the ratings
form one part of the training subset.To arrive at a firm conclusion, this procedure will be
repeated thrice for each similarity measure. Mean-centered ratings are used. We are interested
in the relative performance of the three measures. The neighborhood size is fixed at 30. Two
users should have at least 5 co-rated items for similarity to be considered.
• Total Valid Collected Ratings = 1188
• Collected Ratings to be Tested = ்ܴ௦௧ = 10 % of the Valid Collected Ratings = 120
• Collected Ratings used to generate predictions by The CF System =்ܴଵ ൌ 1188 -
120 =1068
• Already available Ratings for the corresponding items from MovieLens Dataset
=்ܴଶ ൌ 19228
• Total ratings used by the CF System= ்ܴ ൌ ்ܴଵ ்ܴଶ ൌ 1068 + 19228
= 20296
• Therefore, 20296 ratings will be used to generate predictions for 120 ratings.
• Neighborhood Size = 30 Nearest Neighbors
• The process is to be carried out thrice for 3 techniques, hence Total No of Passes = 3
X 3 = 9
4. RESULTS
To judge the accuracy of the similarity computation technique we consider the
following parameters:
• Average deviation for the generated ratings. Deviation is the difference between the
actual rating and the predicted rating.
• Average deviation is measured by calculating MAE (Mean Absolute Error) given by:
ܧܣܯሺ݂ሻ ൌ
1
|்ܴ௦௧|
ܺ |ݎ െ ݂ሺ݅ሻ|
ఢோೞ
whereݎ is the rating in ்ܴ௦௧and ݂ሺ݅ሻis the corresponding rating generated
using்ܴ.
• No. of satisfactory or good predictions, i.e., no. of generated predictions for which
deviation was <0.5
• No. of unsatisfactory or bad predictions, i.e., no. of generated predictions for which
deviation was >1
Following is the data obtained after carrying out the studied design:
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
371
Table 1: Obtained Results
Pass Similarity
Type
Rୣୱ୲ R୰ୟ୧୬ Rୋୣ୬ Total
Deviation
MAE Good
Ratings
Bad
Ratings
1 PCC 120 20296 107 98.05 0.916 38 39
2 PCC 120 20296 108 102.89 0.953 36 45
3 PCC 120 20296 110 95.43 0.868 39 33
4 ED 120 20296 88 72.55 0.824 28 33
5 ED 120 20296 96 87.35 0.91 32 37
6 ED 120 20296 104 87.64 0.843 32 39
7 CS 120 20296 120 85.49 0.712 49 32
8 CS 120 20296 120 95.49 0.796 41 36
9 CS 120 20296 120 85.64 0.714 45 29
(Note: PCC- Pearson Correlation Coefficient; ED- Euclidean Distance; CS- Cosine
Similarity)
An important observation is that the number of ratings generated ܴீ ்ܴ௦௧. This
is because when similarity is computed for an active user, only 30 neighbours are considered.
It is not always possible that all items rated by these 30 users will contain all 100 items for
which ratings have been recorded. Hence, for some ratings, the rating for a particular item for
a specific user is left un-generated. Therefore, in calculating MAE, we use ܴீ instead of
்ܴ௦௧.
All three performances of a similarity type are then used to measure the mean performance:
Table 2: Mean Performance
SN Similarity Type Avg. MAE Avg. % Good Ratings Avg. % Bad Ratings
1 Euclidean Distance 0.859 31.97 37.85
2 Pearson Correlation 0.912 34.77 36.04
3 Cosine Similarity 0.741 37.5 26.94
As it can be observed, for a slight increase in average MAE, the Pearson Correlation
Coefficient produces a higher percentage of good predictions and a lower percentage of bad
predictions than the Euclidean distance measure. It can be thus considered a superior measure
out of the two. The Vector Based Cosine Similarity outperforms the other two on all 3
parameters. Hence it is the most adept out of all three.
It can be articulated that the Memory Based Collaborative filtering algorithm performs the
recommendation task with estimable accuracy and precision. It can be thus considered a
significant approach for Collaborative Filtering based recommendation.
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
372
5. CONCLUSION
The aim of recommender systems is to automate and generate precise predictions. We
resolved to check the effectiveness of a widely used approach for the same- The Memory
Based Algorithm. The most crucial stage in the algorithm is neighborhood formation by
similarity calculation. So, we checked the effectiveness of commonly used similarity
measures. The quantitative results of our experiments indicated that Vector Based Cosine
Similarity was a more effective similarity measure than Pearson Correlation Coefficient and
Euclidean distance based similarity. The memory based algorithm produces practicable
predictions and is thus, an efficacious technique for online recommendation. The possible
extensions include carrying out a similar study after normalizing the ratings (z-score
normalization can be used for the same) and by varying the similarity weight according to the
number of corrated items (significance weighting) calculation.
REFERENCES
[1] B.M. Sarwar, G. Karypis,J.A. Konstan and J. Riedl,Item based collaborative filtering
recommendation algorithms, Proc. 10th International Conference on World Wide Web
(WWW ’01), 2001, 285–295.
[2] X. Su and T.M. Khoshgoftaar,A Survey of Collaborative Filtering Techniques,
Advances in Artificial Intelligence, Hindawi Publishing Corporation, Article ID
421425, 2009, 19 pages.
[3] E. Vozalis and K.G. Margaritis, Analysis of Recommender Systems’ Algorithms, Proc.
6th Hellenic-European Conference on Computer Mathematics and its Applications-
HERCMA, 2003/9.
[4] D. Militaru and C. Zaharia, A survey of collaborative filtering-based systems for
online recommendation, Proc. 12th International Conference on Electronic
Commerce: Roadmap for the Future of Electronic Business [ICEC ‘10], ACM, New
York, 2010, 43-47.
[5] G. Adomavicius and A. Tuzhilin, Towards the Next Generation of Recommender
Systems: A Survey of the State-of-the-Art and Possible Extensions, IEEE Transactions
on Knowledge and Data Engineering, 17(6), 2005, 734-749
[6] T. Segaran,Making Recommendations, in Programming Collective Intelligence, (USA:
O’Reilly Media, 2007) 7-28.
[7] J. Breese, D. Heckerman and C. Kadie, Empirical Analysis of Predictive Algorithms for
Collaborative Filtering, Microsoft Research, Redmond, Technical Report MSR-TR-98-
12, 1998, 43-52.
[8] C.R. Cyril Anthoni and Dr. A. Christy, “Integration of Feature Sets with Machine
Learning Techniques for Spam Filtering”, International Journal of Computer
Engineering & Technology (IJCET), Volume 2, Issue 1, 2011, pp. 47 - 52, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375.
[9] Suresh Kumar RG, S.Saravanan and Soumik Mukherjee, “Recommendations for
Implementing Cloud Computing Management Platforms using Open Source”,
International Journal of Computer Engineering & Technology (IJCET), Volume 3,
Issue 3, 2012, pp. 83 - 93, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.