2. Georgian Micsa
β Software engineer with 6+ years of experience, mainly Java but also
JavaScript and .NET
β Interested on OOD, architecture and agile software development
methodologies
β Currently working as Senior Java Developer @ Cegeka
β georgian.micsa@gmail.com
β http://ro.linkedin.com/in/georgianmicsa
3. What is it?
β Recommender/recommendation system/engine/platform
β A subclass of information filtering system
β Predict the 'rating' or 'preference' that a user would give to a new item
(music, books, movies, people or groups etc)
β Can use a model built from the characteristics of an item (content-based
approaches)
β Can use the user's social environment (collaborative filtering approaches)
4. Examples
β Amazon.com
β Recommend additional books
β Frequently bought together books
β Implemented using a sparse matrix of book cooccurrences
β Pandora Radio
β Plays music with similar characteristics
β Content based filtering based on properties of song/artist
β Based also on user's feedback
β Users emphasize or deemphasize certain characteristics
5. Examples 2
β Last.fm
β Collaborative filtering
β Recommends songs by observing the tracks played by user and
comparing to behaviour of other users
β Suggests songs played by users with similar interests
β Netflix
β Predictions of movies
β Hybrid approach
β Collaborative filtering based on user`s previous ratings and watching
behaviours (compared to other users)
β Content based filtering based on characteristics of movies
6. Collaborative filtering
β Collect and analyze a large amount of information on usersβ behaviors,
activities or preferences
β Predict what users will like based on their similarity to other users
β It does not rely on the content of the items
β Measures user similarity or item similarity
β Many algorithms:
β the k-nearest neighborhood (k-NN)
β the Pearson Correlation
β etc.
7. Collaborative filtering 2
β Build a model from user's profile collecting explicit and implicit data
β Explicit data:
β Asking a user to rate an item on a sliding scale.
β Rank a collection of items from favorite to least favorite.
β Presenting two items to a user and asking him/her to choose the
better one of them.
β Asking a user to create a list of items that he/she likes.
β Implicit data:
β Observing the items that a user views in an online store.
β Analyzing item/user viewing times
β Keeping a record of the items that a user purchases online.
β Obtaining a list of items that a user has listened to or watched
β Analyzing the user's social network and discovering similar likes and
dislikes
8. Collaborative filtering 3
β Collaborative filtering approaches often suffer from three problems:
β Cold Start: needs a large amount of existing data on a user in order
to make accurate recommendations
β Scalability: a large amount of computation power is often necessary
to calculate recommendations.
β Sparsity: The number of items sold on major e-commerce sites is
extremely large. The most active users will only have rated a small
subset of the overall database. Thus, even the most popular items
have very few ratings.
9. Content-based filtering
β Based on information about and characteristics of the items
β Try to recommend items that are similar to those that a user liked in the
past (or is examining in the present)
β Use an item profile (a set of discrete attributes and features)
β Content-based profile of users based on a weighted vector of item
features
β The weights denote the importance of each feature to the user
β To compute the weights:
β average values of the rated item vector
β Bayesian Classifiers, cluster analysis, decision trees, and artificial
neural networks
10. Content-based filtering 2
β Can collect feedback from user to assign higher or lower weights on the
importance of certain attributes
β Cross-content recommendation: music, videos, products, discussions etc.
from different services can be recommended based on news browsing.
β Popular for movie recommendations: Internet Movie Database, See This
Next etc.
11. Hybrid Recommender Systems
β Combines collaborative filtering and content-based filtering
β Implemented in several ways:
β by making content-based and collaborative-based predictions
separately and then combining them
β by adding content-based capabilities to a collaborative-based
approach (and vice versa)
β by unifying the approaches into one model
β Studies have shown that hybrid methods can provide more accurate
recommendations than pure approaches
β Overcome cold start and the sparsity problems
β Netflix and See This Next
12. What is Apache Mahout?
β A scalable Machine Learning library
β Apache License
β Scalable to reasonably large datasets (core algorithms implemented in
Map/Reduce, runnable on Hadoop)
β Distributed and non-distributed algorithms
β Community
β Usecases
β’ Clustering (group items that are topically related)
β’ Classification (learn to assign categories to documents)
β’ Frequent Itemset Mining (find items that appear together)
β’ Recommendation Mining (find items a user might like)
13. Non-distributed recommenders
β Non-distributed, non Hadoop, collaborative recommender algorithms
β Java or external server which exposes recommendation logic to your
application via web services and HTTP
β Key interfaces:
β DataModel: CSV files or database
β UserSimilarity: computes similarity between users
β ItemSimilarity: computes similarity between items
β UserNeighborhood: used for similarity of users
β Recommender: produces recommendations
β Different implementations based on your needs
β Input in this format: UserId,ItemId,[Preference or Rating]
β Preference is not needed in case of associations (pages viewed by users)
14. User-based recommender example
DataModel model = new FileDataModel(new File("data.txt"));
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
// Optional:
userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer());
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(3, userSimilarity, model);
Recommender recommender =
new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations =
cachingRecommender.recommend(1234, 10);
15. Item-based recommender example
DataModel model = new FileDataModel(new File("data.txt"));
// Construct the list of pre-computed correlations
Collection<GenericItemSimilarity.ItemItemSimilarity> correlations = ...;
ItemSimilarity itemSimilarity = new GenericItemSimilarity(correlations);
Recommender recommender =
new GenericItemBasedRecommender(model, itemSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations =
cachingRecommender.recommend(1234, 10);
16. Recommender evaluation
For preference data models:
DataModel myModel = ...;
RecommenderBuilder builder = new RecommenderBuilder() {
public Recommender buildRecommender(DataModel model) {
// build and return the Recommender to evaluate here
}
};
RecommenderEvaluator evaluator =
new AverageAbsoluteDifferenceRecommenderEvaluator();
double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
For boolean data models, precision and recall can be computed.
17. Distributed Item Based
β Mahout offers 2 Hadoop Map/Reduce jobs aimed to support Itembased
Collaborative Filtering
β org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
β computes all similar items
β input is a CSV file with theformat userID,itemID,value
β output is a file of itemIDs with their associated similarity value
β different configuration options: eg. similarity measure to use (co
occurrence, Euclidian distance, Pearson correlation, etc.)
β org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
β Completely distributed itembased recommender
β input is a CSV file with the format userID,itemID,value
β output is a file of userIDs with associated recommended itemIDs and
their scores
β also configuration options
18. Mahout tips
β Start with non-distributed recommenders
β 100M user-item associations can be handled by a modern server with 4GB
of heap available as a real-time recommender
β Over this scale distributed algorithms make sense
β Data can be sampled, noisy and old data can be pruned
β Ratings: GenericItemBasedRecommender and
PearsonCorrelationSimilarity
β Preferences: GenericBooleanPrefItemBasedRecommender and
LogLikelihoodSimilarity
β Content-based item-item similarity => your own ItemSimilarity
19. Mahout tips 2
β CSV files
β FileDataModel
β push new files periodically
β Database
β XXXJDBCDataModel
β ReloadFromJDBCDataModel
β Offline or live recommendations?
β Distributed algorithms => Offline periodical computations
β Data is pushed periodically as CSV files or in DB
β SlopeOneRecommender deals with updates quickly
β Real time update of the DataModel and refresh recommander after
some events (user rates an item etc.)