2. Structure of the talk
• Background/Problem description
• Goal-driven design
• Experimental results
• Conclusions
3. Collaborative filtering
• Predicting user preference
towards unknown items
• Based on previously expressed preferences
Love Pulp Crazy White Up in A Single
Actually Fiction Heart Ribbon the Air Man
Sophie ? ? ?
Peter ? ? ?
Jaden ? ? ?
4. Evaluation metrics
ˆ
• Root Mean Squared Error E (( r r ) 2 )
• Netflix recommendation competition
adopted this metric
• The objective function for some of the SVD
implementations is equivalent to the performance
measure [Koren et al 2009]
• Criticism
– Error criterion is uniform across rating scales
– Is it consistent with users’ satisfactions?
5. Goal-driven design
• We argue that
– Measure does not always reflect user needs
– Different user needs require different performance
measures
• The algorithm should be defined based on user
needs
– Start from the user point of view, define measure and
algorithm accordingly
8. Boundaries and the direction of error
• Taste boundary - interval between liked and
disliked items
• Direction – error towards the boundary
• Magnitude – whether the error crosses taste
boundary
10. The two dimensional weighting function
r = 1,2 r=3 r = 4,5
p <= 2.5 w1 w2 w3
2.5<p<=3.5 w4 w5 w6
P > 3.5 w7 w8 w9
11. Two-stage Optimization (in General)
Learning the
Directional
Errors
Feedback/IR Learning the
Metrics Recom. Model
Testing
12. Two-stage Optimization (An example)
Genetic algorithm
NDCG as fitness function
Plug in the learned Weights in SVD
Training
T 2 2 2
argmin w(rui q pu )
i ( qi pu )
q, p ui
13. Genetic algorithms
• Search algorithms that work via the
process of natural selection
• Start with a sample set of potential solutions (a set
of weights)
• Evolve towards a set of more optimal solutions
• Poor solutions tend to die out (smaller NDCG)
• Better solutions remain in the population (higher
NDCG)
14. Experiments
• MovieLens 100k dataset
• 1862 movies, 943 users
• Only using ratings
• Five-fold cross validation
15. Evaluation metrics
• Recommendation as a ranking problem
• IR measures
– Normalized discounted cumulative gain (NDCG)
– Mean average precision (MAP)
– Mean reciprocal rank (MRR)
16. Results – Experiment I
Baseline SVD
r = 1,2 r=3 r = 4,5
p <= 2.5 0.0517 0.0193 0.0106
2.5<p<=3.5 0.0904 0.1461 0.1391
p > 3.5 0.0299 0.1012 0.4115
SVD with weights where w7>w8>w4
r = 1,2 r=3 r = 4,5
p <= 2.5 0.0759 0.0407 0.0264
2.5<p<=3.5 0.0837 0.1676 0.2381
p > 3.5 0.0125 0.0583 0.2966
17. Results – Experiment II
r = 1,2 r=3 r = 4,5
p <= 2.5 w1 w2 w3
2.5<p<=3.5 w4 w5 w6
P > 3.5 w7 w8 w9
18. Results – Experiment II
• Genetic algorithm to find optimal weigh for sector
w7,w8 and w4 (statistically significant)
Weighted Baseline
MAP 0.450 0.447
MRR 0.899 0.889
NDCG@10 0.726 0.720
NDCG@5 0.574 0.570
NDCG@3 0.450 0.447
19. Probability of correct prediction within sectors
Probability of predicting non-relevant items relevant
20. Improved user experience
• More likely to receive relevant items on their
recommendation list
• Less likely that lower rated items receive higher
predictions
• But it is more likely that higher rated items receive
lower predictions
21. Conclusion
• Optimize algorithm from the user point of view
• Identify directional errors
• Assign risk to each direction
• Approach can be changed depending on how
items are presented
22. Future work
• Taste boundaries might be user dependent
• Directional error across items or users
• Different recommender goals
24. References
• Deshpande, M., Karypis, G.: Item-based top-N recommendation algorithms.
ACM Trans. Inf. Syst. 22(1) (2004)
• Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J.: An algorithmic
framework for performing collaborative filtering. In: SIGIR '99. (1999)
• Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for
recommender systems. Computer 42(8) (2009)
• Wang, J., de Vries, A.P., Reinders, M.J.T.: Unifying user-based and item-
based collaborative filtering approaches by similarity fusion. In: SIGIR '06:
Proceedings of the 29th annual international ACM SIGIR conference on
Research and development in information retrieval, New York, NY, ACM Press
Hinweis der Redaktion
Brief introduction. Name, where from. I am going to present this work on goal driven collaborative filtering. This is a simple idea based on an assumption that not all the errors the system makes are equal.
I will start with a brief introduction on collaborative filtering.
This can be applied to variety of items for example movie recommendationSometimes other content is taken into account, such as gender, geographic location for the user, the type of the item a etc.In this work we use ratings for prediction.Example, Sophie, Peter and Jaden
Expected value of the squared difference between the predicted rating and the observer valueUsed for the netflix competitionSince the error is squared, we emphasize large errorsObviously large error occur at the end of the rating scales.
Add goal driven design imageWe argue that current algorithms do not always optimize performance based on user needs. It optimizes algorithm based on a performance measure that not always reflects user needs. In addition that different user needs require different performance.Therefore the system should be defined based on user needs and the performance should be measured accordingly.We also argue that different user needs requires different measures.The measure might indicate the qualities that the algorithm should possessBut the algorithm should be designed based on user needs not only on the measure
This graph show the probability where the model over predicts or under predict certain group of items.For example items that are rated 3 are more likely to be over predicted than under predicted. Also note the pattern that the model works best with items that are rated four, you get the best accuracy there. Because we have the highest number of training points for this group of items.
Models...Extract factorsIn this work we used SVD as a baseline algorithm
Uninteresting itemsDepending on the way the items are presented. Top n-listExploring. Same error, but the question is whether this error should
Taste boundaries, the interval between liked and disliked items. In a rating scale from 1 to five the boundary would be three.The direction would represent whether the predicted rating with respect to the observed rating is towards the taste boundary or notAnd finally the magnitude shows whether this directional error crosses the taste boundary or not
Here. We obviously want to make the prediction correct at the diagonal. But if the prediction is not correct we define the risk of predicting items differently depending on the criteria I just explained. The size of the arrow represents the magnitude of the risk, as we understand it. For example it is more important to penalize lower rated items as they get higher predicted than the other way around. Therefore the aim is to minimize error in sectors that are identified more important.
We define a weighting function that is a function of p the predicted value of the item and r the observed rating9 sectorsReduce the probability that an items falls in a sector – redIncrease the probability that an items falls in a sector - green
The objective function is to minimize the squared error. Where w is the function of the predicted value and the observed rating, as defined in the previous slide.The second part of the equation is the regularizing term, in order to avoid over fitting by penalizing the magnitude of the parameters. We solve this by using gradient descendent optimization. So that we find a number of factors for each item and user in the dataset. To calculate to prediction for an unknown item user pair, we just take the dot product of the item and user vector.Our contribution here is the weighting function that would force the model to reduce error in sectors.
We designed a two level optimization in order to come up with the best set of weights.Weights were optimized on the second set and tested on the third.
Genetic algorithms are search algorithms that work via the process of natural selection. They begin with a sample set ofpotential solutions which then evolves toward a set of more optimal solutions. Within the sample set, solutions that are poor tend to die out while better solutions remain in the population, thus introducing more solutions into the set.
Only use rating information
We assess the system performance on the top k-list
Experiment IWe set the weights manually for sectors where we wanted to reduce the error the most, these included w7, w8, w4.The table shows that we reduced the probability that items will fall into particular sectors, but we also have reduced the probability that the item will correctly predicted.
Five fold validation, and tested and turned to be statistically significant.
Essentially this approach aims to minimize the error for the predefined sectors which inevitably results in the increase of error in other sectors. Fig. 5(a)shows the probability that true ratings are correctly predicted within our predefined taste boundary by the optimized versus the baseline approach using theweights obtained in the second experiment (Table 4). As expected the baseline approach predicts higher ratings better than our optimized approach, since theoptimized approach does not penalize this type of error (high ratings predicted less), whereas we have some improvement in the lower range where we aimed toreduce the error. This approach takes the low risk approach therefore it hurts the performance at the higher range of the spectrum where it is less risky topredict something less, in exchange it reduces the error for item that are rated low. This means that it is less likely that users get items that are not relevantto them (Fig. 5(b)).