- Scalable recommendation algorithm based on Locality Sensitive Hashing (LSH) and Collaborative Filtering.
- Distributed implementation of LSH with Apache Spark.
2. Outline
• Introduction
• Collaborative Filtering (CF) and Scalability Problem
• Locality Sensitive Hashing (LSH) for Recommendation
• Improvement for LSH methods
• Preliminary Results
• Work Plan
3. Recommender Systems
•Recommender systems
•Applied to various domains:
•Book/movie/news recommendations
•Contextual advertising
•Search engine personalization
•Matchmaking
•Two type of problems:
• Preference elicitation (prediction)
• Set-based recommendations (top-N)
5. Neighborhood-based
Methods
The idea: Similar users behave in a similar way.
• User-based: rely on the opinion of like-minded users to
predict a rating.
• Item-based: look at rating given to similar items.
Require computation of similarity weights to select
trusted neighbors whose ratings are used in the
prediction.
6. Neighborhood-based
Methods
Problem
• Compare all users/items to find trusted neighbors
(k-nearest-neighbors)
• Not scale well with data size (# of users/items)
Computational Complexity
Space Model Build Query
User-based O(m2) O(m2n) O(m)
Item-based O(n2) O(n2m) O(n)
m : number of users
n : number of items
8. Locality Sensitive Hashing
(LSH)
• ANN search method
• Provides a way to eliminate searching all of the data to
find the nearest neighbors
• Finds the nearest neighbors fast in basic
neighbourhood based methods.
9. Locality Sensitive Hashing
(LSH)
General approach:
• “Hash” items several times, in such a way that similar
items are more likely to be hashed to the same
bucket than dissimilar items are.
• Pairs hashed to the same bucket candidate pairs.
• Check only the candidate pairs for similarity.
10.
Locality-Sensitive Functions
The function h will “hash” items, and the decision will be
based on whether or not the result is equal.
• h(x) = h(y) make x and y a candidate pair.
• h(x) ≠ h(y) do not make x and y a candidate pair.
g = h1 AND h2 AND h3 …
or
g = h1 OR h2 OR h3 …
A collection of functions of this form will be called a family of
functions.
11. LSH for Cosine
Charikar defines family of functions for Cosine as follows:
Let u and v be rating vectors and r is a random generated vector
whose components are +1 and −1.
The family of hash functions (H) generated:
, where
shows the probability of u and v being declared as a candidate pair.
17. UB-KNN-LSH IB-KNN-LSH
• find candidate set, C, for target
user, u, with LSH.
• find k-nearest-neighbors to u
from C that have rated on i.
• use k-nearest-neighbors to
generate a prediction for u on i.
• find candidate set, C, for target
item, i, with LSH.
• find k-nearest-neighbors to i
from C which user u rated on.
• use k-nearest-neighbors to
generate a prediction for u on
item i.
LSH MethodsPrediction
18. UB-LSH1 IB-LSH1
• find candidate users list, Cl, for
u who rated on i with LSH.
• calculate frequency of each
user in Cl who rated on i.
• sort candidate users based on
frequency and get top k users
• use frequency as weight to
predict rating for u on i with
user-based prediction.
• find candidate items list, Cl, for i
with LSH.
• calculate frequency of items in
Cl which is rated by u.
• sort candidate items based on
frequency and get top k items.
• use frequency as weight to
predict rating for u on i with item
based prediction.
LSH MethodsPrediction
19. ImprovementPrediction
UB-LSH2 IB-LSH2
• find candidate users list, Cl, for
u who rated on i with LSH.
• select k users from Cl randomly.
• predict rating for u on i with
user-based prediction as the
average ratings of k users.
• find candidate items list, Cl, for i
with LSH.
• select k items rated by u from Cl
randomly.
• predict rating for u on i with
item-based prediction as the
average ratings of k items.
- Eliminate frequency calculation and sorting.
- Frequent users or items in Cl have higher chance to be selected randomly.
20. Complexity
Prediction
Space Model Build Prediction
User-based O(m) O(m2) O(mn)
Item-based O(n) O(n2) O(mn)
UB-KNN-LSH O(mL) O(mLKt) O(L+|C|n+k)
IB-KNN-LSH O(nL) O(nLKt) O(L+|C|m+k)
UB-LSH1 O(mL) O(mLKt) O(L+|Cl|+|Cl|lg(|Cl|)+k)
IB-LSH1 O(nL) O(nLKt) O(L+|Cl|+|Cl|lg(|Cl|)+k)
UB-LSH2 O(mL) O(mLKt) O(L+2k)
IB-LSH2 O(nL) O(nLKt) O(L+2k)
m : number of users
n : number of items
L: number of hash tables
K : number of hash functions
t : time to evaluate a hash function
C: Candidate user (or item) set ( |C| ≤ Lm / 2K or |C| ≤ Ln / 2K )
Cl : Candidate user (or item) list ( | Cl | ≤ Lm / 2K or | Cl | ≤ Ln / 2K )
21. | Cl | ≤ Lm / 2K
L = 5
m =16,042
Candidate List (Cl)
Prediction
0
10000
20000
30000
40000
50000
1 2 3 4 5 6 7 8 9 10
NumberofUsers
Number of Hash Functions
Cl
m
| Cl | ≤ Ln / 2K
L = 5
n =17,454
0
10000
20000
30000
40000
50000
1 2 3 4 5 6 7 8 9 10
NumberofItems
Number of Hash Functions
Cl
n
30. UB-LSH1 IB-LSH1
• find candidate set, C, for user u
with LSH.
• for each user, v, in C; retrieve
items that rated by v and add
to a running candidate list, Cl.
• calculate frequency of items in
Cl.
• sort Cl based on frequency.
• recommend the most frequent
N items to u.
• for each item, i, u rated; retrieve
candidate set, C, for i with LSH
and add C to a running
candidate list, Cl.
• calculate frequency of items in
Cl.
• sort Cl based on frequency.
• recommend the most frequent N
items to u.
LSH MethodsTop-N Recommendation
31. Improvement
Top-N Recommendation
UB-LSH2 IB-LSH2
• find candidate set, C, for user
u with LSH.
• for each user, v, in C; retrieve
items that rated by v and add
to a running candidate list, Cl.
• select N items from Cl randomly
and recommend to u.
• for each item, i, u rated; retrieve
candidate set, C, for i with LSH
and add to a running candidate
list, Cl.
• select N items from Cl randomly
and recommend to u.
Eliminates frequency calculation and sorting.
32. Complexity
Top-N Recommendation
Space Model Build Top-N Recommendation
User-based O(m) O(m2) O(mn)
Item-based O(n) O(n2) O(mn)
UB-LSH1 O(mL) O(mLKt) O(L+|C|+|Cl|+|Cl|lg(|Cl|)
IB-LSH1 O(nL) O(nLKt) O(pL+|Cl|+|Cl|lg(|Cl|))
UB-LSH2 O(mL) O(mLKt) O(L+|C|+N)
IB-LSH2 O(nL) O(nLKt) O(pL+N)
m : number of users
n : number of items
p : number of ratings of a user
L : number of hash tables
K : number of hash functions
t : time to evaluate a hash function
C : Candidate user (or item) set ( |C| ≤ Lm / 2K or |C| ≤ Ln / 2K)
Cl : Candidate item list ( |Cl| ≤ p|C| for UB-LSH1 and IB-LSH1 s.t. |Cl| ≤ Lpn / 2K )
33. |Cl| ≤ Lpn / 2K )
L = 5
n =1000
p = 100 (avg. number of ratings for a user)
Candidate List (Cl)
Top-N Recommendation
0
5000
10000
15000
20000
25000
30000
35000
4 5 6 7 8 9 10 11 12 13
NumberofItems
Number of Hash Functions
min Cl
max Cl
n
40. • LSH as a real-time stream recommendation algorithm
• Dimensionality reduction methods (e.g., Matrix
Factorization)
• Other ANN Methods:
• Tree based
• Clustering based
Work Plan