Modern web applications embrace personalization in order to provide a unique customer experience. Recommendation engines, in general, and Collaborative Filtering, in particular, are essential techniques for delivering state-of-the-art personalization effects on a web site.
These slides are based on a presentation that I gave to New England's Java User Group (NEJUG) in 2009; in that respect, they are quite old. Nevertheless, the content is about the fundamental concepts of these techniques and the fundamentals never go out of fashion.
The code references are from the project Yooreeka. The Yooreeka project started with the code of the book "Algorithms of the Intelligent Web " (Manning 2009). You can find the Yooreeka 2.0 API (Javadoc) at http://www.marmanis.com/static/javadoc/index.html
1. Recommendation Engines:
A key personalization feature of modern web applications
Haralambos (Babis) Marmanis
NEJUG
June 11, 2009
2. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 Introduction
Recommendations in Action
“It’s the Economy ...”
Java source code
2 Basic Concepts
The Online Music Store Example
Similarity
Distance (formulas)
Similarity (formulas)
The ”best” Similarity formula
3 Collaborative Filtering
User based
Rating Counting Matrix
Item based
4 Content based
3. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Recommendations in Action
Online store recommendations
Amazon.com
Provide recommendations for purchasing more items
4. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Recommendations in Action
Online store recommendations
Netflix.com
Provide recommendations for viewing more movies
5. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Recommendations in Action
Content recommendations
Any news portal or other content aggregator
Recommendations for articles, books, news stories
6. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
“It’s the Economy ...”
The Long Tail
Goodbye Pareto Principle, Hello Long Tail
Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,
used a log-linear curve to describe the relationship
between Amazon.com sales and sales ranking.
They found that a large proportion of Amazon.com’s book
sales come from obscure books that were not available in
brick-and-mortar stores.
They also found that consumer benefit from access to
increased product variety in online book stores is ten times
larger than their benefit from access to lower prices online!
7. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
“It’s the Economy ...”
The Long Tail
Goodbye Pareto Principle, Hello Long Tail
Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,
used a log-linear curve to describe the relationship
between Amazon.com sales and sales ranking.
They found that a large proportion of Amazon.com’s book
sales come from obscure books that were not available in
brick-and-mortar stores.
They also found that consumer benefit from access to
increased product variety in online book stores is ten times
larger than their benefit from access to lower prices online!
8. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
“It’s the Economy ...”
The Long Tail
Goodbye Pareto Principle, Hello Long Tail
Erik Brynjolfsson, Yu (Jeffrey) Hu, and Michael D. Smith,
used a log-linear curve to describe the relationship
between Amazon.com sales and sales ranking.
They found that a large proportion of Amazon.com’s book
sales come from obscure books that were not available in
brick-and-mortar stores.
They also found that consumer benefit from access to
increased product variety in online book stores is ten times
larger than their benefit from access to lower prices online!
9. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Java source code
Yooreeka!
Open Source, Machine Learning library
Search, recommendations, clustering, classification, and
combination of classifiers!
URL: http://code.google.com/p/yooreeka/
10. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 Introduction
Recommendations in Action
“It’s the Economy ...”
Java source code
2 Basic Concepts
The Online Music Store Example
Similarity
Distance (formulas)
Similarity (formulas)
The ”best” Similarity formula
3 Collaborative Filtering
User based
Rating Counting Matrix
Item based
4 Content based
11. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The Online Music Store Example
Frank’s music ratings
12. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The Online Music Store Example
Constantine’s music ratings
13. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The Online Music Store Example
Catherine’s music ratings
14. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of Similarity
Often based on the notion of distance
The smaller the distance, the greater the similarity
Similarity values, typically, constrained in [0,∞) or [0,1]
It is not necessary to define similarity formulas. E.g. if
d < then similar, otherwise not.
Similarity could also be empirical or probabilistic
15. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of Similarity
Often based on the notion of distance
The smaller the distance, the greater the similarity
Similarity values, typically, constrained in [0,∞) or [0,1]
It is not necessary to define similarity formulas. E.g. if
d < then similar, otherwise not.
Similarity could also be empirical or probabilistic
16. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of Similarity
Often based on the notion of distance
The smaller the distance, the greater the similarity
Similarity values, typically, constrained in [0,∞) or [0,1]
It is not necessary to define similarity formulas. E.g. if
d < then similar, otherwise not.
Similarity could also be empirical or probabilistic
17. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of Similarity
Often based on the notion of distance
The smaller the distance, the greater the similarity
Similarity values, typically, constrained in [0,∞) or [0,1]
It is not necessary to define similarity formulas. E.g. if
d < then similar, otherwise not.
Similarity could also be empirical or probabilistic
18. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity
The notion of Similarity
Often based on the notion of distance
The smaller the distance, the greater the similarity
Similarity values, typically, constrained in [0,∞) or [0,1]
It is not necessary to define similarity formulas. E.g. if
d < then similar, otherwise not.
Similarity could also be empirical or probabilistic
19. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Distance (formulas)
Let Xi and Yi be two vectors in RN
Minkowski or p-norm distance
1
N p
d = |Xi − Yi |p (1)
i=1
Manhattan distance
d = max |Xi − Yi | (2)
i
Chebychev or L∞ distance
1
N p
d = lim |Xi − Yi |p (3)
p→∞
i=1
20. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Distance (formulas)
Let Xi and Yi be two vectors in RN
Minkowski or p-norm distance
1
N p
d = |Xi − Yi |p (1)
i=1
Manhattan distance
d = max |Xi − Yi | (2)
i
Chebychev or L∞ distance
1
N p
d = lim |Xi − Yi |p (3)
p→∞
i=1
21. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Distance (formulas)
Let Xi and Yi be two vectors in RN
Minkowski or p-norm distance
1
N p
d = |Xi − Yi |p (1)
i=1
Manhattan distance
d = max |Xi − Yi | (2)
i
Chebychev or L∞ distance
1
N p
d = lim |Xi − Yi |p (3)
p→∞
i=1
22. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Distance (formulas)
Let Xi and Yi be two vectors in RN
Minkowski or p-norm distance
1
N p
d = |Xi − Yi |p (1)
i=1
Manhattan distance
d = max |Xi − Yi | (2)
i
Chebychev or L∞ distance
1
N p
d = lim |Xi − Yi |p (3)
p→∞
i=1
23. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity (formulas)
Na¨ve Similarity
ı
β
simNaive = (4)
β+d
where d is the Euclidean distance.
Similarity I
simI = 1 − tanh(σ) (5)
where σ is the biased estimator of sample variance
Similarity II
common
simII = simI × (6)
maximum
There is more . . . Jaccard, Tanimoto, and so on
24. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity (formulas)
Na¨ve Similarity
ı
β
simNaive = (4)
β+d
where d is the Euclidean distance.
Similarity I
simI = 1 − tanh(σ) (5)
where σ is the biased estimator of sample variance
Similarity II
common
simII = simI × (6)
maximum
There is more . . . Jaccard, Tanimoto, and so on
25. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Similarity (formulas)
Na¨ve Similarity
ı
β
simNaive = (4)
β+d
where d is the Euclidean distance.
Similarity I
simI = 1 − tanh(σ) (5)
where σ is the biased estimator of sample variance
Similarity II
common
simII = simI × (6)
maximum
There is more . . . Jaccard, Tanimoto, and so on
26. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The ”best” Similarity formula
Which is the best similarity formula?
There is no such thing! It depends on the problem, the
data, the definition of ... ”best”
¨ ¨ ¨
Spertus,Sahami, and Buyukkokten (2005)
Evaluating similarity measures: a large-scale study in the
orkut social network. Proceedings of the eleventh ACM
SIGKDD international conference on Knowledge discovery
in data mining
The simple L2 based (cosine) similarity showed the best
empirical results among seven similarity metrics.
27. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The ”best” Similarity formula
Which is the best similarity formula?
There is no such thing! It depends on the problem, the
data, the definition of ... ”best”
¨ ¨ ¨
Spertus,Sahami, and Buyukkokten (2005)
Evaluating similarity measures: a large-scale study in the
orkut social network. Proceedings of the eleventh ACM
SIGKDD international conference on Knowledge discovery
in data mining
The simple L2 based (cosine) similarity showed the best
empirical results among seven similarity metrics.
28. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The ”best” Similarity formula
Which is the best similarity formula?
There is no such thing! It depends on the problem, the
data, the definition of ... ”best”
¨ ¨ ¨
Spertus,Sahami, and Buyukkokten (2005)
Evaluating similarity measures: a large-scale study in the
orkut social network. Proceedings of the eleventh ACM
SIGKDD international conference on Knowledge discovery
in data mining
The simple L2 based (cosine) similarity showed the best
empirical results among seven similarity metrics.
29. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
The ”best” Similarity formula
Which is the best similarity formula?
There is no such thing! It depends on the problem, the
data, the definition of ... ”best”
¨ ¨ ¨
Spertus,Sahami, and Buyukkokten (2005)
Evaluating similarity measures: a large-scale study in the
orkut social network. Proceedings of the eleventh ACM
SIGKDD international conference on Knowledge discovery
in data mining
The simple L2 based (cosine) similarity showed the best
empirical results among seven similarity metrics.
30. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 Introduction
Recommendations in Action
“It’s the Economy ...”
Java source code
2 Basic Concepts
The Online Music Store Example
Similarity
Distance (formulas)
Similarity (formulas)
The ”best” Similarity formula
3 Collaborative Filtering
User based
Rating Counting Matrix
Item based
4 Content based
31. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Tapestry
Experimental mail system by Goldberg et al. (circa 1992)
in Xerox PARC
32. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Tapestry
Experimental mail system by Goldberg et al. (circa 1992)
in Xerox PARC
33. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Tapestry
Experimental mail system by Goldberg et al. (circa 1992)
in Xerox PARC
34. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Tapestry
Experimental mail system by Goldberg et al. (circa 1992)
in Xerox PARC
35. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Tapestry
Experimental mail system by Goldberg et al. (circa 1992)
in Xerox PARC
41. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
BeanShell script (Items)
Delphi delphi = new
Delphi(ds,RecommendationType.ITEM_BASED);
MusicUser mu1 = ds.pickUser("Bob");
delphi.recommend(mu1);
MusicItem mi = ds.pickItem("La Bamba");
delphi.findSimilarItems(mi);
42. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the code
Delphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
43. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the code
Delphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
44. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the code
Delphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
45. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the code
Delphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
46. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Item based
Peruse the code
Delphi
UserBasedSimilarity
ItemBasedSimilarity
BaseSimilarityMatrix
RatingCountMatrix
47. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 Introduction
Recommendations in Action
“It’s the Economy ...”
Java source code
2 Basic Concepts
The Online Music Store Example
Similarity
Distance (formulas)
Similarity (formulas)
The ”best” Similarity formula
3 Collaborative Filtering
User based
Rating Counting Matrix
Item based
4 Content based
48. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Text Parsing & Analysis
No more ratings, what do we do?
Now we deal with documents
So, we need to define similarity based on the content of
the documents
Use Lucene’s StandardAnalyzer
Build your own! (see CustomAnalyzer)
49. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Text Parsing & Analysis
No more ratings, what do we do?
Now we deal with documents
So, we need to define similarity based on the content of
the documents
Use Lucene’s StandardAnalyzer
Build your own! (see CustomAnalyzer)
50. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Text Parsing & Analysis
No more ratings, what do we do?
Now we deal with documents
So, we need to define similarity based on the content of
the documents
Use Lucene’s StandardAnalyzer
Build your own! (see CustomAnalyzer)
51. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Text Parsing & Analysis
No more ratings, what do we do?
Now we deal with documents
So, we need to define similarity based on the content of
the documents
Use Lucene’s StandardAnalyzer
Build your own! (see CustomAnalyzer)
52. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Document representation
No more ratings!
53. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Document representation
No more ratings!
54. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Document representation
No more ratings!
55. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 Introduction
Recommendations in Action
“It’s the Economy ...”
Java source code
2 Basic Concepts
The Online Music Store Example
Similarity
Distance (formulas)
Similarity (formulas)
The ”best” Similarity formula
3 Collaborative Filtering
User based
Rating Counting Matrix
Item based
4 Content based
56. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Netflix Prize Description
Netflix prize
More than 100 million ratings
480 thousand randomly-chosen, anonymous customers
18 thousand movie titles
57. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Netflix Prize Description
Netflix prize
More than 100 million ratings
480 thousand randomly-chosen, anonymous customers
18 thousand movie titles
58. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Netflix Prize Description
Netflix prize
More than 100 million ratings
480 thousand randomly-chosen, anonymous customers
18 thousand movie titles
59. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Netflix Prize Description
Netflix prize
More than 100 million ratings
480 thousand randomly-chosen, anonymous customers
18 thousand movie titles
60. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Lessons learned
Important considerations
Data normalization
Neighbor selection
How many neighbors?
Who are the ”best” neighbors?
Neighbor weights
”Our experience is that most efforts should be
concentrated in deriving substantially different approaches,
rather than refining a single technique.”
61. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Lessons learned
Important considerations
Data normalization
Neighbor selection
How many neighbors?
Who are the ”best” neighbors?
Neighbor weights
”Our experience is that most efforts should be
concentrated in deriving substantially different approaches,
rather than refining a single technique.”
62. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Lessons learned
Important considerations
Data normalization
Neighbor selection
How many neighbors?
Who are the ”best” neighbors?
Neighbor weights
”Our experience is that most efforts should be
concentrated in deriving substantially different approaches,
rather than refining a single technique.”
63. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Lessons learned
Important considerations
Data normalization
Neighbor selection
How many neighbors?
Who are the ”best” neighbors?
Neighbor weights
”Our experience is that most efforts should be
concentrated in deriving substantially different approaches,
rather than refining a single technique.”
64. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Presentation Outline
1 Introduction
Recommendations in Action
“It’s the Economy ...”
Java source code
2 Basic Concepts
The Online Music Store Example
Similarity
Distance (formulas)
Similarity (formulas)
The ”best” Similarity formula
3 Collaborative Filtering
User based
Rating Counting Matrix
Item based
4 Content based
65. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerations
Business value validation - ”Long Tail”, ”niches to riches”,
etc.
Similarity metrics - Many to choose from, do not be afraid
to explore!
Collaborative Filtering: ”Show me your friend ...”
User based
Item based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality
66. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerations
Business value validation - ”Long Tail”, ”niches to riches”,
etc.
Similarity metrics - Many to choose from, do not be afraid
to explore!
Collaborative Filtering: ”Show me your friend ...”
User based
Item based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality
67. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerations
Business value validation - ”Long Tail”, ”niches to riches”,
etc.
Similarity metrics - Many to choose from, do not be afraid
to explore!
Collaborative Filtering: ”Show me your friend ...”
User based
Item based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality
68. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerations
Business value validation - ”Long Tail”, ”niches to riches”,
etc.
Similarity metrics - Many to choose from, do not be afraid
to explore!
Collaborative Filtering: ”Show me your friend ...”
User based
Item based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality
69. Introduction Basic Concepts Collaborative Filtering Content based Netflix Prize Summary
Important considerations
Business value validation - ”Long Tail”, ”niches to riches”,
etc.
Similarity metrics - Many to choose from, do not be afraid
to explore!
Collaborative Filtering: ”Show me your friend ...”
User based
Item based
Content based recommendations - NLP challenges
Large scale implementations - Speed, data size, quality