ApacheCon 2009 talk describing methods for doing intelligent (well, really clever at least) search on items with no or poor meta-data.
The video of the talk should be available shortly on the ApacheCon web-site.
3. Some Preliminaries
• Text retrieval = matrix multiplication
A: our corpus
documents are rows
terms are columns
4. Some Preliminaries
• Text retrieval = matrix multiplication
A: our corpus
documents are rows
terms are columns
for each document d:
for each term t:
sd += adt qt
5. Some Preliminaries
• Text retrieval = matrix multiplication
A: our corpus
documents are rows
terms are columns
sd = Σt adt qt
6. Some Preliminaries
• Text retrieval = matrix multiplication
A: our corpus
documents are rows
terms are columns
s=Aq
8. More Preliminaries
• Recommendation = Matrix multiply
A: our users’ histories
users are rows
items are columns
Users who bought items
in the list h also bought
items in the list r
9. More Preliminaries
• Recommendation = Matrix multiply
A: our users’ histories
users are rows
items are columns
for each user u:
for each item t1:
for each item t2:
rt1 += au,t1 au,t2 ht2
14. Why so ish?
• In real life, ish happens because:
• Big data ... so we selectively sample
• Sparse data ... so we smooth
• Finite computers ... so we sparsify
• Top-40 effect ... so we use some stats
15. The same in spite of ish
• The shape of the computation is unchanged
• The cost of the computation is unchanged
• Broad algebraic conclusions still hold
17. Dyadic Structure
● Functional
– Interaction: actor -> item*
● Relational
– Interaction ⊆ Actors x Items
● Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
● Predict missing observations
18. Fundamental Algorithmics
● Cooccurrence
● A is actors x items, K is items x items
● Product has general shape of matrix
● K tells us “users who interacted with x also
interacted with y”
23. What we have:
For a user who watched/bought/listened to this
Sum over all other users who watched/bought/...
24. What we have:
For a user who watched/bought/listened to this
Sum over all other users who watched/bought/...
Add up what they watched/bought/listened to
25. What we have:
For a user who watched/bought/listened to this
Sum over all other users who watched/bought/...
Add up what they watched/bought/listened to
And recommend that
26. What we have:
For a user who watched/bought/listened to this
Sum over all other users who watched/bought/...
Add up what they watched/bought/listened to
And recommend that
ish
33. But why not ...
Why just dyadic learning?
Why not triadic learning?
34. But why not ...
Why just dyadic learning?
Why not p-adic learning?
35. For example
● Users enter queries (A)
– (actor = user, item=query)
● Users view videos (B)
– (actor = user, item=video)
● AʼA gives query recommendation
– “did you mean to ask for”
● BʼB gives video recommendation
– “you might like these videos”
36. The punch-line
● BʼA recommends videos in response to a query
– (isnʼt that a search engine?)
– (not quite, it doesnʼt look at content or meta-data)
37. Real-life example
● Query: “Paco de Lucia”
● Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
● Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
40. System Diagram
Viewing Logs selective
count
t user video sampler
Search Logs selective llr + Related videos
count
t user query-term sampler sparsify v => v1 v2...
Related terms
join on v => t1 t2...
count
user
Hadoop
41. Indexing
Related terms
v => t1 t2...
Related videos
v => v1 v2...
join on
Lucene Index
video
Video meta
v => url title...
Hadoop Lucene (+Katta?)
42. Hypothetical Example
● Want a navigational ontology?
● Just put labels on a web page with traffic
– This gives A = users x label clicks
● Remember viewing history
– This gives B = users x items
● Cross recommend
– BʼA = click to item mapping
● After several users click, results are whatever
users think they should be
43. Resources
● My blog
– http://tdunning.blogspot.com/
● The original LLR in NLP paper
– Accurate Methods for the Statistics of Surprise and Coincidence
(check on citeseer)
● Source code
– Mahout project
– contact me (tdunning@apache.org)