8. 8
Algorithms in Mahout – Cont.
• Vector Similarity
– RowSimiliarityJob (MR)
– VectorDistanceJob (MR)
• Other
– Collocations
• Non-MapReduce algorithms
See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
9. 9
Mahout Focus on Scalability
• Goal: Be as fast and efficient as possible given the
intrinsic design of the algorithm
– Some algorithms won‟t scale to massive machine clusters
– Others fit logically on a Map Reduce framework like Apache
Hadoop
– Still others will need alternative distributed programming
models
– Be pragmatic
• Most Mahout implementations are Map Reduce
enabled
• (Always a) Work in Progress
10. 10
Prepare Data from Raw content
• Lucene integration
– bin/mahout lucenevector …
• Document Vectorizer
– bin/mahout seqdirectory …
– bin/mahout seq2sparse …
• Programmatically
– See the Utils module in Mahout
• Database (JDBC)
• File System (HDFS)
11. 11
Machine Learning
• “Machine Learning is programming computers
to optimize a performance criterion using
example data or past experience”
– Intro. To Machine Learning by E. Alpaydin
• Subset of Artificial Intelligence
• Lots of related fields:
– Information Retrieval
– Stats
– Biology
– Linear algebra
– Many more
15. 15
More use cases
• Recommend products/books/friends …
• Classify content into predefined groups
• Find similar content based on object properties
• Find associations/patterns in actions/behaviors
• Identify key topics in large collections of text
• Detect anomalies in machine output
• Ranking search results (PageRank)
• Others
17. 17
Approach
• Collect User Preferences -> User vs Item Matrix
• Find Similar Users or Items (Neighborhood-based
approach)
• Works by finding similarly rated items in the user-
item-matrix (e.g. cosine, Pearson-Correlation,
Tanimoto Coefficient)
• Estimates a user's preference towards an item by
looking at his/her preferences towards similar items
18. 18
Collaborative Filtering – User Based
Find User Similarity
1. 如何預測用戶1對於商品4的
喜好程度?
2. 找尋n個和用戶1相似的用戶
且購買過商品4(基於購買
記錄的評價)為用戶n
3. 根據用戶n對商品4的評價,
以相似度為權重回填結果
4. 針對所有用戶組合,重覆
1~3,直到所有空格都被填
滿
Items
User 1
?
User n
回填結果
20. 20
Test Drive of Mahout Recommender
• Group Len Dataset:
http://www.grouplens.org/node/12
• 1,000,209 anonymous ratings of 3,900 movies made
by 6,040 MovieLens users
• movies.dat (movie ids with title and category)
• ratings.dat (ratings of movies)
• users.dat (user information)
21. 21
Ratings File
• Each line of ratings file has the format
UserID::MovieID::Rating::Timestamp
• Mahout requires following csv format
UserID,ItemID,Value
• tr –s „:‟ „,‟ < ratings.dat | cut –f1-3 –d, > rating.csv
23. 23
Recommendation Result
• Recommendation Result will look like
UserID [ItemID:Weight, ItemID:Weight,…]
• Each line represents a UserID with associated
recommended ItemID
24. 24
Collect User Behavior Events
Implicit (Easy to collect) Explicit (Hard to collect)
View Rating (0~5)
Shopping Cart (0 or 1) Voting (0 or 1)
Order or Buy (0 or 1) Forward or Share (0 or 1)
Duration Time (Noisy) Add favorite (0 or 1)
Tag (text analysis)
Comments (text analysis)
25. 25
Process Event into Preference
• Group by different event type, and calculate similarity
based on event types. Ex. Also View, Also Buy..
• Weighting:
– Explicit Event > Implicit Event
– Order, Cart > View
• Noise Reduction
• Normalization
27. 27
Complementary
• Sometimes CF cannot generate enough
recommendation to all users
• Cold start problem
• New user and new item
• Some statistical approaches can be complementary
• Ranking is very easy to implement by MR. Word Count
?
29. 29
Data Process Flow
Front End
Java Script
Event Colloector
(Nginx)
HDFS
Log Parser
HBase
Core Engine
Mahout Job
User Based
Item Based
MR Job
Ranking &
Stats.
Rec API
Item Mgmt.
API
Dashboard
&
Mgmt Console
request
access
log
Preprocess
& Dispatch
Schedule &
Flow Control
Front End
Backend
Admin
30. 30
System Components
• Nginx
– Event Collector & Request Forwarder
• Log Parser
– Preprocess collected log and dispatch log to HDFS
• HDFS
– Fundamental storage of the system
• Core Engine
– Scheduling & Workflow Control
– Job Driver
• Management Console
– Dashboard (PV,UV,Conv. Rate)
– Scheduling, Log Viewer, System Configuration
31. 31
System Components – Cont.
• Recommendation Jobs
– Mahout jobs for CF
– MR jobs for Ranking
• HBase
– Recommendation Result for query
• Recommendation API
– API wrapper for frontend to query result from HBase table
– Handle business logic and policy here
• Item Management API
– API interface for frontend item management
– Allow List, Exception List
32. 32
HBase Table
Table Rowkey Column
CATEGORY CategoryID column=f:id Category ID
column=f:rank ranking by view
column=f:rank_cart ranking by cart
column=f:rank_order ranking by order
column=f:rank_view ranking by view
ERUID_USE
R
ErUid column=f:uid ERUID/UID mapping
USER_ERUI
D
uid column=f:eruid UID/ERUID mapping
SEARCH Keyword column=f:id search ID
column=f:rec item list
36. 36
Tracking Code Snippet
<script id="etu-recommender" type="text/javascript">
var erHostname='${erHostname}'
var _qevents = _qevents || [];
_qevents.push({
${paramName} : '${paramValue}',
...
});
var erUrlPrefix=('https:' == document.location.protocol ?
'https://':'http://')+erHostname+'/';
(function() {
var er = document.createElement('script');
er.type = 'text/javascript';
er.async = true;
er.src = erUrlPrefix+'/er.js?'+(new Date().getTime());
var currentJs=document.getElementById('etu-recommender');
currentJs.parentNode.insertBefore(er,currentJs);
})();
</script>
37. 37
Sample parameters for tracking a "view" action
#
Parameter
Name
Parameter
Type
Sample Value Required
1 cid String "www.etusolution.
com"
Yes
2 uid String "johnny_nien" Yes
3 act String "view" Yes
4 pid String "P00001" Yes
5 cat String Array [ "C", "C00001" ] No, but please
take it as a yes.
6 avl * Boolean(0 or 1) 1 No
Note: Explanation about "avl" will be available later
38. 38
Query Recommendations
<script id="etu-recommender" type="text/javascript">
var erUrlPrefix='${erUrlPrefix}';
var _qquery = _qquery || [];
_qquery.push({
${paramName} : '${paramValue}',
……
});
function etuRecQueryCallBack(queryParams,queryResult) {
// Implement Your Logic Here!!!
}
var erUrlPrefix=('https:' == document.location.protocol ?
'https://':'http://')+erHostname+'/';
(function() {
var er = document.createElement('script');
er.type = 'text/javascript';
er.async = true;
er.src = erUrlPrefix+'/er.js?'+(new Date().getTime());
var currentJs=document.getElementById('etu-recommender');
currentJs.parentNode.insertBefore(er,currentJs);
})();
</script>
39. 39
Sample parameter for Also Buy … (Item
based)
#
Parameter
Name
Parameter
Type
Sample Value Required
1 cid String "www.etusolution.
com"
Yes
2 type String “item” Yes
3 act String ”order" Yes
4 pid String "P001" Yes
5 cat String "C001" No, but highly
recommended
44. 44
Recommender 的轉化率分析
Online Performance Tracking
Item A
Item A
透過點擊
推薦清單
透過主頁或其他所有頁面
PV1, UV1
PV2, UV2
推薦商品點擊率 =
PV2 or UV2
PV1 or UV1
推薦商品轉化率 =
透過
推薦清單
U-Cart 2
UV2 or PV2
**
PV : page view
UV : unique visitor
U-Cart : added to
cart by UV
U-Cart 1
U-Cart 2
Algorithm Benchmark
• Train vs Test (80-20)
• A/B test
45. 45
Summary
• Mahout is very useful if you would like to build a
machine learning application on top of Hadoop
• BUT, a recommendation system is not algorithm only
• DON‟T re-invent the wheels. Leverage mahout and
hadoop
• Put most of your efforts on integration, performance
tuning, and business logic
46. 46
Future Roadmaps
• Offline to online integration -> Offline User Event
Collection
• 360 Degree CRM -> CRM Connector
• Social Recommendation -> Social Connector
• Retargeting -> Customer Behavior Data Warehouse
• Go real-time!
50. 50
Algorithms Examples –
Recommendation
• Prediction: Estimate Bob's preference towards “The
Matrix”
1. Look at all items that
– a) are similar to “The Matrix“
– b) have been rated by Bob
=> “Alien“, “Inception“
2. Estimate the unknown preference with a weighted sum
51. 51
Algorithms Examples –
Recommendation
• MapReduce phase 1
– Map – Make user the key
(Alice, Matrix, 5)
(Alice, Alien, 1)
(Alice, Inception, 4)
(Bob, Alien, 2)
(Bob, Inception, 5)
(Peter, Matrix, 4)
(Peter, Alien, 3)
(Peter, Inception, 2)
Alice (Matrix, 5)
Alice (Alien, 1)
Alice (Inception, 4)
Bob (Alien, 2)
Bob (Inception, 5)
Peter (Matrix, 4)
Peter (Alien, 3)
Peter (Inception, 2)
52. 52
Algorithms Examples –
Recommendation
• MapReduce phase 1
– Reduce – Create inverted index
Alice (Matrix, 5)
Alice (Alien, 1)
Alice (Inception, 4)
Bob (Alien, 2)
Bob (Inception, 5)
Peter (Matrix, 4)
Peter (Alien, 3)
Peter (Inception, 2)
Alice (Matrix, 5) (Alien, 1) (Inception, 4)
Bob (Alien, 2) (Inception, 5)
Peter(Matrix, 4) (Alien, 3) (Inception, 2)
53. 53
Algorithms Examples –
Recommendation
• MapReduce phase 2
– Map – Isolate all co-occurred ratings (all cases where a user
rated both items)
Matrix, Alien (5,1)
Matrix, Alien (4,3)
Alien, Inception (1,4)
Alien, Inception (2,5)
Alien, Inception (3,2)
Matrix, Inception (4,2)
Matrix, Inception (5,4)
Alice (Matrix, 5) (Alien, 1) (Inception, 4)
Bob (Alien, 2) (Inception, 5)
Peter(Matrix, 4) (Alien, 3) (Inception, 2)