The document discusses implementations for scalable machine learning algorithms like collaborative filtering, clustering, classification, and dimensionality reduction. It provides examples of calculating similarity between users based on common movies rated and comparing different similarity metrics. It also discusses challenges like sparsity in rating datasets and techniques for improving performance like dimensionality reduction.
3. Similarity – Number of Common Movies between users
SIM(US1, US2)= 0 , SIM(US1, US3)= 3
Threshold for Similarity
The more the user watches movies, the more is he similar to others
3
7. October, 2006 – 1 million Dollar
Training Data Set
Users – 480,000
Movies – 18,000
Pairs – 100 Million
Ratings : 1- 5
Test Data Set
Ratings to be predicted – 1.5 Million Pairs
Metrics - RMSE
Cinematch – 0.9514
Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos)
7
13. Training Data Set
Users – 480,000
Movies – 18,000
Ratings – 100 Million
Sparse Matrix
Actual Possible pairings – 480,000*18,000 = 8.6 Billion
Pairs Present = 1.1%
Best Representation:
(Key, Value) pair
13
14. Similarity Matrix Computation
Time Complexity
User based Similarity :
For all Users (Sim (UserVector, User vector))
Number of users = 480,000
Number of user pairs = 480,000 * 480,000= 230 Billion user pairs
Number of comparisons for one sim val = 18000
Total Computations = 230 Billion * 18000 = 4140 Trillion
Operations
14
18. User Based – Similarity Between Users
Product Based – Similarity Between Products
Click Based – Based on user Clicks/Likes
Content Based – Based on tags, reviews, ratings.
18
21. The Firm ∼ The RainMaker
The Bourne Identity ∼ The Bourne Ultimatum
Uniform Weight
Weighted Parameters
21
Author Category Year
The Firm John Grisham Thriller 1991
The Bourne
Identity
Robert Ludlum Thriller 1980
The Bourne
Ultimatum
Robert Ludlum Thriller 1990
The Rainmaker John Grisham Thriller 1995
22. Problem:
User Reads a news article
Find Similar news articles
Don’t find same news article.
How to convert document into a vector?
Extract all the words
Remove stop words
Identify Named Entities
22
23. New Movie
- No views (or less views)
- No similar Movies
New User
- No ratings (fewer ratings)
- No similar Users
23