SlideShare a Scribd company logo
1 of 24
- S.V.Giri
Provides implementation for Scalable Machine Learning Algorithms
-- Wikipedia
Machine Learning Algorithms
 Collaborative Filtering
 Clustering
 Classification
 Dimensionality reduction
 Anomaly detection
2
Similarity – Number of Common Movies between users
SIM(US1, US2)= 0 , SIM(US1, US3)= 3
Threshold for Similarity
The more the user watches movies, the more is he similar to others
3
Cosine Similarity
Tanimoto Coefficient
Pearson Correlation Coefficient
Euclidean Distance
LogLikelihood Similarity
Spearman Rank Correlation
4
 A measure of similarity between 2 vectors
 Values from 0 to 1
5
n
i i
n
i i
n
i ii
yx
yx
yx
yx
yx
1
2
1
2
1
),cos( 


Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22
Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97
Cos(US2,US4)= 0*1+4*4+5*3/(6.71*5.09)= 0.91
6
October, 2006 – 1 million Dollar
Training Data Set
Users – 480,000
Movies – 18,000
Pairs – 100 Million
Ratings : 1- 5
Test Data Set
Ratings to be predicted – 1.5 Million Pairs
Metrics - RMSE
Cinematch – 0.9514
Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos)
7
Actual Values –
(us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3)
Predicted Values –
(us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2)
RMSE = √((5-4)²+(3-3)²+(1-2)²+(3-2)²)/4
= 0.86
8
(US4, SW4) =??
Average of all the other user ratings for this movie
= 4+2+5/3 = round(3.66) = 4
9
10
Sim(US4,US1) = 0.19
Sim(US4,US2) = 0.91
Sim(US4,US3)= 0.35
US4 is similar to US2
Hence Rating(US4,SW2)= Rating(US2,SW2)=2
11
Sim(US5,US2) = 0.955
Rating(US5,SW2)= Rating(US2,SW2)= 2
Avg(US2)= 3, AVG (US5)=2
Rating(US5,SW2)= Rating(US2,SW2)+ AVG (US5)- AVG(US2)= 1
12
Training Data Set
Users – 480,000
Movies – 18,000
Ratings – 100 Million
Sparse Matrix
Actual Possible pairings – 480,000*18,000 = 8.6 Billion
Pairs Present = 1.1%
Best Representation:
(Key, Value) pair
13
Similarity Matrix Computation
Time Complexity
User based Similarity :
For all Users (Sim (UserVector, User vector))
Number of users = 480,000
Number of user pairs = 480,000 * 480,000= 230 Billion user pairs
Number of comparisons for one sim val = 18000
Total Computations = 230 Billion * 18000 = 4140 Trillion
Operations
14
Dimensionality Reductions :
SVD (Singular Valued Decomposition)
MinHasing
Locality Sensitive Hashing (LSH)
15
US1 SW1 5
US1 SW2 4
US1 LOTR1 5
US1 Notting Hill 0
US1 Mean Girls 1
US2 SW1 0
US2 SW2 2
US2 LOTR1 -
…
16
17
User Based – Similarity Between Users
Product Based – Similarity Between Products
Click Based – Based on user Clicks/Likes
Content Based – Based on tags, reviews, ratings.
18
19
Cos(SW1,SW2)= 0.94
Cos(SW1, Notting Hill)= 0.233
Cos(Mean Girls, Notting Hill)= 0.94
20
US1 US2 US3 US4
SW1 5 0 5 1
SW2 4 2 5 -
LOTR1 5 - 5 -
Notting Hill 0 4 2 4
Mean Girls 1 5 1 3
The Firm ∼ The RainMaker
The Bourne Identity ∼ The Bourne Ultimatum
 Uniform Weight
 Weighted Parameters
21
Author Category Year
The Firm John Grisham Thriller 1991
The Bourne
Identity
Robert Ludlum Thriller 1980
The Bourne
Ultimatum
Robert Ludlum Thriller 1990
The Rainmaker John Grisham Thriller 1995
Problem:
 User Reads a news article
 Find Similar news articles
 Don’t find same news article.
How to convert document into a vector?
 Extract all the words
 Remove stop words
 Identify Named Entities
22
New Movie
- No views (or less views)
- No similar Movies
New User
- No ratings (fewer ratings)
- No similar Users
23
Thank you
24

More Related Content

Similar to Mahout Taste Engine

2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-projectPaulo Faria
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Amro Elfeki
 
Sociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering BenchmarkSociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering BenchmarkAlbert Azout
 
IRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection SchemesIRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection SchemesIRJET Journal
 
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningMetaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningVarun Ojha
 
RecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupRecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupLadislav Peska
 
Search-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionSearch-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionLionel Briand
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...hyunsung lee
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsEvgeniy Marinov
 
ZunqiuPresentationOct05
ZunqiuPresentationOct05ZunqiuPresentationOct05
ZunqiuPresentationOct05Chen Zunqiu
 
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation Osama Hosam
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project reportGaurav Sawant
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automataijait
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systemsfuchaoqun
 
Literature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingLiterature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingPriyatham Bollimpalli
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataIOSRjournaljce
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014ijcsbi
 

Similar to Mahout Taste Engine (20)

2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-project
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology
 
Sociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering BenchmarkSociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering Benchmark
 
IRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection SchemesIRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection Schemes
 
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningMetaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
 
RecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupRecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF group
 
Search-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionSearch-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability Detection
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender Systems
 
Glowworm Swarm Optimisation
Glowworm Swarm OptimisationGlowworm Swarm Optimisation
Glowworm Swarm Optimisation
 
ZunqiuPresentationOct05
ZunqiuPresentationOct05ZunqiuPresentationOct05
ZunqiuPresentationOct05
 
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
Literature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingLiterature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based Watermarking
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Mahout Taste Engine

  • 2. Provides implementation for Scalable Machine Learning Algorithms -- Wikipedia Machine Learning Algorithms  Collaborative Filtering  Clustering  Classification  Dimensionality reduction  Anomaly detection 2
  • 3. Similarity – Number of Common Movies between users SIM(US1, US2)= 0 , SIM(US1, US3)= 3 Threshold for Similarity The more the user watches movies, the more is he similar to others 3
  • 4. Cosine Similarity Tanimoto Coefficient Pearson Correlation Coefficient Euclidean Distance LogLikelihood Similarity Spearman Rank Correlation 4
  • 5.  A measure of similarity between 2 vectors  Values from 0 to 1 5 n i i n i i n i ii yx yx yx yx yx 1 2 1 2 1 ),cos(   
  • 6. Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22 Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97 Cos(US2,US4)= 0*1+4*4+5*3/(6.71*5.09)= 0.91 6
  • 7. October, 2006 – 1 million Dollar Training Data Set Users – 480,000 Movies – 18,000 Pairs – 100 Million Ratings : 1- 5 Test Data Set Ratings to be predicted – 1.5 Million Pairs Metrics - RMSE Cinematch – 0.9514 Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos) 7
  • 8. Actual Values – (us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3) Predicted Values – (us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2) RMSE = √((5-4)²+(3-3)²+(1-2)²+(3-2)²)/4 = 0.86 8
  • 9. (US4, SW4) =?? Average of all the other user ratings for this movie = 4+2+5/3 = round(3.66) = 4 9
  • 10. 10
  • 11. Sim(US4,US1) = 0.19 Sim(US4,US2) = 0.91 Sim(US4,US3)= 0.35 US4 is similar to US2 Hence Rating(US4,SW2)= Rating(US2,SW2)=2 11
  • 12. Sim(US5,US2) = 0.955 Rating(US5,SW2)= Rating(US2,SW2)= 2 Avg(US2)= 3, AVG (US5)=2 Rating(US5,SW2)= Rating(US2,SW2)+ AVG (US5)- AVG(US2)= 1 12
  • 13. Training Data Set Users – 480,000 Movies – 18,000 Ratings – 100 Million Sparse Matrix Actual Possible pairings – 480,000*18,000 = 8.6 Billion Pairs Present = 1.1% Best Representation: (Key, Value) pair 13
  • 14. Similarity Matrix Computation Time Complexity User based Similarity : For all Users (Sim (UserVector, User vector)) Number of users = 480,000 Number of user pairs = 480,000 * 480,000= 230 Billion user pairs Number of comparisons for one sim val = 18000 Total Computations = 230 Billion * 18000 = 4140 Trillion Operations 14
  • 15. Dimensionality Reductions : SVD (Singular Valued Decomposition) MinHasing Locality Sensitive Hashing (LSH) 15
  • 16. US1 SW1 5 US1 SW2 4 US1 LOTR1 5 US1 Notting Hill 0 US1 Mean Girls 1 US2 SW1 0 US2 SW2 2 US2 LOTR1 - … 16
  • 17. 17
  • 18. User Based – Similarity Between Users Product Based – Similarity Between Products Click Based – Based on user Clicks/Likes Content Based – Based on tags, reviews, ratings. 18
  • 19. 19
  • 20. Cos(SW1,SW2)= 0.94 Cos(SW1, Notting Hill)= 0.233 Cos(Mean Girls, Notting Hill)= 0.94 20 US1 US2 US3 US4 SW1 5 0 5 1 SW2 4 2 5 - LOTR1 5 - 5 - Notting Hill 0 4 2 4 Mean Girls 1 5 1 3
  • 21. The Firm ∼ The RainMaker The Bourne Identity ∼ The Bourne Ultimatum  Uniform Weight  Weighted Parameters 21 Author Category Year The Firm John Grisham Thriller 1991 The Bourne Identity Robert Ludlum Thriller 1980 The Bourne Ultimatum Robert Ludlum Thriller 1990 The Rainmaker John Grisham Thriller 1995
  • 22. Problem:  User Reads a news article  Find Similar news articles  Don’t find same news article. How to convert document into a vector?  Extract all the words  Remove stop words  Identify Named Entities 22
  • 23. New Movie - No views (or less views) - No similar Movies New User - No ratings (fewer ratings) - No similar Users 23