SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Recommendations with
Python and Hadoop
Streaming

      Andrew Look

                    Senior Engineer
                          Shopzilla
Getting started
● Slides
  ○ http://bit.ly/J7vmx7
● Python/NumPy Installed
  ○ http://bit.ly/JWNWbq
● Sample code
  ○ http://aws-hadoop.s3.amazonaws.com/similarity.zip
Outline
●   Problem
●   Recommendation basics
●   MapReduce review and conventions
●   Python + Hadoop Streaming basics
●   MapReduce jobs (data, code, data-flow)
●   Recommendation algorithm
Problem - Music Recommendations
●   We want to recommend similar artists
●   We have data from Last.fm
●   Which Last.fm users liked which artists?
●   How can we decide which artists are similar?




     Toby Keith   Tupac      De La Soul   Garth Brooks
Solution - Find Artist Similarities
● We'll follow along with a tutorial from AWS
● By Data Wrangling blogger/AWS developer
  Peter Skomoroch
● Uses publicly available data from Last.fm
● User's rating of artist is number of plays
Solution - Find Artist Similarities
● We can look at co-ratings
● One user played artist A songs X times
● Same user played artist B songs Y times


     co-rating = ((A,X),(B,Y))
Recommendation Basics
● User Based
  ○ Given a user, recommend the artists that are favored
      by users with similar artist preferences


● Item Based
  ○ Given an item (artist), recommend the artists that
      were most commonly favored by users that also
      liked the input artist
Recommendation Basics
● Types of data
   ○ Explicit - user rates a movie on Netflix
   ○ Implicit - user watches a YouTube video


● Types of ratings
  ○ Multivalued - bounded, ex. star rating (1-5)
  ○ Multivalued - unbounded, ex. number of plays (>0)
  ○ Binary - did a user play a movie or not?
Last.fm Recommendations
●   Data was implicitly collected (as users play songs)
●   Transform binary data (did user listen to artist?) ...
●   Into multivalued data (how many times?)
●   We'll use item-based recommendations
Mapper Input
Map Output - Reduce Input
Chaining MapReduce Jobs
Distributed Cache
Python Shell and Hadoop Streaming
Streaming API requires shell commands
   ● Mapper
   ● Reducer
Python Shell and Hadoop Streaming
Streaming API requires shell commands
   ● Mapper
   ● Reducer

For mapper / reducer commands Streaming
API will
  ● Partition the input
  ● Distribute across mappers and
     reducers
Python Shell and Hadoop Streaming
Full Recommendation Job Overview
Example - Working Data Set
○ Inspect your working data set ...
○ Each row is one "rating"
○ Each "number of plays" is the "rating value"

   Code

   cat input/sample_user_artist_data.txt   
   | head
Example - Working Data Set
User ID      Artist ID   Number of Plays

1000020      1001820     20

1000020      1003557     1

1000021      700         1

1000029      1001819     1

1000036      1001820     34

1000036      1011819     2

1000036      700         2

1000040      1001820     1

1000057      1011819     37

1000060      700         17
Mapper 1 - Count Ratings per Artist
○ Prepend LongValueSum:<artist ID>
○ More on this later
○ Use a value of "1"


  Code

  cat input/sample_user_artist_data.txt   
  | ./similarity.py mapper1
Mapper 1 - Count Ratings per Artist
      Artist ID              Number of Ratings

      LongValueSum:1001820   1

      LongValueSum:1003557   1

      LongValueSum:700       1

      LongValueSum:1001819   1

      LongValueSum:1001820   1

      LongValueSum:1011819   1

      LongValueSum:700       1

      LongValueSum:1001820   1

      LongValueSum:1011819   1

      LongValueSum:700       1
Mapper 1 - Count Ratings per Artist
○ We use the sort command locally
○ We sort by artist ID
○ Emulates shuffle/sort in Hadoop
  Code

  cat input/sample_user_artist_data.txt   
  | ./similarity.py mapper1 | sort
Mapper 1 - Count Ratings per Artist
      Artist ID              Number of Plays

      LongValueSum:1001820   1

      LongValueSum:1001820   1

      LongValueSum:1001820   1

      LongValueSum:1003557   1

      LongValueSum:1011819   1

      LongValueSum:1011819   1

      LongValueSum:1011819   1

      LongValueSum:700       1

      LongValueSum:700       1

      LongValueSum:700       1
Reducer 1 - Count Ratings by Artist
○ LongValueSum tells 'aggregate' reducer
  ○ Group by artist ID
  ○ Sum up the 1's
  ○ Emit artist ID as Key, count(ratings) as Value
   Code

   cat input/sample_user_artist_data.txt   
   | ./similarity.py mapper1 | sort        
   | ./similarity.py reducer1              
   > input/artist_playcounts.txt
Reducer 1 - Count Ratings by Artist
      Artist ID   Number of Ratings

      1000143     1905

      1000418     184

      1001820     12950

      700         7243

      1003557     2976

      1011819     7601

      1012511     1881
Mapper 2 - User Artist Preferences
○ Mapper2 outputs key user ID, artist ID
○ Mapper2 outputs rating as value (# plays)

   Code

   cat input/sample_user_artist_data.txt   
   | ./similarity.py mapper2 int
Mapper 2 - User Artist Preferences
      User ID, Artist ID   Number of Plays

      1000020,1001820      20

      1000020,1003557      1

      1000021,700          1

      1000029,1011819      1

      1000036,1001820      34

      1000036,1011819      2

      1000036,700          2

      1000040,1001820      1

      1000057,1011819      37

      1000060,700          17
Mapper 2 - User Artist Preferences
○ Can large counts skew our results?
○ Apply log function to outliers.

   Code

   cat input/sample_user_artist_data.txt   
   | ./similarity.py mapper2 log | sort
Mapper 2 - Logarithmic Smoothing
    User ID, Artist ID   Smoothing   Smoothed Count

    1000020,1001820      log(20)     3

    1000020,1003557      log(1)      1

    1000021,700          log(1)      1

    1000029,1011819      log(1)      1

    1000036,1001820      log(34)     4

    1000036,1011819      log(2)      1

    1000036,700          log(2)      1

    1000040,1001820      log(1)      1

    1000057,1011819      log(37)     4

    1000060,700          log(17)     3
Reducer 2 - Aggregate User Prefs
○ Reduce for each user
○ Key - user ID
○ Value is complex
  ○ Count(ratings)
  ○ Sum(rating values)
  ○ Space delimited list of - artist ID, rating value

  Code

  cat input/sample_user_artist_data.txt        
  | ./similarity.py mapper2 log | sort         
  | ./similarity.py reducer2
Reducer 2 - Aggregated User Prefs
     User ID   Smoothing

     1000020   2 | 4 | 1001820,3 1003557,1

     1000021   1 | 1 | 700,1

     1000029   1 | 1 | 1011819,1

     1000036   3 | 6 | 1001820,4 1011819,1 700,1

     1000040   1 | 1 | 1001820,1

     1000057   1 | 4 | 1011819,4

     1000060   1 | 3 | 700,3
Mapper 3 - User Co-Ratings
○ Mapper3 culls users via cutoff
○ Drop user ID, emit pairwise

   Code

   cat input/sample_user_artist_data.txt   
   | ./similarity.py mapper2 log | sort    
   | ./similarity.py reducer2              
   | ./similarity.py mapper3 100           
   input/artist_playcounts.txt | sort
Mapper 3 - User Co-Ratings
    Artist ID: X, Y   Rating: X, Y

    1000143 1003577   2 3

    1000143 1011819   2 3

    1001820 700       1 2

    1001820 700       1 3

    1011819 700       3 2

    1011819 700       3 3

    1011819 700       4 2

    1011819 700       4 2

    1011819 700       5 5

    1012511 700       1 1
Reducer 3 - Artist Similarities
○ Given num artists, computes similarities
○ Each pair of artists emitted w/ similarities
   Code

   cat input/sample_user_artist_data.txt   
   | ./similarity.py mapper2 log | sort    
   | ./similarity.py reducer2              
   | ./similarity.py mapper3 100           
   input/artist_playcounts.txt | sort      
   | ./similarity.py reducer3 147160       
   > artist_similarities.txt
Reducer 3 - Artist Similarities
     Artist ID, Similarity, Artist ID, Co-Ratings

     1003557 0.121659425105 1012511 360

     1012511 0.121659425105 1003557 360

     1003557 0.0197107349416 700 212

     700 0.0197107349416 1003557 212

     1011819 0.0128808637553 1012511 259

     1012511 0.0128808637553 1011819 259

     1011819 0.297222927702 700 3050

     700 0.297222927702 1011819 3050

     1012511 0.0426446192482 700 270

     700 0.0426446192482 1012511 270
Mapper 4 - Sort by Artist Correlation
○ Emit artist ID, similarity concatenated
○ Sort by similarity = recommendation

   Code

   cat artist_similarities.txt              
   | ./similarity.py mapper4 20 | sort
Mapper 4 - Sort by Artist Correlation
     Artist X-ID, Similarity   Artist Y-ID, Num Co-Ratings


     1012511,0.924219271937    1000143 237

     1012511,0.945653412649    1001820 468

     1012511,0.957355380752    700 270

     1012511,0.961454917198    1000418 50

     1012511,0.987119136245    1011819 259

     700,0.702777072298        1011819 3050

     700,0.898811337303        1001820 2250

     700,0.95212801312         1000143 114

     700,0.957355380752        1012511 270

     700,0.980289265058        1003557 212
Reducer 4 - Cosmetic Results
○ Reducer attaches artist names
   Code

   cat artist_similarities.txt                           
   | ./similarity.py mapper4 20                          
   | sort                                                
   | ./similarity.py reducer4 3 lastfm/artist_data.txt   
   > related_artists.tsv
Reducer 4 - Cosmetic Results

Artist ID   Related Artist   Similarity   Number of Co-   Artist Name
            ID                            Ratings

1000143     1000143          1            0               Toby Keith

1000143     1003557          0.2434       809             Garth Brooks

1000143     1000418          0.1068       120             Mark Chestnutt

1000143     1012511          0.0758       237             Kenny Rogers

1000418     1000418          1            0               Mark Chestnutt

1000418     1000143          0.1068       120             Toby Keith

1000418     1003557          0.056        114             Garth Brooks

1000418     1012511          0.0385       50              Kenny Rogers
Pearson Similarity - Visualization




   covariance(A, B) =     2.44
   covariance(C, D) =    -2.36
Pearson Similarity - Equation

pearson(x, y)
  = covariance(x, y)
  / (stddev(x) * stddev(y))


   pearson(A, B) = 0.772
   pearson(C, D) = -0.746
Pearson Similarity - Summary
○ Pearson similarity normalizes correlation
○ Linear dependence between two variables
○ Normalized ...


         -1 < pearson(x, y) < 1

                (for any x, y)
Questions?
Appendix
● Hadoop Streaming
     ○    http://hadoop.apache.org/common/docs/r0.20.1/streaming.html


● Explanation of LongValueSum
     ○    http://stackoverflow.com/questions/1946953/availiable-reducers-in-elastic-mapreduce


● Pearson Correlation
  ○ http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
● Finding Similar Items with Amazon Elastic MapReduce,
     Python, and Hadoop Streaming
     ○ http://aws.amazon.com/articles/2294
Appendix
● Anscombe's Quartet
     ○   http://en.wikipedia.org/wiki/Anscombe's_quartet


● Tau Coefficient
    ○ http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient
●   Jaccard Index
    ○    http://en.wikipedia.org/wiki/Jaccard_index

● Quality of Recommendations
  ○ http://en.wikipedia.org/wiki/Mean_squared_error

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Unit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptxUnit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptx
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learning
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
APRIORI ALGORITHM -PPT.pptx
APRIORI ALGORITHM -PPT.pptxAPRIORI ALGORITHM -PPT.pptx
APRIORI ALGORITHM -PPT.pptx
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Machine learning
Machine learningMachine learning
Machine learning
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
Zero shot learning
Zero shot learning Zero shot learning
Zero shot learning
 
Hmm
HmmHmm
Hmm
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Xgboost
XgboostXgboost
Xgboost
 
An introduction to machine learning in biomedical research: Key concepts, pr...
An introduction to machine learning in biomedical research:  Key concepts, pr...An introduction to machine learning in biomedical research:  Key concepts, pr...
An introduction to machine learning in biomedical research: Key concepts, pr...
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Bayanno-Net: Bangla Handwritten Digit Recognition using CNN
Bayanno-Net: Bangla Handwritten Digit Recognition using CNNBayanno-Net: Bangla Handwritten Digit Recognition using CNN
Bayanno-Net: Bangla Handwritten Digit Recognition using CNN
 

Andere mochten auch

The Next Multiannual Financial Framework: From National Interest to Building ...
The Next Multiannual Financial Framework: From National Interest to Building ...The Next Multiannual Financial Framework: From National Interest to Building ...
The Next Multiannual Financial Framework: From National Interest to Building ...
thinkingeurope2011
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
nextlib
 

Andere mochten auch (7)

The Next Multiannual Financial Framework: From National Interest to Building ...
The Next Multiannual Financial Framework: From National Interest to Building ...The Next Multiannual Financial Framework: From National Interest to Building ...
The Next Multiannual Financial Framework: From National Interest to Building ...
 
Sistema de Recomendação de Produtos Utilizando Mineração de Dados
Sistema de Recomendação de Produtos Utilizando Mineração de DadosSistema de Recomendação de Produtos Utilizando Mineração de Dados
Sistema de Recomendação de Produtos Utilizando Mineração de Dados
 
Distributed Cache With MapReduce
Distributed Cache With MapReduceDistributed Cache With MapReduce
Distributed Cache With MapReduce
 
Construindo Sistemas de Recomendação com Python
Construindo Sistemas de Recomendação com PythonConstruindo Sistemas de Recomendação com Python
Construindo Sistemas de Recomendação com Python
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 

Ähnlich wie Recommendations with hadoop streaming and python

Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
zhu02
 

Ähnlich wie Recommendations with hadoop streaming and python (20)

React Native Performance
React Native Performance React Native Performance
React Native Performance
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDB
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
React-Native Rendering Performance
React-Native Rendering PerformanceReact-Native Rendering Performance
React-Native Rendering Performance
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Tone deaf: finding structure in Last.fm data
Tone deaf: finding structure in Last.fm dataTone deaf: finding structure in Last.fm data
Tone deaf: finding structure in Last.fm data
 
Sample project-Quality Circle
Sample project-Quality CircleSample project-Quality Circle
Sample project-Quality Circle
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Practical data analysis with wine
Practical data analysis with winePractical data analysis with wine
Practical data analysis with wine
 
Image_filtering (1).pptx
Image_filtering (1).pptxImage_filtering (1).pptx
Image_filtering (1).pptx
 
Advance sql session - strings
Advance sql  session - stringsAdvance sql  session - strings
Advance sql session - strings
 
The Ring programming language version 1.8 book - Part 23 of 202
The Ring programming language version 1.8 book - Part 23 of 202The Ring programming language version 1.8 book - Part 23 of 202
The Ring programming language version 1.8 book - Part 23 of 202
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Recommendations with hadoop streaming and python

  • 1. Recommendations with Python and Hadoop Streaming Andrew Look Senior Engineer Shopzilla
  • 2. Getting started ● Slides ○ http://bit.ly/J7vmx7 ● Python/NumPy Installed ○ http://bit.ly/JWNWbq ● Sample code ○ http://aws-hadoop.s3.amazonaws.com/similarity.zip
  • 3. Outline ● Problem ● Recommendation basics ● MapReduce review and conventions ● Python + Hadoop Streaming basics ● MapReduce jobs (data, code, data-flow) ● Recommendation algorithm
  • 4. Problem - Music Recommendations ● We want to recommend similar artists ● We have data from Last.fm ● Which Last.fm users liked which artists? ● How can we decide which artists are similar? Toby Keith Tupac De La Soul Garth Brooks
  • 5. Solution - Find Artist Similarities ● We'll follow along with a tutorial from AWS ● By Data Wrangling blogger/AWS developer Peter Skomoroch ● Uses publicly available data from Last.fm ● User's rating of artist is number of plays
  • 6. Solution - Find Artist Similarities ● We can look at co-ratings ● One user played artist A songs X times ● Same user played artist B songs Y times co-rating = ((A,X),(B,Y))
  • 7. Recommendation Basics ● User Based ○ Given a user, recommend the artists that are favored by users with similar artist preferences ● Item Based ○ Given an item (artist), recommend the artists that were most commonly favored by users that also liked the input artist
  • 8. Recommendation Basics ● Types of data ○ Explicit - user rates a movie on Netflix ○ Implicit - user watches a YouTube video ● Types of ratings ○ Multivalued - bounded, ex. star rating (1-5) ○ Multivalued - unbounded, ex. number of plays (>0) ○ Binary - did a user play a movie or not?
  • 9. Last.fm Recommendations ● Data was implicitly collected (as users play songs) ● Transform binary data (did user listen to artist?) ... ● Into multivalued data (how many times?) ● We'll use item-based recommendations
  • 11. Map Output - Reduce Input
  • 14. Python Shell and Hadoop Streaming Streaming API requires shell commands ● Mapper ● Reducer
  • 15. Python Shell and Hadoop Streaming Streaming API requires shell commands ● Mapper ● Reducer For mapper / reducer commands Streaming API will ● Partition the input ● Distribute across mappers and reducers
  • 16. Python Shell and Hadoop Streaming
  • 18. Example - Working Data Set ○ Inspect your working data set ... ○ Each row is one "rating" ○ Each "number of plays" is the "rating value" Code cat input/sample_user_artist_data.txt | head
  • 19. Example - Working Data Set User ID Artist ID Number of Plays 1000020 1001820 20 1000020 1003557 1 1000021 700 1 1000029 1001819 1 1000036 1001820 34 1000036 1011819 2 1000036 700 2 1000040 1001820 1 1000057 1011819 37 1000060 700 17
  • 20. Mapper 1 - Count Ratings per Artist ○ Prepend LongValueSum:<artist ID> ○ More on this later ○ Use a value of "1" Code cat input/sample_user_artist_data.txt | ./similarity.py mapper1
  • 21. Mapper 1 - Count Ratings per Artist Artist ID Number of Ratings LongValueSum:1001820 1 LongValueSum:1003557 1 LongValueSum:700 1 LongValueSum:1001819 1 LongValueSum:1001820 1 LongValueSum:1011819 1 LongValueSum:700 1 LongValueSum:1001820 1 LongValueSum:1011819 1 LongValueSum:700 1
  • 22. Mapper 1 - Count Ratings per Artist ○ We use the sort command locally ○ We sort by artist ID ○ Emulates shuffle/sort in Hadoop Code cat input/sample_user_artist_data.txt | ./similarity.py mapper1 | sort
  • 23. Mapper 1 - Count Ratings per Artist Artist ID Number of Plays LongValueSum:1001820 1 LongValueSum:1001820 1 LongValueSum:1001820 1 LongValueSum:1003557 1 LongValueSum:1011819 1 LongValueSum:1011819 1 LongValueSum:1011819 1 LongValueSum:700 1 LongValueSum:700 1 LongValueSum:700 1
  • 24. Reducer 1 - Count Ratings by Artist ○ LongValueSum tells 'aggregate' reducer ○ Group by artist ID ○ Sum up the 1's ○ Emit artist ID as Key, count(ratings) as Value Code cat input/sample_user_artist_data.txt | ./similarity.py mapper1 | sort | ./similarity.py reducer1 > input/artist_playcounts.txt
  • 25. Reducer 1 - Count Ratings by Artist Artist ID Number of Ratings 1000143 1905 1000418 184 1001820 12950 700 7243 1003557 2976 1011819 7601 1012511 1881
  • 26. Mapper 2 - User Artist Preferences ○ Mapper2 outputs key user ID, artist ID ○ Mapper2 outputs rating as value (# plays) Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 int
  • 27. Mapper 2 - User Artist Preferences User ID, Artist ID Number of Plays 1000020,1001820 20 1000020,1003557 1 1000021,700 1 1000029,1011819 1 1000036,1001820 34 1000036,1011819 2 1000036,700 2 1000040,1001820 1 1000057,1011819 37 1000060,700 17
  • 28. Mapper 2 - User Artist Preferences ○ Can large counts skew our results? ○ Apply log function to outliers. Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 log | sort
  • 29. Mapper 2 - Logarithmic Smoothing User ID, Artist ID Smoothing Smoothed Count 1000020,1001820 log(20) 3 1000020,1003557 log(1) 1 1000021,700 log(1) 1 1000029,1011819 log(1) 1 1000036,1001820 log(34) 4 1000036,1011819 log(2) 1 1000036,700 log(2) 1 1000040,1001820 log(1) 1 1000057,1011819 log(37) 4 1000060,700 log(17) 3
  • 30. Reducer 2 - Aggregate User Prefs ○ Reduce for each user ○ Key - user ID ○ Value is complex ○ Count(ratings) ○ Sum(rating values) ○ Space delimited list of - artist ID, rating value Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 log | sort | ./similarity.py reducer2
  • 31. Reducer 2 - Aggregated User Prefs User ID Smoothing 1000020 2 | 4 | 1001820,3 1003557,1 1000021 1 | 1 | 700,1 1000029 1 | 1 | 1011819,1 1000036 3 | 6 | 1001820,4 1011819,1 700,1 1000040 1 | 1 | 1001820,1 1000057 1 | 4 | 1011819,4 1000060 1 | 3 | 700,3
  • 32. Mapper 3 - User Co-Ratings ○ Mapper3 culls users via cutoff ○ Drop user ID, emit pairwise Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 log | sort | ./similarity.py reducer2 | ./similarity.py mapper3 100 input/artist_playcounts.txt | sort
  • 33. Mapper 3 - User Co-Ratings Artist ID: X, Y Rating: X, Y 1000143 1003577 2 3 1000143 1011819 2 3 1001820 700 1 2 1001820 700 1 3 1011819 700 3 2 1011819 700 3 3 1011819 700 4 2 1011819 700 4 2 1011819 700 5 5 1012511 700 1 1
  • 34. Reducer 3 - Artist Similarities ○ Given num artists, computes similarities ○ Each pair of artists emitted w/ similarities Code cat input/sample_user_artist_data.txt | ./similarity.py mapper2 log | sort | ./similarity.py reducer2 | ./similarity.py mapper3 100 input/artist_playcounts.txt | sort | ./similarity.py reducer3 147160 > artist_similarities.txt
  • 35. Reducer 3 - Artist Similarities Artist ID, Similarity, Artist ID, Co-Ratings 1003557 0.121659425105 1012511 360 1012511 0.121659425105 1003557 360 1003557 0.0197107349416 700 212 700 0.0197107349416 1003557 212 1011819 0.0128808637553 1012511 259 1012511 0.0128808637553 1011819 259 1011819 0.297222927702 700 3050 700 0.297222927702 1011819 3050 1012511 0.0426446192482 700 270 700 0.0426446192482 1012511 270
  • 36. Mapper 4 - Sort by Artist Correlation ○ Emit artist ID, similarity concatenated ○ Sort by similarity = recommendation Code cat artist_similarities.txt | ./similarity.py mapper4 20 | sort
  • 37. Mapper 4 - Sort by Artist Correlation Artist X-ID, Similarity Artist Y-ID, Num Co-Ratings 1012511,0.924219271937 1000143 237 1012511,0.945653412649 1001820 468 1012511,0.957355380752 700 270 1012511,0.961454917198 1000418 50 1012511,0.987119136245 1011819 259 700,0.702777072298 1011819 3050 700,0.898811337303 1001820 2250 700,0.95212801312 1000143 114 700,0.957355380752 1012511 270 700,0.980289265058 1003557 212
  • 38. Reducer 4 - Cosmetic Results ○ Reducer attaches artist names Code cat artist_similarities.txt | ./similarity.py mapper4 20 | sort | ./similarity.py reducer4 3 lastfm/artist_data.txt > related_artists.tsv
  • 39. Reducer 4 - Cosmetic Results Artist ID Related Artist Similarity Number of Co- Artist Name ID Ratings 1000143 1000143 1 0 Toby Keith 1000143 1003557 0.2434 809 Garth Brooks 1000143 1000418 0.1068 120 Mark Chestnutt 1000143 1012511 0.0758 237 Kenny Rogers 1000418 1000418 1 0 Mark Chestnutt 1000418 1000143 0.1068 120 Toby Keith 1000418 1003557 0.056 114 Garth Brooks 1000418 1012511 0.0385 50 Kenny Rogers
  • 40. Pearson Similarity - Visualization covariance(A, B) = 2.44 covariance(C, D) = -2.36
  • 41. Pearson Similarity - Equation pearson(x, y) = covariance(x, y) / (stddev(x) * stddev(y)) pearson(A, B) = 0.772 pearson(C, D) = -0.746
  • 42. Pearson Similarity - Summary ○ Pearson similarity normalizes correlation ○ Linear dependence between two variables ○ Normalized ... -1 < pearson(x, y) < 1 (for any x, y)
  • 44. Appendix ● Hadoop Streaming ○ http://hadoop.apache.org/common/docs/r0.20.1/streaming.html ● Explanation of LongValueSum ○ http://stackoverflow.com/questions/1946953/availiable-reducers-in-elastic-mapreduce ● Pearson Correlation ○ http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient ● Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming ○ http://aws.amazon.com/articles/2294
  • 45. Appendix ● Anscombe's Quartet ○ http://en.wikipedia.org/wiki/Anscombe's_quartet ● Tau Coefficient ○ http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient ● Jaccard Index ○ http://en.wikipedia.org/wiki/Jaccard_index ● Quality of Recommendations ○ http://en.wikipedia.org/wiki/Mean_squared_error