Recommendations with hadoop streaming and python

Recommendations with
Python and Hadoop
Streaming

Andrew Look

Senior Engineer
Shopzilla

Getting started
● Slides
○ http://bit.ly/J7vmx7
● Python/NumPy Installed
○ http://bit.ly/JWNWbq
● Sample code
○ http://aws-hadoop.s3.amazonaws.com/similarity.zip

Outline
● Problem
● Recommendation basics
● MapReduce review and conventions
● Python + Hadoop Streaming basics
● MapReduce jobs (data, code, data-flow)
● Recommendation algorithm

Problem - Music Recommendations
● We want to recommend similar artists
● We have data from Last.fm
● Which Last.fm users liked which artists?
● How can we decide which artists are similar?

Toby Keith Tupac De La Soul Garth Brooks

Solution - Find Artist Similarities
● We'll follow along with a tutorial from AWS
● By Data Wrangling blogger/AWS developer
Peter Skomoroch
● Uses publicly available data from Last.fm
● User's rating of artist is number of plays

Solution - Find Artist Similarities
● We can look at co-ratings
● One user played artist A songs X times
● Same user played artist B songs Y times

co-rating = ((A,X),(B,Y))

Recommendation Basics
● User Based
○ Given a user, recommend the artists that are favored
by users with similar artist preferences

● Item Based
○ Given an item (artist), recommend the artists that
were most commonly favored by users that also
liked the input artist

Recommendation Basics
● Types of data
○ Explicit - user rates a movie on Netflix
○ Implicit - user watches a YouTube video

● Types of ratings
○ Multivalued - bounded, ex. star rating (1-5)
○ Multivalued - unbounded, ex. number of plays (>0)
○ Binary - did a user play a movie or not?

Last.fm Recommendations
● Data was implicitly collected (as users play songs)
● Transform binary data (did user listen to artist?) ...
● Into multivalued data (how many times?)
● We'll use item-based recommendations

Python Shell and Hadoop Streaming
Streaming API requires shell commands
● Mapper
● Reducer

Streaming API requires shell commands
● Mapper
● Reducer

For mapper / reducer commands Streaming
API will
● Partition the input
● Distribute across mappers and
reducers

Full Recommendation Job Overview

Example - Working Data Set
○ Inspect your working data set ...
○ Each row is one "rating"
○ Each "number of plays" is the "rating value"

Code

cat input/sample_user_artist_data.txt
| head

Example - Working Data Set
User ID Artist ID Number of Plays

1000020 1001820 20

1000020 1003557 1

1000021 700 1

1000029 1001819 1

1000036 1001820 34

1000036 1011819 2

1000036 700 2

1000040 1001820 1

1000057 1011819 37

1000060 700 17

Mapper 1 - Count Ratings per Artist
○ Prepend LongValueSum:<artist ID>
○ More on this later
○ Use a value of "1"

Code

| ./similarity.py mapper1

Artist ID Number of Ratings

LongValueSum:1001820 1


LongValueSum:700 1




LongValueSum:700 1



LongValueSum:700 1

○ We use the sort command locally
○ We sort by artist ID
○ Emulates shuffle/sort in Hadoop
Code

| ./similarity.py mapper1 | sort

Artist ID Number of Plays








LongValueSum:700 1

LongValueSum:700 1

LongValueSum:700 1

Reducer 1 - Count Ratings by Artist
○ LongValueSum tells 'aggregate' reducer
○ Group by artist ID
○ Sum up the 1's
○ Emit artist ID as Key, count(ratings) as Value
Code

| ./similarity.py mapper1 | sort
| ./similarity.py reducer1
> input/artist_playcounts.txt

Reducer 1 - Count Ratings by Artist
Artist ID Number of Ratings

1000143 1905

1000418 184

1001820 12950

700 7243

1003557 2976

1011819 7601

1012511 1881

Mapper 2 - User Artist Preferences
○ Mapper2 outputs key user ID, artist ID
○ Mapper2 outputs rating as value (# plays)

Code

| ./similarity.py mapper2 int

User ID, Artist ID Number of Plays

1000020,1001820 20

1000020,1003557 1

1000021,700 1

1000029,1011819 1

1000036,1001820 34

1000036,1011819 2

1000036,700 2

1000040,1001820 1

1000057,1011819 37

1000060,700 17

○ Can large counts skew our results?
○ Apply log function to outliers.

Code

| ./similarity.py mapper2 log | sort

Mapper 2 - Logarithmic Smoothing
User ID, Artist ID Smoothing Smoothed Count

1000020,1001820 log(20) 3

1000020,1003557 log(1) 1

1000021,700 log(1) 1

1000029,1011819 log(1) 1

1000036,1001820 log(34) 4

1000036,1011819 log(2) 1

1000036,700 log(2) 1

1000040,1001820 log(1) 1

1000057,1011819 log(37) 4

1000060,700 log(17) 3

Reducer 2 - Aggregate User Prefs
○ Reduce for each user
○ Key - user ID
○ Value is complex
○ Count(ratings)
○ Sum(rating values)
○ Space delimited list of - artist ID, rating value

Code


Reducer 2 - Aggregated User Prefs
User ID Smoothing

1000020 2 | 4 | 1001820,3 1003557,1

1000021 1 | 1 | 700,1

1000029 1 | 1 | 1011819,1

1000036 3 | 6 | 1001820,4 1011819,1 700,1

1000040 1 | 1 | 1001820,1

1000057 1 | 4 | 1011819,4

1000060 1 | 3 | 700,3

Mapper 3 - User Co-Ratings
○ Mapper3 culls users via cutoff
○ Drop user ID, emit pairwise

Code

| ./similarity.py mapper3 100
input/artist_playcounts.txt | sort

Mapper 3 - User Co-Ratings
Artist ID: X, Y Rating: X, Y

1000143 1003577 2 3

1000143 1011819 2 3

1001820 700 1 2

1001820 700 1 3

1011819 700 3 2

1011819 700 3 3

1011819 700 4 2

1011819 700 4 2

1011819 700 5 5

1012511 700 1 1

Reducer 3 - Artist Similarities
○ Given num artists, computes similarities
○ Each pair of artists emitted w/ similarities
Code

input/artist_playcounts.txt | sort
| ./similarity.py reducer3 147160
> artist_similarities.txt

Reducer 3 - Artist Similarities
Artist ID, Similarity, Artist ID, Co-Ratings

1003557 0.121659425105 1012511 360

1012511 0.121659425105 1003557 360

1003557 0.0197107349416 700 212

700 0.0197107349416 1003557 212

1011819 0.0128808637553 1012511 259

1012511 0.0128808637553 1011819 259

1011819 0.297222927702 700 3050

700 0.297222927702 1011819 3050

1012511 0.0426446192482 700 270

700 0.0426446192482 1012511 270

Mapper 4 - Sort by Artist Correlation
○ Emit artist ID, similarity concatenated
○ Sort by similarity = recommendation

Code

cat artist_similarities.txt
| ./similarity.py mapper4 20 | sort

Mapper 4 - Sort by Artist Correlation
Artist X-ID, Similarity Artist Y-ID, Num Co-Ratings

1012511,0.924219271937 1000143 237

1012511,0.945653412649 1001820 468

1012511,0.957355380752 700 270

1012511,0.961454917198 1000418 50

1012511,0.987119136245 1011819 259

700,0.702777072298 1011819 3050

700,0.898811337303 1001820 2250

700,0.95212801312 1000143 114

700,0.957355380752 1012511 270

700,0.980289265058 1003557 212

Reducer 4 - Cosmetic Results
○ Reducer attaches artist names
Code

cat artist_similarities.txt
| sort
| ./similarity.py reducer4 3 lastfm/artist_data.txt
> related_artists.tsv

Reducer 4 - Cosmetic Results

Artist ID Related Artist Similarity Number of Co- Artist Name
ID Ratings

1000143 1000143 1 0 Toby Keith

1000143 1003557 0.2434 809 Garth Brooks

1000143 1000418 0.1068 120 Mark Chestnutt

1000143 1012511 0.0758 237 Kenny Rogers

1000418 1000418 1 0 Mark Chestnutt

1000418 1000143 0.1068 120 Toby Keith

1000418 1003557 0.056 114 Garth Brooks

1000418 1012511 0.0385 50 Kenny Rogers

Pearson Similarity - Visualization

covariance(A, B) = 2.44
covariance(C, D) = -2.36

Pearson Similarity - Equation

pearson(x, y)
= covariance(x, y)
/ (stddev(x) * stddev(y))

pearson(A, B) = 0.772
pearson(C, D) = -0.746

Pearson Similarity - Summary
○ Pearson similarity normalizes correlation
○ Linear dependence between two variables
○ Normalized ...

-1 < pearson(x, y) < 1

(for any x, y)

Appendix
● Hadoop Streaming
○ http://hadoop.apache.org/common/docs/r0.20.1/streaming.html

● Explanation of LongValueSum
○ http://stackoverflow.com/questions/1946953/availiable-reducers-in-elastic-mapreduce

● Pearson Correlation
○ http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
● Finding Similar Items with Amazon Elastic MapReduce,
Python, and Hadoop Streaming
○ http://aws.amazon.com/articles/2294

Appendix
● Anscombe's Quartet
○ http://en.wikipedia.org/wiki/Anscombe's_quartet

● Tau Coefficient
○ http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient
● Jaccard Index
○ http://en.wikipedia.org/wiki/Jaccard_index

● Quality of Recommendations
○ http://en.wikipedia.org/wiki/Mean_squared_error

Recommendations with hadoop streaming and python

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Recommendations with hadoop streaming and python

Ähnlich wie Recommendations with hadoop streaming and python (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Recommendations with hadoop streaming and python