3. Outline
● Problem
● Recommendation basics
● MapReduce review and conventions
● Python + Hadoop Streaming basics
● MapReduce jobs (data, code, data-flow)
● Recommendation algorithm
4. Problem - Music Recommendations
● We want to recommend similar artists
● We have data from Last.fm
● Which Last.fm users liked which artists?
● How can we decide which artists are similar?
Toby Keith Tupac De La Soul Garth Brooks
5. Solution - Find Artist Similarities
● We'll follow along with a tutorial from AWS
● By Data Wrangling blogger/AWS developer
Peter Skomoroch
● Uses publicly available data from Last.fm
● User's rating of artist is number of plays
6. Solution - Find Artist Similarities
● We can look at co-ratings
● One user played artist A songs X times
● Same user played artist B songs Y times
co-rating = ((A,X),(B,Y))
7. Recommendation Basics
● User Based
○ Given a user, recommend the artists that are favored
by users with similar artist preferences
● Item Based
○ Given an item (artist), recommend the artists that
were most commonly favored by users that also
liked the input artist
8. Recommendation Basics
● Types of data
○ Explicit - user rates a movie on Netflix
○ Implicit - user watches a YouTube video
● Types of ratings
○ Multivalued - bounded, ex. star rating (1-5)
○ Multivalued - unbounded, ex. number of plays (>0)
○ Binary - did a user play a movie or not?
9. Last.fm Recommendations
● Data was implicitly collected (as users play songs)
● Transform binary data (did user listen to artist?) ...
● Into multivalued data (how many times?)
● We'll use item-based recommendations
14. Python Shell and Hadoop Streaming
Streaming API requires shell commands
● Mapper
● Reducer
15. Python Shell and Hadoop Streaming
Streaming API requires shell commands
● Mapper
● Reducer
For mapper / reducer commands Streaming
API will
● Partition the input
● Distribute across mappers and
reducers
18. Example - Working Data Set
○ Inspect your working data set ...
○ Each row is one "rating"
○ Each "number of plays" is the "rating value"
Code
cat input/sample_user_artist_data.txt
| head
19. Example - Working Data Set
User ID Artist ID Number of Plays
1000020 1001820 20
1000020 1003557 1
1000021 700 1
1000029 1001819 1
1000036 1001820 34
1000036 1011819 2
1000036 700 2
1000040 1001820 1
1000057 1011819 37
1000060 700 17
20. Mapper 1 - Count Ratings per Artist
○ Prepend LongValueSum:<artist ID>
○ More on this later
○ Use a value of "1"
Code
cat input/sample_user_artist_data.txt
| ./similarity.py mapper1
21. Mapper 1 - Count Ratings per Artist
Artist ID Number of Ratings
LongValueSum:1001820 1
LongValueSum:1003557 1
LongValueSum:700 1
LongValueSum:1001819 1
LongValueSum:1001820 1
LongValueSum:1011819 1
LongValueSum:700 1
LongValueSum:1001820 1
LongValueSum:1011819 1
LongValueSum:700 1
22. Mapper 1 - Count Ratings per Artist
○ We use the sort command locally
○ We sort by artist ID
○ Emulates shuffle/sort in Hadoop
Code
cat input/sample_user_artist_data.txt
| ./similarity.py mapper1 | sort
23. Mapper 1 - Count Ratings per Artist
Artist ID Number of Plays
LongValueSum:1001820 1
LongValueSum:1001820 1
LongValueSum:1001820 1
LongValueSum:1003557 1
LongValueSum:1011819 1
LongValueSum:1011819 1
LongValueSum:1011819 1
LongValueSum:700 1
LongValueSum:700 1
LongValueSum:700 1
24. Reducer 1 - Count Ratings by Artist
○ LongValueSum tells 'aggregate' reducer
○ Group by artist ID
○ Sum up the 1's
○ Emit artist ID as Key, count(ratings) as Value
Code
cat input/sample_user_artist_data.txt
| ./similarity.py mapper1 | sort
| ./similarity.py reducer1
> input/artist_playcounts.txt
25. Reducer 1 - Count Ratings by Artist
Artist ID Number of Ratings
1000143 1905
1000418 184
1001820 12950
700 7243
1003557 2976
1011819 7601
1012511 1881
26. Mapper 2 - User Artist Preferences
○ Mapper2 outputs key user ID, artist ID
○ Mapper2 outputs rating as value (# plays)
Code
cat input/sample_user_artist_data.txt
| ./similarity.py mapper2 int
27. Mapper 2 - User Artist Preferences
User ID, Artist ID Number of Plays
1000020,1001820 20
1000020,1003557 1
1000021,700 1
1000029,1011819 1
1000036,1001820 34
1000036,1011819 2
1000036,700 2
1000040,1001820 1
1000057,1011819 37
1000060,700 17
28. Mapper 2 - User Artist Preferences
○ Can large counts skew our results?
○ Apply log function to outliers.
Code
cat input/sample_user_artist_data.txt
| ./similarity.py mapper2 log | sort
30. Reducer 2 - Aggregate User Prefs
○ Reduce for each user
○ Key - user ID
○ Value is complex
○ Count(ratings)
○ Sum(rating values)
○ Space delimited list of - artist ID, rating value
Code
cat input/sample_user_artist_data.txt
| ./similarity.py mapper2 log | sort
| ./similarity.py reducer2