Slides from a talk at a meetup organized by SF Scala at Spotify's San Francisco office. The slides present details of playlist recommendations at Spotify and how Spotify uses Scalding to develop robust and reliable pipelines to generate these recommendations.
Meetup details: http://www.meetup.com/SF-Scala/events/224430674/
2. Who am I?
Nikhil Tibrewal (Nick-hill)
● Data Engineer on Lambda squad (Spotify’s primary ML team)
● Graduated from Carnegie Mellon University in Dec 2013
● B.Sc. in Computer Science + additional major in Econ
● Been part of Spotify band for ~1.5 years
● Worked on a range of projects, primarily Playlist Recommendations
3. Spotify in numbers
● Started in 2006, 58 markets
● 75M+ active users, 20M+ paying
● 30M+ songs, 20K new per day
● 1.5+ billion playlists
● 1 TB logs per day
4. ● Discover tab
● Radio
● Related Artists
● Discover Weekly
● Playlist recs on “Now” Strip
Recommendations so far on Spotify
For Ellie Goulding
8. Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
9. Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
Good
10. Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
Good Bad
11. Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
12. Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
13. Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
ANNOY (Approximate Nearest Neighbors Oh Yeah)
created at Spotify
https://github.com/spotify/annoy
14. Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
○ Vectorize user taste as well:
■ User vector derived from user listening history
15. Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
○ Vectorize user taste as well:
■ User vector derived from user listening history
○ User and playlist vectors in same space!
○ Query for nearest playlists to user from Annoy tree
annoyTree.getNearest(seedVector, K)
16. Quick Overview!
● Recommendations pipeline: Ranking Model
○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations
■ John: 21, USA, likes rock
■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds
○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
17. Quick Overview!
● Recommendations pipeline: Ranking Model
○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations
■ John: 21, USA, likes rock
■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds
○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
90% DAUs have recs!
18. Quick Overview!
● Infrastructure
○ Luigi to manage workflow (also built at Spotify)
○ Entire pipeline written in Scalding
○ 1200+ nodes Hadoop cluster to run jobs
○ Cassandra (~dozen nodes for playlist recs)
○ Java backend micro-services serving recs
19. Quick Overview!
"Scalding is comprised of a DSL (domain-specific language)
that makes MapReduce computations look like Scala’s
collection API and is a wrapper for Cascading to make it easy
to define jobs, test and data sources on an HDFS" (http:
//cascading.io/customer/twitter/)
20. Scalding w.r.t. Playlist Recs
● Used Python back in the day
○ Inputs and outputs were tab separated
○ Complexity UP => Difficulty to maintain UP
○ Hard to write tests
● Scalding provided compile time error checks
○ Catch errors early
○ Define schemas (e.g. Avro)
● Can use Parquet + Avro for input/output
○ Easy to write and read data
○ Records with a lot of fields!
○ Lesson: Parquet hurts performance w/ fat columns (nested data structs)
+
25. Scalding w.r.t. Playlist Recs
● Pipeline tolerance
○ Job failures are normal, and annoying with big jobs
○ Scalding checkpoints
○ Lesson: checkpoint itself is a map-reduce job and has the same caveats
○ Still very helpful!
+
26. Scalding w.r.t. Playlist Recs
● Job runtimes
○ Common solutions: more reducers and code optimizations
○ Speculative execution for larger jobs
○ Caveat: can take up unnecessary resources
+
27. Scalding w.r.t. Playlist Recs
● Memory issues
○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with
infrequent large bulk inserts”
■ Replicated to all mappers
○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
+
https://github.com/spotify/sparkey
28. Scalding w.r.t. Playlist Recs
● Memory issues
○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with
infrequent large bulk inserts”
■ Replicated to all mappers
○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
○ Lesson: trade memory resources for MAYBE a little more time with joins
+
bigPipe.join(exSparkeyPipe)
https://github.com/spotify/sparkey
29. Scalding w.r.t. Playlist Recs
● Driven
○ “A sophisticated tool that collects telemetry data from running Scalding /
Cascading jobs on a cluster and presenting them in an intriguing User
Interface."
○ http://cascading.io/
+
33. Scalding w.r.t. Playlist Recs
● Other awesome benefits
○ Active community + big players
○ Data pipeline flows naturally follow the functional paradigm - essentially
writing Scala code
+