Playlist Recommendations @ Spotify

Playlist Recommendations
@
Nikhil Tibrewal
@nikhil_tibrewal

Who am I?
Nikhil Tibrewal (Nick-hill)
● Data Engineer on Lambda squad (Spotify’s primary ML team)
● Graduated from Carnegie Mellon University in Dec 2013
● B.Sc. in Computer Science + additional major in Econ
● Been part of Spotify band for ~1.5 years
● Worked on a range of projects, primarily Playlist Recommendations

Spotify in numbers
● Started in 2006, 58 markets
● 75M+ active users, 20M+ paying
● 30M+ songs, 20K new per day
● 1.5+ billion playlists
● 1 TB logs per day

● Discover tab
● Radio
● Related Artists
● Discover Weekly
● Playlist recs on “Now” Strip
Recommendations so far on Spotify
For Ellie Goulding

“Now” Strip
Human
curated
playlist

“Now” Strip
Human
curated
playlist
Recommended
playlist

But…
How are playlist recs generated?

Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content

Quick Overview!
Good

Quick Overview!
Good Bad

Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering

Quick Overview!
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist

Quick Overview!
○ Use Annoy to store playlist vectors in N dimensional space
ANNOY (Approximate Nearest Neighbors Oh Yeah)
created at Spotify
https://github.com/spotify/annoy

Quick Overview!
○ Vectorize user taste as well:
■ User vector derived from user listening history

Quick Overview!
○ Vectorize user taste as well:
■ User vector derived from user listening history
○ User and playlist vectors in same space!
○ Query for nearest playlists to user from Annoy tree
annoyTree.getNearest(seedVector, K)

Quick Overview!
● Recommendations pipeline: Ranking Model
○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations
■ John: 21, USA, likes rock
■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds
○ Apply post-processing steps for shuffling and add variety to avoid
repetitions

Quick Overview!
● Recommendations pipeline: Ranking Model
○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations
■ John: 21, USA, likes rock
■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds
○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
90% DAUs have recs!

Quick Overview!
● Infrastructure
○ Luigi to manage workflow (also built at Spotify)
○ Entire pipeline written in Scalding
○ 1200+ nodes Hadoop cluster to run jobs
○ Cassandra (~dozen nodes for playlist recs)
○ Java backend micro-services serving recs

Quick Overview!
"Scalding is comprised of a DSL (domain-specific language)
that makes MapReduce computations look like Scala’s
collection API and is a wrapper for Cascading to make it easy
to define jobs, test and data sources on an HDFS" (http:
//cascading.io/customer/twitter/)

Scalding w.r.t. Playlist Recs
● Used Python back in the day
○ Inputs and outputs were tab separated
○ Complexity UP => Difficulty to maintain UP
○ Hard to write tests
● Scalding provided compile time error checks
○ Catch errors early
○ Define schemas (e.g. Avro)
● Can use Parquet + Avro for input/output
○ Easy to write and read data
○ Records with a lot of fields!
○ Lesson: Parquet hurts performance w/ fat columns (nested data structs)
+

Scalding w.r.t. Playlist Recs +

● Data quality
○ Hadoop counters wrappers in extended Scalding library code
+

● Data quality
○ Hadoop counters wrappers in extended Scalding library code
○ Verify counters within reasonable ranges
+

● Pipeline tolerance
○ Job failures are normal, and annoying with big jobs
○ Scalding checkpoints
○ Lesson: checkpoint itself is a map-reduce job and has the same caveats
○ Still very helpful!
+

● Job runtimes
○ Common solutions: more reducers and code optimizations
○ Speculative execution for larger jobs
○ Caveat: can take up unnecessary resources
+

● Memory issues
○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with
infrequent large bulk inserts”
■ Replicated to all mappers
○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
+
https://github.com/spotify/sparkey

● Memory issues
○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with
infrequent large bulk inserts”
■ Replicated to all mappers
○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
○ Lesson: trade memory resources for MAYBE a little more time with joins
+
bigPipe.join(exSparkeyPipe)
https://github.com/spotify/sparkey

● Driven
○ “A sophisticated tool that collects telemetry data from running Scalding /
Cascading jobs on a cluster and presenting them in an intriguing User
Interface."
○ http://cascading.io/
+

● Other awesome benefits
+

○ Active community + big players
+

○ Active community + big players
○ Data pipeline flows naturally follow the functional paradigm - essentially
writing Scala code
+

Productivity without sacrificing performance!
+

Status: Completed
Spotify is hiring!
Nikhil Tibrewal
@nikhil_tibrewal

Playlist Recommendations @ Spotify

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Playlist Recommendations @ Spotify

Ähnlich wie Playlist Recommendations @ Spotify (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Playlist Recommendations @ Spotify