This document provides an overview of Scala data pipelines at Spotify. It discusses:
- The speaker's background and Spotify's scale with over 75 million active users.
- Spotify's music recommendation systems including Discover Weekly and personalized radio.
- How Scala and frameworks like Scalding, Spark, and Crunch are used to build data pipelines for tasks like joins, aggregations, and machine learning algorithms.
- Techniques for optimizing pipelines including distributed caching, bloom filters, and Parquet for efficient storage and querying of large datasets.
- The speaker's success in migrating over 300 jobs from Python to Scala and growing the team of engineers building Scala pipelines at Spotify.
2. Who am I?
⣠SpotifyNYCsince2011
⣠FormerlyYahoo!Search
⣠Musicrecommendations
⣠Datainfrastructure
⣠Scalasince2013
3. Spotify in numbers
⢠Started in 2006, 58 markets
⢠75M+ active users, 20M+ paying
⢠30M+ songs, 20K new per day
⢠1.5 billion playlists
⢠1 TB logs per day
⢠1200+ node Hadoop cluster
⢠10K+ Hadoop jobs per day
4. Music recommendation @ Spotify
⢠Discover Weekly
⢠Radio
⢠RelatedArtists
⢠Discover Page
6. A little teaser
PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
CombineFn<K,V> reduceFn)
Crunch: CombineFns are used to represent the associative operationsâŚ
Grouped[K, +V]::reduce[U >: V](fn: (U, U) U)
Scalding: reduce with fn which must be associative and commutativeâŚ
PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)
Spark: Merge the values for each key using an associative reduce functionâŚ
8. One more teaser
Linear equation inAlternate Least Square (ALS) Matrix factorization
xu = (YTY + YT(Cu â I)Y)â1YTCup(u)
vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY
ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY
.map { case (_, (r, op)) =>
(solveKey(r), op * (r.rating * alpha))
}.reduceByKey(_ + _)
ratings.keyBy(fixedKey).join(vectors) // YtCupu
.map { case (_, (r, v)) =>
val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0)
(solveKey(r), v * (Cui * pui))
}.reduceByKey(_ + _)
http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
9. Success story
⢠Mid 2013: 100+ Python Luigi M/R jobs, few tests
⢠10+ new hires since, most fresh grads
⢠Few with Java experience, none with Scala
⢠Now: 300+ Scalding jobs, 400+ tests
⢠More ad-hoc jobs untracked
⢠Spark also taking off
19. Key-value file as distributed cache
val streams: TypedPipe[(String, String)] = _ // (gid, user)
val tgp: SparkeyManager = _ // tgp replicated to all mappers
streams
.map { case (track, user) =>
(user, tgp.get(track).split(",").toSet)
}
.group
.sum
https://github.com/spotify/sparkey
SparkeyManagerwraps DistributedCacheFile
20. Joins and CoGroups
⢠Require shuffle and reduce step
⢠Some ops force everything to reducersâ¨
e.g. mapGroup, mapValueStream
⢠CoGroup more flexible for complex logic
⢠Scalding flattens a.join(b).join(c)âŚâ¨
into MultiJoin(a, b, c, âŚ)
21. Distributed cache
⢠Fasterwith off-heap binary files
⢠Building cache = more wiring
⢠Memory mapping may interfere withYARN
⢠E.g. 64GB nodes with 48GB for containers (no cgroup)
⢠12 à 2GB containers each with 2GB JVM heap + mmap cache
⢠OOM and swap!
⢠Keep files small (< 1GB) or fallback to joinsâŚ
22. Analyze your jobs
⢠Concurrent Driven
⢠Visualize job execution
⢠Workflow optimization
⢠Bottlenecks
⢠Data skew
24. Recommending tracks
⢠User listened to Rammstein - Du Hast
⢠Recommend 10 similartracks
⢠40 dimension feature vectors fortracks
⢠Compute cosine similarity between all pairs
⢠O(n) lookup per userwhere n â 30m
⢠Trythat with 50m users * 10 seed tracks each
25. ANNOY - cheat by approximation
⢠Approximate Nearest Neighbor OhYeah
⢠Random projections and binarytree search
⢠Build index on single machine
⢠Load in mappers via distribute cache
⢠O(log n) lookup
https://github.com/spotify/annoy
https://github.com/spotify/annoy-java
27. Filtering candidates
⢠Users donât like seeing artist/album/tracks they already know
⢠But may forget what they listened long ago
⢠50m * thousands of items each
⢠Over 5 years of streaming logs
⢠Need to update daily
⢠Need to purge old items per user
28. Options
⢠Aggregate all logs daily
⢠Aggregate last x days daily
⢠CSVof artist/album/track ids
⢠Bloom filters
29. Decayed value with cutoff
⢠Compute new user-item score daily
⢠Weighted on context, e.g. radio, search, playlist
⢠scoreâ = score + previous * 0.99
⢠half life = log0.99
0.5 = 69 days
⢠Cut off at top 2000
⢠Items that users might remember seeing recently
30. Bloom filters
⢠Probabilistic data structure
⢠Encoding set of items with m bits and k hash functions
⢠No false negative
⢠Tunable false positive probability
⢠Size proportional to capacity & FP probability
⢠Letâs build one per user-{artists,albums,tracks}
⢠Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
31. Size versus max items & FP prob
⢠User-item distribution is uneven
⢠Assuming same setting for all users
⢠# items << capacity â wasting space
⢠# items > capacity â high FP rate
32. Scalable Bloom Filter
⢠Growing sequence of standard BFs
⢠Increasing capacity and tighter FP probability
⢠Most users have few BFs
⢠Power users have many
⢠Serialization and lookup overhead
33. Scalable Bloom Filter
⢠Growing sequence of standard BFs
⢠Increasing capacity and tighter FP probability
⢠Most users have few BFs
⢠Power users have many
⢠Serialization and lookup overhead
n=1k
item
34. Scalable Bloom Filter
⢠Growing sequence of standard BFs
⢠Increasing capacity and tighter FP probability
⢠Most users have few BFs
⢠Power users have many
⢠Serialization and lookup overhead
n=1k n=10k
item
full
35. Scalable Bloom Filter
⢠Growing sequence of standard BFs
⢠Increasing capacity and tighter FP probability
⢠Most users have few BFs
⢠Power users have many
⢠Serialization and lookup overhead
item
n=1k n=10k n=100k
fullfull
36. Scalable Bloom Filter
⢠Growing sequence of standard BFs
⢠Increasing capacity and tighter FP probability
⢠Most users have few BFs
⢠Power users have many
⢠Serialization and lookup overhead
n=1k n=10k n=100k n=1m
item
fullfullfull
37. Opportunistic Bloom Filter
⢠Building n BFs of increasing capacity in parallel
⢠Up to << N max possible items
⢠Keep smallest one with capacity > items inserted
⢠Expensive to build
⢠Cheap to store and lookup
38. Opportunistic Bloom Filter
⢠Building n BFs of increasing capacity in parallel
⢠Up to << N max possible items
⢠Keep smallest one with capacity > items inserted
⢠Expensive to build
⢠Cheap to store and lookup
n=1k
43. Opportunistic Bloom Filter
⢠Building n BFs of increasing capacity in parallel
⢠Up to N max possible items
⢠Keep smallest one with capacity items inserted
⢠Expensive to build
⢠Cheap to store and lookup
n=1k
48. Opportunistic Bloom Filter
⢠Building n BFs of increasing capacity in parallel
⢠Up to N max possible items
⢠Keep smallest one with capacity items inserted
⢠Expensive to build
⢠Cheap to store and lookup
n=1k
53. Opportunistic Bloom Filter
⢠Building n BFs of increasing capacity in parallel
⢠Up to N max possible items
⢠Keep smallest one with capacity items inserted
⢠Expensive to build
⢠Cheap to store and lookup
n=1k
60. Track metadata
⢠Label dump â content ingestion
⢠Third partytrack genres, e.g. GraceNote
⢠Audio attributes, e.g. tempo, key, time signature
⢠Cultural data, e.g. popularity, tags
⢠Latent vectors from collaborative filtering
⢠Many sources for album, artist, user metadata too
61. Multiple data sources
⢠Big joins
⢠Complex dependencies
⢠Wide rows with few columns accessed
⢠Wasting I/O
62. Apache Parquet
⢠Pre-join sources into mega-datasets
⢠Store as Parquet columnar storage
⢠Column projection
⢠Predicate pushdown
⢠Avro within Scalding pipelines
63. Projection
pipe.map(a = (a.getName, a.getAmount))
versus
Parquet.project[Account](name, amount)
⢠Strings â unsafe and error prone
⢠No IDE auto-completion â finger injury
⢠my_fancy_field_name â .getMyFancyFieldName
⢠Hard to migrate existing code