SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Scala Data
Pipelines @
Spotify
Neville Li
@sinisa_lyh
Who am I?
‣ SpotifyNYCsince2011
‣ FormerlyYahoo!Search
‣ Musicrecommendations
‣ Datainfrastructure
‣ Scalasince2013
Spotify in numbers
• Started in 2006, 58 markets
• 75M+ active users, 20M+ paying
• 30M+ songs, 20K new per day
• 1.5 billion playlists
• 1 TB logs per day
• 1200+ node Hadoop cluster
• 10K+ Hadoop jobs per day
Music recommendation @ Spotify
• Discover Weekly
• Radio
• RelatedArtists
• Discover Page
Recommendation systems
A little teaser
PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
CombineFn<K,V> reduceFn)
Crunch: CombineFns are used to represent the associative operations…
Grouped[K, +V]::reduce[U >: V](fn: (U, U) U)
Scalding: reduce with fn which must be associative and commutative…
PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)
Spark: Merge the values for each key using an associative reduce function…
Monoid!
enables map side reduce
Actually it’s a semigroup
One more teaser
Linear equation inAlternate Least Square (ALS) Matrix factorization
xu = (YTY + YT(Cu − I)Y)−1YTCup(u)
vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY
ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY
.map { case (_, (r, op)) =>
(solveKey(r), op * (r.rating * alpha))
}.reduceByKey(_ + _)
ratings.keyBy(fixedKey).join(vectors) // YtCupu
.map { case (_, (r, v)) =>
val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0)
(solveKey(r), v * (Cui * pui))
}.reduceByKey(_ + _)
http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
Success story
• Mid 2013: 100+ Python Luigi M/R jobs, few tests
• 10+ new hires since, most fresh grads
• Few with Java experience, none with Scala
• Now: 300+ Scalding jobs, 400+ tests
• More ad-hoc jobs untracked
• Spark also taking off
First 10 months
……
Activity over time
Guess how many jobs
written by yours truly?
Performance vs. Agility
https://nicholassterling.wordpress.com/2012/11/16/scala-performance/
Let’sdiveinto
something
technical
To join or not to join?
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.join(tgp)
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
Hash join
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.hashJoin(tgp.forceToDisk) // tgp replicated to all mappers
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)

.group
.reduce(_ ++ _) // map-side reduce!
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)

.group
.sum // SetMonoid[Set[T]] from Algebird
* sum[U >:V](implicit sg: Semigroup[U])
Key-value file as distributed cache
val streams: TypedPipe[(String, String)] = _ // (gid, user)
val tgp: SparkeyManager = _ // tgp replicated to all mappers
streams
.map { case (track, user) =>
(user, tgp.get(track).split(",").toSet)
}
.group
.sum
https://github.com/spotify/sparkey
SparkeyManagerwraps DistributedCacheFile
Joins and CoGroups
• Require shuffle and reduce step
• Some ops force everything to reducers

e.g. mapGroup, mapValueStream
• CoGroup more flexible for complex logic
• Scalding flattens a.join(b).join(c)…

into MultiJoin(a, b, c, …)
Distributed cache
• Fasterwith off-heap binary files
• Building cache = more wiring
• Memory mapping may interfere withYARN
• E.g. 64GB nodes with 48GB for containers (no cgroup)
• 12 × 2GB containers each with 2GB JVM heap + mmap cache
• OOM and swap!
• Keep files small (< 1GB) or fallback to joins…
Analyze your jobs
• Concurrent Driven
• Visualize job execution
• Workflow optimization
• Bottlenecks
• Data skew
Notenough
math?
Recommending tracks
• User listened to Rammstein - Du Hast
• Recommend 10 similartracks
• 40 dimension feature vectors fortracks
• Compute cosine similarity between all pairs
• O(n) lookup per userwhere n ≈ 30m
• Trythat with 50m users * 10 seed tracks each
ANNOY - cheat by approximation
• Approximate Nearest Neighbor OhYeah
• Random projections and binarytree search
• Build index on single machine
• Load in mappers via distribute cache
• O(log n) lookup
https://github.com/spotify/annoy
https://github.com/spotify/annoy-java
ANN Benchmark
https://github.com/erikbern/ann-benchmarks
Filtering candidates
• Users don’t like seeing artist/album/tracks they already know
• But may forget what they listened long ago
• 50m * thousands of items each
• Over 5 years of streaming logs
• Need to update daily
• Need to purge old items per user
Options
• Aggregate all logs daily
• Aggregate last x days daily
• CSVof artist/album/track ids
• Bloom filters
Decayed value with cutoff
• Compute new user-item score daily
• Weighted on context, e.g. radio, search, playlist
• score’ = score + previous * 0.99
• half life = log0.99
0.5 = 69 days
• Cut off at top 2000
• Items that users might remember seeing recently
Bloom filters
• Probabilistic data structure
• Encoding set of items with m bits and k hash functions
• No false negative
• Tunable false positive probability
• Size proportional to capacity & FP probability
• Let’s build one per user-{artists,albums,tracks}
• Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
Size versus max items & FP prob
• User-item distribution is uneven
• Assuming same setting for all users
• # items << capacity → wasting space
• # items > capacity → high FP rate
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k
item
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k n=10k
item
full
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
item
n=1k n=10k n=100k
fullfull
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k n=10k n=100k n=1m
item
fullfullfull
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smallest one with capacity > items inserted
• Expensive to build
• Cheap to store and lookup
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smallest one with capacity > items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
80%
n=10k
 
8%
n=100k
 
0.8%
n=1m
 
0.08%
item
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to  N max possible items
• Keep smallest one with capacity  items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
100%
n=10k
 
70%
n=100k
 
7%
n=1m
 
0.7%
item
full
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to  N max possible items
• Keep smallest one with capacity  items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
100%
n=10k
 
100%
n=100k
 
60%
n=1m

Weitere ähnliche Inhalte

Was ist angesagt?

CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated RecommendationsHarald Steck
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender SystemsRoelof van Zwol
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?blueace
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsJames Kirk
 
TOROS N2 - lightweight approximate Nearest Neighbor library
TOROS N2 - lightweight approximate Nearest Neighbor libraryTOROS N2 - lightweight approximate Nearest Neighbor library
TOROS N2 - lightweight approximate Nearest Neighbor libraryif kakao
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix ScaleAish Fenton
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 

Was ist angesagt? (20)

CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender Systems
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender Systems
 
TOROS N2 - lightweight approximate Nearest Neighbor library
TOROS N2 - lightweight approximate Nearest Neighbor libraryTOROS N2 - lightweight approximate Nearest Neighbor library
TOROS N2 - lightweight approximate Nearest Neighbor library
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix Scale
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 

Andere mochten auch

Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ SpotifyNikhil Tibrewal
 
Mugo one pager
Mugo one pagerMugo one pager
Mugo one pagerori segal
 
Jackdaw research music survey report
Jackdaw research music survey reportJackdaw research music survey report
Jackdaw research music survey reportJan Dawson
 
How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015Paul Lamere
 

Andere mochten auch (6)

Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
 
Music survey results (2)
Music survey results (2)Music survey results (2)
Music survey results (2)
 
Music & interaction
Music & interactionMusic & interaction
Music & interaction
 
Mugo one pager
Mugo one pagerMugo one pager
Mugo one pager
 
Jackdaw research music survey report
Jackdaw research music survey reportJackdaw research music survey report
Jackdaw research music survey report
 
How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015
 

Ähnlich wie Scala Data Pipelines @ Spotify

London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
CPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolscharsbar
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent SearchTed Dunning
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsChris Johnson
 
Vertica architecture
Vertica architectureVertica architecture
Vertica architectureZvika Gutkin
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
Tuning Your Engine
Tuning Your EngineTuning Your Engine
Tuning Your Enginejoelbradbury
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 
Scala in practice - 3 years later
Scala in practice - 3 years laterScala in practice - 3 years later
Scala in practice - 3 years laterpatforna
 
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Thoughtworks
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 

Ähnlich wie Scala Data Pipelines @ Spotify (20)

London devops logging
London devops loggingLondon devops logging
London devops logging
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
CPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its tools
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
Vertica architecture
Vertica architectureVertica architecture
Vertica architecture
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Akka streams
Akka streamsAkka streams
Akka streams
 
Tuning Your Engine
Tuning Your EngineTuning Your Engine
Tuning Your Engine
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Scala in practice - 3 years later
Scala in practice - 3 years laterScala in practice - 3 years later
Scala in practice - 3 years later
 
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
 
Graphite
GraphiteGraphite
Graphite
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 

Mehr von Neville Li

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li
 
Why functional why scala
Why functional  why scala Why functional  why scala
Why functional why scala Neville Li
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at SpotifyNeville Li
 

Mehr von Neville Li (7)

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Scio
ScioScio
Scio
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Why functional why scala
Why functional  why scala Why functional  why scala
Why functional why scala
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 

Kürzlich hochgeladen

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 

Kürzlich hochgeladen (20)

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 

Scala Data Pipelines @ Spotify

  • 2. Who am I? ‣ SpotifyNYCsince2011 ‣ FormerlyYahoo!Search ‣ Musicrecommendations ‣ Datainfrastructure ‣ Scalasince2013
  • 3. Spotify in numbers • Started in 2006, 58 markets • 75M+ active users, 20M+ paying • 30M+ songs, 20K new per day • 1.5 billion playlists • 1 TB logs per day • 1200+ node Hadoop cluster • 10K+ Hadoop jobs per day
  • 4. Music recommendation @ Spotify • Discover Weekly • Radio • RelatedArtists • Discover Page
  • 6. A little teaser PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn, CombineFn<K,V> reduceFn) Crunch: CombineFns are used to represent the associative operations… Grouped[K, +V]::reduce[U >: V](fn: (U, U) U) Scalding: reduce with fn which must be associative and commutative… PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V) Spark: Merge the values for each key using an associative reduce function…
  • 7. Monoid! enables map side reduce Actually it’s a semigroup
  • 8. One more teaser Linear equation inAlternate Least Square (ALS) Matrix factorization xu = (YTY + YT(Cu − I)Y)−1YTCup(u) vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }.reduceByKey(_ + _) ratings.keyBy(fixedKey).join(vectors) // YtCupu .map { case (_, (r, v)) => val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0) (solveKey(r), v * (Cui * pui)) }.reduceByKey(_ + _) http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
  • 9. Success story • Mid 2013: 100+ Python Luigi M/R jobs, few tests • 10+ new hires since, most fresh grads • Few with Java experience, none with Scala • Now: 300+ Scalding jobs, 400+ tests • More ad-hoc jobs untracked • Spark also taking off
  • 12. Guess how many jobs written by yours truly?
  • 15. To join or not to join? val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .join(tgp) .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 16. Hash join val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .hashJoin(tgp.forceToDisk) // tgp replicated to all mappers .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 17. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .reduce(_ ++ _) // map-side reduce!
  • 18. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .sum // SetMonoid[Set[T]] from Algebird * sum[U >:V](implicit sg: Semigroup[U])
  • 19. Key-value file as distributed cache val streams: TypedPipe[(String, String)] = _ // (gid, user) val tgp: SparkeyManager = _ // tgp replicated to all mappers streams .map { case (track, user) => (user, tgp.get(track).split(",").toSet) } .group .sum https://github.com/spotify/sparkey SparkeyManagerwraps DistributedCacheFile
  • 20. Joins and CoGroups • Require shuffle and reduce step • Some ops force everything to reducers
 e.g. mapGroup, mapValueStream • CoGroup more flexible for complex logic • Scalding flattens a.join(b).join(c)…
 into MultiJoin(a, b, c, …)
  • 21. Distributed cache • Fasterwith off-heap binary files • Building cache = more wiring • Memory mapping may interfere withYARN • E.g. 64GB nodes with 48GB for containers (no cgroup) • 12 × 2GB containers each with 2GB JVM heap + mmap cache • OOM and swap! • Keep files small (< 1GB) or fallback to joins…
  • 22. Analyze your jobs • Concurrent Driven • Visualize job execution • Workflow optimization • Bottlenecks • Data skew
  • 24. Recommending tracks • User listened to Rammstein - Du Hast • Recommend 10 similartracks • 40 dimension feature vectors fortracks • Compute cosine similarity between all pairs • O(n) lookup per userwhere n ≈ 30m • Trythat with 50m users * 10 seed tracks each
  • 25. ANNOY - cheat by approximation • Approximate Nearest Neighbor OhYeah • Random projections and binarytree search • Build index on single machine • Load in mappers via distribute cache • O(log n) lookup https://github.com/spotify/annoy https://github.com/spotify/annoy-java
  • 27. Filtering candidates • Users don’t like seeing artist/album/tracks they already know • But may forget what they listened long ago • 50m * thousands of items each • Over 5 years of streaming logs • Need to update daily • Need to purge old items per user
  • 28. Options • Aggregate all logs daily • Aggregate last x days daily • CSVof artist/album/track ids • Bloom filters
  • 29. Decayed value with cutoff • Compute new user-item score daily • Weighted on context, e.g. radio, search, playlist • score’ = score + previous * 0.99 • half life = log0.99 0.5 = 69 days • Cut off at top 2000 • Items that users might remember seeing recently
  • 30. Bloom filters • Probabilistic data structure • Encoding set of items with m bits and k hash functions • No false negative • Tunable false positive probability • Size proportional to capacity & FP probability • Let’s build one per user-{artists,albums,tracks} • Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
  • 31. Size versus max items & FP prob • User-item distribution is uneven • Assuming same setting for all users • # items << capacity → wasting space • # items > capacity → high FP rate
  • 32. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead
  • 33. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k item
  • 34. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k item full
  • 35. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead item n=1k n=10k n=100k fullfull
  • 36. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k n=100k n=1m item fullfullfull
  • 37. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup
  • 38. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 43. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 48. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 53. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 60. Track metadata • Label dump → content ingestion • Third partytrack genres, e.g. GraceNote • Audio attributes, e.g. tempo, key, time signature • Cultural data, e.g. popularity, tags • Latent vectors from collaborative filtering • Many sources for album, artist, user metadata too
  • 61. Multiple data sources • Big joins • Complex dependencies • Wide rows with few columns accessed • Wasting I/O
  • 62. Apache Parquet • Pre-join sources into mega-datasets • Store as Parquet columnar storage • Column projection • Predicate pushdown • Avro within Scalding pipelines
  • 63. Projection pipe.map(a = (a.getName, a.getAmount)) versus Parquet.project[Account](name, amount) • Strings → unsafe and error prone • No IDE auto-completion → finger injury • my_fancy_field_name → .getMyFancyFieldName • Hard to migrate existing code
  • 64. Predicate pipe.filter(a = a.getName == Neville a.getAmount 100) versus FilterApi.and( FilterApi.eq(FilterApi.binaryColumn(name), Binary.fromString(Neville)), FilterApi.gt(FilterApi.floatColumn(amount), 100f.asInstnacesOf[java.lang.Float]))
  • 65. Macro to the rescue Code →AST→ (pattern matching) → (recursion) → (quasi-quotes) → Code Projection[Account](_.getName, _.getAmount) Predicate[Account](x = x.getName == “Neville x.getAmount 100) https://github.com/nevillelyh/parquet-avro-extra http://www.lyh.me/slides/macros.html
  • 66. What else? ‣ Analytics ‣ Adstargeting,prediction ‣ Metadataquality ‣ Zeppelin ‣ Morecoolstuffintheworks