3. Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Spotify’s 2500 node
Hadoop cluster
@l_phant
4. • Spotify Launches
• Instant Access to a gigantic
catalog of music
• Click to play instantaneous!
In 2008
20. In the Beginning…
• Spotifywas almost completely on-premise/bare
metal
• 2500 node Hadoop cluster, over 10K machines in
production at four globally distributed data centers
• Grew with users: from 1M in 2009, over 100M in 2016
21. Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs
and increasing in reliability and variety of service
offered
• Owning and operating physical machines is not a
competitive advantage for Spotify
22. Why Google’s Cloud?
• We believe Google’s industry leading background
in Big Data technologies will give us a data
processing advantage
24. BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop and servers
• Leverages Google’s internal tech
• Dremel (query execution engine)
• Colossus (distributed storage)
• Borg (distributed compute)
• Juniper (network)
Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
25. BigQuery vs. Hive
• Example Query: Find the top 10 songs by
popularity in Spain during October
• BigQuery (1.50 TB processed): 108s
• Hive(15.5TB processed): 2647s
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
26. BigQuery vs. Hive (example #2)
• Example Query: Find the total hours of music
listening in Spain during October
• BigQuery (780 GB processed): 33s
• Hive(15.5TB processed): 969s
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
27. •
Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name
1 J Balvin Safari
2 DJ Snake Let Me Love You
3 Ricky Mar8n Vente Pa' Ca
4 Sebas8an Yatra Traicionera
5 Zion & Lennox (feat. J Balvin) Otra Vez
6 Carlos Vives, Shakira La Bicicleta
7 The Chainsmokers Closer
8 Major Lazer (feat. Jus8n Bieber & MØ) Cold Water
9 Sia The Greatest
10 IAmChino (feat. Pitbull, Yandel & Chacal) Ay MI Dios
28. Time Spent Listening to
Spotify by users in Spain
during October
Nearly 10,000 Years!
29. BigQuery at Spotify
• Interactive and ad-hoc querying immediately
started to transferto BQ once the data was
available on the cloud
• Pace of learning increases as friction to question
decreases
30. Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish
subscribe behavior
• Like Kafka, but without needing to operate servers
and supporting services (zookeeper)
31. Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from globally distributed non-GCP
datacenters is painless
32. • Managed Service for running batch and streaming jobs
• UnifiedAPI for batch and streaming mode
• Inspired by internal Google tools like FlumeJava and
Millwheel
• Programming model open-sourced asApache Beam
(currently incubating)
Cloud Dataflow
33. • Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scalaAPI for running Dataflow jobs
and provides easy integrations with BigQuery
• New batch processing jobs @Spotify are being
written in Scio/Dataflow
Cloud Dataflow (Batch) at Spotify
34. • Exactly-once stream processing framework
• Areplacement for Spark/Flink streaming and
Storm workloads at Spotify
• Optimizes for consistencywhich can complicate
real-time workloads
Cloud Dataflow (Streaming) at Spotify
35. Spotify + Google Cloud Timeline
2015 2016
Beginning of Google
Cloud evaluation
BigQuery begins
to replace Hive
Cloud Pub/Sub begins
to replace Kafka
Dataflow (streaming)
begins to replace StormSpotify + Google
Cloud Announcement
Dataflow (batch)
replacing Map/Reduce
Note: Dates are approximations
39. Introducing “Pulsar”
• An internal name forthe system aggregating data
fromAccess Points and feeding it into Cloud Pub/
Sub
• Replaces the Kafka real-time event feed
42. Dataflow
• Subscribes to critical event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need to schedule orwait for
results
49. Problem
As a developer, I want to be able to instantly explore
data being logged bythe clients.
50. Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana
51.
52.
53. Benefits
• Able to understand what’s being sent bythe client
as it happens
• Exploring events, visualizing distribution (i.e. does
this field actually get populated)
• Prototyping analysis based on a sample
• Dashboards for Employee Releases
57. Dataflow to the Rescue!
• We created a librarythat allows teams to build
maps/filters with simple java code
• Code gets translated into a Dataflow job
60. No Ops!
• For our users:
• Event-feed managed through Cloud Pub/Sub
• Dataflow managed by Google
• Shared Elasticsearch cluster (managed by an
infra team)
61. Low Ops :/
• Dataflow is improving, but it’s had some stability
issues with streaming jobs
• Teams may need to set-up their own Elasticsearch
cluster ifthey require a higher SLAthan default
65. Live Results for X-Factor
• X-Factor: television music
competition
• Contest songs get loaded onto
Spotify immediately after show
airs
• Listener behavior determines the
order of contestants on the playlist
70. Cloud to the Rescue!
• Spotify has leveled up our abilityto gain actionable
insights by leveraging Google Cloud tools, such as
Pub/Sub, Dataflow and BigQuery
71. TheValue of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or
long term data loss
• Instant insights on newly developed features allows
teams to iterate quicker and take risks
• Providing a quicker ad-hoc querying engine allows teams
to ask more questions and learn faster
72. UseAnything and Everything
• Opensource and other cloud providers offer many
alternatives to the stack we’ve used
• Opensource tools, like Elasticsearch/Kibana, and
proprietary solutions, like Tableau, have also been
useful additions
74. Stream Processing First
• The sun never sets on Spotify, why impose
boundaries on our datasets?
• What’s the shortest distance between two lines?
Zero!
• Can we reduce the feedback cycle to zero?