2. Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Spotify’s 2500 node
Hadoop cluster
@l_phant
3. • Spotify Launches
• Access to a gigantic catalog
of music
• Click to play instantaneous!
In 2008
19. In the Beginning…
• Spotifywas almost completely on-premise/bare metal
• Grew to 2,500 node Hadoop cluster and over 10K
total machines in production at four globally
distributed data centers
• “Flirted” with cloud providers at various times
21. Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs
and increasing in reliability and variety of service
offered
• Owning and operating physical machines is not a
competitive advantage for Spotify
22. Why Google’s Cloud?
• We believe Google’s industry leading background
in Big Data technologies will give us a data
processing advantage
24. BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop and servers
• Leverages Google’s internal tech
• Dremel (query execution engine)
• Colossus (distributed storage)
• Borg (distributed compute)
• Jupiter (network)
Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
25. BigQuery vs. Hive
• Example Queries:
• What are the top 10 songs by popularity in Spain
during October 2016?
• How many hours did users in Spain spend
listening to Spotify during October?
26. BigQuery vs. Hive
• What are the top 10 songs by popularity in Spain during October 2016?
• Hive
• 2647s (44min, 7sec)
• 15.5TB processed
• BigQuery
• 108s (1min, 48sec)
• 1.50TB processed
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
27. Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name
1 J Balvin Safari
2 DJ Snake Let Me Love You
3 Ricky Mar8n Vente Pa' Ca
4 Sebas8an Yatra Traicionera
5 Zion & Lennox (feat. J Balvin) Otra Vez
6 Carlos Vives, Shakira La Bicicleta
7 The Chainsmokers Closer
8 Major Lazer (feat. Jus8n Bieber & MØ) Cold Water
9 Sia The Greatest
10 IAmChino (feat. Pitbull, Yandel & Chacal) Ay MI Dios
28. BigQuery vs. Hive
• How much time did users in Spain spend listening to Spotify during October?
• Hive
• 969s (16min, 9 sec)
• 15.5TB processed
• BigQuery
• 33s
• 780 GB processed
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
30. BigQuery at Spotify
• Interactive and ad-hoc querying immediately
started to transferto BQ once the data was
available on the cloud
• Pace of learning increases as friction to question
decreases
31. Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish
subscribe behavior
• Like Kafka, but without needing to operate servers
and supporting services (zookeeper)
32. Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from globally distributed non-GCP
datacenters is painless
33. • Managed Service for running batch and streaming jobs
• UnifiedAPI for batch and streaming mode
• Inspired by internal Google tools like FlumeJava and
Millwheel
• Programming model open-sourced asApache Beam
(currently incubating)
Cloud Dataflow
34. • Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scalaAPI for running Dataflow jobs
and provides easy integrations with BigQuery
• New batch processing jobs at Spotify are being
written in Scio/Dataflow
Cloud Dataflow (Batch) at Spotify
35. • Exactly-once stream processing framework
• Areplacement for Spark/Flink streaming and
Storm workloads at Spotify
• Optimizes for consistencywhich can complicate
real-time workloads
Cloud Dataflow (Streaming) at Spotify
36.
37. Spotify + Google Cloud Timeline
2015 2016
Beginning of Google
Cloud evaluation
BigQuery begins
to replace Hive
Cloud Pub/Sub begins
to replace Kafka
Dataflow (streaming)
begins to replace Storm
Dataflow (batch)
replacing Map/Reduce
Note: Dates are approximations
41. Getting Data from Clients to Pub/Sub
• Built Pulsar, a simple service aggregating data from
Access Points and feeding it into Cloud Pub/Sub
• Replaces the Kafka real-time event feed
43. Dataflow
• Subscribes to important event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need to schedule orwait for
results
50. Problem
As a developer, I want to be able to instantly explore
data being logged bythe clients.
51. Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana
52.
53.
54. Benefits
• Able to understand what’s being sent bythe client
as it happens
• Exploring events, visualizing distribution (i.e. does
this field actually get populated)
• Prototyping analysis based on a sample
• Dashboards for Employee Releases
58. Live Results for X-Factor
• X-Factor: music competition
• Songs available on Spotify
immediately after show airs
• Listener behavior determines the
order of contestants on the playlist
63. Putting it all together
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
64. TheValue of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or
long term data loss
• Instant insights on newly developed features allows
teams to iterate quicker and take risks
• Providing a quicker ad-hoc querying engine allows teams
to ask more questions and learn faster
65. UseAnything and Everything
• Spotify has leveraged Google Cloud tools, such as Pub/
Sub, Dataflow and BigQuery
• Opensource and other cloud providers offer many
alternatives to this stack
• Opensource tools (Elasticsearch/Kibana) and proprietary
solutions (Tableau) have also been useful additions
67. Stream Processing First
• The sun never sets on Spotify, why impose
boundaries on our datasets?
• What’s the shortest distance between two points?
Zero!
• Can we reduce the feedback cycle to zero?