- Data observability is important for Spotify because they process massive amounts of data from 8 million events per second.
- To ensure observability, Spotify annotates and documents their data schemas, monitors pipeline execution times and counts to check for errors, monitors financial costs of pipelines and storage, and sets up alerts and dashboards to monitor for failures.
- Having good data observability helps Spotify understand where their data is coming from and going, troubleshoot issues quickly, and ensure royalty payments to artists are accurate since they rely on the data pipelines.
4. What do I do?
Part of the Data and Insights tribe at Spotify
Our team owns: one of the biggest services at Spotify (~1M rps) and
one of the biggest pipelines at Spotify: anonymization of event
delivery
6. Where is the data coming from?
Event
Delivery
System
Pseudonymization
Cloud Storage
pseudonymization pipelines
for every event
run every hour
8 million events per second
7. A bit of scale
● 8.000.000 events per second
● Largest events are around 8 billions events per
hour
● Over 400 unique event types published in
separate datasets
● 500TB of data a day
● We used to own the largest Hadoop cluster in
Europe
8. How does it feel to be on-call?
If you need to hot fix something in production it is like changing a flat
tire in a car going 200 km/h on a highway without stopping it! The
longer your system is stopped, the longer it will take to catch up. The
time to catch up for downstream consumers will increase
exponentially.
9. Who needs that much the data?
Once delivered, events are processed by numerous data
jobs currently running in Spotify. There are many different
use cases for which the delivered data is used. Data can
be used to produce music recommendations, analyse our
A/B tests or analyse our client crashes. Most importantly,
delivered data is used to calculate royalties which are paid
to artists based on generated streams.
11. Make sure your data is discoverable
To annotate your data is the key to avoid piles of mess!!!
Upsides:
➢ Other people can use/find your data
➢ Sensitive data in the dataset? Encrypt based on annotations
➢ Easy mapping in your code like schema <-> case class
➢ Easier to find which key to join on
Downsides:
➢ you have to do it. Once
14. Monitor your pipelines. Count it!
Never produce corrupt data! Implement as much sanity-checks as possible.
Example: if your pipeline encrypts the row in the dataset, based on user_id.
And use the random key otherwise (impossible to decrypt)
Count it: Count the %ge of rows where the user_id have not been found or
parsed, thus alert if it increased more than ….10percent?
Sanity check your data and alert!
17. Monitor your pipelines. Money. Per System
Taking GDPR as an example. How much does it cost to “Download your
data” ? How much should you put inside?
What to monitor:
● How much do we pay for every request?
● The cost above: what is the cost of every pipeline that contributes to
it?
● How many requests do people actually open?
18. Set up a retention! Storage is ⅓ of the cost
● Setup the default retention. Remove the partitions after the
expiration date
● Profile the storage. Can the cold storage be used (cheap to store,
expensive to access)
● It adds up: multi-regional vs regional buckets. Where the data is
accessed from?
● How the data is used? BigQuery or pipelines
20. SLAs for your partitions
Concept of low, normal and high priority for
events. It gives us different SLAs for different events
(depending on importance, 6h, 24h, 72h). Thanks to
that we know which events recover first when shit hits
the fan. This also made our life better as normal
priority events will not alert during nights and low
priority events will not alert during weekends.
22. Does your infra lose the BCD?
Business critical data: royalty calculations, user accounts, ads etc
➔ High SLO
➔ “Special treatment” when recovering from the incident
➔ …..and special observability since the amount of “BCD” events
are limited
23. How to prove that no data is lost?
SDK
Service
ReceiverS
ervice P/S
Make hourly
partitions,
dedup,
anonymize
Hourly
partitions
24. Who is watching the watcher
SDK
Service
ReceiverS
ervice P/S
Make hourly
partitions,
dedup,
anonymize
Hourly
partitions
Streamig
job
NACK
Rejec
ted
Counting
Service
Compare