2. I’m Kinshuk Mishra
• Work on distributed systems and data science problems
• Lead architecture for ads backend platform at Spotify
• You can find me @_kinshukmishra
3. 3
• Started in 2006
• Currently has over 24 million users
• 6 million paying users
• Available in 28 countries
• Over 300 engineers, of which 100 in NYC
What is Spotify?
4. • getFreeTierUsers() / getAllUsers() > 0.70
• getSpotifyPayoutToMusicLabels() = $$$
• Great medium for promotions and announcements
Why are Ads important?
7. The type of questions we have
Find the total available audio ad impressions on iOS platform
between 9/12/2013 and 9/13/2013 in NYC metro area for male
users in the age-group of 18-35, and who typically listen to hip-hop
music genre?
8. What is unique about us?
• Rules triggering ad breaks are unique
• We also log user activity and audio streaming data
9. Different approaches
• Simulate ad delivery by replaying user events and
triggering ad breaks
• Pre-compute impression aggregates for different
dimensions and build a complex model to combine those
• Use subset of impression data then filter and extrapolate it
using a simple model
14. What was the big picture going be like?
Hadoop
Ad
impression
log
Postgres
DB
Booked
Campaigns
Forecas4ng
engine
Forecast
Query
15. High level forecasting engine algorithm
Log
data
Load
Data
Cache
Campaign
data
daily Once a minute
Submit
Forecast
query
Wait
for
query
Apply
filter
criteria
to
dataset
Count
available
impressions
Apply
growth
and
other
extrapola4on
factors
24. Optimizing data retrieval
• We analyzed our data access pattern and found over 75% of
our campaigns are targeted by age and location.
• So we mapped location to a list of users sorted by age using
SortedSetMultimap
• Optimized user lookup by location and age-group to O(kLgN)
from typical O(kN) where,
N : Total users for a location
k : constant
26. How to find available inventory for sample population?
1. Take all user ad impressions by applying “day of the month”
substitution
2. Apply filters by ad-type, location, age, gender, platform, etc.
3. Count the total impressions for all the users who match
4. Read booked impressions for the similar target criteria from
the cache
5. Inventory available = total impressions – booked
impressions
28. Extrapolation
• Population (15 million) -> Sample (150,000)
• Scaling factor is 100
• Total Available inventory = scaling factor * available inventory for sample
29. Other features
• Ad Frequency capping
• Day of the week and time of the day filtering
• View per user (VPU) capping
30. What worked for us?
1. Fast lookups
2. Simple models scaled well
3. Deterministic algorithms easier to debug
4. Adding new targeting features was easy
5. Forecasting engine agnostic to changes in ad server
31. What didn’t work that well?
1. Campaign level forecasts difficult without simulation
2. Cold start is a real problem when there is no proxy dataset
3. Forecasting inventory for new ad types can be challenging
32. What we’ve learnt
• Think data volume
• Consider Sampling
• Choose appropriate time window
• Analyze data access patterns and optimize for it
• Use deterministic algorithms
• Analyze data trends and factor those in computation
• Simple models scale well
33. May 12, 2014
Email - Kinshuk@spotify.com
https://twitter.com/Spotifyjobs
Thanks!