Spotify has many time series to be forecasted such as streams forecast, monthly active users, ads inventory and consumption, etc.
Prophet library has been very effective in capturing most of the time series requirements, however tuning is necessary as the default set of parameters don't perform well on our more noisy datasets with many confounding factors.
3. Who am I
● Data & Software Engineer
● 3 years on Spotify’s Forecasting team
● Spark, Scio, Google Cloud & Hadoop user
3
4. Spotify’s Fast Facts
Spotify is a digital music, podcast, and video streaming freemium service
● 100M+ Subscribers
● 217M+ Monthly Active Users
● 50M+ Songs
● 3B+ Playlists
● 79 Markets
(retrieved 2019-08)
https://newsroom.spotify.com/company-info/
4
5. Forecasting at Spotify
● We forecast streams, monthly active users, ads inventory and etc.
● 10K+ time series to forecast
● Usually forecasting 2 years ahead
● Happens at the end of each quarter
5
6. Time Series Data
“a series of data points indexed (or listed or graphed) in time order.” Wikipedia
6
7. What is Prophet
● An open source time series forecasting tool by Facebook
● Available in Python and R
● Forecasting with default settings is often accurate as other complex models
7
8. Spark on GCP*
● Dataproc is a Google cloud service running Spark
● Fast
● Easy-to-use
● Fully managed
● Cost-effective
* Google Cloud Platform
8
9. Python is a new King!
● Majority of data science libs are available in Python
● Prophet is also in Python
● Python+Spark=PySpark helps to scale easily
● Installing Python libs on Dataproc is a piece of cake
9
14. In-house Model Distributor
14
● A tool to distribute/scale a Python model
● Plug & play, no Spark knowledge needed
● Runs a model with permutation of parameter values
● Model outputs can be aggregated with Spark for further analysis
● Full integration with Bigquery, Cloud Storage and Dataproc
21. Challenges
21
● Get the most out of Dataproc cluster
● It’s not efficient to tune all 10K+ forecasts
● Job finishes with no results
● Testing locally with local Spark is not enough to find some of the errors
● Big output with so many small files