Prophet at Scale: Using Prophet at scale to tune and forecast time series at Spotify

•

0 gefällt mir•1,156 views

Spotify has many time series to be forecasted such as streams forecast, monthly active users, ads inventory and consumption, etc. Prophet library has been very effective in capturing most of the time series requirements, however tuning is necessary as the default set of parameters don't perform well on our more noisy datasets with many confounding factors.

Software

Prophet at Scale:
Using Prophet at scale to tune and
forecast time series at Spotify
Mahan Hosseinzadeh
Data Engineer
PyData London 2019, July 13th

Who am I
● Data & Software Engineer
● 3 years on Spotify’s Forecasting team
● Spark, Scio, Google Cloud & Hadoop user
3

Spotify’s Fast Facts
Spotify is a digital music, podcast, and video streaming freemium service
● 100M+ Subscribers
● 217M+ Monthly Active Users
● 50M+ Songs
● 3B+ Playlists
● 79 Markets
(retrieved 2019-08)
https://newsroom.spotify.com/company-info/
4

Forecasting at Spotify
● We forecast streams, monthly active users, ads inventory and etc.
● 10K+ time series to forecast
● Usually forecasting 2 years ahead
● Happens at the end of each quarter
5

Time Series Data
“a series of data points indexed (or listed or graphed) in time order.” Wikipedia
6

What is Prophet
● An open source time series forecasting tool by Facebook
● Available in Python and R
● Forecasting with default settings is often accurate as other complex models
7

Spark on GCP*
● Dataproc is a Google cloud service running Spark
● Fast
● Easy-to-use
● Fully managed
● Cost-effective
* Google Cloud Platform
8

Python is a new King!
● Majority of data science libs are available in Python
● Prophet is also in Python
● Python+Spark=PySpark helps to scale easily
● Installing Python libs on Dataproc is a piece of cake
9

One Forecast
● Runtime varies from couple of seconds to few minutes
● Runtime is based on input size and Prophet settings
11

Many Forecasts
10K forecasts takes ~2 days when running sequentially
12

Many Forecasts on Spark
10K forecasts takes ~30 mins when running distributed
13
Task 1
Task 2
Executor 1
Task 1
Task 2
Executor 2
Worker Node 1
Task 1
Task 2
Executor 1
Task 1
Task 2
Executor 2
Worker Node 2
SparkContext
Driver

In-house Model Distributor
14
● A tool to distribute/scale a Python model
● Plug & play, no Spark knowledge needed
● Runs a model with permutation of parameter values
● Model outputs can be aggregated with Spark for further analysis
● Full integration with Bigquery, Cloud Storage and Dataproc

In-house Model Distributor
15
training data
1 set of parameters
1 forecast
training data
many sets of parameters
many forecasts
Model Distributor

Parallel Tuning
Occasionally Prophet default settings are not good enough and tuning is needed
16
Task 1
Task 2
Executor 1
Task 1
Task 2
Executor 2
Worker Node 1
Task 1
Task 2
Executor 1
Task 1
Task 2
Executor 2
Worker Node 2
SparkContext
Driver
Forecast 1
with 5 sets of settings
Forecast 2
with 3 sets of settings

Parallel Tuning
Deﬁned parameters
17
country
UK
SE
IT
US
prior_scale
0.5
0.6
0.7
0.8
n_changepoints
45
50
55
60

Task 1
Executor 1
Worker Node 2
Task 1
Executor 1
Worker Node 1
Sequential Tuning
18
SparkContext
Driver
Forecast 1
with setting ranges
Forecast 2
with setting ranges

Sequential Tuning
Range parameters
19
country
UK
SE
IT
US
prior_scale
[0.5 to 0.8]
n_changepoints
[45 to 60]

Interesting stats
● 1 node: 16 vCPUs, 60GB memory
● 50 nodes: 800 vCPUs, 3TB memory
● Cluster creation: ~4 mins
● 10K forecasts: ~30 mins
● Cost: ~5$
20

Challenges
21
● Get the most out of Dataproc cluster
● It’s not efﬁcient to tune all 10K+ forecasts
● Job ﬁnishes with no results
● Testing locally with local Spark is not enough to ﬁnd some of the errors
● Big output with so many small ﬁles

Thank you!
mahanhosseinzadeh@spotify.com
https://www.linkedin.com/in/mahanhoss
22

Container
Registry
BigqueryScio
(Dataflow)
Styx
BigquerySpark
(Dataproc)
Tech stack with data ingestion
23

Empfohlen

FB ProphetNikhil Baby

Time series forecastingFiras Kastantin

Time series forecasting with machine learningDr Wei Liu

Time-series Analysis in MinutesOrzota

Time Series Forecasting Project Presentation.Anupama Kate

Predictive modellingRajib Kumar De

Arima Forecasting - Presentation by Sera Cresta, Nora Alosaimi and Puneet MahanaAmrinder Arora

Ray: Enterprise-Grade, Distributed PythonDatabricks

Empfohlen

FB ProphetNikhil Baby

Time series forecastingFiras Kastantin

Time series forecasting with machine learningDr Wei Liu

Time-series Analysis in MinutesOrzota

Time Series Forecasting Project Presentation.Anupama Kate

Predictive modellingRajib Kumar De

Arima Forecasting - Presentation by Sera Cresta, Nora Alosaimi and Puneet MahanaAmrinder Arora

Ray: Enterprise-Grade, Distributed PythonDatabricks

Crisp dmakbkck

Time series deep learningAlberto Arrigoni

R programmingShantanu Patil

Linear Regression With REdureka!

Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks

STATA - Introductionstata_org_uk

DAX and Power BI Training - 002 DAX Level 1 - 3Will Harvey

Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Amazon Web Services

Power BI Advanced Data Modeling Virtual WorkshopCCG

Llama-indexDenis973830

Qlikview for BeginnersEdureka!

Basic of python for data analysisPramod Toraskar

Pandasmaikroeder

PySpark dataframeJaemun Jung

Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens

Microsoft Power BI Technical OverviewDavid J Rosenthal

Kafka as your Data Lake - is it Feasible?Guido Schmutz

R programming presentationAkshat Sharma

Apache Druid 101Data Con LA

Machine Learning Strategies for Time Series PredictionGianluca Bontempi

Simplifying AI integration on Apache SparkDatabricks

Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks

Weitere ähnliche Inhalte

Was ist angesagt?

Crisp dmakbkck

Time series deep learningAlberto Arrigoni

R programmingShantanu Patil

Linear Regression With REdureka!

Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks

STATA - Introductionstata_org_uk

DAX and Power BI Training - 002 DAX Level 1 - 3Will Harvey

Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Amazon Web Services

Power BI Advanced Data Modeling Virtual WorkshopCCG

Llama-indexDenis973830

Qlikview for BeginnersEdureka!

Basic of python for data analysisPramod Toraskar

Pandasmaikroeder

PySpark dataframeJaemun Jung

Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens

Microsoft Power BI Technical OverviewDavid J Rosenthal

Kafka as your Data Lake - is it Feasible?Guido Schmutz

R programming presentationAkshat Sharma

Apache Druid 101Data Con LA

Machine Learning Strategies for Time Series PredictionGianluca Bontempi

Was ist angesagt? (20)

Crisp dm

Time series deep learning

R programming

Linear Regression With R

Databricks + Snowflake: Catalyzing Data and AI Initiatives

STATA - Introduction

DAX and Power BI Training - 002 DAX Level 1 - 3

Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...

Power BI Advanced Data Modeling Virtual Workshop

Llama-index

Qlikview for Beginners

Basic of python for data analysis

Pandas

PySpark dataframe

Data Mesh in Azure using Cloud Scale Analytics (WAF)

Microsoft Power BI Technical Overview

Kafka as your Data Lake - is it Feasible?

R programming presentation

Apache Druid 101

Machine Learning Strategies for Time Series Prediction

Ähnlich wie Prophet at Scale: Using Prophet at scale to tune and forecast time series at Spotify

Simplifying AI integration on Apache SparkDatabricks

Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward

SplunkLive! Hunk Technical Deep DiveSplunk

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

Simplified News Analytics in Presidential Election with Google Cloud PlatformImre Nagi

Physical Plans in Spark SQLDatabricks

Gospel - High-performance heterogeneous architectures for graph analyticsNECST Lab @ Politecnico di Milano

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson

Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan

Structured Streaming in SparkDigital Vidya

A Tool For Big Data Analysis using Apache Sparkdatamantra

Industrialiser sparkLucien Fregosi

Are general purpose big data systems eating the world?Holden Karau

Debugging data pipelines @OLA by Karan KumarShubham Tagra

Insight Demoreza-asad

Insight Recent Demoreza-asad

A compute infrastructure for data scientistsStitch Fix Algorithms

Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks

Ähnlich wie Prophet at Scale: Using Prophet at scale to tune and forecast time series at Spotify (20)

Simplifying AI integration on Apache Spark

Best Practices for Enabling Speculative Execution on Large Scale Platforms

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

SplunkLive! Hunk Technical Deep Dive

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Simplified News Analytics in Presidential Election with Google Cloud Platform

Physical Plans in Spark SQL

Gospel - High-performance heterogeneous architectures for graph analytics

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...

Headaches and Breakthroughs in Building Continuous Applications

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK

Structured Streaming in Spark

A Tool For Big Data Analysis using Apache Spark

Industrialiser spark

Are general purpose big data systems eating the world?

Debugging data pipelines @OLA by Karan Kumar

Insight Demo

Insight Recent Demo

A compute infrastructure for data scientists

Delight: An Improved Apache Spark UI, Free, and Cross-Platform

Kürzlich hochgeladen

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Professional Resume Template for Software DevelopersVinodh Ram

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Test Automation Strategy for Frontend and BackendArshad QA

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Software Quality Assurance Interview QuestionsArshad QA

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

5 Signs You Need a Fashion PLM Software.pdfWave PLM

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Kürzlich hochgeladen (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

Exploring iOS App Development: Simplifying the Process

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

why an Opensea Clone Script might be your perfect match.pdf

Microsoft AI Transformation Partner Playbook.pdf

Professional Resume Template for Software Developers

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Test Automation Strategy for Frontend and Backend

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

Hand gesture recognition PROJECT PPT.pptx

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Salesforce Certified Field Service Consultant

Software Quality Assurance Interview Questions

Diamond Application Development Crafting Solutions with Precision

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

5 Signs You Need a Fashion PLM Software.pdf

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Prophet at Scale: Using Prophet at scale to tune and forecast time series at Spotify

1. Prophet at Scale: Using Prophet at scale to tune and forecast time series at Spotify Mahan Hosseinzadeh Data Engineer PyData London 2019, July 13th

2. Background 2

3. Who am I ● Data & Software Engineer ● 3 years on Spotify’s Forecasting team ● Spark, Scio, Google Cloud & Hadoop user 3

4. Spotify’s Fast Facts Spotify is a digital music, podcast, and video streaming freemium service ● 100M+ Subscribers ● 217M+ Monthly Active Users ● 50M+ Songs ● 3B+ Playlists ● 79 Markets (retrieved 2019-08) https://newsroom.spotify.com/company-info/ 4

5. Forecasting at Spotify ● We forecast streams, monthly active users, ads inventory and etc. ● 10K+ time series to forecast ● Usually forecasting 2 years ahead ● Happens at the end of each quarter 5

6. Time Series Data “a series of data points indexed (or listed or graphed) in time order.” Wikipedia 6

7. What is Prophet ● An open source time series forecasting tool by Facebook ● Available in Python and R ● Forecasting with default settings is often accurate as other complex models 7

8. Spark on GCP* ● Dataproc is a Google cloud service running Spark ● Fast ● Easy-to-use ● Fully managed ● Cost-effective * Google Cloud Platform 8

9. Python is a new King! ● Majority of data science libs are available in Python ● Prophet is also in Python ● Python+Spark=PySpark helps to scale easily ● Installing Python libs on Dataproc is a piece of cake 9

10. Scaling in the cloud 10

11. One Forecast ● Runtime varies from couple of seconds to few minutes ● Runtime is based on input size and Prophet settings 11

12. Many Forecasts 10K forecasts takes ~2 days when running sequentially 12

13. Many Forecasts on Spark 10K forecasts takes ~30 mins when running distributed 13 Task 1 Task 2 Executor 1 Task 1 Task 2 Executor 2 Worker Node 1 Task 1 Task 2 Executor 1 Task 1 Task 2 Executor 2 Worker Node 2 SparkContext Driver

14. In-house Model Distributor 14 ● A tool to distribute/scale a Python model ● Plug & play, no Spark knowledge needed ● Runs a model with permutation of parameter values ● Model outputs can be aggregated with Spark for further analysis ● Full integration with Bigquery, Cloud Storage and Dataproc

15. In-house Model Distributor 15 training data 1 set of parameters 1 forecast training data many sets of parameters many forecasts Model Distributor

16. Parallel Tuning Occasionally Prophet default settings are not good enough and tuning is needed 16 Task 1 Task 2 Executor 1 Task 1 Task 2 Executor 2 Worker Node 1 Task 1 Task 2 Executor 1 Task 1 Task 2 Executor 2 Worker Node 2 SparkContext Driver Forecast 1 with 5 sets of settings Forecast 2 with 3 sets of settings

17. Parallel Tuning Deﬁned parameters 17 country UK SE IT US prior_scale 0.5 0.6 0.7 0.8 n_changepoints 45 50 55 60

18. Task 1 Executor 1 Worker Node 2 Task 1 Executor 1 Worker Node 1 Sequential Tuning 18 SparkContext Driver Forecast 1 with setting ranges Forecast 2 with setting ranges

19. Sequential Tuning Range parameters 19 country UK SE IT US prior_scale [0.5 to 0.8] n_changepoints [45 to 60]

20. Interesting stats ● 1 node: 16 vCPUs, 60GB memory ● 50 nodes: 800 vCPUs, 3TB memory ● Cluster creation: ~4 mins ● 10K forecasts: ~30 mins ● Cost: ~5$ 20

21. Challenges 21 ● Get the most out of Dataproc cluster ● It’s not efficient to tune all 10K+ forecasts ● Job finishes with no results ● Testing locally with local Spark is not enough to find some of the errors ● Big output with so many small files

22. Thank you! mahanhosseinzadeh@spotify.com https://www.linkedin.com/in/mahanhoss 22

23. Container Registry BigqueryScio (Dataflow) Styx BigquerySpark (Dataproc) Tech stack with data ingestion 23