SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Ben Weber, Zynga
Automating Predictive Modeling
at Zynga with Pandas UDFs
#UnifiedAnalytics #SparkAISummit
Zynga Analytics
3#UnifiedAnalytics #SparkAISummit
Zynga Portfolio
4#UnifiedAnalytics #SparkAISummit
Our Challenge
• We want to build game-specific models for
behaviors such as likelihood to purchase
• Our games have diverse event taxonomies
• We have tens of millions of players and
dozens of games across multiple platforms
5#UnifiedAnalytics #SparkAISummit
Our Approach
• Featuretools for automating feature engineering
• Pandas UDFs for distributing Featuretools
• Databricks for building our model pipeline
6#UnifiedAnalytics #SparkAISummit
AutoModel
• Zynga’s first portfolio-scale data product
• Generates hundreds of propensity models
• Powers features in our games & live services
7#UnifiedAnalytics #SparkAISummit
AutoModel Pipeline
8#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
Data Extraction
9#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
S3 & Parquet
Feature Engineering
10#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
Automated Feature Engineering
• Goals
– Translate our narrow and deep data tables into a shallow
and wide representation
– Support dozens of titles with diverse event taxonomies
– Scale to billions of records and millions of players
– Minimize manual data science workflows
11#UnifiedAnalytics #SparkAISummit
Feature Tools
• A python library for deep feature synthesis
• Represents data as entity sets
• Identifies feature descriptors for transforming
your data into new representations
12#UnifiedAnalytics #SparkAISummit
Entity Sets
13#UnifiedAnalytics #SparkAISummit
Entityset: transactions
Entities:
customers (shape = [5, 3])
transactions (shape = [500, 5])
Relationships:
transactions.customer_id -> customers.customer_id
• Define the relationships between tables
• Work with Pandas data frames
Feature Synthesis
14#UnifiedAnalytics #SparkAISummit
import featuretools as ft
feature_matrix, features_defs = ft.dfs(entityset=es, target_entity="customers")
feature_matrix.head(5)
customer_id zip_code count(transactions) sum(transactions.amounts)
1 91000 0 0
2 91000 10 120.5
3 91005 5 17.96
4 91005 2 9.99
5 91000 3 29.97
Using Featuretools
import featuretools as ft
# 1-hot encode the raw event data
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=rawDataDF)
feature_matrix, defs = ft.dfs(entityset=es, target_entity="events", max_depth=1)
encodedDF, encoders = ft.encode_features(feature_matrix, defs)
# perform deep feature synthesis on the encoded data
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=encodedDF)
es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id")
generated_features, descriptors = ft.dfs(entityset=es, target_entity="users", max_depth=3)
15#UnifiedAnalytics #SparkAISummit
Scaling Up
• Parallelize the process
• Translate feature descriptions to Spark SQL
• Find a way to distribute the task
16#UnifiedAnalytics #SparkAISummit
Feature Application
17#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
Pandas UDFs
Pandas UDFs
• Introduced in Spark 2.3
• Provide Scalar and Grouped map operations
• Partitioned using a groupby clause
• Enable distributing code that uses Pandas
18#UnifiedAnalytics #SparkAISummit
Grouped Map UDFs
19#UnifiedAnalytics #SparkAISummit
UDF
Pandas
Output
Pandas
Input
Spark Output
Spark Input
UDF
Pandas
Output
Pandas
Input
UDF
Pandas
Output
Pandas
Input
UDF
Pandas
Output
Pandas
Input
UDF
Pandas
Output
Pandas
Input
When to use UDFs?
• You need to operate on Pandas data frames
• Your data can be represented as a single Spark
data frame
• You can partition your data set
20#UnifiedAnalytics #SparkAISummit
Distributing SciPy
21#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Step 1: Define the schema
22#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Step 2: Choose a partition
23#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Step 3: Use Pandas
24#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Step 4: Return Pandas
25#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Distributing Featuretools
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def apply_feature_generation(pandasInputDF):
# create Entity Set representation
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=pandasInputDF)
es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id")
# apply the feature calculation and return the result
return ft.calculate_feature_matrix(saved_features, es)
sparkFeatureDF = sparkInputDF.groupby('user_group').apply(apply_feature_generation)
26#UnifiedAnalytics #SparkAISummit
Issues with Pandas UDFs
• Debugging is a challenge
• Pushes the limits of Apache Arrow
• Data type mismatches
• Schema needs to be known before execution
27#UnifiedAnalytics #SparkAISummit
Model Training & Scoring
28#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
MLlib
Propensity Models
• Classification models
– Gradient-Boosted Trees
– XGBoost
• Hyperparameter tuning
– ParamGridBuilder
– CrossValidator
29#UnifiedAnalytics #SparkAISummit
Model Application
30#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
Couchbase
Productizing with Data Bricks
31#UnifiedAnalytics #SparkAISummit
Driver Notebook
Thread pool
Publish Scores
Model Notebook
Game 1
Model Notebook
Game 2
Model Notebook
Game 3
Jobs
API
Pandas UDFs at Zynga
• AutoModel
– Featuretools
• Experimentation
– StatsModels
– SciPy
– NumPy
32#UnifiedAnalytics #SparkAISummit
Machine Learning at Zynga
Old Approach
• Custom data science and
engineering work per model
• Months-long development process
• Ad-hoc process for productizing
models
New Approach
• Minimal effort for building new
propensity models
• No custom work for new games
• Predictions are deployed to
our application database
33#UnifiedAnalytics #SparkAISummit
Takeaways
• Pandas UDFs unlock a new magnitude of
processing for Python libraries
• Zynga is using PySpark to build portfolio-scale
data products
34#UnifiedAnalytics #SparkAISummit
Questions?
• We are hiring! Zynga.com/jobs
Ben Weber Zynga Analytics @bgweber
35#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Weitere ähnliche Inhalte

Was ist angesagt?

Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
Wes McKinney
 

Was ist angesagt? (20)

Data Visualization Tools in Python
Data Visualization Tools in PythonData Visualization Tools in Python
Data Visualization Tools in Python
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 
KERAS Python Tutorial
KERAS Python TutorialKERAS Python Tutorial
KERAS Python Tutorial
 
Machine Learning Inference at the Edge
Machine Learning Inference at the EdgeMachine Learning Inference at the Edge
Machine Learning Inference at the Edge
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Data Science in Manufacturing and Automation
Data Science in Manufacturing and AutomationData Science in Manufacturing and Automation
Data Science in Manufacturing and Automation
 
Unsupervised Learning - Teaching AI to Understand Our World
Unsupervised Learning - Teaching AI to Understand Our WorldUnsupervised Learning - Teaching AI to Understand Our World
Unsupervised Learning - Teaching AI to Understand Our World
 
High Dimensional Data Visualization
High Dimensional Data VisualizationHigh Dimensional Data Visualization
High Dimensional Data Visualization
 
Multiple Classifier Systems
Multiple Classifier SystemsMultiple Classifier Systems
Multiple Classifier Systems
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
Deep Learning using Keras
Deep Learning using KerasDeep Learning using Keras
Deep Learning using Keras
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 

Ähnlich wie Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs

Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
Taking Web Apps Offline
Taking Web Apps OfflineTaking Web Apps Offline
Taking Web Apps Offline
Pedro Morais
 
Drupal 8. Search API. Facets. Customize / combine facets
Drupal 8. Search API. Facets. Customize / combine facetsDrupal 8. Search API. Facets. Customize / combine facets
Drupal 8. Search API. Facets. Customize / combine facets
AnyforSoft
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 

Ähnlich wie Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs (20)

Strategies for refactoring and migrating a big old project to be multilingual...
Strategies for refactoring and migrating a big old project to be multilingual...Strategies for refactoring and migrating a big old project to be multilingual...
Strategies for refactoring and migrating a big old project to be multilingual...
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
 
When you need more data in less time...
When you need more data in less time...When you need more data in less time...
When you need more data in less time...
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Ember.js Tokyo event 2014/09/22 (English)
Ember.js Tokyo event 2014/09/22 (English)Ember.js Tokyo event 2014/09/22 (English)
Ember.js Tokyo event 2014/09/22 (English)
 
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
 
Taking Web Apps Offline
Taking Web Apps OfflineTaking Web Apps Offline
Taking Web Apps Offline
 
Big Objects in Salesforce
Big Objects in SalesforceBig Objects in Salesforce
Big Objects in Salesforce
 
Drupal 8. Search API. Facets. Customize / combine facets
Drupal 8. Search API. Facets. Customize / combine facetsDrupal 8. Search API. Facets. Customize / combine facets
Drupal 8. Search API. Facets. Customize / combine facets
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Spark AI 2020
Spark AI 2020Spark AI 2020
Spark AI 2020
 
Introduction to Swagger
Introduction to SwaggerIntroduction to Swagger
Introduction to Swagger
 
Magento Indexes
Magento IndexesMagento Indexes
Magento Indexes
 
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
 
What's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksWhat's Coming Next in Sencha Frameworks
What's Coming Next in Sencha Frameworks
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Kürzlich hochgeladen (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 

Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Ben Weber, Zynga Automating Predictive Modeling at Zynga with Pandas UDFs #UnifiedAnalytics #SparkAISummit
  • 5. Our Challenge • We want to build game-specific models for behaviors such as likelihood to purchase • Our games have diverse event taxonomies • We have tens of millions of players and dozens of games across multiple platforms 5#UnifiedAnalytics #SparkAISummit
  • 6. Our Approach • Featuretools for automating feature engineering • Pandas UDFs for distributing Featuretools • Databricks for building our model pipeline 6#UnifiedAnalytics #SparkAISummit
  • 7. AutoModel • Zynga’s first portfolio-scale data product • Generates hundreds of propensity models • Powers features in our games & live services 7#UnifiedAnalytics #SparkAISummit
  • 11. Automated Feature Engineering • Goals – Translate our narrow and deep data tables into a shallow and wide representation – Support dozens of titles with diverse event taxonomies – Scale to billions of records and millions of players – Minimize manual data science workflows 11#UnifiedAnalytics #SparkAISummit
  • 12. Feature Tools • A python library for deep feature synthesis • Represents data as entity sets • Identifies feature descriptors for transforming your data into new representations 12#UnifiedAnalytics #SparkAISummit
  • 13. Entity Sets 13#UnifiedAnalytics #SparkAISummit Entityset: transactions Entities: customers (shape = [5, 3]) transactions (shape = [500, 5]) Relationships: transactions.customer_id -> customers.customer_id • Define the relationships between tables • Work with Pandas data frames
  • 14. Feature Synthesis 14#UnifiedAnalytics #SparkAISummit import featuretools as ft feature_matrix, features_defs = ft.dfs(entityset=es, target_entity="customers") feature_matrix.head(5) customer_id zip_code count(transactions) sum(transactions.amounts) 1 91000 0 0 2 91000 10 120.5 3 91005 5 17.96 4 91005 2 9.99 5 91000 3 29.97
  • 15. Using Featuretools import featuretools as ft # 1-hot encode the raw event data es = ft.EntitySet(id="events") es = es.entity_from_dataframe(entity_id="events", dataframe=rawDataDF) feature_matrix, defs = ft.dfs(entityset=es, target_entity="events", max_depth=1) encodedDF, encoders = ft.encode_features(feature_matrix, defs) # perform deep feature synthesis on the encoded data es = ft.EntitySet(id="events") es = es.entity_from_dataframe(entity_id="events", dataframe=encodedDF) es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id") generated_features, descriptors = ft.dfs(entityset=es, target_entity="users", max_depth=3) 15#UnifiedAnalytics #SparkAISummit
  • 16. Scaling Up • Parallelize the process • Translate feature descriptions to Spark SQL • Find a way to distribute the task 16#UnifiedAnalytics #SparkAISummit
  • 18. Pandas UDFs • Introduced in Spark 2.3 • Provide Scalar and Grouped map operations • Partitioned using a groupby clause • Enable distributing code that uses Pandas 18#UnifiedAnalytics #SparkAISummit
  • 19. Grouped Map UDFs 19#UnifiedAnalytics #SparkAISummit UDF Pandas Output Pandas Input Spark Output Spark Input UDF Pandas Output Pandas Input UDF Pandas Output Pandas Input UDF Pandas Output Pandas Input UDF Pandas Output Pandas Input
  • 20. When to use UDFs? • You need to operate on Pandas data frames • Your data can be represented as a single Spark data frame • You can partition your data set 20#UnifiedAnalytics #SparkAISummit
  • 21. Distributing SciPy 21#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 22. Step 1: Define the schema 22#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 23. Step 2: Choose a partition 23#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 24. Step 3: Use Pandas 24#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 25. Step 4: Return Pandas 25#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 26. Distributing Featuretools @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def apply_feature_generation(pandasInputDF): # create Entity Set representation es = ft.EntitySet(id="events") es = es.entity_from_dataframe(entity_id="events", dataframe=pandasInputDF) es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id") # apply the feature calculation and return the result return ft.calculate_feature_matrix(saved_features, es) sparkFeatureDF = sparkInputDF.groupby('user_group').apply(apply_feature_generation) 26#UnifiedAnalytics #SparkAISummit
  • 27. Issues with Pandas UDFs • Debugging is a challenge • Pushes the limits of Apache Arrow • Data type mismatches • Schema needs to be known before execution 27#UnifiedAnalytics #SparkAISummit
  • 28. Model Training & Scoring 28#UnifiedAnalytics #SparkAISummit Data Extract Feature Engineering Feature Application Model Training Model Publish MLlib
  • 29. Propensity Models • Classification models – Gradient-Boosted Trees – XGBoost • Hyperparameter tuning – ParamGridBuilder – CrossValidator 29#UnifiedAnalytics #SparkAISummit
  • 31. Productizing with Data Bricks 31#UnifiedAnalytics #SparkAISummit Driver Notebook Thread pool Publish Scores Model Notebook Game 1 Model Notebook Game 2 Model Notebook Game 3 Jobs API
  • 32. Pandas UDFs at Zynga • AutoModel – Featuretools • Experimentation – StatsModels – SciPy – NumPy 32#UnifiedAnalytics #SparkAISummit
  • 33. Machine Learning at Zynga Old Approach • Custom data science and engineering work per model • Months-long development process • Ad-hoc process for productizing models New Approach • Minimal effort for building new propensity models • No custom work for new games • Predictions are deployed to our application database 33#UnifiedAnalytics #SparkAISummit
  • 34. Takeaways • Pandas UDFs unlock a new magnitude of processing for Python libraries • Zynga is using PySpark to build portfolio-scale data products 34#UnifiedAnalytics #SparkAISummit
  • 35. Questions? • We are hiring! Zynga.com/jobs Ben Weber Zynga Analytics @bgweber 35#UnifiedAnalytics #SparkAISummit
  • 36. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT