Building propensity models at Zynga used to be a time-intensive task that required custom data science and engineering work for every new model. We've built an automated model pipeline that uses PySpark and feature generation to automate this process. The challenge that we faced was that the Featuretools library that we wanted to use for automated feature engineering works only on Pandas data frames, limiting the size of data sets that we could handle. Our solution to this problem is to use Pandas UDFs to scale the feature engineering process to our entire player base. We start with our full set of players, partition the data into smaller chucks that can be loaded into memory, apply the feature engineering step on these subsets of data, and then combine the results back into one large data set. This presentation will outline how we use Pandas UDFs in production to automate propensity modeling at Zynga. The outcome of this approach is that we now have hundreds of propensity models in production that teams can use to personalize game experiences. Instead of spending time on feature engineering and model fitting, our data scientists are now spending more of their time engaging with game teams to help build new features.
5. Our Challenge
• We want to build game-specific models for
behaviors such as likelihood to purchase
• Our games have diverse event taxonomies
• We have tens of millions of players and
dozens of games across multiple platforms
5#UnifiedAnalytics #SparkAISummit
6. Our Approach
• Featuretools for automating feature engineering
• Pandas UDFs for distributing Featuretools
• Databricks for building our model pipeline
6#UnifiedAnalytics #SparkAISummit
7. AutoModel
• Zynga’s first portfolio-scale data product
• Generates hundreds of propensity models
• Powers features in our games & live services
7#UnifiedAnalytics #SparkAISummit
11. Automated Feature Engineering
• Goals
– Translate our narrow and deep data tables into a shallow
and wide representation
– Support dozens of titles with diverse event taxonomies
– Scale to billions of records and millions of players
– Minimize manual data science workflows
11#UnifiedAnalytics #SparkAISummit
12. Feature Tools
• A python library for deep feature synthesis
• Represents data as entity sets
• Identifies feature descriptors for transforming
your data into new representations
12#UnifiedAnalytics #SparkAISummit
13. Entity Sets
13#UnifiedAnalytics #SparkAISummit
Entityset: transactions
Entities:
customers (shape = [5, 3])
transactions (shape = [500, 5])
Relationships:
transactions.customer_id -> customers.customer_id
• Define the relationships between tables
• Work with Pandas data frames
15. Using Featuretools
import featuretools as ft
# 1-hot encode the raw event data
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=rawDataDF)
feature_matrix, defs = ft.dfs(entityset=es, target_entity="events", max_depth=1)
encodedDF, encoders = ft.encode_features(feature_matrix, defs)
# perform deep feature synthesis on the encoded data
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=encodedDF)
es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id")
generated_features, descriptors = ft.dfs(entityset=es, target_entity="users", max_depth=3)
15#UnifiedAnalytics #SparkAISummit
16. Scaling Up
• Parallelize the process
• Translate feature descriptions to Spark SQL
• Find a way to distribute the task
16#UnifiedAnalytics #SparkAISummit
18. Pandas UDFs
• Introduced in Spark 2.3
• Provide Scalar and Grouped map operations
• Partitioned using a groupby clause
• Enable distributing code that uses Pandas
18#UnifiedAnalytics #SparkAISummit
20. When to use UDFs?
• You need to operate on Pandas data frames
• Your data can be represented as a single Spark
data frame
• You can partition your data set
20#UnifiedAnalytics #SparkAISummit
26. Distributing Featuretools
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def apply_feature_generation(pandasInputDF):
# create Entity Set representation
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=pandasInputDF)
es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id")
# apply the feature calculation and return the result
return ft.calculate_feature_matrix(saved_features, es)
sparkFeatureDF = sparkInputDF.groupby('user_group').apply(apply_feature_generation)
26#UnifiedAnalytics #SparkAISummit
27. Issues with Pandas UDFs
• Debugging is a challenge
• Pushes the limits of Apache Arrow
• Data type mismatches
• Schema needs to be known before execution
27#UnifiedAnalytics #SparkAISummit
28. Model Training & Scoring
28#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
MLlib
31. Productizing with Data Bricks
31#UnifiedAnalytics #SparkAISummit
Driver Notebook
Thread pool
Publish Scores
Model Notebook
Game 1
Model Notebook
Game 2
Model Notebook
Game 3
Jobs
API
33. Machine Learning at Zynga
Old Approach
• Custom data science and
engineering work per model
• Months-long development process
• Ad-hoc process for productizing
models
New Approach
• Minimal effort for building new
propensity models
• No custom work for new games
• Predictions are deployed to
our application database
33#UnifiedAnalytics #SparkAISummit
34. Takeaways
• Pandas UDFs unlock a new magnitude of
processing for Python libraries
• Zynga is using PySpark to build portfolio-scale
data products
34#UnifiedAnalytics #SparkAISummit
35. Questions?
• We are hiring! Zynga.com/jobs
Ben Weber Zynga Analytics @bgweber
35#UnifiedAnalytics #SparkAISummit
36. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT