Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi-Model Machine Learning for
Real Time Bidding over Display Ads
Beth Logan
Senior Director of Optimization
Maximo Gurmendez
Data Science Engineering Team Lead
With credit to our Spark developers:
Inés Guelfi, Juan Tejería, Martin Manasliski, Victoria Seoane

We wanted to try Spark but wondered
Thread
safe?
Is Spark
fast
enough?
Does it use
too much
memory?

Agenda
1. What we do
2. How we do it
3. Why Spark?
4. Challenges Addressed
5. Main Takeaways

DataXu’s Mission
Make marketing
smarter through
Data Science!

Taking Action Automatically
• Bid in real-time ad auctions on behalf of advertisers
• Machine Learning System learns from past bids
Browser
Request
Ad
Ad
exchanges
Ad
Selection
+ Bid
Ad Bid
Request
DataXu Machine
Learning system
DataXu Real
time system
User

DataXu ML System
Learn
Models
Ads shown
User actions
(purchase, clicks, etc)
Only high
quality
models
Hive
database
Calibrate
Evaluate
Real Time
Bidding
Hadoop

Why is this hard?
Huge Scale • 2 Petabytes Processed Daily
• 1.6 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models Trained per Day
Unattended Operation • Model training and deployment runs automatically every day
Changing Industry • Need ability to adapt quickly to new customer requirements

Why Spark?
• Large open source Machine Learning library
– Fast turnaround of research to production
– Easy to prototype and support new customer use cases
– Built-in upgrade of algorithms
– Increased reliability
• Trains models faster than hadoop
• Enables iterative models
• Elastic environment via cloud

Challenges Addressed
• Smart Dataset Partitioning by Campaign
• Categorical Features
• Functional Features
• Pipelines + RowTransformers
• Use of SparkSQL
• Real-time model instantiations

Partitioning the data
• Need 1 RDD per campaign
• "Fat Reducers" or "Many files" problem
• 2-pass solution

Partitioning the data: Solution
• Sample the RDD
• Construct histogram of sizes
• Use histogram to allocate more
processes (pseudo-sub-
partition)

Spark ML Pipelines
Raw Feature
Transformation
Feature Encoding
Feature Selection
Decision Tree
Trainer
Transformers
Estimator

Spark ML Pipelines
Transformer
transform(Dataframe):Dataframe
Model
fit(Dataframe):Model
Extends
Estimator
• Great for Training, evaluation &
experimentation
• Can we use them at bid time?

ML Pipelines: Row Transformer
Problem:
At bid time there is no DataFrame!
Solution:
Use row transformer
Transformer
transform(Dataframe):Dataframe
RowTransformer
transform(Row):Row
Extends

Meta-Pipeline Extension
• Combines and evaluates several pipelines
• DAG with all steps and dependencies
• JSON Configurable
• Pipelines = All possible paths from root to leaves
• Use to train multiple classifiers with little
overhead (training time dominated by data read
and transformation)

Best model varies from campaign to campaign
AUC

Models at bid time
• Standard java serialization
• Models relatively light-weight and fast
Add Preprocessing
Metadata ~ 130K
0
20
40
60
80
100
120
140
Model Size in Memory (KB)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Current DataXu
Model
Spark Random
Forest
Avg. Latency
(milliseconds)

• Choosing features via select command
• Functional features and categorical to numerical encoding via UDFs
• Top K feature values via UDAF
• Reuse UDFs at bid time
• Imperative to declarative
• Huge savings in LOC
Use and abuse of SparkSQL

SparkSQL:TopK UDAF Example
For categorical encoding we first obtain most
popular nominals:
select topk(os) from training_data
Result:
{windows:1562, macos:928, linux:21}

SparkSQL: Feature Encoding
Enumerate: select enumerate_encode(os),
enumerate_encode(browser) from training_data
One-Hot-Encoding: select onehot(os,’macos’),
onehot(os,’windows’),onehot(os,’linux’) …
Result:
1,3
3,1
2,1
Result:
1,0,0
0,1,0
0,0,1
Easily encode categorical features using UDFs

Takeaways
•It works!
• Spark SQL: maintainable & declarative
• Models can bid at real-time
• Automated & unattended ML at large scale
• ML Pipelines had to be extended

THANK YOU!
blogan@dataxu.com
mgurmendez@dataxu.com
dataxu.com/careers
always looking for smart people!!

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Ähnlich wie Multi Model Machine Learning by Maximo Gurmendez and Beth Logan (20)

Mehr von Spark Summit

Mehr von Spark Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan