Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
1. Multi-Model Machine Learning for
Real Time Bidding over Display Ads
Beth Logan
Senior Director of Optimization
Maximo Gurmendez
Data Science Engineering Team Lead
With credit to our Spark developers:
Inés Guelfi, Juan Tejería, Martin Manasliski, Victoria Seoane
2. We wanted to try Spark but wondered
Thread
safe?
Is Spark
fast
enough?
Does it use
too much
memory?
3. Agenda
1. What we do
2. How we do it
3. Why Spark?
4. Challenges Addressed
5. Main Takeaways
6. Taking Action Automatically
• Bid in real-time ad auctions on behalf of advertisers
• Machine Learning System learns from past bids
Browser
Request
Ad
Ad
exchanges
Ad
Selection
+ Bid
Ad Bid
Request
DataXu Machine
Learning system
DataXu Real
time system
User
7. DataXu ML System
Learn
Models
Ads shown
User actions
(purchase, clicks, etc)
Only high
quality
models
Hive
database
Calibrate
Evaluate
Real Time
Bidding
Hadoop
8. Why is this hard?
Huge Scale • 2 Petabytes Processed Daily
• 1.6 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models Trained per Day
Unattended Operation • Model training and deployment runs automatically every day
Changing Industry • Need ability to adapt quickly to new customer requirements
9. Why Spark?
• Large open source Machine Learning library
– Fast turnaround of research to production
– Easy to prototype and support new customer use cases
– Built-in upgrade of algorithms
– Increased reliability
• Trains models faster than hadoop
• Enables iterative models
• Elastic environment via cloud
10. Challenges Addressed
• Smart Dataset Partitioning by Campaign
• Categorical Features
• Functional Features
• Pipelines + RowTransformers
• Use of SparkSQL
• Real-time model instantiations
11. Partitioning the data
• Need 1 RDD per campaign
• "Fat Reducers" or "Many files" problem
• 2-pass solution
12. Partitioning the data: Solution
• Sample the RDD
• Construct histogram of sizes
• Use histogram to allocate more
processes (pseudo-sub-
partition)
13. Spark ML Pipelines
Raw Feature
Transformation
Feature Encoding
Feature Selection
Decision Tree
Trainer
Transformers
Estimator
15. ML Pipelines: Row Transformer
Problem:
At bid time there is no DataFrame!
Solution:
Use row transformer
Transformer
transform(Dataframe):Dataframe
RowTransformer
transform(Row):Row
Extends
16. Meta-Pipeline Extension
• Combines and evaluates several pipelines
• DAG with all steps and dependencies
• JSON Configurable
• Pipelines = All possible paths from root to leaves
• Use to train multiple classifiers with little
overhead (training time dominated by data read
and transformation)
18. Models at bid time
• Standard java serialization
• Models relatively light-weight and fast
Add Preprocessing
Metadata ~ 130K
0
20
40
60
80
100
120
140
Model Size in Memory (KB)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Current DataXu
Model
Spark Random
Forest
Avg. Latency
(milliseconds)
19. • Choosing features via select command
• Functional features and categorical to numerical encoding via UDFs
• Top K feature values via UDAF
• Reuse UDFs at bid time
• Imperative to declarative
• Huge savings in LOC
Use and abuse of SparkSQL
20. SparkSQL:TopK UDAF Example
For categorical encoding we first obtain most
popular nominals:
select topk(os) from training_data
Result:
{windows:1562, macos:928, linux:21}
21. SparkSQL: Feature Encoding
Enumerate: select enumerate_encode(os),
enumerate_encode(browser) from training_data
One-Hot-Encoding: select onehot(os,’macos’),
onehot(os,’windows’),onehot(os,’linux’) …
Result:
1,3
3,1
2,1
Result:
1,0,0
0,1,0
0,0,1
Easily encode categorical features using UDFs
22. Takeaways
•It works!
• Spark SQL: maintainable & declarative
• Models can bid at real-time
• Automated & unattended ML at large scale
• ML Pipelines had to be extended