The concept of talk is as follows: - to give a general idea about user segmentation task in DMP project and how solving this problem helps our business - to tell how we use autoML to solve this task and to explain its components - to give insights about techniques we apply to make our pipeline fast and stable on huge datasets
3. About Rambler&Co AdTech
Projects:
• Data management platform (user
segmentation)
• Recommender systems
• “Lumiere” (forecasting offline cinema traffic)
• Computer vision
4. About Rambler&Co AdTech
Projects:
• Data management platform (user
segmentation)
• Recommender systems
• “Lumiere” (forecasting offline cinema traffic)
• Computer vision
5. In this talk:
• DMP and user segmentation tasks explained
• Key structures of AutoML pipeline for user segmentation
• Problems we faced while maintaining pipeline
• Feature engineering for machine learning at scale
• Optimization of pipeline tasks
6. Data management platform (DMP): a powerful AdTech
solution
-Collect user behavior data from various sources
-Integrate data to create a complete customer view
-Store and manage audience segments
-Target audience segments in online ad companies
7. Types of Data Sources of Data
1st party data – raw events logs (visited
websites)
2nd party data – customer journey data
3rd party data – data collected from partners
Media resources
Products and services
Data from ad campaigns, behavioral factors
Other sources
Образец слайда
8. DMP AutoML pipeline: solution for any user
segmentation task
About 1000 models fitted on daily basis
Every model is being applied on 300 million of test samples daily
ML problems:
• binary/multiclass classification
• Look alike –> binary classification(segment vs random)
10. General principles of DMP AutoML
All models have similar structure of fit and apply stages
Adding models and exploitation options have to be implemented with
web interface
No need for ML developers to support a scope of key operations
13. AutoML pipeline daily workflow
Felix
Compute features
Create train table
Train models
Compute pivots load
pivots
Apply and slice
predictions
Compute
metrics
Load
models
14. Workflow manager: Apache Airflow
• Run a series of tasks as DAG (directed acyclic graph)
• Express task dependencies
• Handle failures
15. Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
16. Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
Problem:
Some target segments(labels) finish computing
slower than others.
Solution:
While some models wait for target segments, other
models keep training
17. Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
18. Key problems we faced
• Data collection delay
• Out of memory issues
• High cardinality feature matrices
• Too much time to map predictions with label thresholds
• Some models are being applied more often than others
20. Data collection delay: do not wait too much
• Use Airflow sensor to wait for MAX_ FEATURE_DELAY
• If exceeded fill the missing parts of features table with last computed
day
21. Feature Engineering(FE): overcoming high cardinality
feature matrices
Main rule:
New Features must be
applicable for a majority of
models
Key techniques
• Counting based FE
• Distance based FE
22. Feature matrix of shape (N, 10000)
id Feature_1 ... Feature_10000
1 42 ... 542
.... ... ... ...
N 89 ... 0
23. Distance based FE: Cluster distance
Algorithm:
1) Reduce dimension of feature matrix if needed (we use SVD
decomposition)
2) Fit KMeans clustering algorithm with K clusters on given data
3) Calculate distance from sample point to centroid of Kth cluster
4) Use distances as feature representation for sample row
24. Feature matrix of shape (N, K)
id dist_to_1st_cluster ... dist_to_Kth_cluster
1 0.6757 ... 0.0942
.... ... ... ...
N 0.342 ... 0.6113
25. Problem:
It may take much time for KMeans to converge
and compute distances for every model…
26. Solution: “Global” Cluster distance Feature
• Fit KMeans only once on representative
unlabeled sample to extract general information
and use for all models
27. Experimental results:
• Replacing individually fitted by model distance features with ”Global” feature
doesn’t harm model quality
• Combining both feature representations improve roc auc score about 1%
28. Counting based FE
Traditional approaches:
1) Feature Hashing
2) One hot encoding
User Domain Count
Bob news.rambler.ru 5
Bob auto.ru 11
Bob mercedes-benz.ru 15
29. Counting based FE
Traditional approaches:
1) Feature Hashing
2) One hot encoding
General problems:
1) High cardinality
2) Efficient with only linear models
30. Counting based FE: DRACULA
Domain Robust Algorithm for Counting Based Learning
Source: http://www.slideshare.net/SessionsEvents/misha-bilenko-
principal-researcher-microsoft
31. Algorithm
Compute counts table from all train
data
Compute P(label | feature) for
every unique feature
Aggregate list of probabilities to get
low cardinality data representation
01
02
03
32. Counts of visited domains for single
user
User Domain Count
Bob news.rambler.ru 5
Bob auto.ru 11
Bob mercedes-benz.ru 15
Domain Count
news.rambler.ru 95859
auto.ru 31040
mercedes-benz.ru 1386
Total counts in train data
37. Compute data representation for single user
P(label=0|domain = ”news.rambler.ru”) = 0.43
P(label=0|domain = ”auto.ru”) = 0.86
P(label=0|domain = ”mercedes-benz.ru”) = 0.81
N = 10
Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0 ]
Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]
Interpretation: the concentration of
nonzero elements in histogram
represents estimation of P(label | user)
38. Advances of algorithm
• Scalable (add new features, recompute probabilities)
• Adaptive (fits for binary and multiclass classification and regression as
well)
• Efficient for gradient boosting decision trees due to low cardinality
• Ability to compute features in distributed manner (mapreduce)
• Ability to store counts table with count min sketch
39. Compute models pivots task
Approach to approximate label thresholds
Problems:
• how to select thresholds for
labels?
• how to do it computationally
fast?
Desired solution:
• Heuristics to approximate label
thresholds
40. Compute models pivots task
Approach to approximate label thresholds
Precision-threshold for binary classification.
What probability threshold optimizes given metrics quality?
41. Compute models pivots task
Approach to approximate label thresholds
Algorithm:
• Take sample of apply data ( we use 5%, about 15 million samples)
• Compute probabilities histogram for this sample
• Use Nth percentile as estimation for label threshold
42. Apply model task
Task interval: every hour
Number of models per run: 200
General problem:
• Some models are being applied more often then others
43. Priority schema of apply models
1) Request all models
2) Filter out not yet trained models
3) Sort by date of adding a model (descending)
4) Sort by date of last apply (ascending)
5) Take N top priority models