SlideShare ist ein Scribd-Unternehmen logo
1 von 50
AutoML for user segmentation
Ilya Boytsov
Rambler&Co
Rambler&Co - largest media holding in Russia
About Rambler&Co AdTech
Projects:
• Data management platform (user
segmentation)
• Recommender systems
• “Lumiere” (forecasting offline cinema traffic)
• Computer vision
About Rambler&Co AdTech
Projects:
• Data management platform (user
segmentation)
• Recommender systems
• “Lumiere” (forecasting offline cinema traffic)
• Computer vision
In this talk:
• DMP and user segmentation tasks explained
• Key structures of AutoML pipeline for user segmentation
• Problems we faced while maintaining pipeline
• Feature engineering for machine learning at scale
• Optimization of pipeline tasks
Data management platform (DMP): a powerful AdTech
solution
-Collect user behavior data from various sources
-Integrate data to create a complete customer view
-Store and manage audience segments
-Target audience segments in online ad companies
Types of Data Sources of Data
1st party data – raw events logs (visited
websites)
2nd party data – customer journey data
3rd party data – data collected from partners
Media resources
Products and services
Data from ad campaigns, behavioral factors
Other sources
Образец слайда
DMP AutoML pipeline: solution for any user
segmentation task
About 1000 models fitted on daily basis
Every model is being applied on 300 million of test samples daily
ML problems:
• binary/multiclass classification
• Look alike –> binary classification(segment vs random)
Retargeting_973
Look-alike 0.0-1%
Look-alike 1.0-5.0%
Look-alike 5.0-10.0%
Retargeting_1069
Look-alike 0.0-1%
Look-alike 1.0-5.0%
Look-alike 5.0-10.0%
Examples: Look-alike modelling boosts ctr
General principles of DMP AutoML
All models have similar structure of fit and apply stages
Adding models and exploitation options have to be implemented with
web interface
No need for ML developers to support a scope of key operations
Felix
Backoffice and web
interface for AutoML
pipeline
Create new models, add new segments,
visualize model performance and many more
AutoML pipeline daily workflow
Felix
Compute features
Create train table
Train models
Compute pivots load
pivots
Apply and slice
predictions
Compute
metrics
Load
models
Workflow manager: Apache Airflow
• Run a series of tasks as DAG (directed acyclic graph)
• Express task dependencies
• Handle failures
Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
Problem:
Some target segments(labels) finish computing
slower than others.
Solution:
While some models wait for target segments, other
models keep training
Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
Key problems we faced
• Data collection delay
• Out of memory issues
• High cardinality feature matrices
• Too much time to map predictions with label thresholds
• Some models are being applied more often than others
Data collection delay
• Use Airflow sensor to wait for MAX_ FEATURE_DELAY
Data collection delay: do not wait too much
• Use Airflow sensor to wait for MAX_ FEATURE_DELAY
• If exceeded fill the missing parts of features table with last computed
day
Feature Engineering(FE): overcoming high cardinality
feature matrices
Main rule:
New Features must be
applicable for a majority of
models
Key techniques
• Counting based FE
• Distance based FE
Feature matrix of shape (N, 10000)
id Feature_1 ... Feature_10000
1 42 ... 542
.... ... ... ...
N 89 ... 0
Distance based FE: Cluster distance
Algorithm:
1) Reduce dimension of feature matrix if needed (we use SVD
decomposition)
2) Fit KMeans clustering algorithm with K clusters on given data
3) Calculate distance from sample point to centroid of Kth cluster
4) Use distances as feature representation for sample row
Feature matrix of shape (N, K)
id dist_to_1st_cluster ... dist_to_Kth_cluster
1 0.6757 ... 0.0942
.... ... ... ...
N 0.342 ... 0.6113
Problem:
It may take much time for KMeans to converge
and compute distances for every model…
Solution: “Global” Cluster distance Feature
• Fit KMeans only once on representative
unlabeled sample to extract general information
and use for all models
Experimental results:
• Replacing individually fitted by model distance features with ”Global” feature
doesn’t harm model quality
• Combining both feature representations improve roc auc score about 1%
Counting based FE
Traditional approaches:
1) Feature Hashing
2) One hot encoding
User Domain Count
Bob news.rambler.ru 5
Bob auto.ru 11
Bob mercedes-benz.ru 15
Counting based FE
Traditional approaches:
1) Feature Hashing
2) One hot encoding
General problems:
1) High cardinality
2) Efficient with only linear models
Counting based FE: DRACULA
Domain Robust Algorithm for Counting Based Learning
Source: http://www.slideshare.net/SessionsEvents/misha-bilenko-
principal-researcher-microsoft
Algorithm
Compute counts table from all train
data
Compute P(label | feature) for
every unique feature
Aggregate list of probabilities to get
low cardinality data representation
01
02
03
Counts of visited domains for single
user
User Domain Count
Bob news.rambler.ru 5
Bob auto.ru 11
Bob mercedes-benz.ru 15
Domain Count
news.rambler.ru 95859
auto.ru 31040
mercedes-benz.ru 1386
Total counts in train data
Counts table
Domain Total count Count(label=0) Count(label=1)
news.rambler.ru 95859 41268 54591
auto.ru 31040 26809 4231
mercedes-benz.ru 1386 1120 266
Total 128285 69131 59154
𝑁label = domain frequency with given label
𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 = smoothing constant * aprior class
probability
𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟 =1- 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟
N = 𝑁 𝑝𝑜𝑠 + 𝑁 𝑛𝑒𝑔
P(label | domain) =
𝑁 𝑙𝑎𝑏𝑒𝑙 + 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟
𝑁+𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 + 𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟
General formula: probability of label given domain
Compute data representation for single user
P(label=0|domain = ”news.rambler.ru”) = 0.43
P(label=0|domain = ”auto.ru”) = 0.86
P(label=0|domain = ”mercedes-benz.ru”) =
0.81
Compute data representation for single user
P(label=0|domain = ”news.rambler.ru”) = 0.43
P(label=0|domain = ”auto.ru”) = 0.86
P(label=0|domain = ”mercedes-benz.ru”) = 0.81
N = 10
Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0 ]
Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]
Compute data representation for single user
P(label=0|domain = ”news.rambler.ru”) = 0.43
P(label=0|domain = ”auto.ru”) = 0.86
P(label=0|domain = ”mercedes-benz.ru”) = 0.81
N = 10
Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0 ]
Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]
Interpretation: the concentration of
nonzero elements in histogram
represents estimation of P(label | user)
Advances of algorithm
• Scalable (add new features, recompute probabilities)
• Adaptive (fits for binary and multiclass classification and regression as
well)
• Efficient for gradient boosting decision trees due to low cardinality
• Ability to compute features in distributed manner (mapreduce)
• Ability to store counts table with count min sketch
Compute models pivots task
Approach to approximate label thresholds
Problems:
• how to select thresholds for
labels?
• how to do it computationally
fast?
Desired solution:
• Heuristics to approximate label
thresholds
Compute models pivots task
Approach to approximate label thresholds
Precision-threshold for binary classification.
What probability threshold optimizes given metrics quality?
Compute models pivots task
Approach to approximate label thresholds
Algorithm:
• Take sample of apply data ( we use 5%, about 15 million samples)
• Compute probabilities histogram for this sample
• Use Nth percentile as estimation for label threshold
Apply model task
Task interval: every hour
Number of models per run: 200
General problem:
• Some models are being applied more often then others
Priority schema of apply models
1) Request all models
2) Filter out not yet trained models
3) Sort by date of adding a model (descending)
4) Sort by date of last apply (ascending)
5) Take N top priority models
Key notes:
Key notes:
1) Think of scalable approach
Key notes:
1) Think of scalable approach
2) Implement monitoring of pipeline performance
Key notes:
1) Think of scalable approach
2) Implement monitoring of pipeline performance
3) Make experiments
In lieu of conclusion...
Pipeline automation...is full of fun
Questions? Contact me!
https://www.facebook.com/ieboytsov
i.boytsov@rambler-co.ru
AutoML for user segmentation: how to match millions of users with hundreds of segments every day

Weitere ähnliche Inhalte

Was ist angesagt?

Parameterization Matlab Projects Research Topics
Parameterization Matlab Projects Research TopicsParameterization Matlab Projects Research Topics
Parameterization Matlab Projects Research TopicsMatlab Simulation
 
Operationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryOperationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryAlpine Data
 
Survey of Graph Indexing
Survey of Graph IndexingSurvey of Graph Indexing
Survey of Graph IndexingKisung Kim
 
A framework for nonlinear model predictive control
A framework for nonlinear model predictive controlA framework for nonlinear model predictive control
A framework for nonlinear model predictive controlModelon
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 
R2RML-F: Towards Sharing and Executing Domain Logic in R2RML Mappings
R2RML-F: Towards Sharing and Executing Domain Logic in R2RML MappingsR2RML-F: Towards Sharing and Executing Domain Logic in R2RML Mappings
R2RML-F: Towards Sharing and Executing Domain Logic in R2RML MappingsChristophe Debruyne
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017Kisung Kim
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimizationKisung Kim
 
Matlab course
Matlab courseMatlab course
Matlab coursebaluja
 

Was ist angesagt? (12)

Parameterization Matlab Projects Research Topics
Parameterization Matlab Projects Research TopicsParameterization Matlab Projects Research Topics
Parameterization Matlab Projects Research Topics
 
Operationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryOperationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud Foundry
 
Survey of Graph Indexing
Survey of Graph IndexingSurvey of Graph Indexing
Survey of Graph Indexing
 
A framework for nonlinear model predictive control
A framework for nonlinear model predictive controlA framework for nonlinear model predictive control
A framework for nonlinear model predictive control
 
O Matrix Overview
O Matrix OverviewO Matrix Overview
O Matrix Overview
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
R2RML-F: Towards Sharing and Executing Domain Logic in R2RML Mappings
R2RML-F: Towards Sharing and Executing Domain Logic in R2RML MappingsR2RML-F: Towards Sharing and Executing Domain Logic in R2RML Mappings
R2RML-F: Towards Sharing and Executing Domain Logic in R2RML Mappings
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017
 
Simulation lab
Simulation labSimulation lab
Simulation lab
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimization
 
Matlab overview
Matlab overviewMatlab overview
Matlab overview
 
Matlab course
Matlab courseMatlab course
Matlab course
 

Ähnlich wie AutoML for user segmentation: how to match millions of users with hundreds of segments every day

The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real ExperienceIhor Bobak
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
 
Practical data science
Practical data sciencePractical data science
Practical data scienceDing Li
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Karthik Murugesan
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016PAPIs.io
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBigML, Inc
 
Customer choice probabilities
Customer choice probabilitiesCustomer choice probabilities
Customer choice probabilitiesAllan D. Butler
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 

Ähnlich wie AutoML for user segmentation: how to match millions of users with hundreds of segments every day (20)

The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
Customer choice probabilities
Customer choice probabilitiesCustomer choice probabilities
Customer choice probabilities
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 

Mehr von Institute of Contemporary Sciences

Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Institute of Contemporary Sciences
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicInstitute of Contemporary Sciences
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Institute of Contemporary Sciences
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena PekezInstitute of Contemporary Sciences
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovInstitute of Contemporary Sciences
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Institute of Contemporary Sciences
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Institute of Contemporary Sciences
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Institute of Contemporary Sciences
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Institute of Contemporary Sciences
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicInstitute of Contemporary Sciences
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicInstitute of Contemporary Sciences
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionInstitute of Contemporary Sciences
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentInstitute of Contemporary Sciences
 

Mehr von Institute of Contemporary Sciences (20)

First 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip PanjevicFirst 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip Panjevic
 
Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar Dilov
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
 
From Zero to ML Hero for Underdogs - Amir Tabakovic
From Zero to ML Hero for Underdogs  - Amir TabakovicFrom Zero to ML Hero for Underdogs  - Amir Tabakovic
From Zero to ML Hero for Underdogs - Amir Tabakovic
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
 
The price is right - Tomislav Krizan
The price is right - Tomislav KrizanThe price is right - Tomislav Krizan
The price is right - Tomislav Krizan
 
When it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkWhen it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela Culibrk
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos Solujic
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir Brusic
 
Improving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity SearchImproving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity Search
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognition
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local government
 
Geospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and ClimateGeospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and Climate
 

Kürzlich hochgeladen

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

AutoML for user segmentation: how to match millions of users with hundreds of segments every day

  • 1. AutoML for user segmentation Ilya Boytsov Rambler&Co
  • 2. Rambler&Co - largest media holding in Russia
  • 3. About Rambler&Co AdTech Projects: • Data management platform (user segmentation) • Recommender systems • “Lumiere” (forecasting offline cinema traffic) • Computer vision
  • 4. About Rambler&Co AdTech Projects: • Data management platform (user segmentation) • Recommender systems • “Lumiere” (forecasting offline cinema traffic) • Computer vision
  • 5. In this talk: • DMP and user segmentation tasks explained • Key structures of AutoML pipeline for user segmentation • Problems we faced while maintaining pipeline • Feature engineering for machine learning at scale • Optimization of pipeline tasks
  • 6. Data management platform (DMP): a powerful AdTech solution -Collect user behavior data from various sources -Integrate data to create a complete customer view -Store and manage audience segments -Target audience segments in online ad companies
  • 7. Types of Data Sources of Data 1st party data – raw events logs (visited websites) 2nd party data – customer journey data 3rd party data – data collected from partners Media resources Products and services Data from ad campaigns, behavioral factors Other sources Образец слайда
  • 8. DMP AutoML pipeline: solution for any user segmentation task About 1000 models fitted on daily basis Every model is being applied on 300 million of test samples daily ML problems: • binary/multiclass classification • Look alike –> binary classification(segment vs random)
  • 9. Retargeting_973 Look-alike 0.0-1% Look-alike 1.0-5.0% Look-alike 5.0-10.0% Retargeting_1069 Look-alike 0.0-1% Look-alike 1.0-5.0% Look-alike 5.0-10.0% Examples: Look-alike modelling boosts ctr
  • 10. General principles of DMP AutoML All models have similar structure of fit and apply stages Adding models and exploitation options have to be implemented with web interface No need for ML developers to support a scope of key operations
  • 11. Felix Backoffice and web interface for AutoML pipeline Create new models, add new segments, visualize model performance and many more
  • 12.
  • 13. AutoML pipeline daily workflow Felix Compute features Create train table Train models Compute pivots load pivots Apply and slice predictions Compute metrics Load models
  • 14. Workflow manager: Apache Airflow • Run a series of tasks as DAG (directed acyclic graph) • Express task dependencies • Handle failures
  • 15. Train and apply DAG`s Train DAG interval: every 4 hours Apply DAG interval: every hour
  • 16. Train and apply DAG`s Train DAG interval: every 4 hours Apply DAG interval: every hour Problem: Some target segments(labels) finish computing slower than others. Solution: While some models wait for target segments, other models keep training
  • 17. Train and apply DAG`s Train DAG interval: every 4 hours Apply DAG interval: every hour
  • 18. Key problems we faced • Data collection delay • Out of memory issues • High cardinality feature matrices • Too much time to map predictions with label thresholds • Some models are being applied more often than others
  • 19. Data collection delay • Use Airflow sensor to wait for MAX_ FEATURE_DELAY
  • 20. Data collection delay: do not wait too much • Use Airflow sensor to wait for MAX_ FEATURE_DELAY • If exceeded fill the missing parts of features table with last computed day
  • 21. Feature Engineering(FE): overcoming high cardinality feature matrices Main rule: New Features must be applicable for a majority of models Key techniques • Counting based FE • Distance based FE
  • 22. Feature matrix of shape (N, 10000) id Feature_1 ... Feature_10000 1 42 ... 542 .... ... ... ... N 89 ... 0
  • 23. Distance based FE: Cluster distance Algorithm: 1) Reduce dimension of feature matrix if needed (we use SVD decomposition) 2) Fit KMeans clustering algorithm with K clusters on given data 3) Calculate distance from sample point to centroid of Kth cluster 4) Use distances as feature representation for sample row
  • 24. Feature matrix of shape (N, K) id dist_to_1st_cluster ... dist_to_Kth_cluster 1 0.6757 ... 0.0942 .... ... ... ... N 0.342 ... 0.6113
  • 25. Problem: It may take much time for KMeans to converge and compute distances for every model…
  • 26. Solution: “Global” Cluster distance Feature • Fit KMeans only once on representative unlabeled sample to extract general information and use for all models
  • 27. Experimental results: • Replacing individually fitted by model distance features with ”Global” feature doesn’t harm model quality • Combining both feature representations improve roc auc score about 1%
  • 28. Counting based FE Traditional approaches: 1) Feature Hashing 2) One hot encoding User Domain Count Bob news.rambler.ru 5 Bob auto.ru 11 Bob mercedes-benz.ru 15
  • 29. Counting based FE Traditional approaches: 1) Feature Hashing 2) One hot encoding General problems: 1) High cardinality 2) Efficient with only linear models
  • 30. Counting based FE: DRACULA Domain Robust Algorithm for Counting Based Learning Source: http://www.slideshare.net/SessionsEvents/misha-bilenko- principal-researcher-microsoft
  • 31. Algorithm Compute counts table from all train data Compute P(label | feature) for every unique feature Aggregate list of probabilities to get low cardinality data representation 01 02 03
  • 32. Counts of visited domains for single user User Domain Count Bob news.rambler.ru 5 Bob auto.ru 11 Bob mercedes-benz.ru 15 Domain Count news.rambler.ru 95859 auto.ru 31040 mercedes-benz.ru 1386 Total counts in train data
  • 33. Counts table Domain Total count Count(label=0) Count(label=1) news.rambler.ru 95859 41268 54591 auto.ru 31040 26809 4231 mercedes-benz.ru 1386 1120 266 Total 128285 69131 59154
  • 34. 𝑁label = domain frequency with given label 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 = smoothing constant * aprior class probability 𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟 =1- 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 N = 𝑁 𝑝𝑜𝑠 + 𝑁 𝑛𝑒𝑔 P(label | domain) = 𝑁 𝑙𝑎𝑏𝑒𝑙 + 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 𝑁+𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 + 𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟 General formula: probability of label given domain
  • 35. Compute data representation for single user P(label=0|domain = ”news.rambler.ru”) = 0.43 P(label=0|domain = ”auto.ru”) = 0.86 P(label=0|domain = ”mercedes-benz.ru”) = 0.81
  • 36. Compute data representation for single user P(label=0|domain = ”news.rambler.ru”) = 0.43 P(label=0|domain = ”auto.ru”) = 0.86 P(label=0|domain = ”mercedes-benz.ru”) = 0.81 N = 10 Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 ] Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]
  • 37. Compute data representation for single user P(label=0|domain = ”news.rambler.ru”) = 0.43 P(label=0|domain = ”auto.ru”) = 0.86 P(label=0|domain = ”mercedes-benz.ru”) = 0.81 N = 10 Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 ] Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0] Interpretation: the concentration of nonzero elements in histogram represents estimation of P(label | user)
  • 38. Advances of algorithm • Scalable (add new features, recompute probabilities) • Adaptive (fits for binary and multiclass classification and regression as well) • Efficient for gradient boosting decision trees due to low cardinality • Ability to compute features in distributed manner (mapreduce) • Ability to store counts table with count min sketch
  • 39. Compute models pivots task Approach to approximate label thresholds Problems: • how to select thresholds for labels? • how to do it computationally fast? Desired solution: • Heuristics to approximate label thresholds
  • 40. Compute models pivots task Approach to approximate label thresholds Precision-threshold for binary classification. What probability threshold optimizes given metrics quality?
  • 41. Compute models pivots task Approach to approximate label thresholds Algorithm: • Take sample of apply data ( we use 5%, about 15 million samples) • Compute probabilities histogram for this sample • Use Nth percentile as estimation for label threshold
  • 42. Apply model task Task interval: every hour Number of models per run: 200 General problem: • Some models are being applied more often then others
  • 43. Priority schema of apply models 1) Request all models 2) Filter out not yet trained models 3) Sort by date of adding a model (descending) 4) Sort by date of last apply (ascending) 5) Take N top priority models
  • 45. Key notes: 1) Think of scalable approach
  • 46. Key notes: 1) Think of scalable approach 2) Implement monitoring of pipeline performance
  • 47. Key notes: 1) Think of scalable approach 2) Implement monitoring of pipeline performance 3) Make experiments
  • 48. In lieu of conclusion... Pipeline automation...is full of fun