Automating ML Research for Increased Predictability

Towards increasing predictability
of machine-learning research
Artem Vorozhtsov

System for displaying ads
on Yandex’s search result pages
and partner’s websites
Ad Targeting Group
Automation of Machine Learning Research
Research with profit
Introduction

R&D best practices
— Modularity
— Computational Measurability
— Transparency and Sharing
— Automation

— Modularity
Units, reuse, abstraction
R&D best practices

— Computational Measurability
Metrics Driven Development
R&D best practices

— Transparency and Sharing
Collaboration and reproducibility
R&D best practices

— Automation
…
R&D best practices

R&D best practices
— Modularity: units, reuse, abstractions
— Computational Measurability: MDD
— Transparency and Sharing: collaboration
— Automation: …

Happy life principles
— Kindness
— Wholeheartedness
— Love
— Discipline
— Self-development

Happy life principles
— Kindness
— Wholeheartedness
— Love
— Discipline
— Self-development
This is a list of global things,
not local (everyday) rules

— Where does automation stop?
— Story of automating
— Everyday rules
— Questions
Plan

Automation –
is the use of machines, control systems
and information technologies to reduce
the need for human work to optimize productivity
in the production of goods and services.

Automation –
is the use of information technologies
to optimize productivity and to increase
predictability in the research, development
and other projects.

Complex KPIs
Where does automation stop?

KPI stands for Key Performance Indicators:
— Money, Clicks on Ads
— Comparison with rivals (# of segments we are better)
— Number of Nobel Prices
— Users & Government Loyalty
— Logliklihood of prediction

— Strategy Thinking, Complex KPIs
— Research

.. where real research starts
Intuition
Research
Creativity
ScienceTools
Complex Maths
Automated pipelines
PDEs
Metrics
Validators
SVMPCA

1. Imagine how simple and agile research
work could be.
1. Believe it is possible, automate the most
and find the place for research.
Recipe

Task:
Ad click probability prediction
(binary classification problem)
KPI:
Profit, Clicks, Conversions, Loglikelihood
Yandex LLC
Story of automation

Story of automation
Classifier
(matrixnet)
filtersfiltersfiltersfilters
filtersfiltersfiltersreducers
filtersfiltersfiltersmetrics
GnuPlot
filterssimulators
MapReduce
STORAGE
clipart from http://www.stoneys.ch

Story of automation
Classifier
(matrixnet)
GnuPlot
filterssimulators
MapReduce
STORAGE
clipart from http://clipartov.net

Story of automation
Classifier
(TMVA, …)
filters
filters
filters
filters
filters
filters
filters
reducers
filters
filters
filters
metrics
GnuPlot
filters
simulators
MapReduce
STORAGE
ML Infrastructure
Report
Idea

Pipeline (no automation)
— Prepare raw data set for ML
— Apply filters (cuts) and mappers
— Calculate features
— Assign weights
— Split to train and test
— Train classifier at training set
— Look at learn curve and check for overfitting
— Apply resulted classifier model to testing set
— Calculate metrics and compare with current best
Story of automation

Pipeline (no automation)
— Prepare raw data set for ML
— Apply filters (cuts) and mappers (add new filter)
— Calculate features (add new feature)
— Assign weights (new idea for weighting)
— Split to train and test
— Train classifier at training set (new train options)
— Look at learn curve and check for overfitting
— Apply resulted classifier model to testing set
— Calculate metrics and compare with current best
Story of automation

— Create and commit YAML file
— Read the report
Story of automation
Engine: “matrixnet” # options: VW, TMVA (TODO!)
Mappers: |
[
Join(„PLACE FOR NEW FEATURES‟),
Grep(„r.Age > 10 and PLACE FOR GREP IDEA'),
Mapper(„r.Weight = PLACE FOR WEIGHT IDEA‟),
yabs.matrixnet.factor.DefaultFactors(),
]
MailTo: ml-reports@yandex-team.ru
Options: „PLACE FOR NEW OPTIONS‟
Tables: „EFHFactors:last_14_days‟
Pipeline (with automation)

Story of automation
Classifier
(TMVA, …)
filters
filters
filters
filters
filters
filters
filters
reducers
filters
filters
filters
metrics
GnuPlot
filters
simulators
MapReduce
STORAGE
ML Infrastructure
Report
YAML-file

Story of automation
metric | learn | test | test cur.
---------------------------------------
ll_p | 0.38171 | 0.36074 | 0.14527
ll_r | 0.38966 | 0.37151 | 0.33247
f1_p | 0.44869 | 0.44430 | 0.43266
fom_p | 0.91526 | 0.90580 | 0.88528
kl_p | 0.31143 | 0.29581 | 0.13186
log_loss | 0.39965 | 0.40354 | 0.44178
mcc_p | 0.30788 | 0.30159 | 0.28512
q10_p | 2.6632 | 2.5994 | 2.5261
q2_p | 1.6315 | 1.6212 | 1.5886
q_p | 1.6244 | 1.6089 | 1.5777
Report

Story of automation
ML Infrastructure
Classifier
(TMVA, …)
GnuPlot
filters
simulator
s
MapReduce
STORAGE
Production
Report (Money, Clicks)
Experiment (1%)
Deploy new model
YAML-file
Report (llp)
Report (Money, Clicks)

Challenges (scientific)
— Multi-armed bandit problem
• Banner is black box with estimated CTR
• Historical data is used for prediction
— Default model bias
• Training set is generated by default model
— Move from KPIs to metrics and cost functions
• Business Strategy  (approx) metrics
— Balancing between different cost functions
• Clicks, Money, Conversions, CPA

Challenge (automation):
Graphical Pipelines Framework
Simulation
data
Experimental
data
map
train
Cut by
threshold
Show mass
distribution
Filter
background
Estimate mixture
parameters
classifymap
Run

Automation for me is:
— Tools
What is Automation?

— Tools (in TMVA)
What is Automation?
Normalization
Rectangular Cuts
SVM
Boosted Trees
Gaussianisation
PCA
PDE
Decorrelation
Genetic Algorithms

— Tools
• Macro language (high level language)
for expressing ideas
What is Automation?
Simulation
data
Experimental
data
map
train
Filter by
threshold
Show mass
distribution
Filter
background
Estimate mixture
parameters
classifymap

— Tools
— Infrastructure
• Connecting with arrows
• Whole pipeline coverage
What is Automation?

— Tools
— Infrastructure
— Specialization
• Collaboration and delegation
What is Automation?

—…
— Specialization
What is Automation?
classifier
train set model
parameters

Parameters
What is Automation?
Comp. Complexity
Model
ProperDefective
Cost Function
Learning rate
Tree depth
Regularization
Features TypesNumber of trees

— Tools
• Macro language (high level language) for
expressing ideas
— Infrastructure
— Specialization
What is Automation?

(1) Copy and paste data
— Add new boxes to automated pipeline
— Automate transport between all boxes
— Do not use strange software
Everyday rules: anti-patterns

(2) Execute data pipeline steps manually in a cycle.
— Define new command for this pipeline
— Use standard formats for data streams
— Define needed ‘mappers’ and ‘reducers’ for data
stream and use them

(3) Your code is >3 times longer than natural language
description
— Start working on new tools (macro languages, DSL)

(4) It takes >1 man-hour to recalculate final graph of
your research
— Automate the whole pipeline

(5) You write line of code that has no chance of being
executed >10,000 times

Code (>10000 times) Interactive Data Analysis (once)
def pca(data, reduce_dims=0, corr=True,
normalise=False,subtract_mean=True):
data_mean = None
if subtract_mean:
data_mean = mean(data, axis=0)
data -= data_mean
transposed = transpose(data)
cov_matrix = corrcoef(transposed)
# Compute eigenvalues and sort into
# descending order
eigen_vals,eigen_vecs =
linalg.eig(cov_matrix)
indices = argsort(eigen_vals)
indices = indices[::-1]
eigen_vecs = eigen_vecs[:, indices]
eigen_vals = eigen_vals[indices]
data = filter(data, “RegionID = 213”)
data1, data2 = split_random(data)
data2ext = decorrelate(data1, data2,
fields = [“age”, “income”, …])
report = check_features(data2ext)
show_report(report)

Choose one action a time (A) or (B):
A. Interactive data analysis using high level tools
B. Coding: extending/improving tools library or
infrastructure. Delegate it?
There is no other options.

(6) Your colleagues think that you are doing something
useless
— Stop doing questionable things

(7) You have a dream, and it hasn’t came true yet
— Tell Yandex about your dream

Artem Vorozhtsov
Head of Ads Targeting Group
avorozhtsov@yandex-team.ru
Thank you!

Automating ML Research for Increased Predictability

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Automating ML Research for Increased Predictability

Ähnlich wie Automating ML Research for Increased Predictability (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Automating ML Research for Increased Predictability

Hinweis der Redaktion