ML Basics

Building ML Applications – A practitioner’s perspective
3/28/18 1
Srujana Merugu

Outline
• Overview of ML
– What is ML ? Why ML ?
– Predictive modeling recap
– ML application lifecycle
• Problem formulation & data definition
• Offline modeling
• Building an internal ML platform
• Key tenets for real-world ML applications
2

What is ML?
3/28/18 3
A"top"bank"uses"a"ML"model"for"
underwriting"decisions"built"for"a"
foreign"customer"segment"on"Indian"
customers"with"no"changes."
HYPE ! FEAR !
MISUSE!

What is Machine Learning?
3/28/18 4
“Field of study that gives computers the ability to learn from data
without being explicitly programmed”.
- Arthur Samuel (1959)

What is Machine Learning?
3/28/18 5
“Field of study that gives computers the ability to learn without
being explicitly programmed”.
- Arthur Samuel (1959)
Main elements of ML
• Representing relevant information as concrete variables
• Collecting empirical observations on variables
• Algorithms to infer associations between variables

3/28/18 6
Abstract..Problem.Areas.e.g.,.
time.series..modeling,.active.
learning
• variable.dependency.
structure.(e.g.,.
sequences,.trees)
• nature.of.observations.
(e.g.,..noisy,.adversarial)
• observation.process.
(e.g.,active,incremental )
Model.Classes,.e.g.,..linear.
models,.CNNs,.CRFs,.SVMs
• Quantification.of..variable.
dependencies.&.
assumptions.
• Specifying.an.exact.
optimization.problem
• Theoretical.results.
Learning.Algorithms,.e.g.,.
SGD,.EM,.Distributed.LDA.
• Mechanisms.to.solve.the.
optimization.problem
• Theoretical.results
• Practical.enhancements:.
scalable,.distributed,.
incremental.versions.
ML#Research#

3/28/18 7
Implementation1of1
Algorithms/Models,1e.g.,1sklearn
word2vec,1Inception
• Software1encoding1of1
algorithms1or1models1in1
specific1programming1
languages
ML1Platform1Utilities,1e.g.,1
AzureML,H20,scikitIlearn,Kera,
• Software1for1efficient1
management1of1ML1
workflows
Abstract11Problem1Areas1e.g.,1
time1series11modeling,1active1
learning
• variable1dependency1
structure1(e.g.,1
sequences,1trees)
• nature1of1observations1
(e.g.,11noisy,1adversarial)
• observation1process1
(e.g.,active,incremental )
Model1Classes,1e.g.,11linear1
models,1CNNs,1CRFs,1SVMs
• Quantification1of11variable1
dependencies1&1
assumptions1
• Specifying1an1exact1
optimization1problem
• Theoretical1results1
Learning1Algorithms,1e.g.,1
SGD,1EM,1Distributed1LDA1
• Mechanisms1to1solve1the1
optimization1problem
• Theoretical1results
• Practical1enhancements:1
scalable,1distributed,1
incremental1versions1
ML#Research#
ML#Software#Development#

• Translating*application*problems*to*ML*problems
• Applying*ML*methodology,*making*right*modeling*
choices**
• Using*existing*software*tools*effectively*
Application*
Problems
e.g.*Seller*
Fraud*
Detection
ML*Literature
ML*Practice
Algo &*Model
Implementations
ML*Platform
Utilities
Robust
Production*
Systems
More*art*than*science*!**

Why Machine Learning?
• Learn it when you can’t code it
– e.g. product similarity (fuzzy relationships & trade-offs)
• Learn it when you need to contextualize
– e.g., personalized product recommendations (fine-grained context)
• Learn it when you can’t track it over time
– e.g., seller fraud detection (input-output mapping changes dynamically)
• Learn it when you can’t scale it
– e.g., customer service ( complex task, large scale input, low latency)
• Learn it when you don’t understand it
– e.g., review aspect-sentiment mining (hidden structure to be detected)
3/28/18 9

Why not Machine Learning ?
• Low problem complexity
– e.g., classes being well separated in chosen representation
• Less interpretability
– true for complex models relative to simple rules
– when it drives critical decisions (e.g., health care)
• Lag time in case data has to be collected or available data is biased
– when there exists more holistic domain knowledge
• Expensive modeling effort
3/28/18 10

Supervised*Learning
Predict(new data based(
on((observed(data
Unsupervised*Learning
Detect(latent(structure(
in the(data
Reinforcement*Learning*
Adapt behavior to(
optimize(long(term(goals(
using(observed(rewards
Broad Areas of Machine Learning
3/28/18 11

Given: An input object with some features X (covariates/independent variables)
Goal: Predict a new target/label attributeY (response/dependent variable)
Note:
– Input & output attributes (X,Y) can be simple (e.g., numeric or categorical
values/vector) or have a complex structure (e.g., time-series, text sequences)
– Classification (Y is categorical), Regression (Y is numeric/ordinal)
Typical Predictive Modeling Problem
12

Shipping Logistics
3/28/18 13
Given: a customer order and
seller details
Predict: expected shipping time

Product Catalog Management
3/28/18 14
Given: a new product
Predict: product category it
should be placed in

Product Recommendations
3/28/18 15
Given: a user, current context & a
candidate product
Predict: preference of user for the
product

Advertising
3/28/18 16
Given: a user, search query and a
candidate product ad
Predict: expected click through rate
of user for the product ad

Many more applications !
• Advertising
• Product search and browse experience
• Forecasting product demand and supply
• Product pricing/competitor monitoring
• Understanding customer profiles and lifetime value
• Detecting seller and customer fraud
• Enriching product catalog & review content
• …
3/28/18 17

Given: An input object with some features X (covariates/independent variables)
Goal: Predict a new target/label attributeY (response/dependent variable)
Note:
– Input & output attributes (X,Y) can be simple (e.g., numeric or categorical
values/vector) or have a complex structure (e.g., time-series, text sequences)
Typical Predictive Modeling Problem
18

• Training data with correct input-output pairs (X,Y)
• Data samples in both train/unseen data are generated in same way (i.i.d)
Supervised Learning: Key Assumptions
Fraud
NOT)Fraud
Fraud
19

Supervised Learning
Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the
target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i)

Supervised Learning
Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the
target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i)
Prediction: Given a new sample X with unknown Y, predict it using F(X)

Training: Find a “good” model f from the training data !
Supervised LearningKey Elements of a Supervised Learning Algorithm

Training: Find a “good” model f from the training data !
• What is an allowed “model” ?
– Member of a model class H, e.g., linear models
• What is “good” ?
– Accurate predictions on training data in terms of a loss function L, e.g.,
squared error (Y –F(X))2
• How do you “find” it ?
– Optimization algorithm A, e.g., gradient descent

Training: Apply algorithm A to find model from the class H that optimizes a
loss function L on the training data D
• H: model class, L: loss function,A: optimization algorithm
• Different choices lead to different models on same data D

ML Application Development Life Cycle
3/28/18 25

Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process

Problem Formulation
3/28/18 36

Machine Learning Problem Definition
Business Problem: Optimize a decision process to improve business metrics
• Sub-optimal decisions due to missing information
• Solution strategy: predict missing information from available data using ML

Machine Learning Problem Definition
Business Problem: Optimize a decision process to improve business metrics
• Sub-optimal decisions due to missing information
• Solution strategy: predict missing information from available data using ML
Example: Reseller fraud
• Business objective: Limit fraud orders to increase #users served and reduce return
shipping expenses.
• Decision process: Add friction to orders via disabling cash on delivery (COD)
• Missing information relevant to the decision:
– Likelihood of the buyer reselling the products
– Likely return shipping costs
– Unserved demand for the product

Key elements of a ML Prediction Problem
• Instance definition
• Target variable to be predicted
• Input features
• Sources of training data
• Modeling metrics (Online/Offline, Primary/Secondary)
• Deployment constraints

Instance Definition
• Is it the right granularity from the business perspective?
• Is it feasible from the data collection perspective ?

Instance Definition
Multiple options
• a customer
• a purchase order spanning multiple products
• a single product order (quantity can be >1"though)
3/28/18 42

Instance Definition
Multiple options
• a customer
• a purchase order spanning multiple products
• a single product order (quantity can be >1 though) [preferred choice]
Why?
• Reselling behavior is also at a single product level
• COD presented per product not per entire purchase
• Blocking a customer on all his orders can even hurt
3/28/18 43

TargetVariable to be Predicted
• Can we express the business metrics (approximately) in terms of the
prediction quality of the target?
• Will accurate predictions of the target improve the business metrics
substantially?

Potential Business Impact
substantially?
• Compute business metrics for each case
– Ideal scenario ( perfect predictions on target )
– Dumb baseline ( default predictions e.g., majority class)
– Simple predictor with rules/domain knowledge
– Existing solution (if one exists)
– Likely scenario with a reasonable effort ML model (
• Assess effort vs. benefits
3/28/18 45

TargetVariable to be Predicted
• Can we express the business metrics (approximately) in terms of the
prediction quality of the target?
substantially?
• What is the data collection effort ?
– manual labeling costs
• Is it possible to get high quality observations?
– uncertainty in the definition, noise or bias in labeling process

TargetVariable
Multiple options
• Likelihood of buyer reselling the current order
• Number of unserved users because of the current order
• Expected return shipping expenses for the current order
3/28/18 47

TargetVariable
Multiple options.
• Likelihood of buyer reselling the current order [compromise choice]
• Number of unserved users because of the current order
• Expected return shipping expenses for the current order
Why?
• Last two choices better in terms of business metrics, but data collection is
difficult
• First choice makes data collection easy (esp. as a binary label) and addresses
business metrics in a reasonable, but slightly suboptimal way
3/28/18 48

Input features
• Is the feature predictive of the target ?
• Are the features going to be available in production setting ?
– Need to define exact time windows for features based on aggregates
– Watch out for time lags in data availability
– Be wary of target leakages (esp. conditional expectations of target )
• How costly is to compute/acquire the feature ?
– Might be different in training/prediction settings

Input Features
Reselling vs. Non-reselling indicators
• High product discount
• High order quantity relative to other orders of same product
– Normalize by median/mean to get relative values
• More for some products/verticals
– Product/vertical id can be used
• Buyer being a business store in product category
– Buyer’s category purchase count
– Buyer being a business store
3/28/18 50

Input Features
Reselling vs. Non-reselling indicators
• High product discount [feasible]
• High order quantity relative to other orders of same product
– Normalize by median/mean to get relative values [with lag]
• More for some products/verticals
– Product/vertical id can be used [feasible]
• Buyer being a business store in product category
– Buyer’s category purchase count [with lag]
– Buyer being a business store [expensive join with external info]
3/28/18 51

Sources of Training Data
• Is the distribution of training data similar to production data?
– e.g., if production data evolves over time, can the “training data” be
adjusted accordingly ?
• Are there systemic biases (e.g., data filters) in training data?
– Adjust the scope of prediction process so that it matches with the
training data setting

Sources of Training Data
Historical order data
– input features are available, but target is missing
Target observations
– Manual labeling on a random subset after focused investigations on the
address and the customer purchase history.
– Improve labeling efficiency by filtering by order quantity and apply same
filtering in production
3/28/18 53

Modeling Metrics
• Online metrics are measured on a live system
– Can be defined directly in terms of the key business metrics
– typically measured via A/B tests and these metrics are potentially influenced by a
number of factors (e.g., net revenue)
• Offline metrics are meant to be computed on retrospective “labeled” data
– typically measured during offline experimentation and more closely tied to
prediction quality (e.g., area under ROC curve)
• Primary metrics are ones that we are actively trying to optimize
– e.g., losses due to fraud
• Secondary metrics are ones that can serve as guardrails
– e.g., customer base size
3/28/18 54

Offline Modeling Metrics
• Does improvement in offline modeling metrics result in gains in
online business metrics ?
Model quality:
– A) Maximize coverage of fraud orders at certain level of certainty (>90%)
– B) Binary target: Four decision possibilities
• Maximize average payoff in terms of expected return costs given the
different possibilities
3/28/18 55
Return'Pay+offs Predicted'Fraud Predicted'Not'Fraud
Actual-Fraud 0 2 avg.-return-costs
Actual-Not-Fraud 2avg.-lost order-costs 0

Deployment Constraints
• What are the application latency & hardware constraints ?
Computational constraints:
– Orders per sec, allowed latency for COD disable action
– Available processing power, memory

Problem Formulation Template
3/28/18 57
• Template(s) with questions
on all the key elements
– Listing of possible choices
– Reason for preferred
choice
• Populated for each project
by product manager + ML
expert

Exercise: ML Problem Definition
Good choices for target variable, features & other elements ?
– Predicting shipping time for an order
– Forecasting the demand for different products
– Determining the nature of a customer complaint
– Predicting customer preference for a product
3/28/18 58

Motivation
3/28/18 61
• Early detection & prevention of common data related errors
• Reproducibility
• Auditability
• Robustness to failure in data fetch pipelines

Data Elements of Interest
3/28/18 62
• Instance identifiers
• Target variables
• Input features
• Other factors useful for evaluating online/offline metrics
Fields to specify for each variable of interest
• ID, Name, Version
• Modeling role, Owner, Description, Tags

Definitions
3/28/18 63
Three possible copies for same variable based on the stages
• Offline training, Offline evaluation, Deployment
Fields to specify for each variable of interest for each stage
• Precise definition (query + source for raw ones or formula for derived ones)
• Data type, value check conditions
• Units/Level sets
• Is Aggregate? , Exact aggregation set or time window
• Missing or invalid value indicators, reasons, mitigations (e.g., div by 0 for ratios)
• First creation date
• Known quality issues

Review Criteria
3/28/18 64
• Unambiguous definitions to allow for ready implementation
• Parity across different stages (training/evaluation/deployment)
– Definitions
– Data type, value checks, units, level sets
– Aggregation windows
– Missing/invalid value handling of derived variables

Review Criteria
3/28/18 65
• Is the input X to targetY map invariant across stages?
– Do definitions drift with time ?(Use averages not sums in general )
• e.g., customer spend to date in books ! order fraud status
– Do we have the correct feature snapshot of X forY ?
• e.g. , customer loyalty category (from when?)

Review Criteria
3/28/18 66
• Common data leakages
– Unintentional peeking into future, target, or any kind of unobserved
variables
– Ambiguously specified aggregates, e.g., customer revenue till the
“most recent” order ; interpretation can be different in training
data and deployment settings because of delays in data logging
– Time-varying features for which only certain (or recent) snapshots
are available, e.g., marital status of the customer

Review Criteria
3/28/18 67
• Handling of invalid/missing values of raw variables
– Join errors in preprocessing
– Service failures in deployment
• Handling of known data quality issues
– Corruption of data for certain segments/time periods

Data Definition Template
3/28/18 68
• Template(s) with details of all the data elements and review questions
• Populated for each project by all the relevant stake holders

Outline
• Overview of ML ecosystem
• Problem formulation & data definition
• Offline modeling
• Building an internal ML platform
• Key tenets for real-world ML applications
69

Data Collection & Integration
Abstract process: (specifics depend on data management infrastructure)
• Find where the data resides
– API, database/table names, external web sources
• Identify mappings between schemas of different sources
• Obtain the instance identifiers
• Perform a bunch of queries (joins/aggregations) for obtaining the
features/target
Data access/integration tools:
• SQL
• Hive, Pig, SparkSQL (for large joins)
• Scrapy, Nutch (web-crawling)
3/28/18 74

Data Exploration: Why?
• Data quality is really critical
– “Garbage in garbage out”
– Need to verify if data confirms to expectations (domain knowledge)
• ML modeling requires making a lot of choices !
• Better understanding of data
– Early detection and fixing of data quality issues
– Good preprocessing, feature engineering & modeling choices
3/28/18 76

Data Exploration: What to look for ?
• Size and schema
– #instances & #features,
– feature names & data types (numeric, ordinal, categorical, text)
• Univariate feature and target summaries
– prevalence of missing, outlier, or junk values
– distributional properties of features,
– skew in target distribution
• Bivariate target-feature dependencies
– distributional properties of features conditioned on the target
– feature-target correlation measures
• Temporal variations of features & targets
3/28/18 77

Example: Data Schema
3/28/18 78

Example: Dataset Summary
3/28/18 79
Missing-
values
Class-
Imbalance
Skewed
Distribution

Univariate (Feature or Target) Histograms
3/28/18 80
Y(axis:.log.Scale..

Feature-Target Dependencies
• Class histograms conditioned on feature value
3/28/18 81
For small values, fraud fraction
is almost 7-10 times less,
For large values it is
comparable or more

Data Sampling & Splitting
• Generalization to “unseen” data forms the core of ML
• Split “labeled” data into train and test datasets
– Train split is used to learn models
– Test split is proxy for the “unseen” data in deployment setting is used to
evaluate model performance
Note: In the test split, target is known unlike data in deployment setting
3/28/18 83

Creating Train & Test Splits
• Random disjoint splits
– Randomly shuffle & the split into train and test sets (e.g., 80% train & 20% test)
• K-fold cross validation
– Partition into K subsets: Use K-1 to train & one for test
– Rotate among the different sets to create K different train-test splits
– More reliable avg. performance estimate, variance measures, statistical tests
– Leave-one out: Extreme case of K-fold (each fold is single instance)
• Out-of-time splits (data has a temporal dependence)
– Train & test splits obtained via time-based cutoff (from production constraints)
• Special case: Imbalanced classes (for certain algorithms, e.g., decision trees)
– Balance train split alone by down (or up) sampling majority (minority) class

K-fold CrossValidation
3/28/18 85

Complex pipelines: Additional Splits
• In a simple scenario, the target labels are used only by the “learning
algorithm”
– Train and test splits suffice for this case
• Complex pipelines might have multiple elements that need a peek
at the target
– e.g., Feature selection, Meta learning algorithms, Output calibration etc.
– Separate data splits for each elements leads to better generalization
– Need to consider size of available labeled data as well
3/28/18 86

Data Preprocessing
• Special handling of text valued features
– Necessary to preserve the relevant “information”
– Appropriate handling of special characters, punctuation, spaces & markup
• Feature/row scaling (for numeric attributes)
– Necessary to avoid numerical computation issues, speedup convergence
– Columns:
• z-scoring: subtract mean, divide by std-deviation! mean=0, variance=1
• fixed range: subtract min, divide by range ! 0 to 1"range
– Rows: L1 norm, L2 norm
• Imputing missing/outlier values
– Necessary to avoid incorrect estimation of model parameters
– Handling strategies depend on the semantics of “missing”
3/28/18 88

Handling Outliers & MissingValues
• Indication of suspect instance: discard the record
• Informative w. r. t. target: introduce a new indicator variable
• Missing at random
– Numeric: replace with mean/median or conditional mean (regression)
– Categorical: replace with mode or likely value conditioned on the rest

Feature Engineering
• Case 1: Raw features are not highly predictive of the target, esp. in
case of simple model classes (e.g., linear models )
• Solution: Feature extraction, i.e., construct new more predictive
features from raw ones to boost model performance
• Case 2: Too many features with few training instances !
“memorizing” or “overfitting” situation leading to poor generalization.
• Solution: Feature selection, i.e., drop non-informative features to
improve generalization.
3/28/18 91

Feature Extraction
• Basic conversions for linear models
– e.g., 1-Hot encoding, Sparse encoding of text
• Non-linear feature transformations for linear models
– Linear models are scalable, but not expressive ! Need non-linear features
– e.g. Binning, quadratic interactions
• Domain-specific transformations
– Well studied for their effectiveness on special data types such as text, images
– e.g.,TF-IDF transformation, SIFT features
• Dimensionality reduction
– High dimensional features (e.g., text) can lead to “overfitting”, but retaining
only some dimensions may be sub-optimal
– Informative low dimensional approximation of the raw feature
– e.g., PCA, clustering3/28/18 92

Basic Conversions: Categorical Features
• One-Hot Encoding
– Converts a categorical feature with K values into a binary vector of size K−1
– Just a representation to enable use in linear models
3/28/18 93
Product_vertical
Handset
Book
Mobile
isBook isMobile
0 0
1 0
0 1

Basic Conversions: High Dimensional Text-like Features
Sparse Matrix Encoding
Text features:
• each feature value snippet is split into tokens(dimensions)
• Bag of tokens ! a sparse vector of "counts" over token vocabulary
• Single text feature ! sparse matrix with #columns = vocabulary size
Other high dimensional features:
• similar process via map from raw features to a bag of dimensions
3/28/18 94

Non-linearity: Numeric Features
• Non-linear functions of features or target (in regression)
– Log transformation, polynomial powers, Box-cox transforms
– Useful given additional knowledge on the feature-target relationship
• Binning
– Numeric feature ! categorical one with #values = #bins
– Results in more weights (K-1 for K bins) in linear models instead of just
one weight for the raw numeric feature
– Useful when the feature-target relation is non-linear or non-monotonic
– Bins: equal ranges, equal #examples, maximize bin purity (e.g. entropy)
3/28/18 95

Numeric Feature Binning
3/28/18 96
[3]

Interaction Features
• Required when features influence on target is not purely additive
– linear combinations of features won’t work
Example: Order with 50% discount on mobiles is much more likely to indicate
fraud than a simple combination of 50% discount or mobile order.
Common Interaction Features:
• Non-linear functions of two or more numeric features, e.g., products & ratios
• Cross-products of two or more categorical features
• Aggregates of numerical features corresponding to categorical feature values
• Tree-paths: use leaves from decision trees trained on a smaller sample
3/28/18 97

Categorical-Categorical Interaction Features
3/28/18 98

Numerical-Categorical Interaction Features
• Compute aggregates of a numeric feature corresponding to each value
of a categorical feature
• New interaction feature ! numeric one obtained by replacing the
categorical feature value with the corresponding numeric aggregate,
• e.g., brand_id ! brand_avg_rating, brand_avg_return_cnt
• Especially useful for categorical features with high cardinality (>50)
3/28/18 99

Tree Path Features
• Learn a decision tree on a small data sample with raw features
• Paths to the leaves are conjunctions constructed from conditions
on multiple raw features
• Highly informative with respect to the target.
3/28/18 100

Domain-Specific Transformations
Text Analytics and Natural Language Processing
• Stop-words removal/Stemming: Helps focus on semantics
• Removing high/low percentiles: Reduces features w/o loss in predictive power
• TF-IDF normalization: Corpus wide normalization of word frequency
• Frequent N-grams: Capture multi-word concepts
• Parts of speech/Ontology tagging: Focus on words with specific roles
Web Information Extraction
• Hyperlinks, Separating multiple fields of text (URL, in/out anchor text, title, body)
• Structural cues: XPaths/CSS; Visual cues: relative sizes/positions of elements
• Text style (italics/bold, font-size, colors)
Image Processing
• SIFT features, Edge extractors, Patch extractors
3/28/18 101

Dimensionality Reduction
• Clustering along feature values
– K-means variants (along feature values)
• Low rank matrix factorization
– Principal Component Analysis (PCA)
– Non-negative Matrix Factorization (NNMF)
• Topic models
– Latent Dirichlet Allocation (LDA)
– Probabilistic Latent Semantic Analysis (PLSA)
• Neural embeddings
– Word2Vec(Skip-gram), Para2Vec
3/28/18 102

Feature Selection
Key Idea: Sometimes “Less (features) is more (predictive power)”
Motivating reasons:
• To improve generalization
• To meet prediction latency or model storage constraints(for some applications)
Broadly, three classes of methods:
• Filter or Univariate methods, e.g., information-gain filtering
• Wrapper methods, e.g., forward search
• Embedded methods, e.g., regularization
3/28/18 103

Feature Selection: Filter or Univariate Methods
• Goal: Find the “top” individually predictive features
– “predictive”: specified correlation metric
• Ranking by a univariate score
– Score features via an empirical statistical measure that corresponds to
predictive power w.r.t. target
– Those above a cut-off (count, percentile, score threshold) are retained
Note:
• Fast, but highly sub-optimal since features evaluated in isolation
• Independent of the learning algorithm.
• e.g., Chi-squared test, information gain, correlation coefficient
3/28/18 104

Feature-Target Correlation
• Mutual information: Captures correlation between categorical
feature (X) and class label (Y)
• p(x,y): Fraction of examples with X=x and Y=y
• p(x), p(y): Fraction of examples with X=x,Y=y
3/28/18 105
I(X,Y) = p(x, y)log
p(x, y)
p(x)p(y)y∈sup(Y )
∑
x∈sup(X)
∑

Feature-Target Correlation
• Pearson’s correlation coefficient: Captures linear relationship between
numeric feature (X) and target value (Y)
• Xi,Yi: Value of X,Y in ith instance
• : Mean of X,Y
• Covariance matrix: Captures correlations between every pair of
features
3/28/18 106
ρ(X,Y) =
cov(X,Y)
σXσY
=
(Xi − X)
i
∑ (Yi −Y )
(Xi − X)2
i
∑
#
$
%
&
'
(
1/2
(Yi −Y )2
i
∑
#
$
%
&
'
(
1/2
X,Y

Feature Selection: Wrapper Methods
• Goal: Find the “best” subset from all possible subsets of input features
– “best” : specified performance metric & specified learning algo
• Iterative search
– Start with an initial choice (e.g., entire set, random subset)
– Each stage: find a better choice from a pool of candidate subsets.
Note:
• Computationally very expensive
• e.g., Backward search/Recursive feature elimination, Forward search
3/28/18 107

Feature Selection: Embedded Methods
• Identify predictive features while the model is being created itself
• Penalty methods: learning objective has an additional penalty term
that pushes the learning algorithm to prefer simpler models
• Good trade-off in terms of the optimality & computational costs
• e.g., Regularization methods (LASSO, Elastic Net, Ridge Regression)
3/28/18 108

Model Training (Recap)
Key elements in learning algorithms:
• Model class, e.g., linear models, decision trees, neural networks
• Loss function, e.g., logistic loss, hinge loss
• Optimization algorithm, e.g., gradient descent, & assoc. params
Lot of algorithms & hyper-parameters to choose from !
3/28/18 113

Scikit-learn Guide
3/28/18 114

Model Choice: Classification
3/28/18 115
Primary factors:
High #data instances ( > 10(MM )
• Linear models – online learning SGD
High #features/examples ratio (>1)
• Linear Models:Aggressive (L1) regularization
• Linear Models: Dimensionality reduction
• Naïve Bayes (homogeneous independent
features)
Need non-linear interactions
• Kernel methods (e.g., Gaussian SVM)
• Tree Ensembles (e.g., GBDT, RF)
• Deep learning methods (e.g., CNNs, RNNs)
•

Model Choice: Regression
3/28/18 116
Primary factors:
High #data instances ( > 10(MM )
• Linear models – online learning SGD
High #features/examples ratio (>1)
• Linear models:Aggressive (L1) regularization
• Linear models: Dimensionality reduction
Need non-linear interactions
• Kernel methods (e.g., Gaussian SVR)
• Tree Ensembles (e.g., GBDT, RF)
• Deep learning methods (e.g., CNNs, RNNs)
•

Model Evaluation & Diagnostics
Model Evaluation:
• Train error: Estimate of the expressive power of the model/algorithm relative
to training data
• Test error: A more reliable estimate of likely performance on “unseen” data
Post evaluation: What is the right strategy to get a better model ?
• 1) Get more training data instances
• 2a) Get more features or construct more complex features
• 2b) Explore more complex models/algorithms
• 3a) Drop some features
• 3b) Explore simpler models/algorithm

Overfitting
• Overfitting: Model fits training data well (low training error) but does not
generalize well to unseen data (poor test error)
• Complex models with large #parameters capture not only good patterns (that
generalize), but also noisy ones
3/28/18 118
Y'
X'
High'prediction'error
Actual
Predicted
Model

Underfitting
• Underfitting: Model lacks the expressive power to capture target distribution
(poor training and test error)
• Simple linear model cannot capture target distribution
3/28/18 119
Y(
X(

Bias &Variance
• Bias of algo: Difference between the actual target and the avg. estimated target
where averaging is done over models trained on different data samples
• Variance of algo:Variation in predictions of models trained on diff. data samples
3/28/18 120

Model Complexity: Bias &Variance
• Simple learning algos with small #params. ! high bias & low variance
– e.g., Linear models with few features
– Reduce bias by increasing model complexity (adding more features)
• Complex learning algos with large #params ! low bias & high variance
– e.g. Linear models with sparse high dimensional features, decision trees
– Reduce variance by increasing training data & decreasing model complexity
(feature selection)
3/28/18 121

Validation Curve
3/28/18 122
Overfitting Region
Optimal-
choice
Ideal choice: Match of complexity between learning
algorithm and the training data.
Prediction performance vs. Model complexity parameter

Learning Curve
3/28/18 123
Prediction performance vs. Num. of training examples
Ideal choice:
Early portion of
the flat region.

Common Evaluation Metrics
Standard evaluation metrics exist for each class of predictive learning scenarios
– Binary Classification
– Multi-class & Multi-label Classification
– Regression
– Ranking
• Loss function used in training objective is just one choice of evaluation metric
– Usually picked because the learning algorithm is readily available
– Might be a good, but not necessarily ideal choice from business perspective
• Critical to work backwards from business metrics to create more meaningful
metrics
3/28/18 124

125
Customer)orders)– Blues)are)not)fraudulent)(P),)Reds)are)fraudulent)(N)
Operational+Decision+Point:)Thresholding)on)the)score)(User)has)to)choose!))
Score)using)customer)order)features)to)create)a)rank)order)from)low)to)high)certainty
Classification – Making Predictions
3/28/18

Classification – Operational Point Evaluation Metrics
• For each threshold, Confusion matrix for binary classification of P vs. N
• Precision = TP/(TP+FP): How correct are we on the ones we predicted P?
• Recall = TP/(TP+FN):What fraction of actual P’s did we predict correctly?
• True Positive Rate (TPR) = Recall
• False Positive Rate (FPR) = FP/(FP+TN):What fraction of N’s did we predict
wrongly?
Actual'P Actual N
Predicted(P TP FP
Predicted(N FN TN
3/28/18 126

127
Tradeoff(Curve
0%
20%
40%
60%
80%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
%(Cum(Non0Frauds
%(Cum(Frauds
90%
True%Positive%Rate%
False%Positive%Rate
ROC curve: Plot of TPR vs. FPR
AUC: Area under ROC curve
• Perfect classifier: AUC =1
• Random classifier: AUC =0.5
• Odds of scoring P > N
• Effective for comparing learning
algorithms across operational pts
Operational point:
• Maximize (TPR – FPR), F1-measure
• Other business driven choices
3/28/185
Receiver Operating Characteristic (ROC) Curve

12
Precision)
Recall
0.750.25 0.5
0
1
1
High)Precision High)Recall
0.25
0.5
0.75
3/28/18
Precision- Recall Curve
Noisy curve in
this region for
small datasets

• Binary Classification: Score threshold corresponds to operational point
• Application-specific bounds on Precision or Recall
– Maximize recall (or precision) with a lower bound on precision (or recall)
• Application-specific misclassification cost matrix
– Optimize overall misclassification cost (TP*CTP +FP*CFP + TN*CTN + FN*CFN )
– Reduces'to'standard'misclassification'error'when'CTP'=CTN=0'and'CFP'=CFN =1
Classification: Picking an Operational Point
3/28/18 129
Actual'P Actual N
Predicted'P CTP CFP
Predicted'N CFN CTN

Regression – Evaluation Metrics
• Metrics when regression is used for predicting target values
– Root Mean Square Error(RMSE):
– R2 : How much better is the model compared to just picking the best constant?
R2 =1- (Model Mean Squared Error /Variance)
– MAPE (Mean Absolute Percent Error):
3/28/18 130
1
N
(Yi − F(Xi )
i
∑ )2
#
$
%
&
'
(
1/2
1
N
Yi − F(Xi )
Yii
∑
#
$
%%
&
'
((

Model Fine-tuning
• Lot of algorithms and hyper-parameters (e.g., learning rate) to choose from
– Infeasible to explore all choices
• Practical solution approach
– Narrow down a few suitable algorithms from meta data (size/attribute types)
– For each chosen algorithm, systematically explore hyper-parameter choices
• Alternate optimization
• Exhaustive grid search
• Bayesian optimization (e.g., Spearmint, MOE)
• Each exploration: learning a model on a train split & evaluating on a test split.
• Preferred split mechanism: k-fold cross-validation
• Best hyperparameter choices based on the test (cross-validation) error
3/28/18 131

Multiple stages of optimization
• Objective: Find f(.) to optimize some cost L(Yunseen, f(Xunseen))
• ML Methodology:
– (Step 1) Model Learning: Determine good choices of f(.) that optimize LA(Ytrain, Xtrain) for different
choices of hyperparameters and algorithms
– (Step II) Hyperparameter Fine-tuning : Among choices in Step I, pick the one that optimizes
LB(Yeval-split, Xeval-split)
– (Step III) Operational choices (e.g., score thresholding, output calibration): For the choice in Step II,
determine the operational choices so as to optimize LC(Yop-split, Xop-split)
• Note:
– Ideal choice is to have L=LA = LB = LC and the data splits to be i.i.d., but not always possible,
• e.g., need L =max. recall for precision >90%, but LA = logistic loss & LB = Area under ROC
– Preferable to choose intermediate metrics that are “close” to the desired business metric and
robust off-the-shelf implementations
3/28/18 132

Building an Internal ML Platform
3/28/18 134

136
Typical(ML(Production(
System
Raw
data
Data*fetch*+*
aggregation
Prod.*
re7training
Prod.
models
Configs
Offline
modeling
Models,
Configs,
Reports
Prod.
scoring
A/B$
bucketing
Business*logic
optimization
Data*
monitoring*&
A/B*tests
outcome
environment
action
(stimulus,(action,(outcome)
stimulus
Dashboards,*
alerts,*A/B*results
Data*
collection*&
attribution

137
Offline'Modeling
Config
Raw'
data
sources
Data'fetch'+'
aggregation
Model
Evaluation
Reports
Interactive'data
analysis
Model'
Learning
Models
Configs
Configs

Existing ML Platform Utilities
Open%source%packages
3/28/18 138
Google%Cloud%ML
• Not)cost)effective)for)large)companies)
• Need)to)move)data)to)external)clouds
Managed%services
• Free)and)flexible))
• Gaps)in)functionality
Large%companies%need%an%internal%ML%platform%to%make%up%for%the%gaps%!

Primary Challenges
• Fast error-proof productionization
• Scalability vs. flexibility trade-off
• Reusability & extensibility of modeling effort
• Management of offline modeling experiments
• Interactive monitoring of modeling experiments
3/28/18 139

Challenge: Road to Productionization
• Long slow road to delivery for each new application, Very little reuse across applications
• POC code → Production code translation is highly error prone
• Rigorous evaluation & debugging of actual production systems is unlikely since these tasks
are owned by dev ops folks and data scientists don’t understand production code
3/28/18 140
Product((
Manager
Data(
Scientist
Software(
Engineers
Dev(Ops
App
Requirements(&(
Metrics
PoC(Modeling
[R,Python]
Production(
code(

Solution: Self-contained “Models”
Data scientists
• build application-specific configurations for data collection & modeling
• ship self-contained production “models” (i.e.,“model”, configurations,
library dependencies) say via Docker (not POC code !)
Software engineers
• build application-agnostic* production code & systems for automation of
data collection, model scoring, re-training, evaluation, etc.
141
Data$Scientist
Software$Engineers
Dev$Ops
Self5contained$Models
Application$agnostic
Production$code$
*$Need$to$consider$data$scale,$latency$for$scoring$&$retraining$which$have$some$dependency$on$the$application

Solution: Self-contained “Models”
• ML packages such as scikit-learn, spark-mllib, Keras allow for an easy
serialization of the entire processing pipeline (i.e., preprocessing, feature-
engineering, scoring) along with the fitted parameters as a single “model”
that can be exported to be used for scoring.
3/28/18 142

Challenge: Scalability vs. Flexibility Trade-off
• Scalability requirements vary across applications
• Factors to consider
– Size of training data
– Frequency of retraining
– Rate of arrival of prediction instances and latency bounds (in case of
online predictions)
– Size of batch and frequency of scoring (in case of batch predictions)
• Data scientists prefer to train models on single machines where
possible
3/28/18 143

Solution: Support Multiple Choices
• Moderate scale for training & prediction
– train models on a single machine (in Python/R);
– export model as is to multiple machines with the same image and predict in
parallel
• Moderate scale for training, but large prediction scale
– train models on a single machine (e.g., spark-Mllib in Python/R);
– export model to a different environment (e.g., Scala/Java ) that allows more
efficient parallelization.
• Large scale for both training & prediction
– train models and predict on a cluster(e.g., via sparkit-learn, PySpark or Scala, )
3/28/18 144

Challenge: Reusability & Extensibility of Modeling Effort
• ML workflows are more than just the “models pipeline”
– e.g., data fetch/aggregation from multiple sources, evaluation across
multiple models, exploratory data analysis
• Offline modeling code (notebooks) tends to get dirty fast
– Mix of interactive analysis (specific to application) and processing of data
• Common approach to reuse
– limited use of libraries + cut & paste code
3/28/18 145

Example Workflow: Data Fetch + Aggregation
3/28/18 146
Libraries(
Data$
Source$1
Data$
Source$2
Data$
Source$3
Data
Reader
Consolidated
Data$File$(s)
Data$$$$$
Aggregation
Data$$$$$
Writer
Read
Utilities
Aggregation
Utilities
Write
Utilities
Data(Aggregation(Config(
Read
Config
Aggregation
Config
Write
Config
Workflow

Example Workflow: Model Learning
3/28/18 147
Workflow
Libraries.
Consolidated+Data+
Data
Splitter
Model++++
&
Report
Target+Constructor
Feature
Pipeline+setup
Filters/
Splitters/
Samplers
Transformers
Learning
Algos
Learning.Config.
Data+Split/
Sampling+Config++
Target
Config
Model
Config
Feature
Config
HP+search+config
Model
Set+up
HP
Search
Predict&+Eval
Eval
Config
Param
search
Eval
Metrics

Example Workflow: Model Evaluation
3/28/18 148
Libraries(
Workflow
Labeled'Data'
Eval'
reports
Feature
Pipeline'setup
Transformers
Learning
Algos
Evaluation(Config(
Pre:trained,'feature'models'config
Model
Set'up
Predict
Eval
metrics
Eval
Eval
Config

Example Workflow: Model Scoring
3/28/18 149
Libraries(
Workflow
Unlabeled(Data(
PredictionsFeature
Pipeline(setup
Transformers
Learning
Algos
Prediction(Config(
Pre:trained,(feature(model(config
Model
Set(up
Predict

Solution: Workflow Abstractions
• Each workflow is represented as a DAG over nodes
– DAGs can be encoded asYAML or JSON files
• Each node is a computational unit with the following elements
– name
– environment of execution(e.g., python/scala)
– actual function to be executed (via a link to existing module, class,
method)
– inputs (with default choices) and outputs
– tags to aid discovery
3/28/18 150

Solution: Workflow Abstractions
• Wrapper libraries allow hooks to existing ML packages (sklearn,
keras, etc) via nodes
• Properly indexed repositories of workflow DAGs, nodes and
node-configurations allow discovery and reuse
• Editing tools for composing DAGs enable extensibility
3/28/18 151

ML#Workflow
Library
Orchestrator
Deployment#
Engine
152
Physical#Computing#Resources
Contribution
Tools
Edit
Tools
Discover
Tools
Orchestrator#assembles#the#composition#and#manages#the#
deployment#with#the#help#of#deployment#manager
Physical#computing#resources#provide#the#execution#environment#
$ Discover#and#Deploy:#Searches#the#library#of#workflows#
meeting#certain#criteria,#and#deploys#them.#
$ Edit#&#Experiment:#Take#an#existing#ML#workflow,#creates#a#new#
one##by##making#some#edits#(mostly#data,#config parameters),#
experiments#with#it#and#publishes#it
$ Create#&#Contribute:##Entirely#create#a#new#library#functions,#
nodes#and#possibly#workflow#DAGs##and#adds#to#repository
ML Workflow-centric Architecture

Challenge: Management of Experiments
• Manual tracking of experimental results requires considerable
effort and is error-prone
• Low reproducibility and auditability of offline modeling
experiments
3/28/18 153

Solution: Automated Repositories of ML Entities
• Run: execution of a workflow
– consumes datasets and configurations as inputs and generates models,
reports and new datasets as outputs
– organizes all the inputs/outputs and intermediate results in an
appropriate directory structure
• Automatically updated versioned repositories
– workflow DAGs, nodes, configs
– runs, datasets, models, reports
• Post each run, the repositories are automatically updated with the
appropriate linkages between the different entities
3/28/18 154

Challenge: Interactive Monitoring of Experiments
• Interactive execution of experiments ! messy code
3/28/18 155

Solution: Read-only monitoring
• Additional layer that allows workflow DAGs to be executed one
step at a time and outputs to be examined from an interactive tool
(e.g., Jupyter notebooks)
– run_node(), load_input(), load_output()
• Cloning of intermediate inputs & outputs on demand so that these
can be analyzed without affecting the original run
– Changes to the actual run have to be explicitly made via workflow
DAGs, configs
3/28/18 156

Key Tenets for Real-world ML Applications
3/28/18 157

Key Tenets for Real-world ML applications
Design phase:
• Work backwards from the application use case
– ML problem formulation & evaluation metrics aligned with business goals
– Software stack/ML libraries based on scalability/latency/retraining needs
• Keep the ML problem formulation simple (but ensure validity)
– Understand assumptions/limitations of ML methods & apply them with care
– Should enable ease of development, testing, and maintenance

Modeling phase:
• Ensure data is of high quality
– Fix missing values, outliers, target leakages
• Narrow down modeling options based on data characteristics
– Learn about the relative effectiveness of various preprocessing, feature engineering,
and learning algorithms for different types of data.
• Be smart on the trade-off between feature engg. effort & model complexity
– Sweet spot depends on the problem complexity, available domain knowledge, and
computational requirements;
• Ensure offline evaluation is a good “proxy” for the “real unseen” data
evaluation
– Generate train/test splits similar to how it would be during deployment

Deployment phase:
• Establish train vs. production parity
– Checks on every possible component that could change
• Establish improvement in business metrics before scaling up
– A/B testing over random buckets of instances
• Trust the models, but always audit
– Insert safe-guards (automated monitoring) and manual audits
• View model building as a continuous process not a one-time effort
– Retrain periodically to handle data drifts & design for this need
Don’t adopt Machine Learning because of the hype !

Thank You !
Happy Modeling !
Contact: srujana@gmail.com

Useful References
3/29/18 162
• Google&AI&Course
• https://ai.google/education/#?modal_active=none
• Rules&of&Machine&Learning:&Best&Practices&for&ML&Engineering&
• http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
• What’s&your&ML&Test&Score?&A&rubric&for&ML&production&systems&
• https://www.eecs.tufts.edu/~dsculley/papers/ml_test_score.pdf
• Practical&advice&for&analysis&of&large,&complex&data&sets
• http://www.unofficialgoogledatascience.com/2016/10/practicalAadviceAforAanalysisA
ofAlarge.html

ML Basics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ML Basics

Ähnlich wie ML Basics (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ML Basics