An introductory course on building ML applications with primary focus on supervised learning. Covers the typical ML application cycle - Problem formulation, data definitions, offline modeling, platform design. Also, includes key tenets for building applications.
Note: This is an old slide deck. The content on building internal ML platforms is a bit outdated and slides on the model choices do not include deep learning models.
2. Outline
• Overview of ML
– What is ML ? Why ML ?
– Predictive modeling recap
– ML application lifecycle
• Problem formulation & data definition
• Offline modeling
• Building an internal ML platform
• Key tenets for real-world ML applications
2
3. What is ML?
3/28/18 3
A"top"bank"uses"a"ML"model"for"
underwriting"decisions"built"for"a"
foreign"customer"segment"on"Indian"
customers"with"no"changes."
HYPE ! FEAR !
MISUSE!
4. What is Machine Learning?
3/28/18 4
“Field of study that gives computers the ability to learn from data
without being explicitly programmed”.
- Arthur Samuel (1959)
5. What is Machine Learning?
3/28/18 5
“Field of study that gives computers the ability to learn without
being explicitly programmed”.
- Arthur Samuel (1959)
Main elements of ML
• Representing relevant information as concrete variables
• Collecting empirical observations on variables
• Algorithms to infer associations between variables
9. Why Machine Learning?
• Learn it when you can’t code it
– e.g. product similarity (fuzzy relationships & trade-offs)
• Learn it when you need to contextualize
– e.g., personalized product recommendations (fine-grained context)
• Learn it when you can’t track it over time
– e.g., seller fraud detection (input-output mapping changes dynamically)
• Learn it when you can’t scale it
– e.g., customer service ( complex task, large scale input, low latency)
• Learn it when you don’t understand it
– e.g., review aspect-sentiment mining (hidden structure to be detected)
3/28/18 9
10. Why not Machine Learning ?
• Low problem complexity
– e.g., classes being well separated in chosen representation
• Less interpretability
– true for complex models relative to simple rules
– when it drives critical decisions (e.g., health care)
• Lag time in case data has to be collected or available data is biased
– when there exists more holistic domain knowledge
• Expensive modeling effort
3/28/18 10
12. Given: An input object with some features X (covariates/independent variables)
Goal: Predict a new target/label attributeY (response/dependent variable)
Note:
– Input & output attributes (X,Y) can be simple (e.g., numeric or categorical
values/vector) or have a complex structure (e.g., time-series, text sequences)
– Classification (Y is categorical), Regression (Y is numeric/ordinal)
Typical Predictive Modeling Problem
12
16. Advertising
3/28/18 16
Given: a user, search query and a
candidate product ad
Predict: expected click through rate
of user for the product ad
17. Many more applications !
• Advertising
• Product search and browse experience
• Forecasting product demand and supply
• Product pricing/competitor monitoring
• Understanding customer profiles and lifetime value
• Detecting seller and customer fraud
• Enriching product catalog & review content
• …
3/28/18 17
18. Given: An input object with some features X (covariates/independent variables)
Goal: Predict a new target/label attributeY (response/dependent variable)
Note:
– Input & output attributes (X,Y) can be simple (e.g., numeric or categorical
values/vector) or have a complex structure (e.g., time-series, text sequences)
Typical Predictive Modeling Problem
18
19. • Training data with correct input-output pairs (X,Y)
• Data samples in both train/unseen data are generated in same way (i.i.d)
Supervised Learning: Key Assumptions
Fraud
NOT)Fraud
Fraud
19
20. Supervised Learning
Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the
target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i)
21. Supervised Learning
Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the
target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i)
Prediction: Given a new sample X with unknown Y, predict it using F(X)
22. Training: Find a “good” model f from the training data !
Supervised LearningKey Elements of a Supervised Learning Algorithm
23. Training: Find a “good” model f from the training data !
• What is an allowed “model” ?
– Member of a model class H, e.g., linear models
• What is “good” ?
– Accurate predictions on training data in terms of a loss function L, e.g.,
squared error (Y –F(X))2
• How do you “find” it ?
– Optimization algorithm A, e.g., gradient descent
Supervised LearningKey Elements of a Supervised Learning Algorithm
24. Training: Apply algorithm A to find model from the class H that optimizes a
loss function L on the training data D
• H: model class, L: loss function,A: optimization algorithm
• Different choices lead to different models on same data D
Supervised LearningKey Elements of a Supervised Learning Algorithm
38. Machine Learning Problem Definition
Business Problem: Optimize a decision process to improve business metrics
• Sub-optimal decisions due to missing information
• Solution strategy: predict missing information from available data using ML
39. Machine Learning Problem Definition
Business Problem: Optimize a decision process to improve business metrics
• Sub-optimal decisions due to missing information
• Solution strategy: predict missing information from available data using ML
Example: Reseller fraud
• Business objective: Limit fraud orders to increase #users served and reduce return
shipping expenses.
• Decision process: Add friction to orders via disabling cash on delivery (COD)
• Missing information relevant to the decision:
– Likelihood of the buyer reselling the products
– Likely return shipping costs
– Unserved demand for the product
40. Key elements of a ML Prediction Problem
• Instance definition
• Target variable to be predicted
• Input features
• Sources of training data
• Modeling metrics (Online/Offline, Primary/Secondary)
• Deployment constraints
41. Instance Definition
• Is it the right granularity from the business perspective?
• Is it feasible from the data collection perspective ?
42. Instance Definition
Multiple options
• a customer
• a purchase order spanning multiple products
• a single product order (quantity can be >1"though)
3/28/18 42
43. Instance Definition
Multiple options
• a customer
• a purchase order spanning multiple products
• a single product order (quantity can be >1 though) [preferred choice]
Why?
• Reselling behavior is also at a single product level
• COD presented per product not per entire purchase
• Blocking a customer on all his orders can even hurt
3/28/18 43
44. TargetVariable to be Predicted
• Can we express the business metrics (approximately) in terms of the
prediction quality of the target?
• Will accurate predictions of the target improve the business metrics
substantially?
45. Potential Business Impact
• Will accurate predictions of the target improve the business metrics
substantially?
• Compute business metrics for each case
– Ideal scenario ( perfect predictions on target )
– Dumb baseline ( default predictions e.g., majority class)
– Simple predictor with rules/domain knowledge
– Existing solution (if one exists)
– Likely scenario with a reasonable effort ML model (
• Assess effort vs. benefits
3/28/18 45
46. TargetVariable to be Predicted
• Can we express the business metrics (approximately) in terms of the
prediction quality of the target?
• Will accurate predictions of the target improve the business metrics
substantially?
• What is the data collection effort ?
– manual labeling costs
• Is it possible to get high quality observations?
– uncertainty in the definition, noise or bias in labeling process
47. TargetVariable
Multiple options
• Likelihood of buyer reselling the current order
• Number of unserved users because of the current order
• Expected return shipping expenses for the current order
3/28/18 47
48. TargetVariable
Multiple options.
• Likelihood of buyer reselling the current order [compromise choice]
• Number of unserved users because of the current order
• Expected return shipping expenses for the current order
Why?
• Last two choices better in terms of business metrics, but data collection is
difficult
• First choice makes data collection easy (esp. as a binary label) and addresses
business metrics in a reasonable, but slightly suboptimal way
3/28/18 48
49. Input features
• Is the feature predictive of the target ?
• Are the features going to be available in production setting ?
– Need to define exact time windows for features based on aggregates
– Watch out for time lags in data availability
– Be wary of target leakages (esp. conditional expectations of target )
• How costly is to compute/acquire the feature ?
– Might be different in training/prediction settings
50. Input Features
Reselling vs. Non-reselling indicators
• High product discount
• High order quantity relative to other orders of same product
– Normalize by median/mean to get relative values
• More for some products/verticals
– Product/vertical id can be used
• Buyer being a business store in product category
– Buyer’s category purchase count
– Buyer being a business store
3/28/18 50
51. Input Features
Reselling vs. Non-reselling indicators
• High product discount [feasible]
• High order quantity relative to other orders of same product
– Normalize by median/mean to get relative values [with lag]
• More for some products/verticals
– Product/vertical id can be used [feasible]
• Buyer being a business store in product category
– Buyer’s category purchase count [with lag]
– Buyer being a business store [expensive join with external info]
3/28/18 51
52. Sources of Training Data
• Is the distribution of training data similar to production data?
– e.g., if production data evolves over time, can the “training data” be
adjusted accordingly ?
• Are there systemic biases (e.g., data filters) in training data?
– Adjust the scope of prediction process so that it matches with the
training data setting
53. Sources of Training Data
Historical order data
– input features are available, but target is missing
Target observations
– Manual labeling on a random subset after focused investigations on the
address and the customer purchase history.
– Improve labeling efficiency by filtering by order quantity and apply same
filtering in production
3/28/18 53
54. Modeling Metrics
• Online metrics are measured on a live system
– Can be defined directly in terms of the key business metrics
– typically measured via A/B tests and these metrics are potentially influenced by a
number of factors (e.g., net revenue)
• Offline metrics are meant to be computed on retrospective “labeled” data
– typically measured during offline experimentation and more closely tied to
prediction quality (e.g., area under ROC curve)
• Primary metrics are ones that we are actively trying to optimize
– e.g., losses due to fraud
• Secondary metrics are ones that can serve as guardrails
– e.g., customer base size
3/28/18 54
55. Offline Modeling Metrics
• Does improvement in offline modeling metrics result in gains in
online business metrics ?
Model quality:
– A) Maximize coverage of fraud orders at certain level of certainty (>90%)
– B) Binary target: Four decision possibilities
• Maximize average payoff in terms of expected return costs given the
different possibilities
3/28/18 55
Return'Pay+offs Predicted'Fraud Predicted'Not'Fraud
Actual-Fraud 0 2 avg.-return-costs
Actual-Not-Fraud 2avg.-lost order-costs 0
56. Deployment Constraints
• What are the application latency & hardware constraints ?
Computational constraints:
– Orders per sec, allowed latency for COD disable action
– Available processing power, memory
57. Problem Formulation Template
3/28/18 57
• Template(s) with questions
on all the key elements
– Listing of possible choices
– Reason for preferred
choice
• Populated for each project
by product manager + ML
expert
58. Exercise: ML Problem Definition
Good choices for target variable, features & other elements ?
– Predicting shipping time for an order
– Forecasting the demand for different products
– Determining the nature of a customer complaint
– Predicting customer preference for a product
3/28/18 58
61. Motivation
3/28/18 61
• Early detection & prevention of common data related errors
• Reproducibility
• Auditability
• Robustness to failure in data fetch pipelines
62. Data Elements of Interest
3/28/18 62
• Instance identifiers
• Target variables
• Input features
• Other factors useful for evaluating online/offline metrics
Fields to specify for each variable of interest
• ID, Name, Version
• Modeling role, Owner, Description, Tags
63. Definitions
3/28/18 63
Three possible copies for same variable based on the stages
• Offline training, Offline evaluation, Deployment
Fields to specify for each variable of interest for each stage
• Precise definition (query + source for raw ones or formula for derived ones)
• Data type, value check conditions
• Units/Level sets
• Is Aggregate? , Exact aggregation set or time window
• Missing or invalid value indicators, reasons, mitigations (e.g., div by 0 for ratios)
• First creation date
• Known quality issues
64. Review Criteria
3/28/18 64
• Unambiguous definitions to allow for ready implementation
• Parity across different stages (training/evaluation/deployment)
– Definitions
– Data type, value checks, units, level sets
– Aggregation windows
– Missing/invalid value handling of derived variables
65. Review Criteria
3/28/18 65
• Is the input X to targetY map invariant across stages?
– Do definitions drift with time ?(Use averages not sums in general )
• e.g., customer spend to date in books ! order fraud status
– Do we have the correct feature snapshot of X forY ?
• e.g. , customer loyalty category (from when?)
66. Review Criteria
3/28/18 66
• Common data leakages
– Unintentional peeking into future, target, or any kind of unobserved
variables
– Ambiguously specified aggregates, e.g., customer revenue till the
“most recent” order ; interpretation can be different in training
data and deployment settings because of delays in data logging
– Time-varying features for which only certain (or recent) snapshots
are available, e.g., marital status of the customer
67. Review Criteria
3/28/18 67
• Handling of invalid/missing values of raw variables
– Join errors in preprocessing
– Service failures in deployment
• Handling of known data quality issues
– Corruption of data for certain segments/time periods
68. Data Definition Template
3/28/18 68
• Template(s) with details of all the data elements and review questions
• Populated for each project by all the relevant stake holders
69. Outline
• Overview of ML ecosystem
• Problem formulation & data definition
• Offline modeling
• Building an internal ML platform
• Key tenets for real-world ML applications
69
74. Data Collection & Integration
Abstract process: (specifics depend on data management infrastructure)
• Find where the data resides
– API, database/table names, external web sources
• Identify mappings between schemas of different sources
• Obtain the instance identifiers
• Perform a bunch of queries (joins/aggregations) for obtaining the
features/target
Data access/integration tools:
• SQL
• Hive, Pig, SparkSQL (for large joins)
• Scrapy, Nutch (web-crawling)
3/28/18 74
76. Data Exploration: Why?
• Data quality is really critical
– “Garbage in garbage out”
– Need to verify if data confirms to expectations (domain knowledge)
• ML modeling requires making a lot of choices !
• Better understanding of data
– Early detection and fixing of data quality issues
– Good preprocessing, feature engineering & modeling choices
3/28/18 76
77. Data Exploration: What to look for ?
• Size and schema
– #instances & #features,
– feature names & data types (numeric, ordinal, categorical, text)
• Univariate feature and target summaries
– prevalence of missing, outlier, or junk values
– distributional properties of features,
– skew in target distribution
• Bivariate target-feature dependencies
– distributional properties of features conditioned on the target
– feature-target correlation measures
• Temporal variations of features & targets
3/28/18 77
81. Feature-Target Dependencies
• Class histograms conditioned on feature value
3/28/18 81
For small values, fraud fraction
is almost 7-10 times less,
For large values it is
comparable or more
83. Data Sampling & Splitting
• Generalization to “unseen” data forms the core of ML
• Split “labeled” data into train and test datasets
– Train split is used to learn models
– Test split is proxy for the “unseen” data in deployment setting is used to
evaluate model performance
Note: In the test split, target is known unlike data in deployment setting
3/28/18 83
84. Creating Train & Test Splits
• Random disjoint splits
– Randomly shuffle & the split into train and test sets (e.g., 80% train & 20% test)
• K-fold cross validation
– Partition into K subsets: Use K-1 to train & one for test
– Rotate among the different sets to create K different train-test splits
– More reliable avg. performance estimate, variance measures, statistical tests
– Leave-one out: Extreme case of K-fold (each fold is single instance)
• Out-of-time splits (data has a temporal dependence)
– Train & test splits obtained via time-based cutoff (from production constraints)
• Special case: Imbalanced classes (for certain algorithms, e.g., decision trees)
– Balance train split alone by down (or up) sampling majority (minority) class
86. Complex pipelines: Additional Splits
• In a simple scenario, the target labels are used only by the “learning
algorithm”
– Train and test splits suffice for this case
• Complex pipelines might have multiple elements that need a peek
at the target
– e.g., Feature selection, Meta learning algorithms, Output calibration etc.
– Separate data splits for each elements leads to better generalization
– Need to consider size of available labeled data as well
3/28/18 86
88. Data Preprocessing
• Special handling of text valued features
– Necessary to preserve the relevant “information”
– Appropriate handling of special characters, punctuation, spaces & markup
• Feature/row scaling (for numeric attributes)
– Necessary to avoid numerical computation issues, speedup convergence
– Columns:
• z-scoring: subtract mean, divide by std-deviation! mean=0, variance=1
• fixed range: subtract min, divide by range ! 0 to 1"range
– Rows: L1 norm, L2 norm
• Imputing missing/outlier values
– Necessary to avoid incorrect estimation of model parameters
– Handling strategies depend on the semantics of “missing”
3/28/18 88
89. Handling Outliers & MissingValues
• Indication of suspect instance: discard the record
• Informative w. r. t. target: introduce a new indicator variable
• Missing at random
– Numeric: replace with mean/median or conditional mean (regression)
– Categorical: replace with mode or likely value conditioned on the rest
91. Feature Engineering
• Case 1: Raw features are not highly predictive of the target, esp. in
case of simple model classes (e.g., linear models )
• Solution: Feature extraction, i.e., construct new more predictive
features from raw ones to boost model performance
• Case 2: Too many features with few training instances !
“memorizing” or “overfitting” situation leading to poor generalization.
• Solution: Feature selection, i.e., drop non-informative features to
improve generalization.
3/28/18 91
92. Feature Extraction
• Basic conversions for linear models
– e.g., 1-Hot encoding, Sparse encoding of text
• Non-linear feature transformations for linear models
– Linear models are scalable, but not expressive ! Need non-linear features
– e.g. Binning, quadratic interactions
• Domain-specific transformations
– Well studied for their effectiveness on special data types such as text, images
– e.g.,TF-IDF transformation, SIFT features
• Dimensionality reduction
– High dimensional features (e.g., text) can lead to “overfitting”, but retaining
only some dimensions may be sub-optimal
– Informative low dimensional approximation of the raw feature
– e.g., PCA, clustering3/28/18 92
93. Basic Conversions: Categorical Features
• One-Hot Encoding
– Converts a categorical feature with K values into a binary vector of size K−1
– Just a representation to enable use in linear models
3/28/18 93
Product_vertical
Handset
Book
Mobile
isBook isMobile
0 0
1 0
0 1
94. Basic Conversions: High Dimensional Text-like Features
Sparse Matrix Encoding
Text features:
• each feature value snippet is split into tokens(dimensions)
• Bag of tokens ! a sparse vector of "counts" over token vocabulary
• Single text feature ! sparse matrix with #columns = vocabulary size
Other high dimensional features:
• similar process via map from raw features to a bag of dimensions
3/28/18 94
95. Non-linearity: Numeric Features
• Non-linear functions of features or target (in regression)
– Log transformation, polynomial powers, Box-cox transforms
– Useful given additional knowledge on the feature-target relationship
• Binning
– Numeric feature ! categorical one with #values = #bins
– Results in more weights (K-1 for K bins) in linear models instead of just
one weight for the raw numeric feature
– Useful when the feature-target relation is non-linear or non-monotonic
– Bins: equal ranges, equal #examples, maximize bin purity (e.g. entropy)
3/28/18 95
97. Interaction Features
• Required when features influence on target is not purely additive
– linear combinations of features won’t work
Example: Order with 50% discount on mobiles is much more likely to indicate
fraud than a simple combination of 50% discount or mobile order.
Common Interaction Features:
• Non-linear functions of two or more numeric features, e.g., products & ratios
• Cross-products of two or more categorical features
• Aggregates of numerical features corresponding to categorical feature values
• Tree-paths: use leaves from decision trees trained on a smaller sample
3/28/18 97
99. Numerical-Categorical Interaction Features
• Compute aggregates of a numeric feature corresponding to each value
of a categorical feature
• New interaction feature ! numeric one obtained by replacing the
categorical feature value with the corresponding numeric aggregate,
• e.g., brand_id ! brand_avg_rating, brand_avg_return_cnt
• Especially useful for categorical features with high cardinality (>50)
3/28/18 99
100. Tree Path Features
• Learn a decision tree on a small data sample with raw features
• Paths to the leaves are conjunctions constructed from conditions
on multiple raw features
• Highly informative with respect to the target.
3/28/18 100
101. Domain-Specific Transformations
Text Analytics and Natural Language Processing
• Stop-words removal/Stemming: Helps focus on semantics
• Removing high/low percentiles: Reduces features w/o loss in predictive power
• TF-IDF normalization: Corpus wide normalization of word frequency
• Frequent N-grams: Capture multi-word concepts
• Parts of speech/Ontology tagging: Focus on words with specific roles
Web Information Extraction
• Hyperlinks, Separating multiple fields of text (URL, in/out anchor text, title, body)
• Structural cues: XPaths/CSS; Visual cues: relative sizes/positions of elements
• Text style (italics/bold, font-size, colors)
Image Processing
• SIFT features, Edge extractors, Patch extractors
3/28/18 101
103. Feature Selection
Key Idea: Sometimes “Less (features) is more (predictive power)”
Motivating reasons:
• To improve generalization
• To meet prediction latency or model storage constraints(for some applications)
Broadly, three classes of methods:
• Filter or Univariate methods, e.g., information-gain filtering
• Wrapper methods, e.g., forward search
• Embedded methods, e.g., regularization
3/28/18 103
104. Feature Selection: Filter or Univariate Methods
• Goal: Find the “top” individually predictive features
– “predictive”: specified correlation metric
• Ranking by a univariate score
– Score features via an empirical statistical measure that corresponds to
predictive power w.r.t. target
– Those above a cut-off (count, percentile, score threshold) are retained
Note:
• Fast, but highly sub-optimal since features evaluated in isolation
• Independent of the learning algorithm.
• e.g., Chi-squared test, information gain, correlation coefficient
3/28/18 104
105. Feature-Target Correlation
• Mutual information: Captures correlation between categorical
feature (X) and class label (Y)
• p(x,y): Fraction of examples with X=x and Y=y
• p(x), p(y): Fraction of examples with X=x,Y=y
3/28/18 105
I(X,Y) = p(x, y)log
p(x, y)
p(x)p(y)y∈sup(Y )
∑
x∈sup(X)
∑
106. Feature-Target Correlation
• Pearson’s correlation coefficient: Captures linear relationship between
numeric feature (X) and target value (Y)
• Xi,Yi: Value of X,Y in ith instance
• : Mean of X,Y
• Covariance matrix: Captures correlations between every pair of
features
3/28/18 106
ρ(X,Y) =
cov(X,Y)
σXσY
=
(Xi − X)
i
∑ (Yi −Y )
(Xi − X)2
i
∑
#
$
%
&
'
(
1/2
(Yi −Y )2
i
∑
#
$
%
&
'
(
1/2
X,Y
107. Feature Selection: Wrapper Methods
• Goal: Find the “best” subset from all possible subsets of input features
– “best” : specified performance metric & specified learning algo
• Iterative search
– Start with an initial choice (e.g., entire set, random subset)
– Each stage: find a better choice from a pool of candidate subsets.
Note:
• Computationally very expensive
• e.g., Backward search/Recursive feature elimination, Forward search
3/28/18 107
108. Feature Selection: Embedded Methods
• Identify predictive features while the model is being created itself
• Penalty methods: learning objective has an additional penalty term
that pushes the learning algorithm to prefer simpler models
• Good trade-off in terms of the optimality & computational costs
• e.g., Regularization methods (LASSO, Elastic Net, Ridge Regression)
3/28/18 108
110. Training: Find a “good” model f from the training data !
Supervised LearningKey Elements of a Supervised Learning Algorithm
111. Training: Find a “good” model f from the training data !
• What is an allowed “model” ?
– Member of a model class H, e.g., linear models
• What is “good” ?
– Accurate predictions on training data in terms of a loss function L, e.g.,
squared error (Y –F(X))2
• How do you “find” it ?
– Optimization algorithm A, e.g., gradient descent
Supervised LearningKey Elements of a Supervised Learning Algorithm
112. Training: Apply algorithm A to find model from the class H that optimizes a
loss function L on the training data D
• H: model class, L: loss function,A: optimization algorithm
• Different choices lead to different models on same data D
Supervised LearningKey Elements of a Supervised Learning Algorithm
113. Model Training (Recap)
Key elements in learning algorithms:
• Model class, e.g., linear models, decision trees, neural networks
• Loss function, e.g., logistic loss, hinge loss
• Optimization algorithm, e.g., gradient descent, & assoc. params
Lot of algorithms & hyper-parameters to choose from !
3/28/18 113
115. Model Choice: Classification
3/28/18 115
Primary factors:
High #data instances ( > 10(MM )
• Linear models – online learning SGD
High #features/examples ratio (>1)
• Linear Models:Aggressive (L1) regularization
• Linear Models: Dimensionality reduction
• Naïve Bayes (homogeneous independent
features)
Need non-linear interactions
• Kernel methods (e.g., Gaussian SVM)
• Tree Ensembles (e.g., GBDT, RF)
• Deep learning methods (e.g., CNNs, RNNs)
•
116. Model Choice: Regression
3/28/18 116
Primary factors:
High #data instances ( > 10(MM )
• Linear models – online learning SGD
High #features/examples ratio (>1)
• Linear models:Aggressive (L1) regularization
• Linear models: Dimensionality reduction
Need non-linear interactions
• Kernel methods (e.g., Gaussian SVR)
• Tree Ensembles (e.g., GBDT, RF)
• Deep learning methods (e.g., CNNs, RNNs)
•
117. Model Evaluation & Diagnostics
Model Evaluation:
• Train error: Estimate of the expressive power of the model/algorithm relative
to training data
• Test error: A more reliable estimate of likely performance on “unseen” data
Post evaluation: What is the right strategy to get a better model ?
• 1) Get more training data instances
• 2a) Get more features or construct more complex features
• 2b) Explore more complex models/algorithms
• 3a) Drop some features
• 3b) Explore simpler models/algorithm
118. Overfitting
• Overfitting: Model fits training data well (low training error) but does not
generalize well to unseen data (poor test error)
• Complex models with large #parameters capture not only good patterns (that
generalize), but also noisy ones
3/28/18 118
Y'
X'
High'prediction'error
Actual
Predicted
Model
119. Underfitting
• Underfitting: Model lacks the expressive power to capture target distribution
(poor training and test error)
• Simple linear model cannot capture target distribution
3/28/18 119
Y(
X(
120. Bias &Variance
• Bias of algo: Difference between the actual target and the avg. estimated target
where averaging is done over models trained on different data samples
• Variance of algo:Variation in predictions of models trained on diff. data samples
3/28/18 120
121. Model Complexity: Bias &Variance
• Simple learning algos with small #params. ! high bias & low variance
– e.g., Linear models with few features
– Reduce bias by increasing model complexity (adding more features)
• Complex learning algos with large #params ! low bias & high variance
– e.g. Linear models with sparse high dimensional features, decision trees
– Reduce variance by increasing training data & decreasing model complexity
(feature selection)
3/28/18 121
122. Validation Curve
3/28/18 122
Overfitting Region
Optimal-
choice
Ideal choice: Match of complexity between learning
algorithm and the training data.
Prediction performance vs. Model complexity parameter
124. Common Evaluation Metrics
Standard evaluation metrics exist for each class of predictive learning scenarios
– Binary Classification
– Multi-class & Multi-label Classification
– Regression
– Ranking
• Loss function used in training objective is just one choice of evaluation metric
– Usually picked because the learning algorithm is readily available
– Might be a good, but not necessarily ideal choice from business perspective
• Critical to work backwards from business metrics to create more meaningful
metrics
3/28/18 124
126. Classification – Operational Point Evaluation Metrics
• For each threshold, Confusion matrix for binary classification of P vs. N
• Precision = TP/(TP+FP): How correct are we on the ones we predicted P?
• Recall = TP/(TP+FN):What fraction of actual P’s did we predict correctly?
• True Positive Rate (TPR) = Recall
• False Positive Rate (FPR) = FP/(FP+TN):What fraction of N’s did we predict
wrongly?
Actual'P Actual N
Predicted(P TP FP
Predicted(N FN TN
3/28/18 126
127. 127
Tradeoff(Curve
0%
20%
40%
60%
80%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
%(Cum(Non0Frauds
%(Cum(Frauds
90%
True%Positive%Rate%
False%Positive%Rate
ROC curve: Plot of TPR vs. FPR
AUC: Area under ROC curve
• Perfect classifier: AUC =1
• Random classifier: AUC =0.5
• Odds of scoring P > N
• Effective for comparing learning
algorithms across operational pts
Operational point:
• Maximize (TPR – FPR), F1-measure
• Other business driven choices
3/28/185
Receiver Operating Characteristic (ROC) Curve
129. • Binary Classification: Score threshold corresponds to operational point
• Application-specific bounds on Precision or Recall
– Maximize recall (or precision) with a lower bound on precision (or recall)
• Application-specific misclassification cost matrix
– Optimize overall misclassification cost (TP*CTP +FP*CFP + TN*CTN + FN*CFN )
– Reduces'to'standard'misclassification'error'when'CTP'=CTN=0'and'CFP'=CFN =1
Classification: Picking an Operational Point
3/28/18 129
Actual'P Actual N
Predicted'P CTP CFP
Predicted'N CFN CTN
130. Regression – Evaluation Metrics
• Metrics when regression is used for predicting target values
– Root Mean Square Error(RMSE):
– R2 : How much better is the model compared to just picking the best constant?
R2 =1- (Model Mean Squared Error /Variance)
– MAPE (Mean Absolute Percent Error):
3/28/18 130
1
N
(Yi − F(Xi )
i
∑ )2
#
$
%
&
'
(
1/2
1
N
Yi − F(Xi )
Yii
∑
#
$
%%
&
'
((
131. Model Fine-tuning
• Lot of algorithms and hyper-parameters (e.g., learning rate) to choose from
– Infeasible to explore all choices
• Practical solution approach
– Narrow down a few suitable algorithms from meta data (size/attribute types)
– For each chosen algorithm, systematically explore hyper-parameter choices
• Alternate optimization
• Exhaustive grid search
• Bayesian optimization (e.g., Spearmint, MOE)
• Each exploration: learning a model on a train split & evaluating on a test split.
• Preferred split mechanism: k-fold cross-validation
• Best hyperparameter choices based on the test (cross-validation) error
3/28/18 131
132. Multiple stages of optimization
• Objective: Find f(.) to optimize some cost L(Yunseen, f(Xunseen))
• ML Methodology:
– (Step 1) Model Learning: Determine good choices of f(.) that optimize LA(Ytrain, Xtrain) for different
choices of hyperparameters and algorithms
– (Step II) Hyperparameter Fine-tuning : Among choices in Step I, pick the one that optimizes
LB(Yeval-split, Xeval-split)
– (Step III) Operational choices (e.g., score thresholding, output calibration): For the choice in Step II,
determine the operational choices so as to optimize LC(Yop-split, Xop-split)
• Note:
– Ideal choice is to have L=LA = LB = LC and the data splits to be i.i.d., but not always possible,
• e.g., need L =max. recall for precision >90%, but LA = logistic loss & LB = Area under ROC
– Preferable to choose intermediate metrics that are “close” to the desired business metric and
robust off-the-shelf implementations
3/28/18 132
139. Primary Challenges
• Fast error-proof productionization
• Scalability vs. flexibility trade-off
• Reusability & extensibility of modeling effort
• Management of offline modeling experiments
• Interactive monitoring of modeling experiments
3/28/18 139
140. Challenge: Road to Productionization
• Long slow road to delivery for each new application, Very little reuse across applications
• POC code → Production code translation is highly error prone
• Rigorous evaluation & debugging of actual production systems is unlikely since these tasks
are owned by dev ops folks and data scientists don’t understand production code
3/28/18 140
Product((
Manager
Data(
Scientist
Software(
Engineers
Dev(Ops
App
Requirements(&(
Metrics
PoC(Modeling
[R,Python]
Production(
code(
141. Solution: Self-contained “Models”
Data scientists
• build application-specific configurations for data collection & modeling
• ship self-contained production “models” (i.e.,“model”, configurations,
library dependencies) say via Docker (not POC code !)
Software engineers
• build application-agnostic* production code & systems for automation of
data collection, model scoring, re-training, evaluation, etc.
141
Data$Scientist
Software$Engineers
Dev$Ops
Self5contained$Models
Application$agnostic
Production$code$
*$Need$to$consider$data$scale,$latency$for$scoring$&$retraining$which$have$some$dependency$on$the$application
142. Solution: Self-contained “Models”
• ML packages such as scikit-learn, spark-mllib, Keras allow for an easy
serialization of the entire processing pipeline (i.e., preprocessing, feature-
engineering, scoring) along with the fitted parameters as a single “model”
that can be exported to be used for scoring.
3/28/18 142
143. Challenge: Scalability vs. Flexibility Trade-off
• Scalability requirements vary across applications
• Factors to consider
– Size of training data
– Frequency of retraining
– Rate of arrival of prediction instances and latency bounds (in case of
online predictions)
– Size of batch and frequency of scoring (in case of batch predictions)
• Data scientists prefer to train models on single machines where
possible
3/28/18 143
144. Solution: Support Multiple Choices
• Moderate scale for training & prediction
– train models on a single machine (in Python/R);
– export model as is to multiple machines with the same image and predict in
parallel
• Moderate scale for training, but large prediction scale
– train models on a single machine (e.g., spark-Mllib in Python/R);
– export model to a different environment (e.g., Scala/Java ) that allows more
efficient parallelization.
• Large scale for both training & prediction
– train models and predict on a cluster(e.g., via sparkit-learn, PySpark or Scala, )
3/28/18 144
145. Challenge: Reusability & Extensibility of Modeling Effort
• ML workflows are more than just the “models pipeline”
– e.g., data fetch/aggregation from multiple sources, evaluation across
multiple models, exploratory data analysis
• Offline modeling code (notebooks) tends to get dirty fast
– Mix of interactive analysis (specific to application) and processing of data
• Common approach to reuse
– limited use of libraries + cut & paste code
3/28/18 145
147. Example Workflow: Model Learning
3/28/18 147
Workflow
Libraries.
Consolidated+Data+
Data
Splitter
Model++++
&
Report
Target+Constructor
Feature
Pipeline+setup
Filters/
Splitters/
Samplers
Transformers
Learning
Algos
Learning.Config.
Data+Split/
Sampling+Config++
Target
Config
Model
Config
Feature
Config
HP+search+config
Model
Set+up
HP
Search
Predict&+Eval
Eval
Config
Param
search
Eval
Metrics
148. Example Workflow: Model Evaluation
3/28/18 148
Libraries(
Workflow
Labeled'Data'
Eval'
reports
Feature
Pipeline'setup
Transformers
Learning
Algos
Evaluation(Config(
Pre:trained,'feature'models'config
Model
Set'up
Predict
Eval
metrics
Eval
Eval
Config
149. Example Workflow: Model Scoring
3/28/18 149
Libraries(
Workflow
Unlabeled(Data(
PredictionsFeature
Pipeline(setup
Transformers
Learning
Algos
Prediction(Config(
Pre:trained,(feature(model(config
Model
Set(up
Predict
150. Solution: Workflow Abstractions
• Each workflow is represented as a DAG over nodes
– DAGs can be encoded asYAML or JSON files
• Each node is a computational unit with the following elements
– name
– environment of execution(e.g., python/scala)
– actual function to be executed (via a link to existing module, class,
method)
– inputs (with default choices) and outputs
– tags to aid discovery
3/28/18 150
151. Solution: Workflow Abstractions
• Wrapper libraries allow hooks to existing ML packages (sklearn,
keras, etc) via nodes
• Properly indexed repositories of workflow DAGs, nodes and
node-configurations allow discovery and reuse
• Editing tools for composing DAGs enable extensibility
3/28/18 151
153. Challenge: Management of Experiments
• Manual tracking of experimental results requires considerable
effort and is error-prone
• Low reproducibility and auditability of offline modeling
experiments
3/28/18 153
154. Solution: Automated Repositories of ML Entities
• Run: execution of a workflow
– consumes datasets and configurations as inputs and generates models,
reports and new datasets as outputs
– organizes all the inputs/outputs and intermediate results in an
appropriate directory structure
• Automatically updated versioned repositories
– workflow DAGs, nodes, configs
– runs, datasets, models, reports
• Post each run, the repositories are automatically updated with the
appropriate linkages between the different entities
3/28/18 154
156. Solution: Read-only monitoring
• Additional layer that allows workflow DAGs to be executed one
step at a time and outputs to be examined from an interactive tool
(e.g., Jupyter notebooks)
– run_node(), load_input(), load_output()
• Cloning of intermediate inputs & outputs on demand so that these
can be analyzed without affecting the original run
– Changes to the actual run have to be explicitly made via workflow
DAGs, configs
3/28/18 156
158. Key Tenets for Real-world ML applications
Design phase:
• Work backwards from the application use case
– ML problem formulation & evaluation metrics aligned with business goals
– Software stack/ML libraries based on scalability/latency/retraining needs
• Keep the ML problem formulation simple (but ensure validity)
– Understand assumptions/limitations of ML methods & apply them with care
– Should enable ease of development, testing, and maintenance
159. Key Tenets for Real-world ML applications
Modeling phase:
• Ensure data is of high quality
– Fix missing values, outliers, target leakages
• Narrow down modeling options based on data characteristics
– Learn about the relative effectiveness of various preprocessing, feature engineering,
and learning algorithms for different types of data.
• Be smart on the trade-off between feature engg. effort & model complexity
– Sweet spot depends on the problem complexity, available domain knowledge, and
computational requirements;
• Ensure offline evaluation is a good “proxy” for the “real unseen” data
evaluation
– Generate train/test splits similar to how it would be during deployment
160. Key Tenets for Real-world ML applications
Deployment phase:
• Establish train vs. production parity
– Checks on every possible component that could change
• Establish improvement in business metrics before scaling up
– A/B testing over random buckets of instances
• Trust the models, but always audit
– Insert safe-guards (automated monitoring) and manual audits
• View model building as a continuous process not a one-time effort
– Retrain periodically to handle data drifts & design for this need
Don’t adopt Machine Learning because of the hype !