SlideShare ist ein Scribd-Unternehmen logo
1 von 162
Downloaden Sie, um offline zu lesen
Building ML Applications – A practitioner’s perspective
3/28/18 1
Srujana Merugu
Outline
• Overview of ML
– What is ML ? Why ML ?
– Predictive modeling recap
– ML application lifecycle
• Problem formulation & data definition
• Offline modeling
• Building an internal ML platform
• Key tenets for real-world ML applications
2
What is ML?
3/28/18 3
A"top"bank"uses"a"ML"model"for"
underwriting"decisions"built"for"a"
foreign"customer"segment"on"Indian"
customers"with"no"changes."
HYPE ! FEAR !
MISUSE!
What is Machine Learning?
3/28/18 4
“Field of study that gives computers the ability to learn from data
without being explicitly programmed”.
- Arthur Samuel (1959)
What is Machine Learning?
3/28/18 5
“Field of study that gives computers the ability to learn without
being explicitly programmed”.
- Arthur Samuel (1959)
Main elements of ML
• Representing relevant information as concrete variables
• Collecting empirical observations on variables
• Algorithms to infer associations between variables
3/28/18 6
Abstract..Problem.Areas.e.g.,.
time.series..modeling,.active.
learning
• variable.dependency.
structure.(e.g.,.
sequences,.trees)
• nature.of.observations.
(e.g.,..noisy,.adversarial)
• observation.process.
(e.g.,active,incremental )
Model.Classes,.e.g.,..linear.
models,.CNNs,.CRFs,.SVMs
• Quantification.of..variable.
dependencies.&.
assumptions.
• Specifying.an.exact.
optimization.problem
• Theoretical.results.
Learning.Algorithms,.e.g.,.
SGD,.EM,.Distributed.LDA.
• Mechanisms.to.solve.the.
optimization.problem
• Theoretical.results
• Practical.enhancements:.
scalable,.distributed,.
incremental.versions.
ML#Research#
3/28/18 7
Implementation1of1
Algorithms/Models,1e.g.,1sklearn
word2vec,1Inception
• Software1encoding1of1
algorithms1or1models1in1
specific1programming1
languages
ML1Platform1Utilities,1e.g.,1
AzureML,H20,scikitIlearn,Kera,
• Software1for1efficient1
management1of1ML1
workflows
Abstract11Problem1Areas1e.g.,1
time1series11modeling,1active1
learning
• variable1dependency1
structure1(e.g.,1
sequences,1trees)
• nature1of1observations1
(e.g.,11noisy,1adversarial)
• observation1process1
(e.g.,active,incremental )
Model1Classes,1e.g.,11linear1
models,1CNNs,1CRFs,1SVMs
• Quantification1of11variable1
dependencies1&1
assumptions1
• Specifying1an1exact1
optimization1problem
• Theoretical1results1
Learning1Algorithms,1e.g.,1
SGD,1EM,1Distributed1LDA1
• Mechanisms1to1solve1the1
optimization1problem
• Theoretical1results
• Practical1enhancements:1
scalable,1distributed,1
incremental1versions1
ML#Research#
ML#Software#Development#
• Translating*application*problems*to*ML*problems
• Applying*ML*methodology,*making*right*modeling*
choices**
• Using*existing*software*tools*effectively*
Application*
Problems
e.g.*Seller*
Fraud*
Detection
ML*Literature
ML*Practice
Algo &*Model
Implementations
ML*Platform
Utilities
Robust
Production*
Systems
More*art*than*science*!**
Why Machine Learning?
• Learn it when you can’t code it
– e.g. product similarity (fuzzy relationships & trade-offs)
• Learn it when you need to contextualize
– e.g., personalized product recommendations (fine-grained context)
• Learn it when you can’t track it over time
– e.g., seller fraud detection (input-output mapping changes dynamically)
• Learn it when you can’t scale it
– e.g., customer service ( complex task, large scale input, low latency)
• Learn it when you don’t understand it
– e.g., review aspect-sentiment mining (hidden structure to be detected)
3/28/18 9
Why not Machine Learning ?
• Low problem complexity
– e.g., classes being well separated in chosen representation
• Less interpretability
– true for complex models relative to simple rules
– when it drives critical decisions (e.g., health care)
• Lag time in case data has to be collected or available data is biased
– when there exists more holistic domain knowledge
• Expensive modeling effort
3/28/18 10
Supervised*Learning
Predict(new data based(
on((observed(data
Unsupervised*Learning
Detect(latent(structure(
in the(data
Reinforcement*Learning*
Adapt behavior to(
optimize(long(term(goals(
using(observed(rewards
Broad Areas of Machine Learning
3/28/18 11
Given: An input object with some features X (covariates/independent variables)
Goal: Predict a new target/label attributeY (response/dependent variable)
Note:
– Input & output attributes (X,Y) can be simple (e.g., numeric or categorical
values/vector) or have a complex structure (e.g., time-series, text sequences)
– Classification (Y is categorical), Regression (Y is numeric/ordinal)
Typical Predictive Modeling Problem
12
Shipping Logistics
3/28/18 13
Given: a customer order and
seller details
Predict: expected shipping time
Product Catalog Management
3/28/18 14
Given: a new product
Predict: product category it
should be placed in
Product Recommendations
3/28/18 15
Given: a user, current context & a
candidate product
Predict: preference of user for the
product
Advertising
3/28/18 16
Given: a user, search query and a
candidate product ad
Predict: expected click through rate
of user for the product ad
Many more applications !
• Advertising
• Product search and browse experience
• Forecasting product demand and supply
• Product pricing/competitor monitoring
• Understanding customer profiles and lifetime value
• Detecting seller and customer fraud
• Enriching product catalog & review content
• …
3/28/18 17
Given: An input object with some features X (covariates/independent variables)
Goal: Predict a new target/label attributeY (response/dependent variable)
Note:
– Input & output attributes (X,Y) can be simple (e.g., numeric or categorical
values/vector) or have a complex structure (e.g., time-series, text sequences)
Typical Predictive Modeling Problem
18
• Training data with correct input-output pairs (X,Y)
• Data samples in both train/unseen data are generated in same way (i.i.d)
Supervised Learning: Key Assumptions
Fraud
NOT)Fraud
Fraud
19
Supervised Learning
Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the
target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i)
Supervised Learning
Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the
target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i)
Prediction: Given a new sample X with unknown Y, predict it using F(X)
Training: Find a “good” model f from the training data !
Supervised LearningKey Elements of a Supervised Learning Algorithm
Training: Find a “good” model f from the training data !
• What is an allowed “model” ?
– Member of a model class H, e.g., linear models
• What is “good” ?
– Accurate predictions on training data in terms of a loss function L, e.g.,
squared error (Y –F(X))2
• How do you “find” it ?
– Optimization algorithm A, e.g., gradient descent
Supervised LearningKey Elements of a Supervised Learning Algorithm
Training: Apply algorithm A to find model from the class H that optimizes a
loss function L on the training data D
• H: model class, L: loss function,A: optimization algorithm
• Different choices lead to different models on same data D
Supervised LearningKey Elements of a Supervised Learning Algorithm
ML Application Development Life Cycle
3/28/18 25
3/28/18 26
3/28/18 27
3/28/18 28
3/28/18 29
3/28/18 30
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
3/28/18 32
3/28/18 33
3/28/18 34
3/28/18 35
Problem Formulation
3/28/18 36
3/28/18 37
Machine Learning Problem Definition
Business Problem: Optimize a decision process to improve business metrics
• Sub-optimal decisions due to missing information
• Solution strategy: predict missing information from available data using ML
Machine Learning Problem Definition
Business Problem: Optimize a decision process to improve business metrics
• Sub-optimal decisions due to missing information
• Solution strategy: predict missing information from available data using ML
Example: Reseller fraud
• Business objective: Limit fraud orders to increase #users served and reduce return
shipping expenses.
• Decision process: Add friction to orders via disabling cash on delivery (COD)
• Missing information relevant to the decision:
– Likelihood of the buyer reselling the products
– Likely return shipping costs
– Unserved demand for the product
Key elements of a ML Prediction Problem
• Instance definition
• Target variable to be predicted
• Input features
• Sources of training data
• Modeling metrics (Online/Offline, Primary/Secondary)
• Deployment constraints
Instance Definition
• Is it the right granularity from the business perspective?
• Is it feasible from the data collection perspective ?
Instance Definition
Multiple options
• a customer
• a purchase order spanning multiple products
• a single product order (quantity can be >1"though)
3/28/18 42
Instance Definition
Multiple options
• a customer
• a purchase order spanning multiple products
• a single product order (quantity can be >1 though) [preferred choice]
Why?
• Reselling behavior is also at a single product level
• COD presented per product not per entire purchase
• Blocking a customer on all his orders can even hurt
3/28/18 43
TargetVariable to be Predicted
• Can we express the business metrics (approximately) in terms of the
prediction quality of the target?
• Will accurate predictions of the target improve the business metrics
substantially?
Potential Business Impact
• Will accurate predictions of the target improve the business metrics
substantially?
• Compute business metrics for each case
– Ideal scenario ( perfect predictions on target )
– Dumb baseline ( default predictions e.g., majority class)
– Simple predictor with rules/domain knowledge
– Existing solution (if one exists)
– Likely scenario with a reasonable effort ML model (
• Assess effort vs. benefits
3/28/18 45
TargetVariable to be Predicted
• Can we express the business metrics (approximately) in terms of the
prediction quality of the target?
• Will accurate predictions of the target improve the business metrics
substantially?
• What is the data collection effort ?
– manual labeling costs
• Is it possible to get high quality observations?
– uncertainty in the definition, noise or bias in labeling process
TargetVariable
Multiple options
• Likelihood of buyer reselling the current order
• Number of unserved users because of the current order
• Expected return shipping expenses for the current order
3/28/18 47
TargetVariable
Multiple options.
• Likelihood of buyer reselling the current order [compromise choice]
• Number of unserved users because of the current order
• Expected return shipping expenses for the current order
Why?
• Last two choices better in terms of business metrics, but data collection is
difficult
• First choice makes data collection easy (esp. as a binary label) and addresses
business metrics in a reasonable, but slightly suboptimal way
3/28/18 48
Input features
• Is the feature predictive of the target ?
• Are the features going to be available in production setting ?
– Need to define exact time windows for features based on aggregates
– Watch out for time lags in data availability
– Be wary of target leakages (esp. conditional expectations of target )
• How costly is to compute/acquire the feature ?
– Might be different in training/prediction settings
Input Features
Reselling vs. Non-reselling indicators
• High product discount
• High order quantity relative to other orders of same product
– Normalize by median/mean to get relative values
• More for some products/verticals
– Product/vertical id can be used
• Buyer being a business store in product category
– Buyer’s category purchase count
– Buyer being a business store
3/28/18 50
Input Features
Reselling vs. Non-reselling indicators
• High product discount [feasible]
• High order quantity relative to other orders of same product
– Normalize by median/mean to get relative values [with lag]
• More for some products/verticals
– Product/vertical id can be used [feasible]
• Buyer being a business store in product category
– Buyer’s category purchase count [with lag]
– Buyer being a business store [expensive join with external info]
3/28/18 51
Sources of Training Data
• Is the distribution of training data similar to production data?
– e.g., if production data evolves over time, can the “training data” be
adjusted accordingly ?
• Are there systemic biases (e.g., data filters) in training data?
– Adjust the scope of prediction process so that it matches with the
training data setting
Sources of Training Data
Historical order data
– input features are available, but target is missing
Target observations
– Manual labeling on a random subset after focused investigations on the
address and the customer purchase history.
– Improve labeling efficiency by filtering by order quantity and apply same
filtering in production
3/28/18 53
Modeling Metrics
• Online metrics are measured on a live system
– Can be defined directly in terms of the key business metrics
– typically measured via A/B tests and these metrics are potentially influenced by a
number of factors (e.g., net revenue)
• Offline metrics are meant to be computed on retrospective “labeled” data
– typically measured during offline experimentation and more closely tied to
prediction quality (e.g., area under ROC curve)
• Primary metrics are ones that we are actively trying to optimize
– e.g., losses due to fraud
• Secondary metrics are ones that can serve as guardrails
– e.g., customer base size
3/28/18 54
Offline Modeling Metrics
• Does improvement in offline modeling metrics result in gains in
online business metrics ?
Model quality:
– A) Maximize coverage of fraud orders at certain level of certainty (>90%)
– B) Binary target: Four decision possibilities
• Maximize average payoff in terms of expected return costs given the
different possibilities
3/28/18 55
Return'Pay+offs Predicted'Fraud Predicted'Not'Fraud
Actual-Fraud 0 2 avg.-return-costs
Actual-Not-Fraud 2avg.-lost order-costs 0
Deployment Constraints
• What are the application latency & hardware constraints ?
Computational constraints:
– Orders per sec, allowed latency for COD disable action
– Available processing power, memory
Problem Formulation Template
3/28/18 57
• Template(s) with questions
on all the key elements
– Listing of possible choices
– Reason for preferred
choice
• Populated for each project
by product manager + ML
expert
Exercise: ML Problem Definition
Good choices for target variable, features & other elements ?
– Predicting shipping time for an order
– Forecasting the demand for different products
– Determining the nature of a customer complaint
– Predicting customer preference for a product
3/28/18 58
Data Definition
3/28/18 59
3/28/18 60
Motivation
3/28/18 61
• Early detection & prevention of common data related errors
• Reproducibility
• Auditability
• Robustness to failure in data fetch pipelines
Data Elements of Interest
3/28/18 62
• Instance identifiers
• Target variables
• Input features
• Other factors useful for evaluating online/offline metrics
Fields to specify for each variable of interest
• ID, Name, Version
• Modeling role, Owner, Description, Tags
Definitions
3/28/18 63
Three possible copies for same variable based on the stages
• Offline training, Offline evaluation, Deployment
Fields to specify for each variable of interest for each stage
• Precise definition (query + source for raw ones or formula for derived ones)
• Data type, value check conditions
• Units/Level sets
• Is Aggregate? , Exact aggregation set or time window
• Missing or invalid value indicators, reasons, mitigations (e.g., div by 0 for ratios)
• First creation date
• Known quality issues
Review Criteria
3/28/18 64
• Unambiguous definitions to allow for ready implementation
• Parity across different stages (training/evaluation/deployment)
– Definitions
– Data type, value checks, units, level sets
– Aggregation windows
– Missing/invalid value handling of derived variables
Review Criteria
3/28/18 65
• Is the input X to targetY map invariant across stages?
– Do definitions drift with time ?(Use averages not sums in general )
• e.g., customer spend to date in books ! order fraud status
– Do we have the correct feature snapshot of X forY ?
• e.g. , customer loyalty category (from when?)
Review Criteria
3/28/18 66
• Common data leakages
– Unintentional peeking into future, target, or any kind of unobserved
variables
– Ambiguously specified aggregates, e.g., customer revenue till the
“most recent” order ; interpretation can be different in training
data and deployment settings because of delays in data logging
– Time-varying features for which only certain (or recent) snapshots
are available, e.g., marital status of the customer
Review Criteria
3/28/18 67
• Handling of invalid/missing values of raw variables
– Join errors in preprocessing
– Service failures in deployment
• Handling of known data quality issues
– Corruption of data for certain segments/time periods
Data Definition Template
3/28/18 68
• Template(s) with details of all the data elements and review questions
• Populated for each project by all the relevant stake holders
Outline
• Overview of ML ecosystem
• Problem formulation & data definition
• Offline modeling
• Building an internal ML platform
• Key tenets for real-world ML applications
69
Offline Modeling
3/28/18 70
3/28/18 71
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
Data Collection & Integration
Abstract process: (specifics depend on data management infrastructure)
• Find where the data resides
– API, database/table names, external web sources
• Identify mappings between schemas of different sources
• Obtain the instance identifiers
• Perform a bunch of queries (joins/aggregations) for obtaining the
features/target
Data access/integration tools:
• SQL
• Hive, Pig, SparkSQL (for large joins)
• Scrapy, Nutch (web-crawling)
3/28/18 74
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
Data Exploration: Why?
• Data quality is really critical
– “Garbage in garbage out”
– Need to verify if data confirms to expectations (domain knowledge)
• ML modeling requires making a lot of choices !
• Better understanding of data
– Early detection and fixing of data quality issues
– Good preprocessing, feature engineering & modeling choices
3/28/18 76
Data Exploration: What to look for ?
• Size and schema
– #instances & #features,
– feature names & data types (numeric, ordinal, categorical, text)
• Univariate feature and target summaries
– prevalence of missing, outlier, or junk values
– distributional properties of features,
– skew in target distribution
• Bivariate target-feature dependencies
– distributional properties of features conditioned on the target
– feature-target correlation measures
• Temporal variations of features & targets
3/28/18 77
Example: Data Schema
3/28/18 78
Example: Dataset Summary
3/28/18 79
Missing-
values
Class-
Imbalance
Skewed
Distribution
Univariate (Feature or Target) Histograms
3/28/18 80
Y(axis:.log.Scale..
Feature-Target Dependencies
• Class histograms conditioned on feature value
3/28/18 81
For small values, fraud fraction
is almost 7-10 times less,
For large values it is
comparable or more
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
Data Sampling & Splitting
• Generalization to “unseen” data forms the core of ML
• Split “labeled” data into train and test datasets
– Train split is used to learn models
– Test split is proxy for the “unseen” data in deployment setting is used to
evaluate model performance
Note: In the test split, target is known unlike data in deployment setting
3/28/18 83
Creating Train & Test Splits
• Random disjoint splits
– Randomly shuffle & the split into train and test sets (e.g., 80% train & 20% test)
• K-fold cross validation
– Partition into K subsets: Use K-1 to train & one for test
– Rotate among the different sets to create K different train-test splits
– More reliable avg. performance estimate, variance measures, statistical tests
– Leave-one out: Extreme case of K-fold (each fold is single instance)
• Out-of-time splits (data has a temporal dependence)
– Train & test splits obtained via time-based cutoff (from production constraints)
• Special case: Imbalanced classes (for certain algorithms, e.g., decision trees)
– Balance train split alone by down (or up) sampling majority (minority) class
K-fold CrossValidation
3/28/18 85
Complex pipelines: Additional Splits
• In a simple scenario, the target labels are used only by the “learning
algorithm”
– Train and test splits suffice for this case
• Complex pipelines might have multiple elements that need a peek
at the target
– e.g., Feature selection, Meta learning algorithms, Output calibration etc.
– Separate data splits for each elements leads to better generalization
– Need to consider size of available labeled data as well
3/28/18 86
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
Data Preprocessing
• Special handling of text valued features
– Necessary to preserve the relevant “information”
– Appropriate handling of special characters, punctuation, spaces & markup
• Feature/row scaling (for numeric attributes)
– Necessary to avoid numerical computation issues, speedup convergence
– Columns:
• z-scoring: subtract mean, divide by std-deviation! mean=0, variance=1
• fixed range: subtract min, divide by range ! 0 to 1"range
– Rows: L1 norm, L2 norm
• Imputing missing/outlier values
– Necessary to avoid incorrect estimation of model parameters
– Handling strategies depend on the semantics of “missing”
3/28/18 88
Handling Outliers & MissingValues
• Indication of suspect instance: discard the record
• Informative w. r. t. target: introduce a new indicator variable
• Missing at random
– Numeric: replace with mean/median or conditional mean (regression)
– Categorical: replace with mode or likely value conditioned on the rest
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
Feature Engineering
• Case 1: Raw features are not highly predictive of the target, esp. in
case of simple model classes (e.g., linear models )
• Solution: Feature extraction, i.e., construct new more predictive
features from raw ones to boost model performance
• Case 2: Too many features with few training instances !
“memorizing” or “overfitting” situation leading to poor generalization.
• Solution: Feature selection, i.e., drop non-informative features to
improve generalization.
3/28/18 91
Feature Extraction
• Basic conversions for linear models
– e.g., 1-Hot encoding, Sparse encoding of text
• Non-linear feature transformations for linear models
– Linear models are scalable, but not expressive ! Need non-linear features
– e.g. Binning, quadratic interactions
• Domain-specific transformations
– Well studied for their effectiveness on special data types such as text, images
– e.g.,TF-IDF transformation, SIFT features
• Dimensionality reduction
– High dimensional features (e.g., text) can lead to “overfitting”, but retaining
only some dimensions may be sub-optimal
– Informative low dimensional approximation of the raw feature
– e.g., PCA, clustering3/28/18 92
Basic Conversions: Categorical Features
• One-Hot Encoding
– Converts a categorical feature with K values into a binary vector of size K−1
– Just a representation to enable use in linear models
3/28/18 93
Product_vertical
Handset
Book
Mobile
isBook isMobile
0 0
1 0
0 1
Basic Conversions: High Dimensional Text-like Features
Sparse Matrix Encoding
Text features:
• each feature value snippet is split into tokens(dimensions)
• Bag of tokens ! a sparse vector of "counts" over token vocabulary
• Single text feature ! sparse matrix with #columns = vocabulary size
Other high dimensional features:
• similar process via map from raw features to a bag of dimensions
3/28/18 94
Non-linearity: Numeric Features
• Non-linear functions of features or target (in regression)
– Log transformation, polynomial powers, Box-cox transforms
– Useful given additional knowledge on the feature-target relationship
• Binning
– Numeric feature ! categorical one with #values = #bins
– Results in more weights (K-1 for K bins) in linear models instead of just
one weight for the raw numeric feature
– Useful when the feature-target relation is non-linear or non-monotonic
– Bins: equal ranges, equal #examples, maximize bin purity (e.g. entropy)
3/28/18 95
Numeric Feature Binning
3/28/18 96
[3]
Interaction Features
• Required when features influence on target is not purely additive
– linear combinations of features won’t work
Example: Order with 50% discount on mobiles is much more likely to indicate
fraud than a simple combination of 50% discount or mobile order.
Common Interaction Features:
• Non-linear functions of two or more numeric features, e.g., products & ratios
• Cross-products of two or more categorical features
• Aggregates of numerical features corresponding to categorical feature values
• Tree-paths: use leaves from decision trees trained on a smaller sample
3/28/18 97
Categorical-Categorical Interaction Features
3/28/18 98
Numerical-Categorical Interaction Features
• Compute aggregates of a numeric feature corresponding to each value
of a categorical feature
• New interaction feature ! numeric one obtained by replacing the
categorical feature value with the corresponding numeric aggregate,
• e.g., brand_id ! brand_avg_rating, brand_avg_return_cnt
• Especially useful for categorical features with high cardinality (>50)
3/28/18 99
Tree Path Features
• Learn a decision tree on a small data sample with raw features
• Paths to the leaves are conjunctions constructed from conditions
on multiple raw features
• Highly informative with respect to the target.
3/28/18 100
Domain-Specific Transformations
Text Analytics and Natural Language Processing
• Stop-words removal/Stemming: Helps focus on semantics
• Removing high/low percentiles: Reduces features w/o loss in predictive power
• TF-IDF normalization: Corpus wide normalization of word frequency
• Frequent N-grams: Capture multi-word concepts
• Parts of speech/Ontology tagging: Focus on words with specific roles
Web Information Extraction
• Hyperlinks, Separating multiple fields of text (URL, in/out anchor text, title, body)
• Structural cues: XPaths/CSS; Visual cues: relative sizes/positions of elements
• Text style (italics/bold, font-size, colors)
Image Processing
• SIFT features, Edge extractors, Patch extractors
3/28/18 101
Dimensionality Reduction
• Clustering along feature values
– K-means variants (along feature values)
• Low rank matrix factorization
– Principal Component Analysis (PCA)
– Non-negative Matrix Factorization (NNMF)
• Topic models
– Latent Dirichlet Allocation (LDA)
– Probabilistic Latent Semantic Analysis (PLSA)
• Neural embeddings
– Word2Vec(Skip-gram), Para2Vec
3/28/18 102
Feature Selection
Key Idea: Sometimes “Less (features) is more (predictive power)”
Motivating reasons:
• To improve generalization
• To meet prediction latency or model storage constraints(for some applications)
Broadly, three classes of methods:
• Filter or Univariate methods, e.g., information-gain filtering
• Wrapper methods, e.g., forward search
• Embedded methods, e.g., regularization
3/28/18 103
Feature Selection: Filter or Univariate Methods
• Goal: Find the “top” individually predictive features
– “predictive”: specified correlation metric
• Ranking by a univariate score
– Score features via an empirical statistical measure that corresponds to
predictive power w.r.t. target
– Those above a cut-off (count, percentile, score threshold) are retained
Note:
• Fast, but highly sub-optimal since features evaluated in isolation
• Independent of the learning algorithm.
• e.g., Chi-squared test, information gain, correlation coefficient
3/28/18 104
Feature-Target Correlation
• Mutual information: Captures correlation between categorical
feature (X) and class label (Y)
• p(x,y): Fraction of examples with X=x and Y=y
• p(x), p(y): Fraction of examples with X=x,Y=y
3/28/18 105
I(X,Y) = p(x, y)log
p(x, y)
p(x)p(y)y∈sup(Y )
∑
x∈sup(X)
∑
Feature-Target Correlation
• Pearson’s correlation coefficient: Captures linear relationship between
numeric feature (X) and target value (Y)
• Xi,Yi: Value of X,Y in ith instance
• : Mean of X,Y
• Covariance matrix: Captures correlations between every pair of
features
3/28/18 106
ρ(X,Y) =
cov(X,Y)
σXσY
=
(Xi − X)
i
∑ (Yi −Y )
(Xi − X)2
i
∑
#
$
%
&
'
(
1/2
(Yi −Y )2
i
∑
#
$
%
&
'
(
1/2
X,Y
Feature Selection: Wrapper Methods
• Goal: Find the “best” subset from all possible subsets of input features
– “best” : specified performance metric & specified learning algo
• Iterative search
– Start with an initial choice (e.g., entire set, random subset)
– Each stage: find a better choice from a pool of candidate subsets.
Note:
• Computationally very expensive
• e.g., Backward search/Recursive feature elimination, Forward search
3/28/18 107
Feature Selection: Embedded Methods
• Identify predictive features while the model is being created itself
• Penalty methods: learning objective has an additional penalty term
that pushes the learning algorithm to prefer simpler models
• Good trade-off in terms of the optimality & computational costs
• e.g., Regularization methods (LASSO, Elastic Net, Ridge Regression)
3/28/18 108
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
Training: Find a “good” model f from the training data !
Supervised LearningKey Elements of a Supervised Learning Algorithm
Training: Find a “good” model f from the training data !
• What is an allowed “model” ?
– Member of a model class H, e.g., linear models
• What is “good” ?
– Accurate predictions on training data in terms of a loss function L, e.g.,
squared error (Y –F(X))2
• How do you “find” it ?
– Optimization algorithm A, e.g., gradient descent
Supervised LearningKey Elements of a Supervised Learning Algorithm
Training: Apply algorithm A to find model from the class H that optimizes a
loss function L on the training data D
• H: model class, L: loss function,A: optimization algorithm
• Different choices lead to different models on same data D
Supervised LearningKey Elements of a Supervised Learning Algorithm
Model Training (Recap)
Key elements in learning algorithms:
• Model class, e.g., linear models, decision trees, neural networks
• Loss function, e.g., logistic loss, hinge loss
• Optimization algorithm, e.g., gradient descent, & assoc. params
Lot of algorithms & hyper-parameters to choose from !
3/28/18 113
Scikit-learn Guide
3/28/18 114
Model Choice: Classification
3/28/18 115
Primary factors:
High #data instances ( > 10(MM )
• Linear models – online learning SGD
High #features/examples ratio (>1)
• Linear Models:Aggressive (L1) regularization
• Linear Models: Dimensionality reduction
• Naïve Bayes (homogeneous independent
features)
Need non-linear interactions
• Kernel methods (e.g., Gaussian SVM)
• Tree Ensembles (e.g., GBDT, RF)
• Deep learning methods (e.g., CNNs, RNNs)
•
Model Choice: Regression
3/28/18 116
Primary factors:
High #data instances ( > 10(MM )
• Linear models – online learning SGD
High #features/examples ratio (>1)
• Linear models:Aggressive (L1) regularization
• Linear models: Dimensionality reduction
Need non-linear interactions
• Kernel methods (e.g., Gaussian SVR)
• Tree Ensembles (e.g., GBDT, RF)
• Deep learning methods (e.g., CNNs, RNNs)
•
Model Evaluation & Diagnostics
Model Evaluation:
• Train error: Estimate of the expressive power of the model/algorithm relative
to training data
• Test error: A more reliable estimate of likely performance on “unseen” data
Post evaluation: What is the right strategy to get a better model ?
• 1) Get more training data instances
• 2a) Get more features or construct more complex features
• 2b) Explore more complex models/algorithms
• 3a) Drop some features
• 3b) Explore simpler models/algorithm
Overfitting
• Overfitting: Model fits training data well (low training error) but does not
generalize well to unseen data (poor test error)
• Complex models with large #parameters capture not only good patterns (that
generalize), but also noisy ones
3/28/18 118
Y'
X'
High'prediction'error
Actual
Predicted
Model
Underfitting
• Underfitting: Model lacks the expressive power to capture target distribution
(poor training and test error)
• Simple linear model cannot capture target distribution
3/28/18 119
Y(
X(
Bias &Variance
• Bias of algo: Difference between the actual target and the avg. estimated target
where averaging is done over models trained on different data samples
• Variance of algo:Variation in predictions of models trained on diff. data samples
3/28/18 120
Model Complexity: Bias &Variance
• Simple learning algos with small #params. ! high bias & low variance
– e.g., Linear models with few features
– Reduce bias by increasing model complexity (adding more features)
• Complex learning algos with large #params ! low bias & high variance
– e.g. Linear models with sparse high dimensional features, decision trees
– Reduce variance by increasing training data & decreasing model complexity
(feature selection)
3/28/18 121
Validation Curve
3/28/18 122
Overfitting Region
Optimal-
choice
Ideal choice: Match of complexity between learning
algorithm and the training data.
Prediction performance vs. Model complexity parameter
Learning Curve
3/28/18 123
Prediction performance vs. Num. of training examples
Ideal choice:
Early portion of
the flat region.
Common Evaluation Metrics
Standard evaluation metrics exist for each class of predictive learning scenarios
– Binary Classification
– Multi-class & Multi-label Classification
– Regression
– Ranking
• Loss function used in training objective is just one choice of evaluation metric
– Usually picked because the learning algorithm is readily available
– Might be a good, but not necessarily ideal choice from business perspective
• Critical to work backwards from business metrics to create more meaningful
metrics
3/28/18 124
125
Customer)orders)– Blues)are)not)fraudulent)(P),)Reds)are)fraudulent)(N)
Operational+Decision+Point:)Thresholding)on)the)score)(User)has)to)choose!))
Score)using)customer)order)features)to)create)a)rank)order)from)low)to)high)certainty
Classification – Making Predictions
3/28/18
Classification – Operational Point Evaluation Metrics
• For each threshold, Confusion matrix for binary classification of P vs. N
• Precision = TP/(TP+FP): How correct are we on the ones we predicted P?
• Recall = TP/(TP+FN):What fraction of actual P’s did we predict correctly?
• True Positive Rate (TPR) = Recall
• False Positive Rate (FPR) = FP/(FP+TN):What fraction of N’s did we predict
wrongly?
Actual'P Actual N
Predicted(P TP FP
Predicted(N FN TN
3/28/18 126
127
Tradeoff(Curve
0%
20%
40%
60%
80%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
%(Cum(Non0Frauds
%(Cum(Frauds
90%
True%Positive%Rate%
False%Positive%Rate
ROC curve: Plot of TPR vs. FPR
AUC: Area under ROC curve
• Perfect classifier: AUC =1
• Random classifier: AUC =0.5
• Odds of scoring P > N
• Effective for comparing learning
algorithms across operational pts
Operational point:
• Maximize (TPR – FPR), F1-measure
• Other business driven choices
3/28/185
Receiver Operating Characteristic (ROC) Curve
12
Precision)
Recall
0.750.25 0.5
0
1
1
High)Precision High)Recall
0.25
0.5
0.75
3/28/18
Precision- Recall Curve
Noisy curve in
this region for
small datasets
• Binary Classification: Score threshold corresponds to operational point
• Application-specific bounds on Precision or Recall
– Maximize recall (or precision) with a lower bound on precision (or recall)
• Application-specific misclassification cost matrix
– Optimize overall misclassification cost (TP*CTP +FP*CFP + TN*CTN + FN*CFN )
– Reduces'to'standard'misclassification'error'when'CTP'=CTN=0'and'CFP'=CFN =1
Classification: Picking an Operational Point
3/28/18 129
Actual'P Actual N
Predicted'P CTP CFP
Predicted'N CFN CTN
Regression – Evaluation Metrics
• Metrics when regression is used for predicting target values
– Root Mean Square Error(RMSE):
– R2 : How much better is the model compared to just picking the best constant?
R2 =1- (Model Mean Squared Error /Variance)
– MAPE (Mean Absolute Percent Error):
3/28/18 130
1
N
(Yi − F(Xi )
i
∑ )2
#
$
%
&
'
(
1/2
1
N
Yi − F(Xi )
Yii
∑
#
$
%%
&
'
((
Model Fine-tuning
• Lot of algorithms and hyper-parameters (e.g., learning rate) to choose from
– Infeasible to explore all choices
• Practical solution approach
– Narrow down a few suitable algorithms from meta data (size/attribute types)
– For each chosen algorithm, systematically explore hyper-parameter choices
• Alternate optimization
• Exhaustive grid search
• Bayesian optimization (e.g., Spearmint, MOE)
• Each exploration: learning a model on a train split & evaluating on a test split.
• Preferred split mechanism: k-fold cross-validation
• Best hyperparameter choices based on the test (cross-validation) error
3/28/18 131
Multiple stages of optimization
• Objective: Find f(.) to optimize some cost L(Yunseen, f(Xunseen))
• ML Methodology:
– (Step 1) Model Learning: Determine good choices of f(.) that optimize LA(Ytrain, Xtrain) for different
choices of hyperparameters and algorithms
– (Step II) Hyperparameter Fine-tuning : Among choices in Step I, pick the one that optimizes
LB(Yeval-split, Xeval-split)
– (Step III) Operational choices (e.g., score thresholding, output calibration): For the choice in Step II,
determine the operational choices so as to optimize LC(Yop-split, Xop-split)
• Note:
– Ideal choice is to have L=LA = LB = LC and the data splits to be i.i.d., but not always possible,
• e.g., need L =max. recall for precision >90%, but LA = logistic loss & LB = Area under ROC
– Preferable to choose intermediate metrics that are “close” to the desired business metric and
robust off-the-shelf implementations
3/28/18 132
Data Collection &,Integration
Data,Exploration,
Feature,Engineering
Meet
Business,
Goals?
Model,Training,,Evaluation,,,,,,,,,,,,,
&,,Fine>tuning
Data,Preprocessing
Data,Sampling/Splitting,
Offline Modeling Process
Building an Internal ML Platform
3/28/18 134
3/28/18 135
136
Typical(ML(Production(
System
Raw
data
Data*fetch*+*
aggregation
Prod.*
re7training
Prod.
models
Configs
Offline
modeling
Models,
Configs,
Reports
Prod.
scoring
A/B$
bucketing
Business*logic
optimization
Data*
monitoring*&
A/B*tests
outcome
environment
action
(stimulus,(action,(outcome)
stimulus
Dashboards,*
alerts,*A/B*results
Data*
collection*&
attribution
137
Offline'Modeling
Config
Raw'
data
sources
Data'fetch'+'
aggregation
Model
Evaluation
Reports
Interactive'data
analysis
Model'
Learning
Models
Configs
Configs
Existing ML Platform Utilities
Open%source%packages
3/28/18 138
Google%Cloud%ML
• Not)cost)effective)for)large)companies)
• Need)to)move)data)to)external)clouds
Managed%services
• Free)and)flexible))
• Gaps)in)functionality
Large%companies%need%an%internal%ML%platform%to%make%up%for%the%gaps%!
Primary Challenges
• Fast error-proof productionization
• Scalability vs. flexibility trade-off
• Reusability & extensibility of modeling effort
• Management of offline modeling experiments
• Interactive monitoring of modeling experiments
3/28/18 139
Challenge: Road to Productionization
• Long slow road to delivery for each new application, Very little reuse across applications
• POC code → Production code translation is highly error prone
• Rigorous evaluation & debugging of actual production systems is unlikely since these tasks
are owned by dev ops folks and data scientists don’t understand production code
3/28/18 140
Product((
Manager
Data(
Scientist
Software(
Engineers
Dev(Ops
App
Requirements(&(
Metrics
PoC(Modeling
[R,Python]
Production(
code(
Solution: Self-contained “Models”
Data scientists
• build application-specific configurations for data collection & modeling
• ship self-contained production “models” (i.e.,“model”, configurations,
library dependencies) say via Docker (not POC code !)
Software engineers
• build application-agnostic* production code & systems for automation of
data collection, model scoring, re-training, evaluation, etc.
141
Data$Scientist
Software$Engineers
Dev$Ops
Self5contained$Models
Application$agnostic
Production$code$
*$Need$to$consider$data$scale,$latency$for$scoring$&$retraining$which$have$some$dependency$on$the$application
Solution: Self-contained “Models”
• ML packages such as scikit-learn, spark-mllib, Keras allow for an easy
serialization of the entire processing pipeline (i.e., preprocessing, feature-
engineering, scoring) along with the fitted parameters as a single “model”
that can be exported to be used for scoring.
3/28/18 142
Challenge: Scalability vs. Flexibility Trade-off
• Scalability requirements vary across applications
• Factors to consider
– Size of training data
– Frequency of retraining
– Rate of arrival of prediction instances and latency bounds (in case of
online predictions)
– Size of batch and frequency of scoring (in case of batch predictions)
• Data scientists prefer to train models on single machines where
possible
3/28/18 143
Solution: Support Multiple Choices
• Moderate scale for training & prediction
– train models on a single machine (in Python/R);
– export model as is to multiple machines with the same image and predict in
parallel
• Moderate scale for training, but large prediction scale
– train models on a single machine (e.g., spark-Mllib in Python/R);
– export model to a different environment (e.g., Scala/Java ) that allows more
efficient parallelization.
• Large scale for both training & prediction
– train models and predict on a cluster(e.g., via sparkit-learn, PySpark or Scala, )
3/28/18 144
Challenge: Reusability & Extensibility of Modeling Effort
• ML workflows are more than just the “models pipeline”
– e.g., data fetch/aggregation from multiple sources, evaluation across
multiple models, exploratory data analysis
• Offline modeling code (notebooks) tends to get dirty fast
– Mix of interactive analysis (specific to application) and processing of data
• Common approach to reuse
– limited use of libraries + cut & paste code
3/28/18 145
Example Workflow: Data Fetch + Aggregation
3/28/18 146
Libraries(
Data$
Source$1
Data$
Source$2
Data$
Source$3
Data
Reader
Consolidated
Data$File$(s)
Data$$$$$
Aggregation
Data$$$$$
Writer
Read
Utilities
Aggregation
Utilities
Write
Utilities
Data(Aggregation(Config(
Read
Config
Aggregation
Config
Write
Config
Workflow
Example Workflow: Model Learning
3/28/18 147
Workflow
Libraries.
Consolidated+Data+
Data
Splitter
Model++++
&
Report
Target+Constructor
Feature
Pipeline+setup
Filters/
Splitters/
Samplers
Transformers
Learning
Algos
Learning.Config.
Data+Split/
Sampling+Config++
Target
Config
Model
Config
Feature
Config
HP+search+config
Model
Set+up
HP
Search
Predict&+Eval
Eval
Config
Param
search
Eval
Metrics
Example Workflow: Model Evaluation
3/28/18 148
Libraries(
Workflow
Labeled'Data'
Eval'
reports
Feature
Pipeline'setup
Transformers
Learning
Algos
Evaluation(Config(
Pre:trained,'feature'models'config
Model
Set'up
Predict
Eval
metrics
Eval
Eval
Config
Example Workflow: Model Scoring
3/28/18 149
Libraries(
Workflow
Unlabeled(Data(
PredictionsFeature
Pipeline(setup
Transformers
Learning
Algos
Prediction(Config(
Pre:trained,(feature(model(config
Model
Set(up
Predict
Solution: Workflow Abstractions
• Each workflow is represented as a DAG over nodes
– DAGs can be encoded asYAML or JSON files
• Each node is a computational unit with the following elements
– name
– environment of execution(e.g., python/scala)
– actual function to be executed (via a link to existing module, class,
method)
– inputs (with default choices) and outputs
– tags to aid discovery
3/28/18 150
Solution: Workflow Abstractions
• Wrapper libraries allow hooks to existing ML packages (sklearn,
keras, etc) via nodes
• Properly indexed repositories of workflow DAGs, nodes and
node-configurations allow discovery and reuse
• Editing tools for composing DAGs enable extensibility
3/28/18 151
ML#Workflow
Library
Orchestrator
Deployment#
Engine
152
Physical#Computing#Resources
Contribution
Tools
Edit
Tools
Discover
Tools
Orchestrator#assembles#the#composition#and#manages#the#
deployment#with#the#help#of#deployment#manager
Physical#computing#resources#provide#the#execution#environment#
$ Discover#and#Deploy:#Searches#the#library#of#workflows#
meeting#certain#criteria,#and#deploys#them.#
$ Edit#&#Experiment:#Take#an#existing#ML#workflow,#creates#a#new#
one##by##making#some#edits#(mostly#data,#config parameters),#
experiments#with#it#and#publishes#it
$ Create#&#Contribute:##Entirely#create#a#new#library#functions,#
nodes#and#possibly#workflow#DAGs##and#adds#to#repository
ML Workflow-centric Architecture
Challenge: Management of Experiments
• Manual tracking of experimental results requires considerable
effort and is error-prone
• Low reproducibility and auditability of offline modeling
experiments
3/28/18 153
Solution: Automated Repositories of ML Entities
• Run: execution of a workflow
– consumes datasets and configurations as inputs and generates models,
reports and new datasets as outputs
– organizes all the inputs/outputs and intermediate results in an
appropriate directory structure
• Automatically updated versioned repositories
– workflow DAGs, nodes, configs
– runs, datasets, models, reports
• Post each run, the repositories are automatically updated with the
appropriate linkages between the different entities
3/28/18 154
Challenge: Interactive Monitoring of Experiments
• Interactive execution of experiments ! messy code
3/28/18 155
Solution: Read-only monitoring
• Additional layer that allows workflow DAGs to be executed one
step at a time and outputs to be examined from an interactive tool
(e.g., Jupyter notebooks)
– run_node(), load_input(), load_output()
• Cloning of intermediate inputs & outputs on demand so that these
can be analyzed without affecting the original run
– Changes to the actual run have to be explicitly made via workflow
DAGs, configs
3/28/18 156
Key Tenets for Real-world ML Applications
3/28/18 157
Key Tenets for Real-world ML applications
Design phase:
• Work backwards from the application use case
– ML problem formulation & evaluation metrics aligned with business goals
– Software stack/ML libraries based on scalability/latency/retraining needs
• Keep the ML problem formulation simple (but ensure validity)
– Understand assumptions/limitations of ML methods & apply them with care
– Should enable ease of development, testing, and maintenance
Key Tenets for Real-world ML applications
Modeling phase:
• Ensure data is of high quality
– Fix missing values, outliers, target leakages
• Narrow down modeling options based on data characteristics
– Learn about the relative effectiveness of various preprocessing, feature engineering,
and learning algorithms for different types of data.
• Be smart on the trade-off between feature engg. effort & model complexity
– Sweet spot depends on the problem complexity, available domain knowledge, and
computational requirements;
• Ensure offline evaluation is a good “proxy” for the “real unseen” data
evaluation
– Generate train/test splits similar to how it would be during deployment
Key Tenets for Real-world ML applications
Deployment phase:
• Establish train vs. production parity
– Checks on every possible component that could change
• Establish improvement in business metrics before scaling up
– A/B testing over random buckets of instances
• Trust the models, but always audit
– Insert safe-guards (automated monitoring) and manual audits
• View model building as a continuous process not a one-time effort
– Retrain periodically to handle data drifts & design for this need
Don’t adopt Machine Learning because of the hype !
Thank You !
Happy Modeling !
Contact: srujana@gmail.com
Useful References
3/29/18 162
• Google&AI&Course
• https://ai.google/education/#?modal_active=none
• Rules&of&Machine&Learning:&Best&Practices&for&ML&Engineering&
• http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
• What’s&your&ML&Test&Score?&A&rubric&for&ML&production&systems&
• https://www.eecs.tufts.edu/~dsculley/papers/ml_test_score.pdf
• Practical&advice&for&analysis&of&large,&complex&data&sets
• http://www.unofficialgoogledatascience.com/2016/10/practicalAadviceAforAanalysisA
ofAlarge.html

Weitere ähnliche Inhalte

Was ist angesagt?

Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 

Was ist angesagt? (20)

Machine learning
Machine learningMachine learning
Machine learning
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Introduction to Machine learning ppt
Introduction to Machine learning pptIntroduction to Machine learning ppt
Introduction to Machine learning ppt
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
The fundamentals of Machine Learning
The fundamentals of Machine LearningThe fundamentals of Machine Learning
The fundamentals of Machine Learning
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine Learning
 

Ähnlich wie ML Basics

Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
muhammadsamroz
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
audeleypearl
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
roushhsiu
 
EE660 Project_sl_final
EE660 Project_sl_finalEE660 Project_sl_final
EE660 Project_sl_final
Shanglin Yang
 

Ähnlich wie ML Basics (20)

Experimenting with Data!
Experimenting with Data!Experimenting with Data!
Experimenting with Data!
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
Overview of machine learning
Overview of machine learning Overview of machine learning
Overview of machine learning
 
Lecture_1_-_Course_Overview_(Inked).pdf
Lecture_1_-_Course_Overview_(Inked).pdfLecture_1_-_Course_Overview_(Inked).pdf
Lecture_1_-_Course_Overview_(Inked).pdf
 
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
 
EE660 Project_sl_final
EE660 Project_sl_finalEE660 Project_sl_final
EE660 Project_sl_final
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
From Data to Artificial Intelligence with the Machine Learning Canvas — ODSC ...
From Data to Artificial Intelligence with the Machine Learning Canvas — ODSC ...From Data to Artificial Intelligence with the Machine Learning Canvas — ODSC ...
From Data to Artificial Intelligence with the Machine Learning Canvas — ODSC ...
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Business intelligence prof nikhat fatma mumtaz husain shaikh
Business intelligence  prof nikhat fatma mumtaz husain shaikhBusiness intelligence  prof nikhat fatma mumtaz husain shaikh
Business intelligence prof nikhat fatma mumtaz husain shaikh
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
Simulation Powerpoint- Lecture Notes
Simulation Powerpoint- Lecture NotesSimulation Powerpoint- Lecture Notes
Simulation Powerpoint- Lecture Notes
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

ML Basics

  • 1. Building ML Applications – A practitioner’s perspective 3/28/18 1 Srujana Merugu
  • 2. Outline • Overview of ML – What is ML ? Why ML ? – Predictive modeling recap – ML application lifecycle • Problem formulation & data definition • Offline modeling • Building an internal ML platform • Key tenets for real-world ML applications 2
  • 3. What is ML? 3/28/18 3 A"top"bank"uses"a"ML"model"for" underwriting"decisions"built"for"a" foreign"customer"segment"on"Indian" customers"with"no"changes." HYPE ! FEAR ! MISUSE!
  • 4. What is Machine Learning? 3/28/18 4 “Field of study that gives computers the ability to learn from data without being explicitly programmed”. - Arthur Samuel (1959)
  • 5. What is Machine Learning? 3/28/18 5 “Field of study that gives computers the ability to learn without being explicitly programmed”. - Arthur Samuel (1959) Main elements of ML • Representing relevant information as concrete variables • Collecting empirical observations on variables • Algorithms to infer associations between variables
  • 6. 3/28/18 6 Abstract..Problem.Areas.e.g.,. time.series..modeling,.active. learning • variable.dependency. structure.(e.g.,. sequences,.trees) • nature.of.observations. (e.g.,..noisy,.adversarial) • observation.process. (e.g.,active,incremental ) Model.Classes,.e.g.,..linear. models,.CNNs,.CRFs,.SVMs • Quantification.of..variable. dependencies.&. assumptions. • Specifying.an.exact. optimization.problem • Theoretical.results. Learning.Algorithms,.e.g.,. SGD,.EM,.Distributed.LDA. • Mechanisms.to.solve.the. optimization.problem • Theoretical.results • Practical.enhancements:. scalable,.distributed,. incremental.versions. ML#Research#
  • 7. 3/28/18 7 Implementation1of1 Algorithms/Models,1e.g.,1sklearn word2vec,1Inception • Software1encoding1of1 algorithms1or1models1in1 specific1programming1 languages ML1Platform1Utilities,1e.g.,1 AzureML,H20,scikitIlearn,Kera, • Software1for1efficient1 management1of1ML1 workflows Abstract11Problem1Areas1e.g.,1 time1series11modeling,1active1 learning • variable1dependency1 structure1(e.g.,1 sequences,1trees) • nature1of1observations1 (e.g.,11noisy,1adversarial) • observation1process1 (e.g.,active,incremental ) Model1Classes,1e.g.,11linear1 models,1CNNs,1CRFs,1SVMs • Quantification1of11variable1 dependencies1&1 assumptions1 • Specifying1an1exact1 optimization1problem • Theoretical1results1 Learning1Algorithms,1e.g.,1 SGD,1EM,1Distributed1LDA1 • Mechanisms1to1solve1the1 optimization1problem • Theoretical1results • Practical1enhancements:1 scalable,1distributed,1 incremental1versions1 ML#Research# ML#Software#Development#
  • 8. • Translating*application*problems*to*ML*problems • Applying*ML*methodology,*making*right*modeling* choices** • Using*existing*software*tools*effectively* Application* Problems e.g.*Seller* Fraud* Detection ML*Literature ML*Practice Algo &*Model Implementations ML*Platform Utilities Robust Production* Systems More*art*than*science*!**
  • 9. Why Machine Learning? • Learn it when you can’t code it – e.g. product similarity (fuzzy relationships & trade-offs) • Learn it when you need to contextualize – e.g., personalized product recommendations (fine-grained context) • Learn it when you can’t track it over time – e.g., seller fraud detection (input-output mapping changes dynamically) • Learn it when you can’t scale it – e.g., customer service ( complex task, large scale input, low latency) • Learn it when you don’t understand it – e.g., review aspect-sentiment mining (hidden structure to be detected) 3/28/18 9
  • 10. Why not Machine Learning ? • Low problem complexity – e.g., classes being well separated in chosen representation • Less interpretability – true for complex models relative to simple rules – when it drives critical decisions (e.g., health care) • Lag time in case data has to be collected or available data is biased – when there exists more holistic domain knowledge • Expensive modeling effort 3/28/18 10
  • 11. Supervised*Learning Predict(new data based( on((observed(data Unsupervised*Learning Detect(latent(structure( in the(data Reinforcement*Learning* Adapt behavior to( optimize(long(term(goals( using(observed(rewards Broad Areas of Machine Learning 3/28/18 11
  • 12. Given: An input object with some features X (covariates/independent variables) Goal: Predict a new target/label attributeY (response/dependent variable) Note: – Input & output attributes (X,Y) can be simple (e.g., numeric or categorical values/vector) or have a complex structure (e.g., time-series, text sequences) – Classification (Y is categorical), Regression (Y is numeric/ordinal) Typical Predictive Modeling Problem 12
  • 13. Shipping Logistics 3/28/18 13 Given: a customer order and seller details Predict: expected shipping time
  • 14. Product Catalog Management 3/28/18 14 Given: a new product Predict: product category it should be placed in
  • 15. Product Recommendations 3/28/18 15 Given: a user, current context & a candidate product Predict: preference of user for the product
  • 16. Advertising 3/28/18 16 Given: a user, search query and a candidate product ad Predict: expected click through rate of user for the product ad
  • 17. Many more applications ! • Advertising • Product search and browse experience • Forecasting product demand and supply • Product pricing/competitor monitoring • Understanding customer profiles and lifetime value • Detecting seller and customer fraud • Enriching product catalog & review content • … 3/28/18 17
  • 18. Given: An input object with some features X (covariates/independent variables) Goal: Predict a new target/label attributeY (response/dependent variable) Note: – Input & output attributes (X,Y) can be simple (e.g., numeric or categorical values/vector) or have a complex structure (e.g., time-series, text sequences) Typical Predictive Modeling Problem 18
  • 19. • Training data with correct input-output pairs (X,Y) • Data samples in both train/unseen data are generated in same way (i.i.d) Supervised Learning: Key Assumptions Fraud NOT)Fraud Fraud 19
  • 20. Supervised Learning Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i)
  • 21. Supervised Learning Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i) Prediction: Given a new sample X with unknown Y, predict it using F(X)
  • 22. Training: Find a “good” model f from the training data ! Supervised LearningKey Elements of a Supervised Learning Algorithm
  • 23. Training: Find a “good” model f from the training data ! • What is an allowed “model” ? – Member of a model class H, e.g., linear models • What is “good” ? – Accurate predictions on training data in terms of a loss function L, e.g., squared error (Y –F(X))2 • How do you “find” it ? – Optimization algorithm A, e.g., gradient descent Supervised LearningKey Elements of a Supervised Learning Algorithm
  • 24. Training: Apply algorithm A to find model from the class H that optimizes a loss function L on the training data D • H: model class, L: loss function,A: optimization algorithm • Different choices lead to different models on same data D Supervised LearningKey Elements of a Supervised Learning Algorithm
  • 25. ML Application Development Life Cycle 3/28/18 25
  • 38. Machine Learning Problem Definition Business Problem: Optimize a decision process to improve business metrics • Sub-optimal decisions due to missing information • Solution strategy: predict missing information from available data using ML
  • 39. Machine Learning Problem Definition Business Problem: Optimize a decision process to improve business metrics • Sub-optimal decisions due to missing information • Solution strategy: predict missing information from available data using ML Example: Reseller fraud • Business objective: Limit fraud orders to increase #users served and reduce return shipping expenses. • Decision process: Add friction to orders via disabling cash on delivery (COD) • Missing information relevant to the decision: – Likelihood of the buyer reselling the products – Likely return shipping costs – Unserved demand for the product
  • 40. Key elements of a ML Prediction Problem • Instance definition • Target variable to be predicted • Input features • Sources of training data • Modeling metrics (Online/Offline, Primary/Secondary) • Deployment constraints
  • 41. Instance Definition • Is it the right granularity from the business perspective? • Is it feasible from the data collection perspective ?
  • 42. Instance Definition Multiple options • a customer • a purchase order spanning multiple products • a single product order (quantity can be >1"though) 3/28/18 42
  • 43. Instance Definition Multiple options • a customer • a purchase order spanning multiple products • a single product order (quantity can be >1 though) [preferred choice] Why? • Reselling behavior is also at a single product level • COD presented per product not per entire purchase • Blocking a customer on all his orders can even hurt 3/28/18 43
  • 44. TargetVariable to be Predicted • Can we express the business metrics (approximately) in terms of the prediction quality of the target? • Will accurate predictions of the target improve the business metrics substantially?
  • 45. Potential Business Impact • Will accurate predictions of the target improve the business metrics substantially? • Compute business metrics for each case – Ideal scenario ( perfect predictions on target ) – Dumb baseline ( default predictions e.g., majority class) – Simple predictor with rules/domain knowledge – Existing solution (if one exists) – Likely scenario with a reasonable effort ML model ( • Assess effort vs. benefits 3/28/18 45
  • 46. TargetVariable to be Predicted • Can we express the business metrics (approximately) in terms of the prediction quality of the target? • Will accurate predictions of the target improve the business metrics substantially? • What is the data collection effort ? – manual labeling costs • Is it possible to get high quality observations? – uncertainty in the definition, noise or bias in labeling process
  • 47. TargetVariable Multiple options • Likelihood of buyer reselling the current order • Number of unserved users because of the current order • Expected return shipping expenses for the current order 3/28/18 47
  • 48. TargetVariable Multiple options. • Likelihood of buyer reselling the current order [compromise choice] • Number of unserved users because of the current order • Expected return shipping expenses for the current order Why? • Last two choices better in terms of business metrics, but data collection is difficult • First choice makes data collection easy (esp. as a binary label) and addresses business metrics in a reasonable, but slightly suboptimal way 3/28/18 48
  • 49. Input features • Is the feature predictive of the target ? • Are the features going to be available in production setting ? – Need to define exact time windows for features based on aggregates – Watch out for time lags in data availability – Be wary of target leakages (esp. conditional expectations of target ) • How costly is to compute/acquire the feature ? – Might be different in training/prediction settings
  • 50. Input Features Reselling vs. Non-reselling indicators • High product discount • High order quantity relative to other orders of same product – Normalize by median/mean to get relative values • More for some products/verticals – Product/vertical id can be used • Buyer being a business store in product category – Buyer’s category purchase count – Buyer being a business store 3/28/18 50
  • 51. Input Features Reselling vs. Non-reselling indicators • High product discount [feasible] • High order quantity relative to other orders of same product – Normalize by median/mean to get relative values [with lag] • More for some products/verticals – Product/vertical id can be used [feasible] • Buyer being a business store in product category – Buyer’s category purchase count [with lag] – Buyer being a business store [expensive join with external info] 3/28/18 51
  • 52. Sources of Training Data • Is the distribution of training data similar to production data? – e.g., if production data evolves over time, can the “training data” be adjusted accordingly ? • Are there systemic biases (e.g., data filters) in training data? – Adjust the scope of prediction process so that it matches with the training data setting
  • 53. Sources of Training Data Historical order data – input features are available, but target is missing Target observations – Manual labeling on a random subset after focused investigations on the address and the customer purchase history. – Improve labeling efficiency by filtering by order quantity and apply same filtering in production 3/28/18 53
  • 54. Modeling Metrics • Online metrics are measured on a live system – Can be defined directly in terms of the key business metrics – typically measured via A/B tests and these metrics are potentially influenced by a number of factors (e.g., net revenue) • Offline metrics are meant to be computed on retrospective “labeled” data – typically measured during offline experimentation and more closely tied to prediction quality (e.g., area under ROC curve) • Primary metrics are ones that we are actively trying to optimize – e.g., losses due to fraud • Secondary metrics are ones that can serve as guardrails – e.g., customer base size 3/28/18 54
  • 55. Offline Modeling Metrics • Does improvement in offline modeling metrics result in gains in online business metrics ? Model quality: – A) Maximize coverage of fraud orders at certain level of certainty (>90%) – B) Binary target: Four decision possibilities • Maximize average payoff in terms of expected return costs given the different possibilities 3/28/18 55 Return'Pay+offs Predicted'Fraud Predicted'Not'Fraud Actual-Fraud 0 2 avg.-return-costs Actual-Not-Fraud 2avg.-lost order-costs 0
  • 56. Deployment Constraints • What are the application latency & hardware constraints ? Computational constraints: – Orders per sec, allowed latency for COD disable action – Available processing power, memory
  • 57. Problem Formulation Template 3/28/18 57 • Template(s) with questions on all the key elements – Listing of possible choices – Reason for preferred choice • Populated for each project by product manager + ML expert
  • 58. Exercise: ML Problem Definition Good choices for target variable, features & other elements ? – Predicting shipping time for an order – Forecasting the demand for different products – Determining the nature of a customer complaint – Predicting customer preference for a product 3/28/18 58
  • 61. Motivation 3/28/18 61 • Early detection & prevention of common data related errors • Reproducibility • Auditability • Robustness to failure in data fetch pipelines
  • 62. Data Elements of Interest 3/28/18 62 • Instance identifiers • Target variables • Input features • Other factors useful for evaluating online/offline metrics Fields to specify for each variable of interest • ID, Name, Version • Modeling role, Owner, Description, Tags
  • 63. Definitions 3/28/18 63 Three possible copies for same variable based on the stages • Offline training, Offline evaluation, Deployment Fields to specify for each variable of interest for each stage • Precise definition (query + source for raw ones or formula for derived ones) • Data type, value check conditions • Units/Level sets • Is Aggregate? , Exact aggregation set or time window • Missing or invalid value indicators, reasons, mitigations (e.g., div by 0 for ratios) • First creation date • Known quality issues
  • 64. Review Criteria 3/28/18 64 • Unambiguous definitions to allow for ready implementation • Parity across different stages (training/evaluation/deployment) – Definitions – Data type, value checks, units, level sets – Aggregation windows – Missing/invalid value handling of derived variables
  • 65. Review Criteria 3/28/18 65 • Is the input X to targetY map invariant across stages? – Do definitions drift with time ?(Use averages not sums in general ) • e.g., customer spend to date in books ! order fraud status – Do we have the correct feature snapshot of X forY ? • e.g. , customer loyalty category (from when?)
  • 66. Review Criteria 3/28/18 66 • Common data leakages – Unintentional peeking into future, target, or any kind of unobserved variables – Ambiguously specified aggregates, e.g., customer revenue till the “most recent” order ; interpretation can be different in training data and deployment settings because of delays in data logging – Time-varying features for which only certain (or recent) snapshots are available, e.g., marital status of the customer
  • 67. Review Criteria 3/28/18 67 • Handling of invalid/missing values of raw variables – Join errors in preprocessing – Service failures in deployment • Handling of known data quality issues – Corruption of data for certain segments/time periods
  • 68. Data Definition Template 3/28/18 68 • Template(s) with details of all the data elements and review questions • Populated for each project by all the relevant stake holders
  • 69. Outline • Overview of ML ecosystem • Problem formulation & data definition • Offline modeling • Building an internal ML platform • Key tenets for real-world ML applications 69
  • 74. Data Collection & Integration Abstract process: (specifics depend on data management infrastructure) • Find where the data resides – API, database/table names, external web sources • Identify mappings between schemas of different sources • Obtain the instance identifiers • Perform a bunch of queries (joins/aggregations) for obtaining the features/target Data access/integration tools: • SQL • Hive, Pig, SparkSQL (for large joins) • Scrapy, Nutch (web-crawling) 3/28/18 74
  • 76. Data Exploration: Why? • Data quality is really critical – “Garbage in garbage out” – Need to verify if data confirms to expectations (domain knowledge) • ML modeling requires making a lot of choices ! • Better understanding of data – Early detection and fixing of data quality issues – Good preprocessing, feature engineering & modeling choices 3/28/18 76
  • 77. Data Exploration: What to look for ? • Size and schema – #instances & #features, – feature names & data types (numeric, ordinal, categorical, text) • Univariate feature and target summaries – prevalence of missing, outlier, or junk values – distributional properties of features, – skew in target distribution • Bivariate target-feature dependencies – distributional properties of features conditioned on the target – feature-target correlation measures • Temporal variations of features & targets 3/28/18 77
  • 79. Example: Dataset Summary 3/28/18 79 Missing- values Class- Imbalance Skewed Distribution
  • 80. Univariate (Feature or Target) Histograms 3/28/18 80 Y(axis:.log.Scale..
  • 81. Feature-Target Dependencies • Class histograms conditioned on feature value 3/28/18 81 For small values, fraud fraction is almost 7-10 times less, For large values it is comparable or more
  • 83. Data Sampling & Splitting • Generalization to “unseen” data forms the core of ML • Split “labeled” data into train and test datasets – Train split is used to learn models – Test split is proxy for the “unseen” data in deployment setting is used to evaluate model performance Note: In the test split, target is known unlike data in deployment setting 3/28/18 83
  • 84. Creating Train & Test Splits • Random disjoint splits – Randomly shuffle & the split into train and test sets (e.g., 80% train & 20% test) • K-fold cross validation – Partition into K subsets: Use K-1 to train & one for test – Rotate among the different sets to create K different train-test splits – More reliable avg. performance estimate, variance measures, statistical tests – Leave-one out: Extreme case of K-fold (each fold is single instance) • Out-of-time splits (data has a temporal dependence) – Train & test splits obtained via time-based cutoff (from production constraints) • Special case: Imbalanced classes (for certain algorithms, e.g., decision trees) – Balance train split alone by down (or up) sampling majority (minority) class
  • 86. Complex pipelines: Additional Splits • In a simple scenario, the target labels are used only by the “learning algorithm” – Train and test splits suffice for this case • Complex pipelines might have multiple elements that need a peek at the target – e.g., Feature selection, Meta learning algorithms, Output calibration etc. – Separate data splits for each elements leads to better generalization – Need to consider size of available labeled data as well 3/28/18 86
  • 88. Data Preprocessing • Special handling of text valued features – Necessary to preserve the relevant “information” – Appropriate handling of special characters, punctuation, spaces & markup • Feature/row scaling (for numeric attributes) – Necessary to avoid numerical computation issues, speedup convergence – Columns: • z-scoring: subtract mean, divide by std-deviation! mean=0, variance=1 • fixed range: subtract min, divide by range ! 0 to 1"range – Rows: L1 norm, L2 norm • Imputing missing/outlier values – Necessary to avoid incorrect estimation of model parameters – Handling strategies depend on the semantics of “missing” 3/28/18 88
  • 89. Handling Outliers & MissingValues • Indication of suspect instance: discard the record • Informative w. r. t. target: introduce a new indicator variable • Missing at random – Numeric: replace with mean/median or conditional mean (regression) – Categorical: replace with mode or likely value conditioned on the rest
  • 91. Feature Engineering • Case 1: Raw features are not highly predictive of the target, esp. in case of simple model classes (e.g., linear models ) • Solution: Feature extraction, i.e., construct new more predictive features from raw ones to boost model performance • Case 2: Too many features with few training instances ! “memorizing” or “overfitting” situation leading to poor generalization. • Solution: Feature selection, i.e., drop non-informative features to improve generalization. 3/28/18 91
  • 92. Feature Extraction • Basic conversions for linear models – e.g., 1-Hot encoding, Sparse encoding of text • Non-linear feature transformations for linear models – Linear models are scalable, but not expressive ! Need non-linear features – e.g. Binning, quadratic interactions • Domain-specific transformations – Well studied for their effectiveness on special data types such as text, images – e.g.,TF-IDF transformation, SIFT features • Dimensionality reduction – High dimensional features (e.g., text) can lead to “overfitting”, but retaining only some dimensions may be sub-optimal – Informative low dimensional approximation of the raw feature – e.g., PCA, clustering3/28/18 92
  • 93. Basic Conversions: Categorical Features • One-Hot Encoding – Converts a categorical feature with K values into a binary vector of size K−1 – Just a representation to enable use in linear models 3/28/18 93 Product_vertical Handset Book Mobile isBook isMobile 0 0 1 0 0 1
  • 94. Basic Conversions: High Dimensional Text-like Features Sparse Matrix Encoding Text features: • each feature value snippet is split into tokens(dimensions) • Bag of tokens ! a sparse vector of "counts" over token vocabulary • Single text feature ! sparse matrix with #columns = vocabulary size Other high dimensional features: • similar process via map from raw features to a bag of dimensions 3/28/18 94
  • 95. Non-linearity: Numeric Features • Non-linear functions of features or target (in regression) – Log transformation, polynomial powers, Box-cox transforms – Useful given additional knowledge on the feature-target relationship • Binning – Numeric feature ! categorical one with #values = #bins – Results in more weights (K-1 for K bins) in linear models instead of just one weight for the raw numeric feature – Useful when the feature-target relation is non-linear or non-monotonic – Bins: equal ranges, equal #examples, maximize bin purity (e.g. entropy) 3/28/18 95
  • 97. Interaction Features • Required when features influence on target is not purely additive – linear combinations of features won’t work Example: Order with 50% discount on mobiles is much more likely to indicate fraud than a simple combination of 50% discount or mobile order. Common Interaction Features: • Non-linear functions of two or more numeric features, e.g., products & ratios • Cross-products of two or more categorical features • Aggregates of numerical features corresponding to categorical feature values • Tree-paths: use leaves from decision trees trained on a smaller sample 3/28/18 97
  • 99. Numerical-Categorical Interaction Features • Compute aggregates of a numeric feature corresponding to each value of a categorical feature • New interaction feature ! numeric one obtained by replacing the categorical feature value with the corresponding numeric aggregate, • e.g., brand_id ! brand_avg_rating, brand_avg_return_cnt • Especially useful for categorical features with high cardinality (>50) 3/28/18 99
  • 100. Tree Path Features • Learn a decision tree on a small data sample with raw features • Paths to the leaves are conjunctions constructed from conditions on multiple raw features • Highly informative with respect to the target. 3/28/18 100
  • 101. Domain-Specific Transformations Text Analytics and Natural Language Processing • Stop-words removal/Stemming: Helps focus on semantics • Removing high/low percentiles: Reduces features w/o loss in predictive power • TF-IDF normalization: Corpus wide normalization of word frequency • Frequent N-grams: Capture multi-word concepts • Parts of speech/Ontology tagging: Focus on words with specific roles Web Information Extraction • Hyperlinks, Separating multiple fields of text (URL, in/out anchor text, title, body) • Structural cues: XPaths/CSS; Visual cues: relative sizes/positions of elements • Text style (italics/bold, font-size, colors) Image Processing • SIFT features, Edge extractors, Patch extractors 3/28/18 101
  • 102. Dimensionality Reduction • Clustering along feature values – K-means variants (along feature values) • Low rank matrix factorization – Principal Component Analysis (PCA) – Non-negative Matrix Factorization (NNMF) • Topic models – Latent Dirichlet Allocation (LDA) – Probabilistic Latent Semantic Analysis (PLSA) • Neural embeddings – Word2Vec(Skip-gram), Para2Vec 3/28/18 102
  • 103. Feature Selection Key Idea: Sometimes “Less (features) is more (predictive power)” Motivating reasons: • To improve generalization • To meet prediction latency or model storage constraints(for some applications) Broadly, three classes of methods: • Filter or Univariate methods, e.g., information-gain filtering • Wrapper methods, e.g., forward search • Embedded methods, e.g., regularization 3/28/18 103
  • 104. Feature Selection: Filter or Univariate Methods • Goal: Find the “top” individually predictive features – “predictive”: specified correlation metric • Ranking by a univariate score – Score features via an empirical statistical measure that corresponds to predictive power w.r.t. target – Those above a cut-off (count, percentile, score threshold) are retained Note: • Fast, but highly sub-optimal since features evaluated in isolation • Independent of the learning algorithm. • e.g., Chi-squared test, information gain, correlation coefficient 3/28/18 104
  • 105. Feature-Target Correlation • Mutual information: Captures correlation between categorical feature (X) and class label (Y) • p(x,y): Fraction of examples with X=x and Y=y • p(x), p(y): Fraction of examples with X=x,Y=y 3/28/18 105 I(X,Y) = p(x, y)log p(x, y) p(x)p(y)y∈sup(Y ) ∑ x∈sup(X) ∑
  • 106. Feature-Target Correlation • Pearson’s correlation coefficient: Captures linear relationship between numeric feature (X) and target value (Y) • Xi,Yi: Value of X,Y in ith instance • : Mean of X,Y • Covariance matrix: Captures correlations between every pair of features 3/28/18 106 ρ(X,Y) = cov(X,Y) σXσY = (Xi − X) i ∑ (Yi −Y ) (Xi − X)2 i ∑ # $ % & ' ( 1/2 (Yi −Y )2 i ∑ # $ % & ' ( 1/2 X,Y
  • 107. Feature Selection: Wrapper Methods • Goal: Find the “best” subset from all possible subsets of input features – “best” : specified performance metric & specified learning algo • Iterative search – Start with an initial choice (e.g., entire set, random subset) – Each stage: find a better choice from a pool of candidate subsets. Note: • Computationally very expensive • e.g., Backward search/Recursive feature elimination, Forward search 3/28/18 107
  • 108. Feature Selection: Embedded Methods • Identify predictive features while the model is being created itself • Penalty methods: learning objective has an additional penalty term that pushes the learning algorithm to prefer simpler models • Good trade-off in terms of the optimality & computational costs • e.g., Regularization methods (LASSO, Elastic Net, Ridge Regression) 3/28/18 108
  • 110. Training: Find a “good” model f from the training data ! Supervised LearningKey Elements of a Supervised Learning Algorithm
  • 111. Training: Find a “good” model f from the training data ! • What is an allowed “model” ? – Member of a model class H, e.g., linear models • What is “good” ? – Accurate predictions on training data in terms of a loss function L, e.g., squared error (Y –F(X))2 • How do you “find” it ? – Optimization algorithm A, e.g., gradient descent Supervised LearningKey Elements of a Supervised Learning Algorithm
  • 112. Training: Apply algorithm A to find model from the class H that optimizes a loss function L on the training data D • H: model class, L: loss function,A: optimization algorithm • Different choices lead to different models on same data D Supervised LearningKey Elements of a Supervised Learning Algorithm
  • 113. Model Training (Recap) Key elements in learning algorithms: • Model class, e.g., linear models, decision trees, neural networks • Loss function, e.g., logistic loss, hinge loss • Optimization algorithm, e.g., gradient descent, & assoc. params Lot of algorithms & hyper-parameters to choose from ! 3/28/18 113
  • 115. Model Choice: Classification 3/28/18 115 Primary factors: High #data instances ( > 10(MM ) • Linear models – online learning SGD High #features/examples ratio (>1) • Linear Models:Aggressive (L1) regularization • Linear Models: Dimensionality reduction • Naïve Bayes (homogeneous independent features) Need non-linear interactions • Kernel methods (e.g., Gaussian SVM) • Tree Ensembles (e.g., GBDT, RF) • Deep learning methods (e.g., CNNs, RNNs) •
  • 116. Model Choice: Regression 3/28/18 116 Primary factors: High #data instances ( > 10(MM ) • Linear models – online learning SGD High #features/examples ratio (>1) • Linear models:Aggressive (L1) regularization • Linear models: Dimensionality reduction Need non-linear interactions • Kernel methods (e.g., Gaussian SVR) • Tree Ensembles (e.g., GBDT, RF) • Deep learning methods (e.g., CNNs, RNNs) •
  • 117. Model Evaluation & Diagnostics Model Evaluation: • Train error: Estimate of the expressive power of the model/algorithm relative to training data • Test error: A more reliable estimate of likely performance on “unseen” data Post evaluation: What is the right strategy to get a better model ? • 1) Get more training data instances • 2a) Get more features or construct more complex features • 2b) Explore more complex models/algorithms • 3a) Drop some features • 3b) Explore simpler models/algorithm
  • 118. Overfitting • Overfitting: Model fits training data well (low training error) but does not generalize well to unseen data (poor test error) • Complex models with large #parameters capture not only good patterns (that generalize), but also noisy ones 3/28/18 118 Y' X' High'prediction'error Actual Predicted Model
  • 119. Underfitting • Underfitting: Model lacks the expressive power to capture target distribution (poor training and test error) • Simple linear model cannot capture target distribution 3/28/18 119 Y( X(
  • 120. Bias &Variance • Bias of algo: Difference between the actual target and the avg. estimated target where averaging is done over models trained on different data samples • Variance of algo:Variation in predictions of models trained on diff. data samples 3/28/18 120
  • 121. Model Complexity: Bias &Variance • Simple learning algos with small #params. ! high bias & low variance – e.g., Linear models with few features – Reduce bias by increasing model complexity (adding more features) • Complex learning algos with large #params ! low bias & high variance – e.g. Linear models with sparse high dimensional features, decision trees – Reduce variance by increasing training data & decreasing model complexity (feature selection) 3/28/18 121
  • 122. Validation Curve 3/28/18 122 Overfitting Region Optimal- choice Ideal choice: Match of complexity between learning algorithm and the training data. Prediction performance vs. Model complexity parameter
  • 123. Learning Curve 3/28/18 123 Prediction performance vs. Num. of training examples Ideal choice: Early portion of the flat region.
  • 124. Common Evaluation Metrics Standard evaluation metrics exist for each class of predictive learning scenarios – Binary Classification – Multi-class & Multi-label Classification – Regression – Ranking • Loss function used in training objective is just one choice of evaluation metric – Usually picked because the learning algorithm is readily available – Might be a good, but not necessarily ideal choice from business perspective • Critical to work backwards from business metrics to create more meaningful metrics 3/28/18 124
  • 126. Classification – Operational Point Evaluation Metrics • For each threshold, Confusion matrix for binary classification of P vs. N • Precision = TP/(TP+FP): How correct are we on the ones we predicted P? • Recall = TP/(TP+FN):What fraction of actual P’s did we predict correctly? • True Positive Rate (TPR) = Recall • False Positive Rate (FPR) = FP/(FP+TN):What fraction of N’s did we predict wrongly? Actual'P Actual N Predicted(P TP FP Predicted(N FN TN 3/28/18 126
  • 127. 127 Tradeoff(Curve 0% 20% 40% 60% 80% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% %(Cum(Non0Frauds %(Cum(Frauds 90% True%Positive%Rate% False%Positive%Rate ROC curve: Plot of TPR vs. FPR AUC: Area under ROC curve • Perfect classifier: AUC =1 • Random classifier: AUC =0.5 • Odds of scoring P > N • Effective for comparing learning algorithms across operational pts Operational point: • Maximize (TPR – FPR), F1-measure • Other business driven choices 3/28/185 Receiver Operating Characteristic (ROC) Curve
  • 129. • Binary Classification: Score threshold corresponds to operational point • Application-specific bounds on Precision or Recall – Maximize recall (or precision) with a lower bound on precision (or recall) • Application-specific misclassification cost matrix – Optimize overall misclassification cost (TP*CTP +FP*CFP + TN*CTN + FN*CFN ) – Reduces'to'standard'misclassification'error'when'CTP'=CTN=0'and'CFP'=CFN =1 Classification: Picking an Operational Point 3/28/18 129 Actual'P Actual N Predicted'P CTP CFP Predicted'N CFN CTN
  • 130. Regression – Evaluation Metrics • Metrics when regression is used for predicting target values – Root Mean Square Error(RMSE): – R2 : How much better is the model compared to just picking the best constant? R2 =1- (Model Mean Squared Error /Variance) – MAPE (Mean Absolute Percent Error): 3/28/18 130 1 N (Yi − F(Xi ) i ∑ )2 # $ % & ' ( 1/2 1 N Yi − F(Xi ) Yii ∑ # $ %% & ' ((
  • 131. Model Fine-tuning • Lot of algorithms and hyper-parameters (e.g., learning rate) to choose from – Infeasible to explore all choices • Practical solution approach – Narrow down a few suitable algorithms from meta data (size/attribute types) – For each chosen algorithm, systematically explore hyper-parameter choices • Alternate optimization • Exhaustive grid search • Bayesian optimization (e.g., Spearmint, MOE) • Each exploration: learning a model on a train split & evaluating on a test split. • Preferred split mechanism: k-fold cross-validation • Best hyperparameter choices based on the test (cross-validation) error 3/28/18 131
  • 132. Multiple stages of optimization • Objective: Find f(.) to optimize some cost L(Yunseen, f(Xunseen)) • ML Methodology: – (Step 1) Model Learning: Determine good choices of f(.) that optimize LA(Ytrain, Xtrain) for different choices of hyperparameters and algorithms – (Step II) Hyperparameter Fine-tuning : Among choices in Step I, pick the one that optimizes LB(Yeval-split, Xeval-split) – (Step III) Operational choices (e.g., score thresholding, output calibration): For the choice in Step II, determine the operational choices so as to optimize LC(Yop-split, Xop-split) • Note: – Ideal choice is to have L=LA = LB = LC and the data splits to be i.i.d., but not always possible, • e.g., need L =max. recall for precision >90%, but LA = logistic loss & LB = Area under ROC – Preferable to choose intermediate metrics that are “close” to the desired business metric and robust off-the-shelf implementations 3/28/18 132
  • 134. Building an Internal ML Platform 3/28/18 134
  • 138. Existing ML Platform Utilities Open%source%packages 3/28/18 138 Google%Cloud%ML • Not)cost)effective)for)large)companies) • Need)to)move)data)to)external)clouds Managed%services • Free)and)flexible)) • Gaps)in)functionality Large%companies%need%an%internal%ML%platform%to%make%up%for%the%gaps%!
  • 139. Primary Challenges • Fast error-proof productionization • Scalability vs. flexibility trade-off • Reusability & extensibility of modeling effort • Management of offline modeling experiments • Interactive monitoring of modeling experiments 3/28/18 139
  • 140. Challenge: Road to Productionization • Long slow road to delivery for each new application, Very little reuse across applications • POC code → Production code translation is highly error prone • Rigorous evaluation & debugging of actual production systems is unlikely since these tasks are owned by dev ops folks and data scientists don’t understand production code 3/28/18 140 Product(( Manager Data( Scientist Software( Engineers Dev(Ops App Requirements(&( Metrics PoC(Modeling [R,Python] Production( code(
  • 141. Solution: Self-contained “Models” Data scientists • build application-specific configurations for data collection & modeling • ship self-contained production “models” (i.e.,“model”, configurations, library dependencies) say via Docker (not POC code !) Software engineers • build application-agnostic* production code & systems for automation of data collection, model scoring, re-training, evaluation, etc. 141 Data$Scientist Software$Engineers Dev$Ops Self5contained$Models Application$agnostic Production$code$ *$Need$to$consider$data$scale,$latency$for$scoring$&$retraining$which$have$some$dependency$on$the$application
  • 142. Solution: Self-contained “Models” • ML packages such as scikit-learn, spark-mllib, Keras allow for an easy serialization of the entire processing pipeline (i.e., preprocessing, feature- engineering, scoring) along with the fitted parameters as a single “model” that can be exported to be used for scoring. 3/28/18 142
  • 143. Challenge: Scalability vs. Flexibility Trade-off • Scalability requirements vary across applications • Factors to consider – Size of training data – Frequency of retraining – Rate of arrival of prediction instances and latency bounds (in case of online predictions) – Size of batch and frequency of scoring (in case of batch predictions) • Data scientists prefer to train models on single machines where possible 3/28/18 143
  • 144. Solution: Support Multiple Choices • Moderate scale for training & prediction – train models on a single machine (in Python/R); – export model as is to multiple machines with the same image and predict in parallel • Moderate scale for training, but large prediction scale – train models on a single machine (e.g., spark-Mllib in Python/R); – export model to a different environment (e.g., Scala/Java ) that allows more efficient parallelization. • Large scale for both training & prediction – train models and predict on a cluster(e.g., via sparkit-learn, PySpark or Scala, ) 3/28/18 144
  • 145. Challenge: Reusability & Extensibility of Modeling Effort • ML workflows are more than just the “models pipeline” – e.g., data fetch/aggregation from multiple sources, evaluation across multiple models, exploratory data analysis • Offline modeling code (notebooks) tends to get dirty fast – Mix of interactive analysis (specific to application) and processing of data • Common approach to reuse – limited use of libraries + cut & paste code 3/28/18 145
  • 146. Example Workflow: Data Fetch + Aggregation 3/28/18 146 Libraries( Data$ Source$1 Data$ Source$2 Data$ Source$3 Data Reader Consolidated Data$File$(s) Data$$$$$ Aggregation Data$$$$$ Writer Read Utilities Aggregation Utilities Write Utilities Data(Aggregation(Config( Read Config Aggregation Config Write Config Workflow
  • 147. Example Workflow: Model Learning 3/28/18 147 Workflow Libraries. Consolidated+Data+ Data Splitter Model++++ & Report Target+Constructor Feature Pipeline+setup Filters/ Splitters/ Samplers Transformers Learning Algos Learning.Config. Data+Split/ Sampling+Config++ Target Config Model Config Feature Config HP+search+config Model Set+up HP Search Predict&+Eval Eval Config Param search Eval Metrics
  • 148. Example Workflow: Model Evaluation 3/28/18 148 Libraries( Workflow Labeled'Data' Eval' reports Feature Pipeline'setup Transformers Learning Algos Evaluation(Config( Pre:trained,'feature'models'config Model Set'up Predict Eval metrics Eval Eval Config
  • 149. Example Workflow: Model Scoring 3/28/18 149 Libraries( Workflow Unlabeled(Data( PredictionsFeature Pipeline(setup Transformers Learning Algos Prediction(Config( Pre:trained,(feature(model(config Model Set(up Predict
  • 150. Solution: Workflow Abstractions • Each workflow is represented as a DAG over nodes – DAGs can be encoded asYAML or JSON files • Each node is a computational unit with the following elements – name – environment of execution(e.g., python/scala) – actual function to be executed (via a link to existing module, class, method) – inputs (with default choices) and outputs – tags to aid discovery 3/28/18 150
  • 151. Solution: Workflow Abstractions • Wrapper libraries allow hooks to existing ML packages (sklearn, keras, etc) via nodes • Properly indexed repositories of workflow DAGs, nodes and node-configurations allow discovery and reuse • Editing tools for composing DAGs enable extensibility 3/28/18 151
  • 153. Challenge: Management of Experiments • Manual tracking of experimental results requires considerable effort and is error-prone • Low reproducibility and auditability of offline modeling experiments 3/28/18 153
  • 154. Solution: Automated Repositories of ML Entities • Run: execution of a workflow – consumes datasets and configurations as inputs and generates models, reports and new datasets as outputs – organizes all the inputs/outputs and intermediate results in an appropriate directory structure • Automatically updated versioned repositories – workflow DAGs, nodes, configs – runs, datasets, models, reports • Post each run, the repositories are automatically updated with the appropriate linkages between the different entities 3/28/18 154
  • 155. Challenge: Interactive Monitoring of Experiments • Interactive execution of experiments ! messy code 3/28/18 155
  • 156. Solution: Read-only monitoring • Additional layer that allows workflow DAGs to be executed one step at a time and outputs to be examined from an interactive tool (e.g., Jupyter notebooks) – run_node(), load_input(), load_output() • Cloning of intermediate inputs & outputs on demand so that these can be analyzed without affecting the original run – Changes to the actual run have to be explicitly made via workflow DAGs, configs 3/28/18 156
  • 157. Key Tenets for Real-world ML Applications 3/28/18 157
  • 158. Key Tenets for Real-world ML applications Design phase: • Work backwards from the application use case – ML problem formulation & evaluation metrics aligned with business goals – Software stack/ML libraries based on scalability/latency/retraining needs • Keep the ML problem formulation simple (but ensure validity) – Understand assumptions/limitations of ML methods & apply them with care – Should enable ease of development, testing, and maintenance
  • 159. Key Tenets for Real-world ML applications Modeling phase: • Ensure data is of high quality – Fix missing values, outliers, target leakages • Narrow down modeling options based on data characteristics – Learn about the relative effectiveness of various preprocessing, feature engineering, and learning algorithms for different types of data. • Be smart on the trade-off between feature engg. effort & model complexity – Sweet spot depends on the problem complexity, available domain knowledge, and computational requirements; • Ensure offline evaluation is a good “proxy” for the “real unseen” data evaluation – Generate train/test splits similar to how it would be during deployment
  • 160. Key Tenets for Real-world ML applications Deployment phase: • Establish train vs. production parity – Checks on every possible component that could change • Establish improvement in business metrics before scaling up – A/B testing over random buckets of instances • Trust the models, but always audit – Insert safe-guards (automated monitoring) and manual audits • View model building as a continuous process not a one-time effort – Retrain periodically to handle data drifts & design for this need Don’t adopt Machine Learning because of the hype !
  • 161. Thank You ! Happy Modeling ! Contact: srujana@gmail.com
  • 162. Useful References 3/29/18 162 • Google&AI&Course • https://ai.google/education/#?modal_active=none • Rules&of&Machine&Learning:&Best&Practices&for&ML&Engineering& • http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf • What’s&your&ML&Test&Score?&A&rubric&for&ML&production&systems& • https://www.eecs.tufts.edu/~dsculley/papers/ml_test_score.pdf • Practical&advice&for&analysis&of&large,&complex&data&sets • http://www.unofficialgoogledatascience.com/2016/10/practicalAadviceAforAanalysisA ofAlarge.html