SlideShare ist ein Scribd-Unternehmen logo
1 von 104
Downloaden Sie, um offline zu lesen
@2016 LinkedIn Corporation. All Rights Reserved.
KDD 16’ Tutorial
Business Applications of Predictive
Modeling at Scale
@2016 LinkedIn Corporation. All Rights Reserved. 2
Instructors
Songtao Guo
Sr. Staff of Business
Analytics Data Mining at
LinkedIn
Yan Liu
Sr. Manager of Business
Analytics Data Mining at
LinkedIn
Paul Ogilvie
Sr. Manager of Machine
Learning Algorithms Team at
LinkedIn
Qiang Zhu
Staff of Business Analytics
Data Mining at LinkedIn
@2016 LinkedIn Corporation. All Rights Reserved. 3
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved. 4
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved. 5
Example 1: Search Ranking
@2016 LinkedIn Corporation. All Rights Reserved. 6
Example 2: Recommendation
Job recommendations
on LinkedIn
Movie recommendations
on Netflix
@2016 LinkedIn Corporation. All Rights Reserved. 7
Example 3: User Intention Predictions
Predict which users are more likely to make certain actions, such as click, purchase, churn, etc.
@2016 LinkedIn Corporation. All Rights Reserved. 8
Example 4: B2B Predictive Analytics
@2016 LinkedIn Corporation. All Rights Reserved. 9
Example 5: Web Content Analytics
Social data
Customer feedback
• Customer service
• Group updates
• Network updates
Survey results
Products Categorization
Sentiments
Group Subscription
Home
Page
Mobile Message
Text Classification
Relevance
@2016 LinkedIn Corporation. All Rights Reserved. 10
Other Applications...
Image Recognition
Fraud/Spam Detection NLP Speech Recognition
and more...
@2016 LinkedIn Corporation. All Rights Reserved. 11
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved. 12
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved.
Problem Definition
13
Example: Job Seeker Subscription Model
Assume we periodically send marketing promotions/campaigns to
LinkedIn members for job-seeker subscriptions. How do we decide
who we should send these emails to?
Binary classification problem: let i
represents the product
subscription status of member
Job 1
Senior Data Scientist
Linkedin
Mountain View, CA, US
12 connections work here
------------------------------------------
Posted 2 days ago
Job 2 Job 3
Job Seeker Subscription
Reaching out
• Who's viewed your profile
• InMail™ messages
Finding the right people
• Premium search
Job seeking
• Featured applicant
• Applicant insights
• Salary data
Standing out
• Premium Profile
• Larger search listings
@2016 LinkedIn Corporation. All Rights Reserved.
crowd and
internal
judgments
label
preparation
raw features
features with
label
log data
online A/B test random
bucket
evaluate for
model selection
data
partitioning
model training
+
-
- -
-
+ +
best
model
compute offline
evaluation metrics
offline scoring
and indexing
log data
a/b test reports
feature
engineering
label
feature
integration
training
data
testing
data
validation
data
model
performance
raw features
feature
integration
scoring
features
-
raw data
-
online/offline
systems
@2016 LinkedIn Corporation. All Rights Reserved.
crowd and
internal
judgments
label
preparation
raw features
features with
label
log data
online A/B test random
bucket
evaluate for
model selection
model training
+
-
- -
-
+ +
best
model
compute offline
evaluation metrics
offline scoring
and indexing
a/b test reports
feature
engineering
label
feature
integration
training
data
testing
data
validation
data
model
performance
raw features
feature
integration
scoring
features
-
raw data
-
data
partitioning
online/offline
systems
log data
@2016 LinkedIn Corporation. All Rights Reserved.
Label Preparation
▪ A set of labels(“right answers”) is defined in
advance
▪ Methods
– Derive from data
▪ Historical transaction
▪ User preference
▪ User activity
– Domain expert
– Scale up label collection
▪ Crowdsourcing
Example1: Job seeker model
• user_id
• Target: whether subscribed
JobSeeker product
• {Yes,No} -> {1,0}
• timestamp
Example2: Churn prediction
• Account + timestamp
• Target: Close/Renew
Example3: sentiment analysis
• Review
• Target: sentiment types
• Strong negative
• Negative
• Neutral
• Positive
• Strong positive
16
@2016 LinkedIn Corporation. All Rights Reserved.
crowd and
internal
judgments
label
preparation
raw features
features with
label
log data
online A/B test random
bucket
evaluate for
model selection
model training
+
-
- -
-
+ +
best
model
compute offline
evaluation metrics
offline scoring
and indexing
a/b test reports
feature
engineering
label
feature
integration
training
data
testing
data
validation
data
model
performance
raw features
feature
integration
scoring
features
-
raw data
-
data
partitioning
online/offline
systems
log data
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Engineering
Entity Features Social FeaturesBehavioral Features
• Multiple data sources, hundreds to billions of features
• Multiple levels of granularity
• Monitoring dynamic changes
18
• Demographics
• Personal and
professional interest
• …
• pageviews
• searches
• activities on external
sites
• …
• Social network identity
and behaviors
• …
The process of transforming raw data into features that better
represent the underlying problem to the predictive models
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Metadata
▪ Location
▪ Description
▪ Type
– numerical, categorical,
ordinal, binary
▪ Granularity
– snapshot, hourly, daily,
weekly, monthly...
▪ Creation frequency
– hourly, daily, weekly,
monthly…
19
▪ Aggregatable
▪ Aggregation length
– snapshot, hourly, daily,
weekly, monthly...
▪ Aggregation option
– min, max, sum, avg,
median...
▪ Transformation option
– log, binary, self…
▪ Owner
▪ Retention policy
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Storage
▪ File format: one of the most essential drivers
of functionality and performance for big data
processing
▪ What must be considered?
– How big are your files?
– How important is file format “splitability”?
– Does block compression matter?
– Are you more concerned about read or write?
20
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Storage Type
21
Storage Text Sequence File Avro Parquet ORC
type row based row based row based column-oriented column + row
block
compression
X V V V V
splittable
compression
X V V V V
schema support X X V V V
schema
evolution
X X V V V
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Representation
▪ Key-Value pair
▪ Full feature vector
▪ Sparse feature vector
22
entity_id feature1 feature2 ...
xx1 0 0.3
xx2 1 0.1
xx3 0 NA
… … 0.9
entity_id feature1 feature2 ...
xx1 0 0.3
xx2 1 0.1
xx3 0 NA
… … 0.9
entity_id feature1 feature2 ...
xx1 1 0.4
xx2 NA 0.2
xx3 0 NA
… … 0.7
2016-02-29
2016-03-31
Entity
Feature
Tim
estam
p
2016-01-31
@2016 LinkedIn Corporation. All Rights Reserved.
Raw Feature Schema Example
23
{
"type" : "record",
"name" : "TUPLE",
"fields" : [ {
"name" : "id",
"type" : [ "null", "long" ]
}, {
"name" : "timestamp",
"type" : [ "null", "int" ]
}, {
"name" : "feature_name",
"type" : [ "null", "string" ]
}, {
"name" : "feature_value",
"type" : [ "null", "double" ]
}
]
}
▪ Key-value pair
entity_id timestamp name value
1 t1 F1 0.4
1 t1 F2 0
1 t2 F4 5
2 t3 F1 1
… … F5 ...
@2016 LinkedIn Corporation. All Rights Reserved.
Raw Feature Schema Example
24
{
"type": "record",
"name": "TUPLE",
"fields": [
{
"name": "id",
"type": [ "null", "long" ]
},
{
"name": "timestamp",
"type": [ "null", "int" ]
},
{
"name": "feature1",
"type": [ "null", "double" ]
},
{
"name": "feature1",
"type": [ "null", "double" ]
},
{
"name": "feature3",
"type": [ "null", "double" ]
},
...
]
}
▪ Full feature vector
entity_id timestamp feature1 feature2 ...
xx1 t1 1 0.4
xx2 t1 NA 0.2
xx3 t2 0 NA
… ... … 0.7
@2016 LinkedIn Corporation. All Rights Reserved.
Raw Feature Schema Example
25
{
"type": "record",
"name": "FeatureVector",
"fields": [
{
"name": "entity_id",
"type": [ "string" ]
},
{
"name": "timestamp",
"type": [ "null", "string" ]
},
{
"name": "features",
"type": { "type": "array", "items": {
"type": "record",
"name": "FeatureValue",
"fields": [
{ "name": "id", "type": "string", "doc": "identifier of a feature"},
{ "name": "value", "type": "string", "doc": "text representation of a feature value" }
]
}},
"default": []
}
]
}
▪ Sparse feature vector
LIBSVM
label index1:value1 index2:value2 ...
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Monitoring
26
Labor Day
Typical metrics to monitor:
● summation: e.g.: total page views, total
searches
● coverage: e.g.: number of users who sent
inmails
● percentiles: e.g.: top 95% of profile views
count
Methods to detect anomaly
● Percentage change
● T-test, Chi-square test
● Generalized ESD (Extreme Studentized
Deviate) test, Seasonal Hybrid ESD.
@2016 LinkedIn Corporation. All Rights Reserved.
crowd and
internal
judgments
label
preparation
raw features
features with
label
log data
online A/B test random
bucket
evaluate for
model selection
model training
+
-
- -
-
+ +
best
model
compute offline
evaluation metrics
offline scoring
and indexing
a/b test reports
feature
engineering
label
feature
integration
training
data
testing
data
validation
data
model
performance
raw features
feature
integration
scoring
features
-
raw data
-
data
partitioning
online/offline
systems
log data
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Integration
28
user_id Label
xx1 0
xx2 1
xx3 0
… …
user_id pageviews searches …
xx1 97 5 …
xx2 27 2 …
xx4 58 4 …
user_id Tenure Is employed …
xx1 3876 1 …
xx2 60 0 …
xx3 2700 1 …
user_id connections Connect in …
xx1 60 4 …
xx2 5 5 …
xx3 120 30 …
user_id Label pageviews searches … Tenure Is employed … connections Connect in …
xx1 0 97 5 … 3876 1 … 60 4 …
xx2 1 27 2 … 60 0 … 5 5 …
xx3 0 null null null 2700 1 … 120 30 …
… …
Label set
Feature set 1
Feature set 2
Feature set 3
@2016 LinkedIn Corporation. All Rights Reserved. 29
Feature Integration
Training data with
timestamp
Feature Mart
present
week 1
week 2
week 3
week 4
mid Y X
xx1 0 Snapshot features Aggregated features
xx2 1 … …
xx3 0 … …
trans desc
self x
sign sign(x)
log sign(x)*log(abs(x)+1))
isna 1:x==null; 0:x!=null
trans desc
sum sum(xi
)
Weighted sum sum(wi
*xi
), wi
=α^i
avg avg(xi
)
Weighted avg sum(wi
*xi
)/sum(wi
), wi
=α^i
past
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Transformation and Interactions
▪ Feature transformation important for linear models
▪ Numeric values:
– binary buckets based on histograms of data
– log transforms
– scale to unit variance, mean of 0
▪ Categorical values
– binary indicator for each categorical value
▪ Interactions
– cross-products of feature types, e.g.
▪ {Skillsmember
} X {Skillsjob
}
▪ {Skillsmember
} X {Industryjob
}
30
@2016 LinkedIn Corporation. All Rights Reserved.
crowd and
internal
judgments
label
preparation
raw features
-
features with
label
log data
online A/B test random
bucket
evaluate for
model selection
model training
+
-
- -
-
+ +
best
model
compute offline
evaluation metrics
offline scoring
and indexing
a/b test reports
feature
engineering label
feature
integration
training
data
testing
data
validation
data
model
performance
raw features
feature
integration
scoring
features
-
raw data
-
data
partitioning
online/offline
systems
log data
@2016 LinkedIn Corporation. All Rights Reserved.
Model Learning
32
Training Set
Validation Set
Testing Set
Model Validation
Model Training
Model Testing
Data Partitioning
Best Model
Model Selection
Solvers
Logistic regression L1/L2
Random Forest
Gradient Boosting Trees
SVM
… ...
@2016 LinkedIn Corporation. All Rights Reserved.
Hyperparameter search
▪ Modeling techniques must be tuned for best performance
– Regression
▪ regularization methods (Ridge, Lasso, Elastic Net)
▪ regularization weight
– Gradient boosted trees
▪ number trees
▪ depth of tree
▪ learning rate
▪ sampling parameters
▪ Provide wrappers to ease hyperparameter search in early problem
exploration
– Grid Search
– Randomized Search, Bayesian Optimization
▪ Typically don’t want to rerun hyperparameter search for each retrain, do
only periodically when there is a major change in data volume, features,
etc.
33
@2016 LinkedIn Corporation. All Rights Reserved. 34
Measure Model Performance
▪ Precision/Recall
– Precision = TP/(TP+FP)
– Recall = TP/(TP+FN)
▪ true positive rate / sensitivity
▪ F1 score
– 2 * (Precision * Recall) / (Precision + Recall)
▪ False positive rate:
– FPR = FP/N = FP/(FP+TN)
▪ Specificity: true negative rate
– SPC = TN/N = TN/(FP+TN)
▪ Accuracy
– ACC = (TP+TN)/(P+N)
True
Positive
False
Positive
False
Negative
True
Negative
Actual Value
p n
P N
Prediction
Outcome
p’
n’
total
@2016 LinkedIn Corporation. All Rights Reserved. 35
Measure Model Performance
@2016 LinkedIn Corporation. All Rights Reserved.
crowd and
internal
judgments
label
preparation
raw features
features with
label
log data
online A/B test random
bucket
evaluate for
model selection
model training
+
-
- -
-
+ +
best
model
offline scoring
and indexing
a/b test reports
feature
engineering
label
feature
integration
training
data
testing
data
validation
data
model
performance
raw features
feature
integration
scoring
features
-
raw data
-
data
partitioning
compute offline
evaluation metrics
online/offline
systems
log data
@2016 LinkedIn Corporation. All Rights Reserved. 37
Model Testing
▪ Similar to model validation
– Test on data NOT used in model selection
– Scoring using the best model
– Performance evaluation/comparison
Raw Testing Data Production
Model
Best Model
@2016 LinkedIn Corporation. All Rights Reserved.
crowd and
internal
judgments
label
preparation
raw features
features with
label
log data
online A/B test random
bucket
evaluate for
model selection
model training
+
-
- -
-
+ +
best
model
offline scoring
and indexing
online/offline
systems
a/b test reports
feature
engineering
label
feature
integration
training
data
testing
data
validation
data
model
performance
raw features
feature
integration
scoring
features
-
raw data
-
data
partitioning
compute offline
evaluation metrics
log data
@2016 LinkedIn Corporation. All Rights Reserved.
Performance Measurement via A/B Test
39
Algorithm b
20%
Algorithm a
80%
Collect results to determine which one is better
@2016 LinkedIn Corporation. All Rights Reserved.
Best Practices for Running A/B Test
▪ Start testing on a small portion of users
▪ Measure one change at a time
▪ Be aware of potential biases (time, targeted population, etc.)
▪ Avoid coupling a marketing campaign with an A/B test
▪ Use a simple rule of thumb to address multiple testing problems
– 0.05 p-value cutoff for metrics that are expected to be impacted
– a smaller cutoff, say 0.001, for metrics that are not
40
crowd and
internal
judgments
label
preparation
raw features
features with
label
log data
online A/B test random
bucket
evaluate for
model selection
model training
+
-
- -
-
+ +
best
model
compute offline
evaluation metrics
offline scoring
and indexing
a/b test reports
feature
engineering
label
feature
integration
training
data
testing
data
validation
data
model
performance
raw features
feature
integration
scoring
features
-
raw data
-
data
partitioning
online/offline
systems
log data
@2016 LinkedIn Corporation. All Rights Reserved. 42
Model Deployment
▪ Schedule and run the scoring pipeline regularly
– Feature integration and transformation
– Scoring using the selected model
▪ Score reasoning
▪ Downstream apps: e.g., marketing campaigns
Production
Model
Active members
past k weeks
id score rank
xxx1 0.99 1
xxx2 0.87 2
xxx3 0.72 3
… … …
@2016 LinkedIn Corporation. All Rights Reserved.
Model Management
43
Day 1 Day 30
Photo credit: asmfoto Marcell Mizik, photo license withDepositphotos File Purchase Agreement #41549281
@2016 LinkedIn Corporation. All Rights Reserved. 44
Model Management
Feed in new training data to
generate new model periodically
Monitor performance changes
over time Keep model up to date
Ensemble historical models as
one of the baseline models
@2016 LinkedIn Corporation. All Rights Reserved. 45
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved.
Leakage: the introduction of information about the data mining target, which should NOT be legitimately available to mine
from. -- Shachar Kaufman, et.al. KDD11
Common leakage scenarios
● Leaking data from the testing set into the training set
● Leaking of information from the future into the past
● Leaking information from data samples outside of scope of the algorithm’s intended use
Data Leakage
ID Entity Time Label Usage 1 Usage 2 Region Size Spending Activity 1 Activity 2 …
1 Company 1 01/01 1 10 2 SF 118 3,810 60 4 …
2 Company 1 02/01 1 5 0 SF 117 3,810 70 5 …
3 Company 1 03/01 1 0 3 SF 119 3,810 120 30 …
4 Company 2 01/01 0 100 30 NYC 50 2,000 23 0 ...
5 Company 2 04/01 0 90 43 NYC 50 2,000 23 0 ...
… … … … … … … … … … … …
1 3
2
4
5
Training
Testing
randomSplit()
price1
price2
price3
Yesterday Today Tomorrow
featurei
= (price1
+ price2
+ price3
) / 3 Target = future price ?
Feature = “Rep id” Target: customer churn
A specific sales representative was assigned to take over an churned
account
Kaggle 2011 IJCNN Social Network Challenge: the winners de-anonymized 80% nodes and 60% edges to disclose real identities in the social network
@2016 LinkedIn Corporation. All Rights Reserved.
● Identify data leakage
○ Exploratory data analysis (EDA)
■ An approach to analyzing data sets to
summarize their main characteristics, often with
visual methods
○ Model performance is too good to be true?
○ Early in-the-field testing
47
Data Leakage
@2016 LinkedIn Corporation. All Rights Reserved.
Class Imbalance
▪ A dataset is said to be imbalanced when the binomial or multinomial
response variable has one or more classes that are underrepresented in
the training data, with respect to the other classes
– “I have a binary classification problem and the label is distributed in 1:100 ratio in my training set. My
results are overfit to majority class.”
▪ The class imbalance problem is pervasive and ubiquitous
– e.g job recommendation, ads CTR, fraud detection
▪ Misclassify the minority class usually with high cost
– rejecting a valid credit card transaction VS. approving a large fraudulent transaction
48
?
@2016 LinkedIn Corporation. All Rights Reserved. 49
Solutions
▪ Can You Collect More Data?
▪ Consider different evaluation metrics
– “Accuracy” might be misleading for imbalanced training data
– Confusion Matrix, Precision, Recall, F1, Kappa, ROC
▪ Re-sampling data set
– Up-sampling (Over-sampling)
– Down-sampling
– Synthetic Minority Oversampling Technique (SMOTE)
▪ Use different learning algorithms
– Decision tree, random forest
– AUC-Maximizing Algorithms
▪ Cost-Sensitive Training
Class Imbalance
@2016 LinkedIn Corporation. All Rights Reserved.
Categorical Data
▪ Categorical feature
– A variable that can take on one of a limited, and usually fixed, number of possible values,
thus assigning each individual to a particular group or "category” - The Practice of Statistics , 2003
▪ |SIGKDD Conference| = ?
▪ |KDD workshop| = ?
▪ |KDD host country| = ?
▪ High cardinality categorical features are common in the data
– E.g. Industry, country, city
▪ Too many levels
– Not all the levels (distinct values of the categorical feature) got enough support. Some are
less useful
– Many Machine Learning tools can only handle certain amount of levels
▪ E.g. Random Forest implementation in R has a hard limit of 32-levels for a categorical feature
50
@2016 LinkedIn Corporation. All Rights Reserved. 51
▪ Solutions
– Reduce cardinality by grouping categories into higher-level ones
– Transform a categorical feature into multiple binary ones
▪ Introduce an additional ‘others’ feature to represent all the new
categories in the testing set
Cloud and Distributed Computing Dental
Hadoop Spark Hive Yarn ... others
1 1 0 0 ... 0
Categorical Data
@2016 LinkedIn Corporation. All Rights Reserved.
▪ Missing data scenarios
– Missing Completely at Random (MCAR)
▪ Is not related to other variables AND is not related to value of missing variable
▪ E.g. Computer crash
– Missing at Random (MAR)
▪ Is related to other, known variables. BUT, is not related to value of missing
variable(s) once we take the above relation(s) into account
▪ E.g. male participants are more likely to refuse to fill out the depression survey, but it does not
depend on the level of their depress
– Missing Not at Random (MNAR)
▪ Is related to what the value of the missing data would have been even if we take
into consideration other variables
▪ E.g. People with low high school GPA decline to report it
52
Missing Data
@2016 LinkedIn Corporation. All Rights Reserved.
Missing Data
▪ Solutions
– Remove observations with missing values
▪ When values are missing at random and you have enough data
– Missing data imputation
▪ Common imputation strategies
– Categorical: Choose the category with the most support
– Numerical: Median, mean, or simply set to 0
▪ Predict missing data using a model
– Introduce a corresponding dummy feature to indicate its availability
– Use a model that are robust to missing data, e.g. tree-based model.
53
@2016 LinkedIn Corporation. All Rights Reserved.
Outliers
▪ Outliers can be introduced in response or predictors
– Rare event (valid)
– Erroneous metrics (invalid)
▪ Impact of the outliers
– Outlier values can have a disproportionate weight on the model.
– MSE will focus on handling outlier observations more to reduce square error
– Boosting will spend considerable modeling effort fitting these observations
▪ Solutions
– Whether the outlier value is valid or invalid?
– Remove observation with outlier feature values
– Apply transformation to reduce impact:
▪ log
▪ binning (e.g. based on distribution)
▪ spline transforms
▪ Impose a constraint on data range (cap values)
– Choose a more robust loss function (e.g. trees)
54
@2016 LinkedIn Corporation. All Rights Reserved.
Model Interpretation
▪ Why?
– Debug, diagnose, generate new hypotheses
– Inevitable questions about why a prediction was made from your
business counterparts
– for presenting reasons to users - may be output of a ML model itself
55
... With premium account, you get more
search results and access to …
... Do you know 5 of your connections
have started to use premium account …
@2016 LinkedIn Corporation. All Rights Reserved.
Model Interpretation
Univariate Feature Interpretation
Pros:
▪ Get a sense of importance for each
feature
▪ Many available algorithms: Random
Forest, Regularized linear models,
various feature selection algorithms
Cons:
▪ Bias, e.g. impurity evaluation of RF
is biased towards preferring
variables with more categories
▪ Difficulty of interpreting the ranking
of correlated variables
▪ Single feature may contain lots of
noise.
56
Group-wise Feature Interpretation
Pros:
▪ Easy to capture the overall look by
grouping massive features into a
few buckets
▪ Strong semantic meanings
▪ Inter-group correlation are less
▪ Impact from noise is reduced by
analysing multiple features at the
same time
Cons:
▪ Domain knowledge required
@2016 LinkedIn Corporation. All Rights Reserved.
Model Interpretation
57
SOCIAL
E-LEARNING
MARKETING
AFFILIATION
ENGAGEMENT SPENDING
GROWTH
PRODUCT
USAGE
Provide high-level
overview and insights
Deep dive to
finer-level predictors
@2016 LinkedIn Corporation. All Rights Reserved. 58
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved. 59
Sponsored Ads
Updates from followed companies
Updates from connections
Case Study - LinkedIn Feed Ranking
@2016 LinkedIn Corporation. All Rights Reserved. 60
Case Study - LinkedIn Feed Ranking
Feed Events
➢ Sponsored Ads
➢ Updates from connections
➢ Updates from followed companies
➢ Updates from joined groups
➢ Articles shared / commented / liked by
connections
➢ Articles mentioned connections
➢ Articles posted by influencers or
connections
➢ News from followed channels
➢ Job recommendations
➢ People You May Know
➢ ...
@2016 LinkedIn Corporation. All Rights Reserved.
Actor-Verb-Object Formalism and Activity Types
61
@2016 LinkedIn Corporation. All Rights Reserved. 62
Case Study - LinkedIn Feed Ranking
Actor
Object
Verb
@2016 LinkedIn Corporation. All Rights Reserved.
Case Study - LinkedIn Feed Ranking
63
▪ Context
▪ Questions
▪ Group Discussions
▪ Discuss LinkedIn Solutions
- 400+ million members
- may have thousands of items to rank
- focus on the modeling problem - assume that items are provided
- skewed activity - some members very active, some not very active
- ranking problem, online scoring
- How to formulate the feed ranking problem?
- What are the suitable algorithms?
- What kind of features are preferred here?
- How to perform offline validation?
@2016 LinkedIn Corporation. All Rights Reserved.
▪ Binary classification problem: let it
represent the interaction between viewer i and update t:
it
=
▪ Assume using logistic regression model, let Xit
be a vector of features characterizing viewers,
feed update and the context, and be a vector of parameters:
log = ′Xit
which is equivalent to:
Problem Formulation
64
1, if viewer interacts with feed update
-1, otherwise
P( it
= 1 | viewer, update)
1 - P( it
= 1 | viewer, update)
P( it
= 1 | viewer, update) = 1/ (1 + exp(- ′Xit
))
@2016 LinkedIn Corporation. All Rights Reserved.
Model Training
▪ The parameter vector is estimated by maximizing the likelihood of the training data as a
function of
L( ) = (1 + exp(- ′Xit
))-1
▪ Add a regularization term (Euclidean norm on the parameter vector) to the log-likelihood
function to mitigate overfitting
( ) = - ∑ log (1 + exp(- it
′Xit
)) - || ||2
▪ Compute the gradient vector of the regularized log-likelihood function by using one of the
standard computational tools for gradient-based maximization, for example, stochastic
gradient descent.
65
it
it
@2016 LinkedIn Corporation. All Rights Reserved.
Potential Features
▪ Features in the feature vector Xit
are classified into following categories:
– Viewer-only features Xi
▪ member industry, geo location, skills
– Activity-only features Xt
▪ time of activity, verb type (like, comment, share, recommendation, ...)
– Object-only features Xo
▪ position
– Viewer-actor features Xij
:
▪ number messages sent to actor from viewer, number shared connections, ...
– Viewer-activity type features Xik
▪ viewer CTR for activity type
– Viewer-actor-activity type features Xijk
▪ viewer CTR for actor - activity type combination
– Viewer-object features Xio
▪ object same language as viewer profile, number previous interactions
66
@2016 LinkedIn Corporation. All Rights Reserved.
Feature Design and Engineering
67
▪ Apply feature transformation to bypass the limitation from the linear
models.
– log function on raw counts,
– indicator function on raw counts,
– bucketize a numerical feature as a set of categorical features concatenated as a binary
sub-vector
@2016 LinkedIn Corporation. All Rights Reserved.
Three-step Offline Evaluation
68
▪ AUC for model selection
▪ Comparing the ratio of observed to expected (predictions) probabilities
for different activity types (denoted as the O/E ratio) and whether or not
the data likelihood converged to the global maximum.
– The CTR model overestimates different pCTR probabilities if O/E<1
– The CTR model underestimates different pCTR probabilities if O/E>1
▪ The “Replay” tool
– Takes a logistic regression model and runs it on historic (random bucket) data and
records the total interaction on “matched impressions”
– Reorders them using the new model that we evaluate;
– Count only clicks on matched impressions that appear in the first position.
@2016 LinkedIn Corporation. All Rights Reserved. 69
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved.
Scale of the problem
70
▪ Can it fit on one machine?
– source: https://github.com/szilard/benchm-ml
– Vowpal Wabbit can train linear model on a
▪ 1B example x 9 dense feature dataset in
~ 23 minutes on AWS r3.8xlarge instance
▪ 10M example dataset in 15 seconds
– xgboost can train gradient boosted trees on a
▪ 10M example x 9 dense feature dataset in as
little as 15 minutes using 6GB RAM
▪ Easier to explore a wide range of algorithms
▪ Single machine algorithms often the most
robust, come with good tools
64
"Green Globe In Child’s Hands" (CC BY-SA 2.0) by kenteegardin
@2016 LinkedIn Corporation. All Rights Reserved. 71
Integration with your company’s tech stack
▪ Data formats or sources
– Avro, Parquet, Hive, SQL, Kafka ….
– Build your own data converters
▪ Can be surprisingly difficult!
▪ Character encoding
▪ Sparse, text named features →
indexed features
▪ Model deployment
– Model formats: PMML
– Embeddable scorers: what
programming language?
?
@2016 LinkedIn Corporation. All Rights Reserved.
Modeling Techniques
▪ Linear regression models (linear, logistic, poisson)
– Scale to the largest datasets
– Can have useful model diagnostics
– Are fast at scoring
– Can be easily retrained or updated online
– Require much more feature engineering
▪ Gradient boosted regression trees and random forests
– Provide great model accuracy with minimal feature engineering
– Are slower to train
– Are slower to score
▪ Others?
– Factorization machines
– Deep learning
72
Σ
@2016 LinkedIn Corporation. All Rights Reserved.
▪ Sometimes efficient scoring is critical
– ranking problems, recommender systems, edge prediction
▪ Pre-compute in batch
– build your own candidate selection
– feed into scoring utility
– rank and store
– scaling for very large numbers of candidate
pairs can be challenging
– relatively expensive if not queried
▪ Compute on-demand
– candidates provided by an underlying index
(e.g. search engine), database, or store
– score online
– rank and return
– latency is critical
– requires robust online infrastructure
Scoring Support
73
Σ
Σ
@2016 LinkedIn Corporation. All Rights Reserved.
▪ build
– in-house building often only makes sense for large teams
– allows specialization, innovation
▪ assemble
– leverage open-source software
– make it easier to use within your company’s tech stack
– fill in gaps that open-source software doesn’t address: feature integration, model
updates, a/b testing, …
▪ buy
– can get support contracts
– makes sense if little to no engineering support within company or business unit
– good for small teams
– can be limiting in terms of techniques
74
"Brick laying" (CC BY 2.0) by hnnbz
"dollar-sign" (CC BY-SA 2.0) by Oldmaison
"LEGO house" (CC BY 2.0)
by Atsushi Tadokoro
@2016 LinkedIn Corporation. All Rights Reserved.
Maturity of platform
▪ Leverage your network
– Who is using it? in production?
– Is it working for them?
– What are its limits? scalability, stability, ease of use, ….
▪ Open source
– How active is the development of the project?
▪ sometimes means lots of changes or instability
▪ usually a good sign
– From what companies?
▪ Diverse companies suggests broad adoption
▪ Big companies suggests scale, maturity of engineering processes
▪ Academic only may indicate weak points in scale or robustness (especially
with respect to scoring solutions)
▪ Purchasing solutions or support
– Consider size of the company, amount of funding raised, your tolerance to risk
– Pedigree of the ML experts and engineers at the company
75
"Jersey City waterfront at night" (CC BY-SA 2.0) by mattk1979
@2016 LinkedIn Corporation. All Rights Reserved.
Anticipate evolving needs
▪ May start with buying a solution
▪ But later need to integrate more tightly and move
toward assembling
▪ Or later build major components yourself
▪ Consider whether the solutions you buy are built on
open source and can help you transition to
assembling solutions
76
"Butterfly" (CC BY-SA 2.0) by Salvatore Gerace
@2016 LinkedIn Corporation. All Rights Reserved. 77
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved.
Spark ML, MLlib
▪ ML
– Usability through DataFrames and Pipelines
– More polished implementations, often more scalable
– Regression, Trees, k-Means, LDA
▪ MLlib
– Old RDD API
– More extensive set of methods
– Matrix Factorization, Power Iteration Clustering, SVD, PCA, Frequent Pattern Mining,
…
▪ Extensive support for reading data sources via Spark: Amazon S3, Amazon Redshift,
mongoDB, mySQL, Shark, Hive, HDFS, byo, ...
▪ Batch and streaming scoring support
▪ PMML model export for some techniques
78
@2016 LinkedIn Corporation. All Rights Reserved.
Vowpal Wabbit
79
▪ Very fast SGD
– hashing trick
– scales to very large number of features and datasets
on a single machine
– models less interpretable (can’t know the weight on any single feature)
▪ Allreduce server for distributed learning
▪ Quadratic, cubic feature interaction
▪ Most users run from command line
▪ Also C++ and C# APIs
▪ Linear, logistic, svm, quantile regression, online (contextual bandit),
LDA, matrix factorization
▪ Batch, API, service scoring
@2016 LinkedIn Corporation. All Rights Reserved.
DMLC
▪ (mostly) research collaboration
▪ Mixed language support across methods
▪ XGBoost: Scalable GBDT
▪ MXNet: flexible deep learning
▪ DiFacto: factorization machines, logistic regression using
parameter server
80
@2016 LinkedIn Corporation. All Rights Reserved.
scikit-learn
▪ Popular python package for machine learning
▪ SVM, regression, trees (RF, GBDT),
k-means, PCA, matrix factorization, …
▪ Load easily from file, HDFS, Amazon S3
▪ Batch, API scoring
81
@2016 LinkedIn Corporation. All Rights Reserved.
R
▪ Application for data mining, statistics, ML
▪ Generally not considered “production” grade
▪ Extensive packages
▪ Among most popular: SVM, regression trees (GBRT, RF, ...),
PCA, Quadratic Programming, GLM, GBM, association rule
mining, bagging, ...
▪ Packages for loading data from file, SQL, JDBC, HDFS, hive,
Redshift, Amazon S3
▪ PMML model export for some packages
82
@2016 LinkedIn Corporation. All Rights Reserved.
TensorFlow
▪ Primarily used for Deep Learning
▪ Python (first class), C++ (minimal) APIs
▪ Designed from ground up for distributed ML
▪ GPU support
▪ Flexible modeling framework beyond deep learning
▪ Auto-differentiation makes it easier to add new techniques
▪ Designed for production learning and scoring
▪ Batch scoring, model export
83
@2016 LinkedIn Corporation. All Rights Reserved.
AWS + Amazon Machine Learning
84
console, API
S3, Redshift,
RDS (SQL)
limited
transformations
logistic regression, multinomial regression,
linear regression
@2016 LinkedIn Corporation. All Rights Reserved.
AWS + Spark
85
linear, logistic, multiclass, and survival
regression, trees (RF, GBDT), MF,
survival regression, k-means, LDA
Scala, Java, R, python
MySQL, Shark, Hive,
HDFS, ODBC, …
pipelines
console, API
@2016 LinkedIn Corporation. All Rights Reserved.
AWS + Spark + databricks
86
linear, logistic, multiclass, and survival
regression, trees (RF, GBDT), MF,
survival regression, k-means, LDA
Scala, Java, R, python
MySQL, Shark, Hive,
HDFS, ODBC, …
pipelines
notebooks,
scheduled jobs,
model deployment to
scoring server
S3, Redshift,
RDS (SQL)
@2016 LinkedIn Corporation. All Rights Reserved.
AWS + H2O
87
extensive regression &
classification, GBDT,
ensembles, deep learning
notebooks, model export (jar, JSON)
HDFS, file,
URL download,
Amazon S3
@2016 LinkedIn Corporation. All Rights Reserved.
AWS + H2O + Spark
88
extensive regression &
classification, GBDT,
ensembles, deep learning
notebooks, model export (jar, JSON)
@2016 LinkedIn Corporation. All Rights Reserved.
AWS + Turi
89
extensive regression &
classification, recommenders
(factorization machines, MF, …),
k-means, LDA, deep learning
notebooks, C++, python
HDFS, Amazon
S3, file, ODBC,
Avro, ...
API, batch scoring
@2016 LinkedIn Corporation. All Rights Reserved.
AWS + Turi + Spark
90
extensive regression &
classification, recommenders
(factorization machines, MF, …),
k-means, LDA, deep learning
notebooks, C++, python
HDFS, Amazon
S3, file, ODBC,
Avro, ...
Spark RDD <-> Dato SFrame
data conversion
@2016 LinkedIn Corporation. All Rights Reserved.
Azure Cloud:
dbs, batch
computation,
Power BI,
stream analytics,
...
Microsoft Azure Machine Learning
91
extensive regression &
classification, k-means,
Vowpal Wabbit
API, Notebooks, GUI Workflows, R, python
API, batch scoring
Hive, Azure,
bulk upload,
URL download,. ..
@2016 LinkedIn Corporation. All Rights Reserved.
Azure Cloud:
dbs, batch
computation,
Power BI,
stream analytics,
...
Azure Machine Learning + Hadoop/Spark
92
extensive regression &
classification, k-means,
Vowpal Wabbit
API, Notebooks, GUI Workflows, R, python
HDInsight:
run Hadoop and
Spark on Azure
API, batch scoring
Hive, Azure,
bulk upload,
URL download,. ..
@2016 LinkedIn Corporation. All Rights Reserved.
Google Machine Learning (limited preview)
93
compute, app,
networking,
storage, big data,
security, ...
TensorFlow: deep learning,
flexible optimization framework
notebooks
Cloud Storage,
BigQuery
model export, API, batch
@2016 LinkedIn Corporation. All Rights Reserved. 94
Outline
Introduction
❏ Why predictive modeling is important?
❏ Understanding of audience background and learning objectives
Predictive
Modeling
Overview
❏ End-to-end walkthrough of a production modeling solution
❏ Common pitfalls and challenges
❏ Case Study - LinkedIn Feed Ranking
❏ Considerations when choosing a framework
❏ Overview of tools and platforms across industry
❏ Modoop - example of a scaled framework
Choosing a
Framework
@2016 LinkedIn Corporation. All Rights Reserved.
Intelligence Engine
HDFS Feature Mart
Application 1 Application 2 Application N
… …
Workflows Workflows Workflows
Feature Engineering
Libraries
Machine Learning
Libraries
Workflow Scheduler & Manager
Ground Truth
Drivers
Metadata
Store
Web UI
95
Modoop - example of building/assembling a platform
ScoresModels
MapReduce / Yarn
Pig, Hive, Python,
Scala, Shell script, ...
@2016 LinkedIn Corporation. All Rights Reserved.
Intelligence Engine
HDFS Feature Mart
Application 1 Application 2 Application N
… …
Workflows Workflows Workflows
Feature Engineering
Libraries
Machine Learning
Libraries
Workflow Scheduler & Manager
Ground Truth
Drivers
Metadata
Store
Web UI
96
ScoresModels
MapReduce / Yarn
Pig, Hive, Python,
Scala, Shell script, ...
Leverage hadoop/Spark, Azkaban open source
@2016 LinkedIn Corporation. All Rights Reserved.
Intelligence Engine
HDFS Feature Mart
Application 1 Application 2 Application N
… …
Workflows Workflows Workflows
Feature Engineering
Libraries
Machine Learning
Libraries
Workflow Scheduler & Manager
Ground Truth
Drivers
Metadata
Store
Web UI
97
ScoresModels
MapReduce / Yarn
Pig, Hive, Python,
Scala, Shell script, ...
Feature mart and metadata store enable feature integration and feature engineering
@2016 LinkedIn Corporation. All Rights Reserved.
Intelligence Engine
HDFS Feature Mart
Application 1 Application 2 Application N
… …
Workflows Workflows Workflows
Feature Engineering
Libraries
Machine Learning
Libraries
Workflow Scheduler & Manager
Ground Truth
Drivers
Metadata
Store
Web UI
98
ScoresModels
MapReduce / Yarn
Pig, Hive, Python,
Scala, Shell script, ...
Leverage open source, augment with evaluation and model selection
Photon ML
Metronome: internal
ML workflow tools
@2016 LinkedIn Corporation. All Rights Reserved.
Intelligence Engine
HDFS Feature Mart
Application 1 Application 2 Application N
… …
Workflows Workflows Workflows
Feature Engineering
Libraries
Machine Learning
Libraries
Workflow Scheduler & Manager
Ground Truth
Drivers
Metadata
Store
Web UI
99
ScoresModels
MapReduce / Yarn
Pig, Hive, Python,
Scala, Shell script, ...
Create end-to-end workflow based on user-provided label and configurations
@2016 LinkedIn Corporation. All Rights Reserved.
Intelligence Engine
HDFS Feature Mart
Application 1 Application 2 Application N
… …
Workflows Workflows Workflows
Feature Engineering
Libraries
Machine Learning
Libraries
Workflow Scheduler & Manager
Ground Truth
Drivers
Metadata
Store
Web UI
100
ScoresModels
MapReduce / Yarn
Pig, Hive, Python,
Scala, Shell script, ...
Custom user interface provides model understanding tools
@2016 LinkedIn Corporation. All Rights Reserved.
Summary
▪ Mining data to extract useful and enduring patterns remains a
skill arguably more art than science
▪ Choose appropriate evaluation metrics for different audiences
▪ A black box model is not enough
▪ Consider scale, features, integration costs, maturity, and
flexibility of platforms
101
@2016 LinkedIn Corporation. All Rights Reserved.
Appendix
102
@2016 LinkedIn Corporation. All Rights Reserved. 103
Spark ML Vowpal Wabbit dmlc scikit-learn R TensorFlow
Interfaces python, Scala, Java, R command line, C#, C++ command line,
xgboost: C++, Python, R,
Java, Scala, Julia
mxnet: Python, R, Julia,
Scala, Go, Javascript
difacto: C++
notebook, command line,
python
gui, command line python
Data Sources Amazon S3, Amazon Redshift,
mongoDB, mySQL, Shark, Hive,
HDFS, byo, ...
file, API file, HDFS, Amazon S3 file, HDFS, Amazon S3, file, SQL, JDBC, HDFS,
hive, Redshift, Amazon S3
file, API
Modeling
Techniques
MLlib: linear, logistic, tree (RF,
GBDT), MF, survival regression,
multiclass, k-means, LDA
linear, logistic, svm,
quantile regression, online
(contextual bandit), LDA,
MF
xgboost (gbrt), mxnet
(deep learning), difacto
(factorization machines,
logistic regression)
SVM, regression, trees
(RF, GBDT), k-means,
PCA, matrix factorization,
...
SVM, regression trees
(GBRT, RF, ..), PCA, QP,
GLM, GBM, association
rule mining, bagging, ...
deep learning,
linear/logistic regression,
flexible learning framework
Platforms Spark Single Machine, Custom
AllReduce
parameter server,
AllReduce (Rabit), Spark,
YARN
- - Google Cloud Compute
Engine
Scoring Batch, Streaming Batch, API Batch Batch, API PMML export for some
packages
Batch, API
Open Source Software
@2016 LinkedIn Corporation. All Rights Reserved. 104
Microsoft Azure Machine
Learning
Google Cloud Machine
Learning Platform
(alpha)
Amazon Machine
Learning
Databricks H20 Turi
Interfaces API, Notebooks, GUI Workflows,
R, python
API, Notebooks,
Programmatic Workflows,
Java, python (alpha)
API, console API, Notebooks,
Programmatic Workflows,
Java, Scala, python, R
API, Notebook, R, Java,
Scala, python
Notebooks, C++, python
Cloud Microsoft Azure Google Cloud Compute
Engine
Amazon AWS Amazon AWS Amazon AWS, Microsoft
Azure*, Google Cloud*,
your data center
Amazon AWS, Microsoft
Azure*, Google Cloud*,
your data center
Data Sources Hive, Azure blob storage, Azure
table, Azure SQL
OData feed, bulk upload, URL
download
text (including JSON),
Google Cloud Bigtable,
Google Cloud Datastore,
byo
Amazon S3, Amazon
Redshift, Amazon RDS
(SQL)
Amazon S3, Amazon
Redshift, mongoDB,
mySQL, Shark, Hive,
HDFS, byo, ...
HDFS, file, URL download,
Amazon S3
HDFS, Amazon S3, file,
ODBC, Avro, SparkRDD,
...
Modeling
Techniques
extensive regression &
classification, k-means, Vowpal
Wabbit
TensorFlow: flexible deep
learning, regression, ...
binary, multiclass
classification, regression
MLlib: linear, logistic, tree
(RF, GBDT), MF, survival
regression, multiclass,
k-means, LDA
extensive regression &
classification, GBDT,
ensembles, deep learning
extensive regression &
classification,
recommenders
(factorization machines,
MF, …), k-means, LDA,
deep learning
Limitations 10GB training datasets
full info
ML in limited preview
Prediciton API: 2.5GB
training datasets
100GB training datasets
1k input features
full info
Extensibility byo R, python, package imports TensorFlow SDK, Google
Cloud Platform
byo AWS services pipelines, byo Spark byo on your platform python, C++, byo platform
Pretrained Models vision, speech, sentiment, ... speech, vision, translation,
...
Scoring API, batch API, batch, planned export API, batch API, batch, streaming,
PMML export
API, JSON, POJO export API, Batch
Machine Learning Platforms

Weitere ähnliche Inhalte

Was ist angesagt?

Palestra Medindo seu nível de Visibilidade e Detecção usando ATT&CK e DeTT&CT
Palestra Medindo seu nível de Visibilidade e Detecção usando ATT&CK e DeTT&CTPalestra Medindo seu nível de Visibilidade e Detecção usando ATT&CK e DeTT&CT
Palestra Medindo seu nível de Visibilidade e Detecção usando ATT&CK e DeTT&CTTenchi Security
 
PowerShell for Practical Purple Teaming
PowerShell for Practical Purple TeamingPowerShell for Practical Purple Teaming
PowerShell for Practical Purple TeamingNikhil Mittal
 
Sql injection
Sql injectionSql injection
Sql injectionZidh
 
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...Chris Gates
 
Hot potato Privilege Escalation
Hot potato Privilege EscalationHot potato Privilege Escalation
Hot potato Privilege EscalationSunny Neo
 
Protection and security
Protection and securityProtection and security
Protection and securitymbadhi
 
STKI IT Delivery staffing ratios 2018 v3
STKI IT Delivery staffing ratios 2018 v3STKI IT Delivery staffing ratios 2018 v3
STKI IT Delivery staffing ratios 2018 v3Pini Cohen
 
Threat Activity Groups - Dragos
Threat Activity Groups - Dragos Threat Activity Groups - Dragos
Threat Activity Groups - Dragos Dragos, Inc.
 
POS and OMS
POS and OMSPOS and OMS
POS and OMSenVista
 
Human Resource Training and Development
Human Resource Training and DevelopmentHuman Resource Training and Development
Human Resource Training and DevelopmentChetan Parihar
 
Network security (vulnerabilities, threats, and attacks)
Network security (vulnerabilities, threats, and attacks)Network security (vulnerabilities, threats, and attacks)
Network security (vulnerabilities, threats, and attacks)Fabiha Shahzad
 

Was ist angesagt? (12)

Palestra Medindo seu nível de Visibilidade e Detecção usando ATT&CK e DeTT&CT
Palestra Medindo seu nível de Visibilidade e Detecção usando ATT&CK e DeTT&CTPalestra Medindo seu nível de Visibilidade e Detecção usando ATT&CK e DeTT&CT
Palestra Medindo seu nível de Visibilidade e Detecção usando ATT&CK e DeTT&CT
 
PowerShell for Practical Purple Teaming
PowerShell for Practical Purple TeamingPowerShell for Practical Purple Teaming
PowerShell for Practical Purple Teaming
 
Sql injection
Sql injectionSql injection
Sql injection
 
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
 
Jump-Start The MASVS
Jump-Start The MASVSJump-Start The MASVS
Jump-Start The MASVS
 
Hot potato Privilege Escalation
Hot potato Privilege EscalationHot potato Privilege Escalation
Hot potato Privilege Escalation
 
Protection and security
Protection and securityProtection and security
Protection and security
 
STKI IT Delivery staffing ratios 2018 v3
STKI IT Delivery staffing ratios 2018 v3STKI IT Delivery staffing ratios 2018 v3
STKI IT Delivery staffing ratios 2018 v3
 
Threat Activity Groups - Dragos
Threat Activity Groups - Dragos Threat Activity Groups - Dragos
Threat Activity Groups - Dragos
 
POS and OMS
POS and OMSPOS and OMS
POS and OMS
 
Human Resource Training and Development
Human Resource Training and DevelopmentHuman Resource Training and Development
Human Resource Training and Development
 
Network security (vulnerabilities, threats, and attacks)
Network security (vulnerabilities, threats, and attacks)Network security (vulnerabilities, threats, and attacks)
Network security (vulnerabilities, threats, and attacks)
 

Andere mochten auch

recruitDC Spring 2012
recruitDC Spring 2012recruitDC Spring 2012
recruitDC Spring 2012RecruitDC
 
A Short PMML Tutorial by LatentView
A Short PMML Tutorial by LatentViewA Short PMML Tutorial by LatentView
A Short PMML Tutorial by LatentViewramesh.latentview
 
An Introduction to Skyline It Services
An Introduction to Skyline It ServicesAn Introduction to Skyline It Services
An Introduction to Skyline It Servicesguest5c9d51
 
1 introduction to staffing
1 introduction to staffing1 introduction to staffing
1 introduction to staffingPreeti Bhaskar
 
Chap001 the nature_of_staffing_editing
Chap001 the nature_of_staffing_editingChap001 the nature_of_staffing_editing
Chap001 the nature_of_staffing_editingZilafeeq Shafilla
 
Earned Value Management (EVM) Limitations
Earned Value Management (EVM) LimitationsEarned Value Management (EVM) Limitations
Earned Value Management (EVM) LimitationsIftikhar Anjum
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in HadoopAnalyticsWeek
 
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Hakka Labs
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingAhmed Banafa
 
Developing a LinkedIn Recruiting Strategy
Developing a LinkedIn Recruiting StrategyDeveloping a LinkedIn Recruiting Strategy
Developing a LinkedIn Recruiting Strategyaparton
 
Social recruiting seminar maximizing linked in and facebook for recruiting ...
Social recruiting seminar   maximizing linked in and facebook for recruiting ...Social recruiting seminar   maximizing linked in and facebook for recruiting ...
Social recruiting seminar maximizing linked in and facebook for recruiting ...HireClix
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsPeter Skomoroch
 
Social Media Data Mining
Social Media Data MiningSocial Media Data Mining
Social Media Data MiningTeresa Rothaar
 
Strata 2012: Humans, Machines, and the Dimensions of Microwork
Strata 2012: Humans, Machines, and the Dimensions of MicroworkStrata 2012: Humans, Machines, and the Dimensions of Microwork
Strata 2012: Humans, Machines, and the Dimensions of MicroworkDaniel Tunkelang
 

Andere mochten auch (20)

Get your data analytics strategy right!
Get your data analytics strategy right!Get your data analytics strategy right!
Get your data analytics strategy right!
 
recruitDC Spring 2012
recruitDC Spring 2012recruitDC Spring 2012
recruitDC Spring 2012
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
A Short PMML Tutorial by LatentView
A Short PMML Tutorial by LatentViewA Short PMML Tutorial by LatentView
A Short PMML Tutorial by LatentView
 
An Introduction to Skyline It Services
An Introduction to Skyline It ServicesAn Introduction to Skyline It Services
An Introduction to Skyline It Services
 
1 introduction to staffing
1 introduction to staffing1 introduction to staffing
1 introduction to staffing
 
Staffing Analytics
Staffing AnalyticsStaffing Analytics
Staffing Analytics
 
Chap001[1]
Chap001[1]Chap001[1]
Chap001[1]
 
Chap001 the nature_of_staffing_editing
Chap001 the nature_of_staffing_editingChap001 the nature_of_staffing_editing
Chap001 the nature_of_staffing_editing
 
Earned Value Management (EVM) Limitations
Earned Value Management (EVM) LimitationsEarned Value Management (EVM) Limitations
Earned Value Management (EVM) Limitations
 
Interactive Marketing
Interactive MarketingInteractive Marketing
Interactive Marketing
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
LinkedIn for B2B
LinkedIn for B2BLinkedIn for B2B
LinkedIn for B2B
 
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Developing a LinkedIn Recruiting Strategy
Developing a LinkedIn Recruiting StrategyDeveloping a LinkedIn Recruiting Strategy
Developing a LinkedIn Recruiting Strategy
 
Social recruiting seminar maximizing linked in and facebook for recruiting ...
Social recruiting seminar   maximizing linked in and facebook for recruiting ...Social recruiting seminar   maximizing linked in and facebook for recruiting ...
Social recruiting seminar maximizing linked in and facebook for recruiting ...
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
Social Media Data Mining
Social Media Data MiningSocial Media Data Mining
Social Media Data Mining
 
Strata 2012: Humans, Machines, and the Dimensions of Microwork
Strata 2012: Humans, Machines, and the Dimensions of MicroworkStrata 2012: Humans, Machines, and the Dimensions of Microwork
Strata 2012: Humans, Machines, and the Dimensions of Microwork
 

Ähnlich wie Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial

Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleSongtao Guo
 
Virtual Sandbox for Data Scientists at Enterprise Scale
Virtual Sandbox for Data Scientists at Enterprise ScaleVirtual Sandbox for Data Scientists at Enterprise Scale
Virtual Sandbox for Data Scientists at Enterprise ScaleDenodo
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessInside Analysis
 
Building Scalable ML Products by TripAdvisor PM & Data Scientist
Building Scalable ML Products by TripAdvisor PM & Data ScientistBuilding Scalable ML Products by TripAdvisor PM & Data Scientist
Building Scalable ML Products by TripAdvisor PM & Data ScientistProduct School
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyNeo4j
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
3 Challenges of Building Complex Dashboards with Open Source Components
3 Challenges of Building Complex Dashboards with Open Source Components3 Challenges of Building Complex Dashboards with Open Source Components
3 Challenges of Building Complex Dashboards with Open Source ComponentsRyan MacCarrigan
 
Transforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceTransforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceSongtao Guo
 
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...Joe Lamantia
 
UX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
UX STRAT USA Presentation: Joe Lamantia, Bottomline TechnologiesUX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
UX STRAT USA Presentation: Joe Lamantia, Bottomline TechnologiesUX STRAT
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveJune Andrews
 
7 Steps Big Data Journey for Enterprises
7 Steps Big Data Journey for Enterprises7 Steps Big Data Journey for Enterprises
7 Steps Big Data Journey for EnterprisesRaju Shreewastava
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemVMware Tanzu
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
The Agile Drupalist - Methodologies & Techniques for Running Effective Drupal...
The Agile Drupalist - Methodologies & Techniques for Running Effective Drupal...The Agile Drupalist - Methodologies & Techniques for Running Effective Drupal...
The Agile Drupalist - Methodologies & Techniques for Running Effective Drupal...Adrian Jones
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Databricks
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
 

Ähnlich wie Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial (20)

Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at Scale
 
Virtual Sandbox for Data Scientists at Enterprise Scale
Virtual Sandbox for Data Scientists at Enterprise ScaleVirtual Sandbox for Data Scientists at Enterprise Scale
Virtual Sandbox for Data Scientists at Enterprise Scale
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
Building Scalable ML Products by TripAdvisor PM & Data Scientist
Building Scalable ML Products by TripAdvisor PM & Data ScientistBuilding Scalable ML Products by TripAdvisor PM & Data Scientist
Building Scalable ML Products by TripAdvisor PM & Data Scientist
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
3 Challenges of Building Complex Dashboards with Open Source Components
3 Challenges of Building Complex Dashboards with Open Source Components3 Challenges of Building Complex Dashboards with Open Source Components
3 Challenges of Building Complex Dashboards with Open Source Components
 
Model Factory at ING Bank
Model Factory at ING BankModel Factory at ING Bank
Model Factory at ING Bank
 
Transforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceTransforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales Intelligence
 
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
 
UX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
UX STRAT USA Presentation: Joe Lamantia, Bottomline TechnologiesUX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
UX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
 
7 Steps Big Data Journey for Enterprises
7 Steps Big Data Journey for Enterprises7 Steps Big Data Journey for Enterprises
7 Steps Big Data Journey for Enterprises
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation System
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
The Agile Drupalist - Methodologies & Techniques for Running Effective Drupal...
The Agile Drupalist - Methodologies & Techniques for Running Effective Drupal...The Agile Drupalist - Methodologies & Techniques for Running Effective Drupal...
The Agile Drupalist - Methodologies & Techniques for Running Effective Drupal...
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
 

Kürzlich hochgeladen

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Kürzlich hochgeladen (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial

  • 1. @2016 LinkedIn Corporation. All Rights Reserved. KDD 16’ Tutorial Business Applications of Predictive Modeling at Scale
  • 2. @2016 LinkedIn Corporation. All Rights Reserved. 2 Instructors Songtao Guo Sr. Staff of Business Analytics Data Mining at LinkedIn Yan Liu Sr. Manager of Business Analytics Data Mining at LinkedIn Paul Ogilvie Sr. Manager of Machine Learning Algorithms Team at LinkedIn Qiang Zhu Staff of Business Analytics Data Mining at LinkedIn
  • 3. @2016 LinkedIn Corporation. All Rights Reserved. 3 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 4. @2016 LinkedIn Corporation. All Rights Reserved. 4 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 5. @2016 LinkedIn Corporation. All Rights Reserved. 5 Example 1: Search Ranking
  • 6. @2016 LinkedIn Corporation. All Rights Reserved. 6 Example 2: Recommendation Job recommendations on LinkedIn Movie recommendations on Netflix
  • 7. @2016 LinkedIn Corporation. All Rights Reserved. 7 Example 3: User Intention Predictions Predict which users are more likely to make certain actions, such as click, purchase, churn, etc.
  • 8. @2016 LinkedIn Corporation. All Rights Reserved. 8 Example 4: B2B Predictive Analytics
  • 9. @2016 LinkedIn Corporation. All Rights Reserved. 9 Example 5: Web Content Analytics Social data Customer feedback • Customer service • Group updates • Network updates Survey results Products Categorization Sentiments Group Subscription Home Page Mobile Message Text Classification Relevance
  • 10. @2016 LinkedIn Corporation. All Rights Reserved. 10 Other Applications... Image Recognition Fraud/Spam Detection NLP Speech Recognition and more...
  • 11. @2016 LinkedIn Corporation. All Rights Reserved. 11 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 12. @2016 LinkedIn Corporation. All Rights Reserved. 12 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 13. @2016 LinkedIn Corporation. All Rights Reserved. Problem Definition 13 Example: Job Seeker Subscription Model Assume we periodically send marketing promotions/campaigns to LinkedIn members for job-seeker subscriptions. How do we decide who we should send these emails to? Binary classification problem: let i represents the product subscription status of member Job 1 Senior Data Scientist Linkedin Mountain View, CA, US 12 connections work here ------------------------------------------ Posted 2 days ago Job 2 Job 3 Job Seeker Subscription Reaching out • Who's viewed your profile • InMail™ messages Finding the right people • Premium search Job seeking • Featured applicant • Applicant insights • Salary data Standing out • Premium Profile • Larger search listings
  • 14. @2016 LinkedIn Corporation. All Rights Reserved. crowd and internal judgments label preparation raw features features with label log data online A/B test random bucket evaluate for model selection data partitioning model training + - - - - + + best model compute offline evaluation metrics offline scoring and indexing log data a/b test reports feature engineering label feature integration training data testing data validation data model performance raw features feature integration scoring features - raw data - online/offline systems
  • 15. @2016 LinkedIn Corporation. All Rights Reserved. crowd and internal judgments label preparation raw features features with label log data online A/B test random bucket evaluate for model selection model training + - - - - + + best model compute offline evaluation metrics offline scoring and indexing a/b test reports feature engineering label feature integration training data testing data validation data model performance raw features feature integration scoring features - raw data - data partitioning online/offline systems log data
  • 16. @2016 LinkedIn Corporation. All Rights Reserved. Label Preparation ▪ A set of labels(“right answers”) is defined in advance ▪ Methods – Derive from data ▪ Historical transaction ▪ User preference ▪ User activity – Domain expert – Scale up label collection ▪ Crowdsourcing Example1: Job seeker model • user_id • Target: whether subscribed JobSeeker product • {Yes,No} -> {1,0} • timestamp Example2: Churn prediction • Account + timestamp • Target: Close/Renew Example3: sentiment analysis • Review • Target: sentiment types • Strong negative • Negative • Neutral • Positive • Strong positive 16
  • 17. @2016 LinkedIn Corporation. All Rights Reserved. crowd and internal judgments label preparation raw features features with label log data online A/B test random bucket evaluate for model selection model training + - - - - + + best model compute offline evaluation metrics offline scoring and indexing a/b test reports feature engineering label feature integration training data testing data validation data model performance raw features feature integration scoring features - raw data - data partitioning online/offline systems log data
  • 18. @2016 LinkedIn Corporation. All Rights Reserved. Feature Engineering Entity Features Social FeaturesBehavioral Features • Multiple data sources, hundreds to billions of features • Multiple levels of granularity • Monitoring dynamic changes 18 • Demographics • Personal and professional interest • … • pageviews • searches • activities on external sites • … • Social network identity and behaviors • … The process of transforming raw data into features that better represent the underlying problem to the predictive models
  • 19. @2016 LinkedIn Corporation. All Rights Reserved. Feature Metadata ▪ Location ▪ Description ▪ Type – numerical, categorical, ordinal, binary ▪ Granularity – snapshot, hourly, daily, weekly, monthly... ▪ Creation frequency – hourly, daily, weekly, monthly… 19 ▪ Aggregatable ▪ Aggregation length – snapshot, hourly, daily, weekly, monthly... ▪ Aggregation option – min, max, sum, avg, median... ▪ Transformation option – log, binary, self… ▪ Owner ▪ Retention policy
  • 20. @2016 LinkedIn Corporation. All Rights Reserved. Feature Storage ▪ File format: one of the most essential drivers of functionality and performance for big data processing ▪ What must be considered? – How big are your files? – How important is file format “splitability”? – Does block compression matter? – Are you more concerned about read or write? 20
  • 21. @2016 LinkedIn Corporation. All Rights Reserved. Feature Storage Type 21 Storage Text Sequence File Avro Parquet ORC type row based row based row based column-oriented column + row block compression X V V V V splittable compression X V V V V schema support X X V V V schema evolution X X V V V
  • 22. @2016 LinkedIn Corporation. All Rights Reserved. Feature Representation ▪ Key-Value pair ▪ Full feature vector ▪ Sparse feature vector 22 entity_id feature1 feature2 ... xx1 0 0.3 xx2 1 0.1 xx3 0 NA … … 0.9 entity_id feature1 feature2 ... xx1 0 0.3 xx2 1 0.1 xx3 0 NA … … 0.9 entity_id feature1 feature2 ... xx1 1 0.4 xx2 NA 0.2 xx3 0 NA … … 0.7 2016-02-29 2016-03-31 Entity Feature Tim estam p 2016-01-31
  • 23. @2016 LinkedIn Corporation. All Rights Reserved. Raw Feature Schema Example 23 { "type" : "record", "name" : "TUPLE", "fields" : [ { "name" : "id", "type" : [ "null", "long" ] }, { "name" : "timestamp", "type" : [ "null", "int" ] }, { "name" : "feature_name", "type" : [ "null", "string" ] }, { "name" : "feature_value", "type" : [ "null", "double" ] } ] } ▪ Key-value pair entity_id timestamp name value 1 t1 F1 0.4 1 t1 F2 0 1 t2 F4 5 2 t3 F1 1 … … F5 ...
  • 24. @2016 LinkedIn Corporation. All Rights Reserved. Raw Feature Schema Example 24 { "type": "record", "name": "TUPLE", "fields": [ { "name": "id", "type": [ "null", "long" ] }, { "name": "timestamp", "type": [ "null", "int" ] }, { "name": "feature1", "type": [ "null", "double" ] }, { "name": "feature1", "type": [ "null", "double" ] }, { "name": "feature3", "type": [ "null", "double" ] }, ... ] } ▪ Full feature vector entity_id timestamp feature1 feature2 ... xx1 t1 1 0.4 xx2 t1 NA 0.2 xx3 t2 0 NA … ... … 0.7
  • 25. @2016 LinkedIn Corporation. All Rights Reserved. Raw Feature Schema Example 25 { "type": "record", "name": "FeatureVector", "fields": [ { "name": "entity_id", "type": [ "string" ] }, { "name": "timestamp", "type": [ "null", "string" ] }, { "name": "features", "type": { "type": "array", "items": { "type": "record", "name": "FeatureValue", "fields": [ { "name": "id", "type": "string", "doc": "identifier of a feature"}, { "name": "value", "type": "string", "doc": "text representation of a feature value" } ] }}, "default": [] } ] } ▪ Sparse feature vector LIBSVM label index1:value1 index2:value2 ...
  • 26. @2016 LinkedIn Corporation. All Rights Reserved. Feature Monitoring 26 Labor Day Typical metrics to monitor: ● summation: e.g.: total page views, total searches ● coverage: e.g.: number of users who sent inmails ● percentiles: e.g.: top 95% of profile views count Methods to detect anomaly ● Percentage change ● T-test, Chi-square test ● Generalized ESD (Extreme Studentized Deviate) test, Seasonal Hybrid ESD.
  • 27. @2016 LinkedIn Corporation. All Rights Reserved. crowd and internal judgments label preparation raw features features with label log data online A/B test random bucket evaluate for model selection model training + - - - - + + best model compute offline evaluation metrics offline scoring and indexing a/b test reports feature engineering label feature integration training data testing data validation data model performance raw features feature integration scoring features - raw data - data partitioning online/offline systems log data
  • 28. @2016 LinkedIn Corporation. All Rights Reserved. Feature Integration 28 user_id Label xx1 0 xx2 1 xx3 0 … … user_id pageviews searches … xx1 97 5 … xx2 27 2 … xx4 58 4 … user_id Tenure Is employed … xx1 3876 1 … xx2 60 0 … xx3 2700 1 … user_id connections Connect in … xx1 60 4 … xx2 5 5 … xx3 120 30 … user_id Label pageviews searches … Tenure Is employed … connections Connect in … xx1 0 97 5 … 3876 1 … 60 4 … xx2 1 27 2 … 60 0 … 5 5 … xx3 0 null null null 2700 1 … 120 30 … … … Label set Feature set 1 Feature set 2 Feature set 3
  • 29. @2016 LinkedIn Corporation. All Rights Reserved. 29 Feature Integration Training data with timestamp Feature Mart present week 1 week 2 week 3 week 4 mid Y X xx1 0 Snapshot features Aggregated features xx2 1 … … xx3 0 … … trans desc self x sign sign(x) log sign(x)*log(abs(x)+1)) isna 1:x==null; 0:x!=null trans desc sum sum(xi ) Weighted sum sum(wi *xi ), wi =α^i avg avg(xi ) Weighted avg sum(wi *xi )/sum(wi ), wi =α^i past
  • 30. @2016 LinkedIn Corporation. All Rights Reserved. Feature Transformation and Interactions ▪ Feature transformation important for linear models ▪ Numeric values: – binary buckets based on histograms of data – log transforms – scale to unit variance, mean of 0 ▪ Categorical values – binary indicator for each categorical value ▪ Interactions – cross-products of feature types, e.g. ▪ {Skillsmember } X {Skillsjob } ▪ {Skillsmember } X {Industryjob } 30
  • 31. @2016 LinkedIn Corporation. All Rights Reserved. crowd and internal judgments label preparation raw features - features with label log data online A/B test random bucket evaluate for model selection model training + - - - - + + best model compute offline evaluation metrics offline scoring and indexing a/b test reports feature engineering label feature integration training data testing data validation data model performance raw features feature integration scoring features - raw data - data partitioning online/offline systems log data
  • 32. @2016 LinkedIn Corporation. All Rights Reserved. Model Learning 32 Training Set Validation Set Testing Set Model Validation Model Training Model Testing Data Partitioning Best Model Model Selection Solvers Logistic regression L1/L2 Random Forest Gradient Boosting Trees SVM … ...
  • 33. @2016 LinkedIn Corporation. All Rights Reserved. Hyperparameter search ▪ Modeling techniques must be tuned for best performance – Regression ▪ regularization methods (Ridge, Lasso, Elastic Net) ▪ regularization weight – Gradient boosted trees ▪ number trees ▪ depth of tree ▪ learning rate ▪ sampling parameters ▪ Provide wrappers to ease hyperparameter search in early problem exploration – Grid Search – Randomized Search, Bayesian Optimization ▪ Typically don’t want to rerun hyperparameter search for each retrain, do only periodically when there is a major change in data volume, features, etc. 33
  • 34. @2016 LinkedIn Corporation. All Rights Reserved. 34 Measure Model Performance ▪ Precision/Recall – Precision = TP/(TP+FP) – Recall = TP/(TP+FN) ▪ true positive rate / sensitivity ▪ F1 score – 2 * (Precision * Recall) / (Precision + Recall) ▪ False positive rate: – FPR = FP/N = FP/(FP+TN) ▪ Specificity: true negative rate – SPC = TN/N = TN/(FP+TN) ▪ Accuracy – ACC = (TP+TN)/(P+N) True Positive False Positive False Negative True Negative Actual Value p n P N Prediction Outcome p’ n’ total
  • 35. @2016 LinkedIn Corporation. All Rights Reserved. 35 Measure Model Performance
  • 36. @2016 LinkedIn Corporation. All Rights Reserved. crowd and internal judgments label preparation raw features features with label log data online A/B test random bucket evaluate for model selection model training + - - - - + + best model offline scoring and indexing a/b test reports feature engineering label feature integration training data testing data validation data model performance raw features feature integration scoring features - raw data - data partitioning compute offline evaluation metrics online/offline systems log data
  • 37. @2016 LinkedIn Corporation. All Rights Reserved. 37 Model Testing ▪ Similar to model validation – Test on data NOT used in model selection – Scoring using the best model – Performance evaluation/comparison Raw Testing Data Production Model Best Model
  • 38. @2016 LinkedIn Corporation. All Rights Reserved. crowd and internal judgments label preparation raw features features with label log data online A/B test random bucket evaluate for model selection model training + - - - - + + best model offline scoring and indexing online/offline systems a/b test reports feature engineering label feature integration training data testing data validation data model performance raw features feature integration scoring features - raw data - data partitioning compute offline evaluation metrics log data
  • 39. @2016 LinkedIn Corporation. All Rights Reserved. Performance Measurement via A/B Test 39 Algorithm b 20% Algorithm a 80% Collect results to determine which one is better
  • 40. @2016 LinkedIn Corporation. All Rights Reserved. Best Practices for Running A/B Test ▪ Start testing on a small portion of users ▪ Measure one change at a time ▪ Be aware of potential biases (time, targeted population, etc.) ▪ Avoid coupling a marketing campaign with an A/B test ▪ Use a simple rule of thumb to address multiple testing problems – 0.05 p-value cutoff for metrics that are expected to be impacted – a smaller cutoff, say 0.001, for metrics that are not 40
  • 41. crowd and internal judgments label preparation raw features features with label log data online A/B test random bucket evaluate for model selection model training + - - - - + + best model compute offline evaluation metrics offline scoring and indexing a/b test reports feature engineering label feature integration training data testing data validation data model performance raw features feature integration scoring features - raw data - data partitioning online/offline systems log data
  • 42. @2016 LinkedIn Corporation. All Rights Reserved. 42 Model Deployment ▪ Schedule and run the scoring pipeline regularly – Feature integration and transformation – Scoring using the selected model ▪ Score reasoning ▪ Downstream apps: e.g., marketing campaigns Production Model Active members past k weeks id score rank xxx1 0.99 1 xxx2 0.87 2 xxx3 0.72 3 … … …
  • 43. @2016 LinkedIn Corporation. All Rights Reserved. Model Management 43 Day 1 Day 30 Photo credit: asmfoto Marcell Mizik, photo license withDepositphotos File Purchase Agreement #41549281
  • 44. @2016 LinkedIn Corporation. All Rights Reserved. 44 Model Management Feed in new training data to generate new model periodically Monitor performance changes over time Keep model up to date Ensemble historical models as one of the baseline models
  • 45. @2016 LinkedIn Corporation. All Rights Reserved. 45 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 46. @2016 LinkedIn Corporation. All Rights Reserved. Leakage: the introduction of information about the data mining target, which should NOT be legitimately available to mine from. -- Shachar Kaufman, et.al. KDD11 Common leakage scenarios ● Leaking data from the testing set into the training set ● Leaking of information from the future into the past ● Leaking information from data samples outside of scope of the algorithm’s intended use Data Leakage ID Entity Time Label Usage 1 Usage 2 Region Size Spending Activity 1 Activity 2 … 1 Company 1 01/01 1 10 2 SF 118 3,810 60 4 … 2 Company 1 02/01 1 5 0 SF 117 3,810 70 5 … 3 Company 1 03/01 1 0 3 SF 119 3,810 120 30 … 4 Company 2 01/01 0 100 30 NYC 50 2,000 23 0 ... 5 Company 2 04/01 0 90 43 NYC 50 2,000 23 0 ... … … … … … … … … … … … … 1 3 2 4 5 Training Testing randomSplit() price1 price2 price3 Yesterday Today Tomorrow featurei = (price1 + price2 + price3 ) / 3 Target = future price ? Feature = “Rep id” Target: customer churn A specific sales representative was assigned to take over an churned account Kaggle 2011 IJCNN Social Network Challenge: the winners de-anonymized 80% nodes and 60% edges to disclose real identities in the social network
  • 47. @2016 LinkedIn Corporation. All Rights Reserved. ● Identify data leakage ○ Exploratory data analysis (EDA) ■ An approach to analyzing data sets to summarize their main characteristics, often with visual methods ○ Model performance is too good to be true? ○ Early in-the-field testing 47 Data Leakage
  • 48. @2016 LinkedIn Corporation. All Rights Reserved. Class Imbalance ▪ A dataset is said to be imbalanced when the binomial or multinomial response variable has one or more classes that are underrepresented in the training data, with respect to the other classes – “I have a binary classification problem and the label is distributed in 1:100 ratio in my training set. My results are overfit to majority class.” ▪ The class imbalance problem is pervasive and ubiquitous – e.g job recommendation, ads CTR, fraud detection ▪ Misclassify the minority class usually with high cost – rejecting a valid credit card transaction VS. approving a large fraudulent transaction 48 ?
  • 49. @2016 LinkedIn Corporation. All Rights Reserved. 49 Solutions ▪ Can You Collect More Data? ▪ Consider different evaluation metrics – “Accuracy” might be misleading for imbalanced training data – Confusion Matrix, Precision, Recall, F1, Kappa, ROC ▪ Re-sampling data set – Up-sampling (Over-sampling) – Down-sampling – Synthetic Minority Oversampling Technique (SMOTE) ▪ Use different learning algorithms – Decision tree, random forest – AUC-Maximizing Algorithms ▪ Cost-Sensitive Training Class Imbalance
  • 50. @2016 LinkedIn Corporation. All Rights Reserved. Categorical Data ▪ Categorical feature – A variable that can take on one of a limited, and usually fixed, number of possible values, thus assigning each individual to a particular group or "category” - The Practice of Statistics , 2003 ▪ |SIGKDD Conference| = ? ▪ |KDD workshop| = ? ▪ |KDD host country| = ? ▪ High cardinality categorical features are common in the data – E.g. Industry, country, city ▪ Too many levels – Not all the levels (distinct values of the categorical feature) got enough support. Some are less useful – Many Machine Learning tools can only handle certain amount of levels ▪ E.g. Random Forest implementation in R has a hard limit of 32-levels for a categorical feature 50
  • 51. @2016 LinkedIn Corporation. All Rights Reserved. 51 ▪ Solutions – Reduce cardinality by grouping categories into higher-level ones – Transform a categorical feature into multiple binary ones ▪ Introduce an additional ‘others’ feature to represent all the new categories in the testing set Cloud and Distributed Computing Dental Hadoop Spark Hive Yarn ... others 1 1 0 0 ... 0 Categorical Data
  • 52. @2016 LinkedIn Corporation. All Rights Reserved. ▪ Missing data scenarios – Missing Completely at Random (MCAR) ▪ Is not related to other variables AND is not related to value of missing variable ▪ E.g. Computer crash – Missing at Random (MAR) ▪ Is related to other, known variables. BUT, is not related to value of missing variable(s) once we take the above relation(s) into account ▪ E.g. male participants are more likely to refuse to fill out the depression survey, but it does not depend on the level of their depress – Missing Not at Random (MNAR) ▪ Is related to what the value of the missing data would have been even if we take into consideration other variables ▪ E.g. People with low high school GPA decline to report it 52 Missing Data
  • 53. @2016 LinkedIn Corporation. All Rights Reserved. Missing Data ▪ Solutions – Remove observations with missing values ▪ When values are missing at random and you have enough data – Missing data imputation ▪ Common imputation strategies – Categorical: Choose the category with the most support – Numerical: Median, mean, or simply set to 0 ▪ Predict missing data using a model – Introduce a corresponding dummy feature to indicate its availability – Use a model that are robust to missing data, e.g. tree-based model. 53
  • 54. @2016 LinkedIn Corporation. All Rights Reserved. Outliers ▪ Outliers can be introduced in response or predictors – Rare event (valid) – Erroneous metrics (invalid) ▪ Impact of the outliers – Outlier values can have a disproportionate weight on the model. – MSE will focus on handling outlier observations more to reduce square error – Boosting will spend considerable modeling effort fitting these observations ▪ Solutions – Whether the outlier value is valid or invalid? – Remove observation with outlier feature values – Apply transformation to reduce impact: ▪ log ▪ binning (e.g. based on distribution) ▪ spline transforms ▪ Impose a constraint on data range (cap values) – Choose a more robust loss function (e.g. trees) 54
  • 55. @2016 LinkedIn Corporation. All Rights Reserved. Model Interpretation ▪ Why? – Debug, diagnose, generate new hypotheses – Inevitable questions about why a prediction was made from your business counterparts – for presenting reasons to users - may be output of a ML model itself 55 ... With premium account, you get more search results and access to … ... Do you know 5 of your connections have started to use premium account …
  • 56. @2016 LinkedIn Corporation. All Rights Reserved. Model Interpretation Univariate Feature Interpretation Pros: ▪ Get a sense of importance for each feature ▪ Many available algorithms: Random Forest, Regularized linear models, various feature selection algorithms Cons: ▪ Bias, e.g. impurity evaluation of RF is biased towards preferring variables with more categories ▪ Difficulty of interpreting the ranking of correlated variables ▪ Single feature may contain lots of noise. 56 Group-wise Feature Interpretation Pros: ▪ Easy to capture the overall look by grouping massive features into a few buckets ▪ Strong semantic meanings ▪ Inter-group correlation are less ▪ Impact from noise is reduced by analysing multiple features at the same time Cons: ▪ Domain knowledge required
  • 57. @2016 LinkedIn Corporation. All Rights Reserved. Model Interpretation 57 SOCIAL E-LEARNING MARKETING AFFILIATION ENGAGEMENT SPENDING GROWTH PRODUCT USAGE Provide high-level overview and insights Deep dive to finer-level predictors
  • 58. @2016 LinkedIn Corporation. All Rights Reserved. 58 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 59. @2016 LinkedIn Corporation. All Rights Reserved. 59 Sponsored Ads Updates from followed companies Updates from connections Case Study - LinkedIn Feed Ranking
  • 60. @2016 LinkedIn Corporation. All Rights Reserved. 60 Case Study - LinkedIn Feed Ranking Feed Events ➢ Sponsored Ads ➢ Updates from connections ➢ Updates from followed companies ➢ Updates from joined groups ➢ Articles shared / commented / liked by connections ➢ Articles mentioned connections ➢ Articles posted by influencers or connections ➢ News from followed channels ➢ Job recommendations ➢ People You May Know ➢ ...
  • 61. @2016 LinkedIn Corporation. All Rights Reserved. Actor-Verb-Object Formalism and Activity Types 61
  • 62. @2016 LinkedIn Corporation. All Rights Reserved. 62 Case Study - LinkedIn Feed Ranking Actor Object Verb
  • 63. @2016 LinkedIn Corporation. All Rights Reserved. Case Study - LinkedIn Feed Ranking 63 ▪ Context ▪ Questions ▪ Group Discussions ▪ Discuss LinkedIn Solutions - 400+ million members - may have thousands of items to rank - focus on the modeling problem - assume that items are provided - skewed activity - some members very active, some not very active - ranking problem, online scoring - How to formulate the feed ranking problem? - What are the suitable algorithms? - What kind of features are preferred here? - How to perform offline validation?
  • 64. @2016 LinkedIn Corporation. All Rights Reserved. ▪ Binary classification problem: let it represent the interaction between viewer i and update t: it = ▪ Assume using logistic regression model, let Xit be a vector of features characterizing viewers, feed update and the context, and be a vector of parameters: log = ′Xit which is equivalent to: Problem Formulation 64 1, if viewer interacts with feed update -1, otherwise P( it = 1 | viewer, update) 1 - P( it = 1 | viewer, update) P( it = 1 | viewer, update) = 1/ (1 + exp(- ′Xit ))
  • 65. @2016 LinkedIn Corporation. All Rights Reserved. Model Training ▪ The parameter vector is estimated by maximizing the likelihood of the training data as a function of L( ) = (1 + exp(- ′Xit ))-1 ▪ Add a regularization term (Euclidean norm on the parameter vector) to the log-likelihood function to mitigate overfitting ( ) = - ∑ log (1 + exp(- it ′Xit )) - || ||2 ▪ Compute the gradient vector of the regularized log-likelihood function by using one of the standard computational tools for gradient-based maximization, for example, stochastic gradient descent. 65 it it
  • 66. @2016 LinkedIn Corporation. All Rights Reserved. Potential Features ▪ Features in the feature vector Xit are classified into following categories: – Viewer-only features Xi ▪ member industry, geo location, skills – Activity-only features Xt ▪ time of activity, verb type (like, comment, share, recommendation, ...) – Object-only features Xo ▪ position – Viewer-actor features Xij : ▪ number messages sent to actor from viewer, number shared connections, ... – Viewer-activity type features Xik ▪ viewer CTR for activity type – Viewer-actor-activity type features Xijk ▪ viewer CTR for actor - activity type combination – Viewer-object features Xio ▪ object same language as viewer profile, number previous interactions 66
  • 67. @2016 LinkedIn Corporation. All Rights Reserved. Feature Design and Engineering 67 ▪ Apply feature transformation to bypass the limitation from the linear models. – log function on raw counts, – indicator function on raw counts, – bucketize a numerical feature as a set of categorical features concatenated as a binary sub-vector
  • 68. @2016 LinkedIn Corporation. All Rights Reserved. Three-step Offline Evaluation 68 ▪ AUC for model selection ▪ Comparing the ratio of observed to expected (predictions) probabilities for different activity types (denoted as the O/E ratio) and whether or not the data likelihood converged to the global maximum. – The CTR model overestimates different pCTR probabilities if O/E<1 – The CTR model underestimates different pCTR probabilities if O/E>1 ▪ The “Replay” tool – Takes a logistic regression model and runs it on historic (random bucket) data and records the total interaction on “matched impressions” – Reorders them using the new model that we evaluate; – Count only clicks on matched impressions that appear in the first position.
  • 69. @2016 LinkedIn Corporation. All Rights Reserved. 69 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 70. @2016 LinkedIn Corporation. All Rights Reserved. Scale of the problem 70 ▪ Can it fit on one machine? – source: https://github.com/szilard/benchm-ml – Vowpal Wabbit can train linear model on a ▪ 1B example x 9 dense feature dataset in ~ 23 minutes on AWS r3.8xlarge instance ▪ 10M example dataset in 15 seconds – xgboost can train gradient boosted trees on a ▪ 10M example x 9 dense feature dataset in as little as 15 minutes using 6GB RAM ▪ Easier to explore a wide range of algorithms ▪ Single machine algorithms often the most robust, come with good tools 64 "Green Globe In Child’s Hands" (CC BY-SA 2.0) by kenteegardin
  • 71. @2016 LinkedIn Corporation. All Rights Reserved. 71 Integration with your company’s tech stack ▪ Data formats or sources – Avro, Parquet, Hive, SQL, Kafka …. – Build your own data converters ▪ Can be surprisingly difficult! ▪ Character encoding ▪ Sparse, text named features → indexed features ▪ Model deployment – Model formats: PMML – Embeddable scorers: what programming language? ?
  • 72. @2016 LinkedIn Corporation. All Rights Reserved. Modeling Techniques ▪ Linear regression models (linear, logistic, poisson) – Scale to the largest datasets – Can have useful model diagnostics – Are fast at scoring – Can be easily retrained or updated online – Require much more feature engineering ▪ Gradient boosted regression trees and random forests – Provide great model accuracy with minimal feature engineering – Are slower to train – Are slower to score ▪ Others? – Factorization machines – Deep learning 72 Σ
  • 73. @2016 LinkedIn Corporation. All Rights Reserved. ▪ Sometimes efficient scoring is critical – ranking problems, recommender systems, edge prediction ▪ Pre-compute in batch – build your own candidate selection – feed into scoring utility – rank and store – scaling for very large numbers of candidate pairs can be challenging – relatively expensive if not queried ▪ Compute on-demand – candidates provided by an underlying index (e.g. search engine), database, or store – score online – rank and return – latency is critical – requires robust online infrastructure Scoring Support 73 Σ Σ
  • 74. @2016 LinkedIn Corporation. All Rights Reserved. ▪ build – in-house building often only makes sense for large teams – allows specialization, innovation ▪ assemble – leverage open-source software – make it easier to use within your company’s tech stack – fill in gaps that open-source software doesn’t address: feature integration, model updates, a/b testing, … ▪ buy – can get support contracts – makes sense if little to no engineering support within company or business unit – good for small teams – can be limiting in terms of techniques 74 "Brick laying" (CC BY 2.0) by hnnbz "dollar-sign" (CC BY-SA 2.0) by Oldmaison "LEGO house" (CC BY 2.0) by Atsushi Tadokoro
  • 75. @2016 LinkedIn Corporation. All Rights Reserved. Maturity of platform ▪ Leverage your network – Who is using it? in production? – Is it working for them? – What are its limits? scalability, stability, ease of use, …. ▪ Open source – How active is the development of the project? ▪ sometimes means lots of changes or instability ▪ usually a good sign – From what companies? ▪ Diverse companies suggests broad adoption ▪ Big companies suggests scale, maturity of engineering processes ▪ Academic only may indicate weak points in scale or robustness (especially with respect to scoring solutions) ▪ Purchasing solutions or support – Consider size of the company, amount of funding raised, your tolerance to risk – Pedigree of the ML experts and engineers at the company 75 "Jersey City waterfront at night" (CC BY-SA 2.0) by mattk1979
  • 76. @2016 LinkedIn Corporation. All Rights Reserved. Anticipate evolving needs ▪ May start with buying a solution ▪ But later need to integrate more tightly and move toward assembling ▪ Or later build major components yourself ▪ Consider whether the solutions you buy are built on open source and can help you transition to assembling solutions 76 "Butterfly" (CC BY-SA 2.0) by Salvatore Gerace
  • 77. @2016 LinkedIn Corporation. All Rights Reserved. 77 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 78. @2016 LinkedIn Corporation. All Rights Reserved. Spark ML, MLlib ▪ ML – Usability through DataFrames and Pipelines – More polished implementations, often more scalable – Regression, Trees, k-Means, LDA ▪ MLlib – Old RDD API – More extensive set of methods – Matrix Factorization, Power Iteration Clustering, SVD, PCA, Frequent Pattern Mining, … ▪ Extensive support for reading data sources via Spark: Amazon S3, Amazon Redshift, mongoDB, mySQL, Shark, Hive, HDFS, byo, ... ▪ Batch and streaming scoring support ▪ PMML model export for some techniques 78
  • 79. @2016 LinkedIn Corporation. All Rights Reserved. Vowpal Wabbit 79 ▪ Very fast SGD – hashing trick – scales to very large number of features and datasets on a single machine – models less interpretable (can’t know the weight on any single feature) ▪ Allreduce server for distributed learning ▪ Quadratic, cubic feature interaction ▪ Most users run from command line ▪ Also C++ and C# APIs ▪ Linear, logistic, svm, quantile regression, online (contextual bandit), LDA, matrix factorization ▪ Batch, API, service scoring
  • 80. @2016 LinkedIn Corporation. All Rights Reserved. DMLC ▪ (mostly) research collaboration ▪ Mixed language support across methods ▪ XGBoost: Scalable GBDT ▪ MXNet: flexible deep learning ▪ DiFacto: factorization machines, logistic regression using parameter server 80
  • 81. @2016 LinkedIn Corporation. All Rights Reserved. scikit-learn ▪ Popular python package for machine learning ▪ SVM, regression, trees (RF, GBDT), k-means, PCA, matrix factorization, … ▪ Load easily from file, HDFS, Amazon S3 ▪ Batch, API scoring 81
  • 82. @2016 LinkedIn Corporation. All Rights Reserved. R ▪ Application for data mining, statistics, ML ▪ Generally not considered “production” grade ▪ Extensive packages ▪ Among most popular: SVM, regression trees (GBRT, RF, ...), PCA, Quadratic Programming, GLM, GBM, association rule mining, bagging, ... ▪ Packages for loading data from file, SQL, JDBC, HDFS, hive, Redshift, Amazon S3 ▪ PMML model export for some packages 82
  • 83. @2016 LinkedIn Corporation. All Rights Reserved. TensorFlow ▪ Primarily used for Deep Learning ▪ Python (first class), C++ (minimal) APIs ▪ Designed from ground up for distributed ML ▪ GPU support ▪ Flexible modeling framework beyond deep learning ▪ Auto-differentiation makes it easier to add new techniques ▪ Designed for production learning and scoring ▪ Batch scoring, model export 83
  • 84. @2016 LinkedIn Corporation. All Rights Reserved. AWS + Amazon Machine Learning 84 console, API S3, Redshift, RDS (SQL) limited transformations logistic regression, multinomial regression, linear regression
  • 85. @2016 LinkedIn Corporation. All Rights Reserved. AWS + Spark 85 linear, logistic, multiclass, and survival regression, trees (RF, GBDT), MF, survival regression, k-means, LDA Scala, Java, R, python MySQL, Shark, Hive, HDFS, ODBC, … pipelines console, API
  • 86. @2016 LinkedIn Corporation. All Rights Reserved. AWS + Spark + databricks 86 linear, logistic, multiclass, and survival regression, trees (RF, GBDT), MF, survival regression, k-means, LDA Scala, Java, R, python MySQL, Shark, Hive, HDFS, ODBC, … pipelines notebooks, scheduled jobs, model deployment to scoring server S3, Redshift, RDS (SQL)
  • 87. @2016 LinkedIn Corporation. All Rights Reserved. AWS + H2O 87 extensive regression & classification, GBDT, ensembles, deep learning notebooks, model export (jar, JSON) HDFS, file, URL download, Amazon S3
  • 88. @2016 LinkedIn Corporation. All Rights Reserved. AWS + H2O + Spark 88 extensive regression & classification, GBDT, ensembles, deep learning notebooks, model export (jar, JSON)
  • 89. @2016 LinkedIn Corporation. All Rights Reserved. AWS + Turi 89 extensive regression & classification, recommenders (factorization machines, MF, …), k-means, LDA, deep learning notebooks, C++, python HDFS, Amazon S3, file, ODBC, Avro, ... API, batch scoring
  • 90. @2016 LinkedIn Corporation. All Rights Reserved. AWS + Turi + Spark 90 extensive regression & classification, recommenders (factorization machines, MF, …), k-means, LDA, deep learning notebooks, C++, python HDFS, Amazon S3, file, ODBC, Avro, ... Spark RDD <-> Dato SFrame data conversion
  • 91. @2016 LinkedIn Corporation. All Rights Reserved. Azure Cloud: dbs, batch computation, Power BI, stream analytics, ... Microsoft Azure Machine Learning 91 extensive regression & classification, k-means, Vowpal Wabbit API, Notebooks, GUI Workflows, R, python API, batch scoring Hive, Azure, bulk upload, URL download,. ..
  • 92. @2016 LinkedIn Corporation. All Rights Reserved. Azure Cloud: dbs, batch computation, Power BI, stream analytics, ... Azure Machine Learning + Hadoop/Spark 92 extensive regression & classification, k-means, Vowpal Wabbit API, Notebooks, GUI Workflows, R, python HDInsight: run Hadoop and Spark on Azure API, batch scoring Hive, Azure, bulk upload, URL download,. ..
  • 93. @2016 LinkedIn Corporation. All Rights Reserved. Google Machine Learning (limited preview) 93 compute, app, networking, storage, big data, security, ... TensorFlow: deep learning, flexible optimization framework notebooks Cloud Storage, BigQuery model export, API, batch
  • 94. @2016 LinkedIn Corporation. All Rights Reserved. 94 Outline Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives Predictive Modeling Overview ❏ End-to-end walkthrough of a production modeling solution ❏ Common pitfalls and challenges ❏ Case Study - LinkedIn Feed Ranking ❏ Considerations when choosing a framework ❏ Overview of tools and platforms across industry ❏ Modoop - example of a scaled framework Choosing a Framework
  • 95. @2016 LinkedIn Corporation. All Rights Reserved. Intelligence Engine HDFS Feature Mart Application 1 Application 2 Application N … … Workflows Workflows Workflows Feature Engineering Libraries Machine Learning Libraries Workflow Scheduler & Manager Ground Truth Drivers Metadata Store Web UI 95 Modoop - example of building/assembling a platform ScoresModels MapReduce / Yarn Pig, Hive, Python, Scala, Shell script, ...
  • 96. @2016 LinkedIn Corporation. All Rights Reserved. Intelligence Engine HDFS Feature Mart Application 1 Application 2 Application N … … Workflows Workflows Workflows Feature Engineering Libraries Machine Learning Libraries Workflow Scheduler & Manager Ground Truth Drivers Metadata Store Web UI 96 ScoresModels MapReduce / Yarn Pig, Hive, Python, Scala, Shell script, ... Leverage hadoop/Spark, Azkaban open source
  • 97. @2016 LinkedIn Corporation. All Rights Reserved. Intelligence Engine HDFS Feature Mart Application 1 Application 2 Application N … … Workflows Workflows Workflows Feature Engineering Libraries Machine Learning Libraries Workflow Scheduler & Manager Ground Truth Drivers Metadata Store Web UI 97 ScoresModels MapReduce / Yarn Pig, Hive, Python, Scala, Shell script, ... Feature mart and metadata store enable feature integration and feature engineering
  • 98. @2016 LinkedIn Corporation. All Rights Reserved. Intelligence Engine HDFS Feature Mart Application 1 Application 2 Application N … … Workflows Workflows Workflows Feature Engineering Libraries Machine Learning Libraries Workflow Scheduler & Manager Ground Truth Drivers Metadata Store Web UI 98 ScoresModels MapReduce / Yarn Pig, Hive, Python, Scala, Shell script, ... Leverage open source, augment with evaluation and model selection Photon ML Metronome: internal ML workflow tools
  • 99. @2016 LinkedIn Corporation. All Rights Reserved. Intelligence Engine HDFS Feature Mart Application 1 Application 2 Application N … … Workflows Workflows Workflows Feature Engineering Libraries Machine Learning Libraries Workflow Scheduler & Manager Ground Truth Drivers Metadata Store Web UI 99 ScoresModels MapReduce / Yarn Pig, Hive, Python, Scala, Shell script, ... Create end-to-end workflow based on user-provided label and configurations
  • 100. @2016 LinkedIn Corporation. All Rights Reserved. Intelligence Engine HDFS Feature Mart Application 1 Application 2 Application N … … Workflows Workflows Workflows Feature Engineering Libraries Machine Learning Libraries Workflow Scheduler & Manager Ground Truth Drivers Metadata Store Web UI 100 ScoresModels MapReduce / Yarn Pig, Hive, Python, Scala, Shell script, ... Custom user interface provides model understanding tools
  • 101. @2016 LinkedIn Corporation. All Rights Reserved. Summary ▪ Mining data to extract useful and enduring patterns remains a skill arguably more art than science ▪ Choose appropriate evaluation metrics for different audiences ▪ A black box model is not enough ▪ Consider scale, features, integration costs, maturity, and flexibility of platforms 101
  • 102. @2016 LinkedIn Corporation. All Rights Reserved. Appendix 102
  • 103. @2016 LinkedIn Corporation. All Rights Reserved. 103 Spark ML Vowpal Wabbit dmlc scikit-learn R TensorFlow Interfaces python, Scala, Java, R command line, C#, C++ command line, xgboost: C++, Python, R, Java, Scala, Julia mxnet: Python, R, Julia, Scala, Go, Javascript difacto: C++ notebook, command line, python gui, command line python Data Sources Amazon S3, Amazon Redshift, mongoDB, mySQL, Shark, Hive, HDFS, byo, ... file, API file, HDFS, Amazon S3 file, HDFS, Amazon S3, file, SQL, JDBC, HDFS, hive, Redshift, Amazon S3 file, API Modeling Techniques MLlib: linear, logistic, tree (RF, GBDT), MF, survival regression, multiclass, k-means, LDA linear, logistic, svm, quantile regression, online (contextual bandit), LDA, MF xgboost (gbrt), mxnet (deep learning), difacto (factorization machines, logistic regression) SVM, regression, trees (RF, GBDT), k-means, PCA, matrix factorization, ... SVM, regression trees (GBRT, RF, ..), PCA, QP, GLM, GBM, association rule mining, bagging, ... deep learning, linear/logistic regression, flexible learning framework Platforms Spark Single Machine, Custom AllReduce parameter server, AllReduce (Rabit), Spark, YARN - - Google Cloud Compute Engine Scoring Batch, Streaming Batch, API Batch Batch, API PMML export for some packages Batch, API Open Source Software
  • 104. @2016 LinkedIn Corporation. All Rights Reserved. 104 Microsoft Azure Machine Learning Google Cloud Machine Learning Platform (alpha) Amazon Machine Learning Databricks H20 Turi Interfaces API, Notebooks, GUI Workflows, R, python API, Notebooks, Programmatic Workflows, Java, python (alpha) API, console API, Notebooks, Programmatic Workflows, Java, Scala, python, R API, Notebook, R, Java, Scala, python Notebooks, C++, python Cloud Microsoft Azure Google Cloud Compute Engine Amazon AWS Amazon AWS Amazon AWS, Microsoft Azure*, Google Cloud*, your data center Amazon AWS, Microsoft Azure*, Google Cloud*, your data center Data Sources Hive, Azure blob storage, Azure table, Azure SQL OData feed, bulk upload, URL download text (including JSON), Google Cloud Bigtable, Google Cloud Datastore, byo Amazon S3, Amazon Redshift, Amazon RDS (SQL) Amazon S3, Amazon Redshift, mongoDB, mySQL, Shark, Hive, HDFS, byo, ... HDFS, file, URL download, Amazon S3 HDFS, Amazon S3, file, ODBC, Avro, SparkRDD, ... Modeling Techniques extensive regression & classification, k-means, Vowpal Wabbit TensorFlow: flexible deep learning, regression, ... binary, multiclass classification, regression MLlib: linear, logistic, tree (RF, GBDT), MF, survival regression, multiclass, k-means, LDA extensive regression & classification, GBDT, ensembles, deep learning extensive regression & classification, recommenders (factorization machines, MF, …), k-means, LDA, deep learning Limitations 10GB training datasets full info ML in limited preview Prediciton API: 2.5GB training datasets 100GB training datasets 1k input features full info Extensibility byo R, python, package imports TensorFlow SDK, Google Cloud Platform byo AWS services pipelines, byo Spark byo on your platform python, C++, byo platform Pretrained Models vision, speech, sentiment, ... speech, vision, translation, ... Scoring API, batch API, batch, planned export API, batch API, batch, streaming, PMML export API, JSON, POJO export API, Batch Machine Learning Platforms