This talk covers going over the various stages of building data mining models, putting them into production and eventually replacing them. A common theme throughout are three attributes of predictive models: accuracy, generalization and description. I assert you can have it all, and having all three is important for managing the lifecycle. A subtle point is that this is a step to developing embedded, automated data mining systems which can figure out themselves when they need to be updated.
Generative AI on Enterprise Cloud with NiFi and Milvus
Â
Production model lifecycle management 2016 09
1. Š 2016 LigaData, Inc. All Rights Reserved.
Production Model
Lifecycle Management
Presented: Tue, Sept 22, 2016
greg@ligadata.com www.Ligadata.org www.Kamanja.org
2. Š 2016 LigaData, Inc. All Rights Reserved. | 2
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain or Refresh the Model
Kamanja Open Source PMML Scoring Platform
Contents
Accurate
General
Understandable
Model
Can you
have all 3
Model
attributes?
3. Š 2016 LigaData, Inc. All Rights Reserved. | 3
Epsilon (owned by American Express then)
ACGâs first neural network (1992) (~40 quants in Analytic Consulting Group)
Score 250mm house holds every month, pick the best 5mm hh
Neural net by a previous consultant,
did great âin the labâ !!
did âreasonableâ month 1
Develop a Robust Solution (or get fired)
General
4. Š 2016 LigaData, Inc. All Rights Reserved. | 4
Epsilon (owned by American Express then)
ACGâs first neural network (1992) (~40 quants in Analytic Consulting Group)
Score 250mm house holds every month, pick the best 5mm hh
Neural net by a previous consultant,
did great âin the labâ !!
did âreasonableâ month 1
did âworseâ month 2
âbadâ month 3 (no lift over random)
prior consultant was fired
I was hired, and told why I was replacing him
My model captured the same response with 4mm hh mailed
was stable for 24+ months, saved $1mm / month
Why? Good KDD Process (Knowledge Discovery in Databases)
Develop a Robust Solution (or get fired)
General
5. Š 2016 LigaData, Inc. All Rights Reserved. | 5
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain or Refresh the Model
Kamanja Open Source PMML Scoring Platform
Contents
7. 7
R package âcaretâ
Same parameter search wrapper over 217 algorithms
http://topepo.github.io/caret/index.html
A âsectionâ of a model notebook
Still need to track the results of each section
Model Notebook
Accurate
8. 8
Bad vs. Good
217 R Algorithms
Covered
Do you really want a one-off solution?
⢠Experimenting with Algorithms
⢠Experimenting with Algorithm Parameters
⢠Variable description à refine preprocessing
⢠:
⢠Deep Learning architectures have many parameters
and network designs
Accurate
10. Model Notebook
10
Bad vs.
Good
Q) What is the best outcome metric?
ROC, R2, Lift, MAD âŚ.
A) Deployment simulation of cost-value-strategy
Does the business problem mirror the 80-20 rule?
Just act on top 1% or top 5%?
Is the business deployment over all the score range? [0⌠1]?
Just over the top 1% or 5% of the score (then NOT ROC, R2, corr)
Are some records 5* or 20* more valuable?
Ă Use cost-profit weighting, or more complex system Is this taught in
mining
competitions or
classes?
Accurate
in terms
of
business
focus
11. Calculate $ of âBusiness Painâ
zero
error
Over
Stock
Under
Stock
Need to Deeply
Understand
Business Metrics
Accurate
12. Calculate $ of âBusiness Painâ
1% bus
pain $
15% business
pain $
zero
error
?
âEqual mistakes â
Unequal PAIN in $
Over
Stock
Under
Stock
Need to Deeply
Understand
Business Metrics
At least use Type I vs.
Type II weighting
Accurate
in terms
of
business
focus
13. Calculate $ of âBusiness Painâ
No way â that could get you fired!
New progress in getting feedback
Over
Stock
4 week supply
of SKU â
30% off sale
Under
Stock
1% bus
pain $
30% bus
pain $15% business
pain $
zero
error
âEqual mistakes â
Unequal PAIN in $
Accurate
in terms
of
business
focus
14. Model Notebook
Outcome Details
⢠My Heuristic Design Objectives: (yours may be different)
â Accuracy in deployment
â Reliability and consistent behavior, a general solution
⢠Use one or more hold-out data sets to check consistency
⢠Penalize more, as the forecast becomes less consistent
â No penalty for model complexity (if it validates consistently)
â Develop a âsmooth, continuous metricâ to sort and find
models that perform âbestâ in future deployment
14
What would
you do?
15. Model Notebook
Outcome Details
⢠Training = results on the training set
⢠Validation = results on the validation hold out
⢠Gap = abs( Training â Validation )
A bigger gap (volatility) is a bigger concern for deployment, a symptom
Minimize Senior VP Heart attacks! (one penalty for volatility)
Set expectations & meet expectations
Regularization helps significantly
⢠Conservative Result
= worst( Training, Validation) + Gap_penalty
Corr / Lift / Profit â higher is better: Cons Result = min(Trn, Val) - Gap
MAD / RMSE / Risk â lower is better: Cons Result = max(Trn, Val) + Gap
Business Value or Pain ranking = function of( conservative result ) 15
Generalization:
You canât
optimize
something you
donât measure
17. Model Notebook Process
Tracking Detail â Training the Data Miner
Input / Test Outcome
Regression
Top
5%
Top
10%
Top
20%
AutoNeural
Neural
Yippeee
!
More
Heuristic Strategy:
⢠Try a few models of many
algorithm types (seed the search)
⢠Opportunistically spend
more effort on what is
working (invest in top stocks)
⢠Still try a few trials on
medium success (diversify,
limited by project time-box)
⢠Try ensemble methods,
combining model forecasts
& top source vars w/ model
The Data Mining Battle Field
18. Š 2016 LigaData, Inc. All Rights Reserved. | 18
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain or Refresh the Model
Kamanja Open Source PMML Scoring Platform
Contents
19. Š 2016 LigaData, Inc. All Rights Reserved. | 19
The law does not care how complex the model or ensemble was..
i.e. NOT sex, age, marital status, race, âŚ.
i.e. âover 180 days late on 2+ billsâ
There are solutions to this constraint, for an arbitrary black box
The solutions have broad use in many areas of the model lifecycle
When Rejecting Credit â
Law Requires 4 Record Level Reasons
Understandable
20. Š 2016 LigaData, Inc. All Rights Reserved. | 20
Should a data miner cut algorithm
choices, so they can come up with
reasons?
21. Š 2016 LigaData, Inc. All Rights Reserved. | 21
97% of the time, NO!
(or let me compete with you)
Focus on the most GENERAL & ACCURATE system first
A VP does not need to know how to program a B+ tree, in order to
make a SQL vendor purchase decision. (Be a trusted advisor)
Should a data miner cut algorithm
choices, so they can come up with
reasons?
âI understand how a bike works, but I drive a car to workâ
âI can explain the model, to the level of detail needed to drive
your businessâ
Understandable
22. Š 2016 LigaData, Inc. All Rights Reserved. | 22
Description Solution â Sensitivity Analysis
(OAT) One At a Time
https://en.wikipedia.org/wiki/Sensitivity_analysis
Arbitrarily Complex
Data Mining System
(S) Source fields
Target
field
For source fields with
binned ranges, sensitivity
tells you importance of the
range, i.e. âlowâ, âŚ. âhighâ
Can put sensitivity values in
Pivot Tables
or Cluster
Record Level âReason
codesâ can be extracted
from the most important
bins that apply to the given
record
23. Š 2016 LigaData, Inc. All Rights Reserved. | 23
Description Solution â Sensitivity Analysis
(OAT) One At a Time
Arbitrarily Complex
Data Mining System
Present record N, S times, each input 5% bigger (fixed input delta)
Record delta change in output, S times per record
Aggregate: average(abs(delta)), target change per input field delta
(S) Source fields
Target
field
For source fields with
binned ranges, sensitivity
tells you importance of the
range, i.e. âlowâ, âŚ. âhighâ
Can put sensitivity values in
Pivot Tables
or Cluster
Record Level âReason
codesâ can be extracted
from the most important
bins that apply to the given
record
Delta in
forecast
24. Š 2016 LigaData, Inc. All Rights Reserved. | 24
Description Solution â Sensitivity Analysis
Applying Reasons per record (independent of var ranking)
⢠Reason codes are specific to the model and record
record 1 record 2
⢠Ranked predictive fields Mr. Smith Mr. Jones
max_late_payment_120d 0 1
max_late_payment_90d 1 0
bankrupt_in_last_5_yrs 1 1
max_late_payment_60d 0 0
⢠Mr. Smithâs reason codes include:
max_late_payment_90d 1
bankrupt_in_last_5_yrs 1
25. Š 2016 LigaData, Inc. All Rights Reserved. | 25
Description Solution â Alternatives
Râs caret offers some feature selection,
⢠http://topepo.github.io/caret/featureselection.html
Filter methods (univariate)
Wrapper methods
⢠Recursive feature elimination
⢠Simulated Annealing
⢠Genetic algorithms
Variable Importance
⢠http://topepo.github.io/caret/varimp.html
⢠Algorithm specific (9 kinds)
⢠Model Independent Metrics
If classification: ROC curve analysis (univariate) per predictor
If regression: Fit a linear model
With variable ranking
still need to relate field
ranking to record reason
Univariate methods do
NOT cover variable
interactions in the model,
or non-linear
Understandable
26. Š 2016 LigaData, Inc. All Rights Reserved. | 26
Description Solution
Local Interpretable Model-agnostic Explanations (LIME)
âWhy Should I Trust You?â Explaining the Predictions of Any
Classifier â Knowledge Discovery in Databases 2016 (August 13-17)
https://arxiv.org/abs/1602.04938 (PDF)
https://github.com/marcotcr/
lime-experiments (Python code)
Describes models locally,
in terms of their variables
Minimize locality-aware loss
Understandable
27. Š 2016 LigaData, Inc. All Rights Reserved. | 27
Description Solution
Local Interpretable Model-agnostic Explanations (LIME)
Understandable
28. Š 2016 LigaData, Inc. All Rights Reserved. | 28
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain or Refresh the Model
Kamanja Open Source PMML Scoring Platform
Contents
29. Š 2016 LigaData, Inc. All Rights Reserved. | 29
Cut out extra preprocessed variables not used in final model
Minimize passes of the data
Many situations, I have had to RECODE prep and/or model to
meet production system requirements
⢠BAD: recode to Oracle, move SAS to mainframe & create JCL
Could take 2 months for conversion & full QA
⢠GOOD: Generate PMML code for model
Build up PMML preprocessing library, like Netflix
Putting a Model in Production
30. Š 2016 LigaData, Inc. All Rights Reserved. | 30
Putting a Model in Production
www.DMG.org/
PMML/products
31. Š 2016 LigaData, Inc. All Rights Reserved. | 31
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain or Refresh the Model
Kamanja Open Source PMML Scoring Platform
Contents
32. Š 2016 LigaData, Inc. All Rights Reserved. | 32
Tracking Model Drift
(easy to see with 2 input dimensions vs. score)
Current
Scoring
Data
Training
Data
General
33. Š 2016 LigaData, Inc. All Rights Reserved. | 33
A trained model is only as general as
the variety of behavior in the training data
the artifacts abstracted out by preprocessing
Good KDD process and variable designs the analysis universe like
the general scoring universe
Over time, there is âdriftâ from the behavior represented in the
scoring data, and the original training data
Stock market cycles
Bull Ă Bear Ă Bull Ă âŚ
Tracking Model Drift
General
34. Š 2016 LigaData, Inc. All Rights Reserved. | 34
MODEL DRIFT DETECTOR in N dimensions
⢠Change in distribution of target (alert over threshold)
During training, find thresholds for 10 or 20 equal frequency bins of the score
During scoring, look at key thresholds around business decisions (act vs not)
Has the % over the fixed threshold changed much?
⢠Change in distribution of most important input fields
Diagnose CAUSES, what is changing, how muchâŚ
Out of the top 25% of the most important input fieldsâŚ
Which had the largest change in contingency table metric?
Tracking Model Drift
General & Description
35. Š 2016 LigaData, Inc. All Rights Reserved. | 35
A frequent process in companies â RETRAIN EVERY DAY
⢠Does yesterdayâs 4th of July sale training data best represent
your 5th of July activity?
⢠Have you âforgottenâ past lessons, not in yesterdayâs data
The Stability vs. Placticity dilemma or
Learn how to play the guitar without forgetting grandmother
What about fraud cases from 6 months ago?
Same issues exist in online training
⢠Drifting vs. forgetting?
choose robustness and transparency, which ever you do
Tracking Model Drift
General & Description
36. Š 2016 LigaData, Inc. All Rights Reserved. | 36
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain or Refresh the Model
Kamanja Open Source PMML Scoring Platform
Contents
37. Š 2016 LigaData, Inc. All Rights Reserved. | 37
Model Retrain
⢠Brute force, most effort, most expense, most reliable
⢠Repeat the full data mining model training project
⢠Re-evaluate all algorithms, preprocessing, ensembles
Model Refresh
⢠âMinimal retrainingâ
⢠Just run the final 1-3 model trainings on âfresherâ data
⢠Do not repeat exploring all algorithms and ensembles
⢠Assume the âstructureâ is a reasonable solution
⢠Go back to your prior Model Notebook â choose the best as a short cut
Retrain, Refresh or Update DBC
1-2 months
3-5 days
General
38. Š 2016 LigaData, Inc. All Rights Reserved. | 38
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain or Refresh the Model
Kamanja Open Source PMML Scoring Platform
Contents
39. Š 2016 LigaData, Inc. All Rights Reserved. | 39
Solution Architecture for Threat and Compliance
Lambda Architecture with Continuous Decisioning
1
2
3
4
5
6
40. Š 2016 LigaData, Inc. All Rights Reserved. | 40
Solution Stack for Threat and Compliance
Leveraging Primarily Open Source Big Data Technologies
41. Š 2016 LigaData, Inc. All Rights Reserved. | 41
Problem
Diverse Inputs
⢠Structured and unstructured data, with
varying latencies
Data Enrichment
⢠Long and laborious process, manual
and ad hoc
Quality of Threat Intelligence
⢠Lots of false positives waste analyst
resources
Poor Integrations with Response
Teams
⢠Manual and Time Consuming Process
Solution
⢠Ingest IP addresses, malware signatures,
hash values, email addresses, etc. in real
time
⢠Automatically enrich with third party data
⢠Check historical logs against new threats
continuously
⢠Predictive analytics based on machine
learning flag suspicious activity before it
becomes a problem
⢠Direct integration with dashboards to
generate alerts and speed up investigation
Use Kamanja to detect potential cyber security breaches
Continuous Decisioning
Use Case: Cyber Threat Detection & Response
42. Š 2016 LigaData, Inc. All Rights Reserved. | 42
Problem
⢠Legacy system is batch oriented
⢠Months required to create and
implement new alerts
⢠Slow speed-to-market developing new
source system extracts. Months
required to assimilate new data.
⢠Risks to PII and NPI, with compliance
implications.
Solution
⢠Use open source big data stack to migrate to
real time data streaming, rapid model
deployment, and alerts with no manual
intervention.
⢠Calculate number of times PII/NPI accessed
over eight hour period, and calculate risk to
generate alerts
⢠Machine learning to identify normal pattern
of out of office hours access. Trigger
automatic alerts when anomalies occur.
⢠Rapid implementation of new models to
deal with emerging threats.
Use Kamanja to detect insider attacks to sensitive data
Continuous Decisioning
Use Case: Application Monitoring
43. Š 2016 LigaData, Inc. All Rights Reserved. | 43
Problem
⢠Need timely alerting of potentially
unauthorized trading activity
⢠Must tie together voluminous data,
reports, and risk measures
⢠Meet increasingly stringent time
requirements
Solution
⢠Create a Trader Surveillance Dashboard
⢠Provide a holistic view of a trader, based on
all relevant information about the trader, the
marketplace, and peers
⢠Build supervised and unsupervised machine
learning models based on operational,
transactional, and financial data.
⢠Real-time analysis and monitoring of trader
activity automatically highlights unusual
activity and triggers alerts on trades to
investigate
Use Kamanja to reduce the risk of rogue behavior at an investment bank
Continuous Decisioning
Use Case: Unauthorized Trading Detection
44. Š 2016 LigaData, Inc. All Rights Reserved. | 44
Problem
⢠$16.3 billion in credit card fraud
losses annually
⢠Fraud is growing more quickly than
transaction value
⢠New types of fraud are one step ahead
of existing solutions
⢠Dependence on third party proprietary
systems means slow reaction times
and expensive changes
Solution
⢠Apply Kamanja to IVR, web, and
transactional data to trigger alerts
⢠Initial models detect suspicious web traffic,
common purchase points, and application
rarity
⢠Leverage existing infrastructure as well as
existing third party systems (Falcon and
TSYS)
⢠Reduce costs by 80% with open source
software
Use Kamanja to incrementally reduce fraud losses by applying multiple
predictive models for transaction authorization
Continuous Decisioning
Use Case: Credit Card Fraud Detection
45. Š 2016 LigaData, Inc. All Rights Reserved. | 45
You can have it all: accurate, general & describable
⢠You may fully understand a bike â but drive a car to work (level of detail)
Control and plan complexity: track in a model notebook
⢠Reuse notebook when you need to retrain
⢠Balance accuracy and generalization in the notebook outcomes
⢠Track business net value per model (be more competitive)
Model and record level description helps model lifecycle
⢠Helps during model building, to improve preprocessing, DBC
⢠Helps gain trust
⢠Helps track model drift and degradation
Use Kamanja, a real time decisioning engine for production
deployment
Summary
Accurate
General
Understandable
Model
46. Š 2015 LigaData, Inc. All Rights Reserved.
Thank You
Tuesday, September 20, 2016
greg@ligadata.com
www.Linkedin.com/in/GregMakowski
www.Kamanja.org (Apache open source licensed)