Despite a wide array of advanced techniques available today, too many practitioners are forced to return to their old toolkit of approaches deemed “more interpretable.” Whether because of non-legal policy or difficulty in executive presentation, these restraints result from poor analytics communication and inability to explain model risks and outcomes, not a failing of the techniques.
From sampling to feature reduction to supervised modeling, the toolbox and communications of data scientists are limited by these constraints. But, instead of simplifying models, data scientists can re-introduce often ignored statistical practices to describe the models, their risk, and the impact of changes in the customer environment.
Even in situations without restrictions, these approaches will improve how practitioners select models and communicate results. Through measurement and simulation, reviewed approaches can be used to articulate the promises, risks, and assumptions of developed models, without requiring deep statistical explanations.
2. 2
• Director of Data Science at Think Big
• I work in the intersection of statistics and technology
• But also business and analytics
• Too often see data scientists limit themselves and their
businesses
Dan Mallinger
3. 3
1. Importance of Communication
2. Lost Tools of Analytics Communication
3. Tricks for those in Regulated Environments
4. More Communication
Today
8. 8
Why We Should Care:
We Won’t Waste Money
Alas, not even a 250Gb
server was sufficient:
even after waiting
three days, the data
couldn't even be
loaded. […]
Steve said it would be
difficult for managers to
accept a process that
involved sampling.
9. 9
hlm.html('Test1', test1_score__eoy~test1_score__boy + ...
is_special_ed * perc_free_lunch ...
other_score * support_rec ...
(is_focal | inst_sid), data=kinder)
Technically this is a regression…
So simple anyone can understand it!
Why We Should Care:
You Can’t Explain Your Models Anyway
10. 10
• If your model need to be re-fit every month, it probably has an
eating disorder
• Be a better communicator to yourself
Why We Should Care:
Some of Us Don’t Understand Our Models
18. 18
• If X matters, then shuffling it should hurt our model
• Then bootstrap for confidence intervals
• Most R models have a method for this (see caret)
Shining a light into the parameters of our black box
Variable Importance
19. 19
Shining a light into the parameters of our black box
Variable Importance: Bob’s Data
20. 20
• Similar to variable importance
• How do relationships in our model play out in different settings?
• How much does our model depend on accurate measurement?
Sensitivity and Robustness
21. 21
Sensitivity and Robustness Example
My code wasn’t working, so thanks to:
https://beckmw.wordpress.com/2013/10/07/sensitivity-analysis-for-neural-networks/
23. 23
• Bob’s manager has told him that
black box models are not
allowed
• But Bob’s neural net performed
better than anything else. Oh
dear!
Dang!
24. 24
• Bob’s work in neural nets can be
leveraged!
• Generically: Prototype selection
• Identify points on the decision
boundary to improve model
• Specifically: Extracting decision
trees from neural nets
Blackbox to Whitebox
25. 25
Blackbox to Whitebox: Methodology
“Extracting Decision Trees from Trained Neural Networks” - Krishnan & Bhattacharya
Also: https://github.com/dvro/scikit-protopy
26. 26
• Bob has shown how
variables impact his
black box
• He’s shown how they
behave in different
contexts
• He’s show how robust
they are to errors
• But he hasn’t told us why
we should care
Now What?
28. 28
• Enterprises are slow: Predict KPI not KRI
• Give confidence bands, sensitivities, and impact of context changes
• Build a story about the model internals and assumptions; tie to domain
knowledge of audience
• Explainability is up to the modeler, not the model *
• Unless, of course, your regulator says otherwise!
Conclusions