This document discusses explaining machine learning model predictions. It begins by introducing the presenters and their focus on Python and scikit-learn. The document then discusses the importance of explaining model predictions for applications like loan approvals, fraud detection, and job recommendations. Next, it provides a toy example using the Titanic dataset to illustrate the need for explaining individual predictions. The document reviews approaches for linear models and random forests, highlighting limitations. It outlines the state of the art in interpreting random forests and demonstrates an approach using scikit-learn on the Titanic data. Finally, it discusses the additional data needed to better interpret random forest predictions for individual cases.
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
PyData Paris 2015 - Track 2.3 AXA
1. WHITENING THE BLACKBOX : WHY AND HOW TO EXPLAIN
MACHINE LEARNING PREDICTIONS ?
PyData 2015 / Paris
Christophe Bourguignat (@chris_bour)
Marcin Detyniecki
Bora Eang
2. We are a new Data team
We are Python & scikit-learn heavy users and
hopefully contributors
We like tricky problems, doubts, questions
And love to share with data geeks about that !
DISCLAIMER - WHO ARE WE ?
2 | PyData 2015 - Paris
4. Why explaining Machine Learning models ?
4 | PyData 2015 - Paris
Explain why a given loan application
did not meet credit underwriting
policy
5. Why explaining Machine Learning models ?
5 | PyData 2015 - Paris
Explain why a given loan application
did not meet credit underwriting
policy
Explain why a given transaction is
suspicious
6. Why explaining Machine Learning models ?
6 | PyData 2015 - Paris
Explain why a given loan application
did not meet credit underwriting
policy
Explain why a given transaction is
suspicious
Explain why a given job is recommended
for an unemployed
7. Why explaining Machine Learning models ?
7 | PyData 2015 - Paris
French « Conseil d’Etat » recommendation
Impose to algorithm-based decisions a transparency requirement, on
personal data used by the algorithm, and the general reasoning it
followed. Give the person subject to the decision the possibility of
submitting its observations.
9. What do we actually want ?
9 | PyData 2015 - Paris
We don’t ask for a “typical profile” of the selected population
We want a reason why an observation got selected by our algorithm
This reason must be simple and understandable (actionable), but can
be specific to it
Observation A, next to observation B on our selected population, can
be selected for a completely different reason
http://www.holehouse.org/mlclass/07_Regularization.html
10. Toy Example : Titanic Dataset (1/2)
10 | PyData 2015 - Paris
11. Toy Example : Titanic Dataset (2/2)
11 | PyData 2015 - Paris
These women were predicted with
a high confidence as « survived »
12. Toy Example : Titanic Dataset (2/2)
12 | PyData 2015 - Paris
These women were predicted with as
high confidence as « not survived »
These women were predicted with
a high confidence as « survived »
13. Toy Example : Titanic Dataset (2/2)
13 | PyData 2015 - Paris
These women were predicted with as
high confidence as « not survived »
These women were predicted with
a high confidence as « survived »
We are looking for a method saying : why ?
14. A simple case : linear models
14 | PyData 2015 - Paris
-5
0
5
10
1 2 3 4
Beta
x
Beta.x
-5
0
5
10
1 2 3 4
Beta
x
Beta.x
Observation 1 Observation 2
15. Unfortunately, predictive
models that are the most
powerful are usually the
least interpretable
http://machinelearningmastery.com/model-prediction-versus-interpretation-in-machine-learning/
16. scikit-learn includes the .feature_importances_ attribute …
Implementation from Breiman, Friedman, "Classification and regression
trees", 1984 (“Gini Importance" or “Mean Decrease Impurity“)
Louppe, 2014 “Understanding Random Forests”, PhD dissertation
R package also implements “Mean Decrease Accuracy”
… but has nothing to show features contribution for a given observation
A complicated case : Random Forests
16 | PyData 2015 - Paris
http://ngm.nationalgeographic.com/2012/12/sequoias/quammen-text
17. « What if » explanation
Sensitivity of the variable
(i.e. derivative)
Feature contribution
« path approach »
How to interpret a forest?
17 | PyData 2015 - Paris MENTION DE CONFIDENTIALITÉ
18. « What if » explanation
Sensitivity of the variable
(i.e. derivative)
Feature contribution
« path approach »
How to interpret a forest?
18 | PyData 2015 - Paris
19. State Of The Art
19 | PyData 2015 - Paris
http://arxiv.org/abs/1312.1121
20. State Of The Art
20 | PyData 2015 - Paris
http://blog.datadive.net/interpreting-random-forests/
21. Scikit-Learn IPython demo
21 | PyData 2015 - Paris
Playing with Titanic Data
Traversing the trees and using a trivial metric :
+1 when a feature is crossed
+ impurity when a feature is crossed
Limitations : Scikit learn stores :
The number of samples for each node (tree_.n_node_samples)
The breakdown by class (tree_.value), but only for leaves
22. Be able to compare subsets induced by each tree
What do we need ? (1/2)
22 | PyData 2015 - Paris
Mr Smith
23. Be able to compare subsets induced by each tree
What do we need ? (1/2)
23 | PyData 2015 - Paris
Mr Smith
24. Be able to compare subsets induced by each tree
What do we need ? (1/2)
24 | PyData 2015 - Paris
Mr Smith
25. Be able to compare subsets induced by each tree
What do we need ? (1/2)
25 | PyData 2015 - Paris
Mr Smith
26. Be able to compare subsets induced by each tree
What do we need ? (1/2)
26 | PyData 2015 - Paris
Mr Smith
27. Ideally, we would need easy access to all nodes attributes :
Average_score
Node_size (absolute or %)
Number of class 0 samples (absolute or %)
Number of class 1 samples (absolute or %)
…
For each tree
For each node
- Metric += F(parent_node, node, left_child_node, right_child_node, brother_node)
- E.g : F = parent_node.average_score – node.average_score
What do we need (2/2)
27 | PyData 2015 - Paris
28. Thank You
and join us, we have many other
problems to crack !
Christophe Bourguignat (@chris_bour)
Marcin Detyniecki
Bora Eang