PyData Paris 2015 - Track 2.3 AXA

•

2 gefällt mir•1,140 views

This document discusses explaining machine learning model predictions. It begins by introducing the presenters and their focus on Python and scikit-learn. The document then discusses the importance of explaining model predictions for applications like loan approvals, fraud detection, and job recommendations. Next, it provides a toy example using the Titanic dataset to illustrate the need for explaining individual predictions. The document reviews approaches for linear models and random forests, highlighting limitations. It outlines the state of the art in interpreting random forests and demonstrates an approach using scikit-learn on the Titanic data. Finally, it discusses the additional data needed to better interpret random forest predictions for individual cases.

Daten & Analysen

WHITENING THE BLACKBOX : WHY AND HOW TO EXPLAIN
MACHINE LEARNING PREDICTIONS ?
PyData 2015 / Paris
Christophe Bourguignat (@chris_bour)
Marcin Detyniecki
Bora Eang

We are a new Data team
We are Python & scikit-learn heavy users and
hopefully contributors
We like tricky problems, doubts, questions
And love to share with data geeks about that !
DISCLAIMER - WHO ARE WE ?
2 | PyData 2015 - Paris

Why explaining Machine Learning models ?
3 | PyData 2015 - Paris

Why explaining Machine Learning models ?
4 | PyData 2015 - Paris
Explain why a given loan application
did not meet credit underwriting
policy

Why explaining Machine Learning models ?
5 | PyData 2015 - Paris
Explain why a given loan application
did not meet credit underwriting
policy
Explain why a given transaction is
suspicious

Why explaining Machine Learning models ?
6 | PyData 2015 - Paris
Explain why a given loan application
did not meet credit underwriting
policy
Explain why a given transaction is
suspicious
Explain why a given job is recommended
for an unemployed

Why explaining Machine Learning models ?
7 | PyData 2015 - Paris
French « Conseil d’Etat » recommendation
Impose to algorithm-based decisions a transparency requirement, on
personal data used by the algorithm, and the general reasoning it
followed. Give the person subject to the decision the possibility of
submitting its observations.

Toy Example : Titanic Dataset (1/2)
10 | PyData 2015 - Paris

Toy Example : Titanic Dataset (2/2)
11 | PyData 2015 - Paris
These women were predicted with
a high confidence as « survived »

Toy Example : Titanic Dataset (2/2)
12 | PyData 2015 - Paris
These women were predicted with as
high confidence as « not survived »
These women were predicted with
a high confidence as « survived »

Toy Example : Titanic Dataset (2/2)
13 | PyData 2015 - Paris
These women were predicted with as
high confidence as « not survived »
These women were predicted with
a high confidence as « survived »
We are looking for a method saying : why ?

A simple case : linear models
14 | PyData 2015 - Paris
-5
0
5
10
1 2 3 4
Beta
x
Beta.x
-5
0
5
10
1 2 3 4
Beta
x
Beta.x
Observation 1 Observation 2

Unfortunately, predictive
models that are the most
powerful are usually the
least interpretable
http://machinelearningmastery.com/model-prediction-versus-interpretation-in-machine-learning/

scikit-learn includes the .feature_importances_ attribute …
 Implementation from Breiman, Friedman, "Classification and regression
trees", 1984 (“Gini Importance" or “Mean Decrease Impurity“)
 Louppe, 2014 “Understanding Random Forests”, PhD dissertation
 R package also implements “Mean Decrease Accuracy”
… but has nothing to show features contribution for a given observation
A complicated case : Random Forests
16 | PyData 2015 - Paris
http://ngm.nationalgeographic.com/2012/12/sequoias/quammen-text

« What if » explanation
Sensitivity of the variable
(i.e. derivative)
Feature contribution
 « path approach »
How to interpret a forest?
17 | PyData 2015 - Paris MENTION DE CONFIDENTIALITÉ

« What if » explanation
Sensitivity of the variable
(i.e. derivative)
Feature contribution
 « path approach »
How to interpret a forest?
18 | PyData 2015 - Paris

State Of The Art
19 | PyData 2015 - Paris
http://arxiv.org/abs/1312.1121

State Of The Art
20 | PyData 2015 - Paris
http://blog.datadive.net/interpreting-random-forests/

Scikit-Learn IPython demo
21 | PyData 2015 - Paris
Playing with Titanic Data
Traversing the trees and using a trivial metric :
 +1 when a feature is crossed
 + impurity when a feature is crossed
Limitations : Scikit learn stores :
 The number of samples for each node (tree_.n_node_samples)
 The breakdown by class (tree_.value), but only for leaves

Be able to compare subsets induced by each tree
What do we need ? (1/2)
22 | PyData 2015 - Paris
Mr Smith

Be able to compare subsets induced by each tree
What do we need ? (1/2)
23 | PyData 2015 - Paris
Mr Smith

Be able to compare subsets induced by each tree
What do we need ? (1/2)
24 | PyData 2015 - Paris
Mr Smith

Be able to compare subsets induced by each tree
What do we need ? (1/2)
25 | PyData 2015 - Paris
Mr Smith

Be able to compare subsets induced by each tree
What do we need ? (1/2)
26 | PyData 2015 - Paris
Mr Smith

Ideally, we would need easy access to all nodes attributes :
 Average_score
 Node_size (absolute or %)
 Number of class 0 samples (absolute or %)
 Number of class 1 samples (absolute or %)
 …
For each tree
 For each node
- Metric += F(parent_node, node, left_child_node, right_child_node, brother_node)
- E.g : F = parent_node.average_score – node.average_score
What do we need (2/2)
27 | PyData 2015 - Paris

Thank You
and join us, we have many other
problems to crack !
Christophe Bourguignat (@chris_bour)
Marcin Detyniecki
Bora Eang

Weitere ähnliche Inhalte

Ähnlich wie PyData Paris 2015 - Track 2.3 AXA

محازةرةي يةكةم

Gaylan Blbas

Varied encounters with data science (slide share)

gilbert.peffer

How to build a Community of Practice (CoP) + How we build the Brussels Data S...

DigitYser

Applying Statistical Modeling and Machine Learning to Perform Time-Series For...

PyData

Big Data & Machine Learning - TDC2013 Sao Paulo

OCTO Technology

Big Data & Machine Learning - TDC2013 São Paulo - 12/0713

Mathieu DESPRIEE

 Rudradeb started his career as an Artificial Intelligence (AI) researcher in 2002 and went on to publish 10 research papers on various AI topics. Our topic is «Creating Value With Artificial Intelligence». Rudradeb’s current focus is to make emerging technology of Blockchain, Machine Learning, and IoT more accessible. He has rich experience in entrepreneurship: he built 6 startups in 4 countries. One of the startups in Belgium was on IoT in the transportation sector. He also worked with CEZ, the Czech Republic on user adoption of their smart metering project. Since 2017, he spends most of his time as a writer and speaker. He has been invited to speak at over 30 events in 11 countries.

Creating Value With Artificial Intelligence, Rudradeb Mitra at Google Developers

Provectus

Juliette Tisseyre, Lead Software Engineer in R&D chez Margo, a animé un talk intitulé « Code Mining, comment extraire et exploiter l’information contenue dans du code source ? » lors du salon Data Jobs 2018. En effet, le code est une forme de data un peu particulière et riche d’informations. Pour autant, ces informations sont très hétéroclites tant par leur forme qu’en terme de pertinence. Quelles informations sont récupérables ? Quelles approches choisir pour quelles applications ? Quelles techniques de Machine Learning peuvent-être mises en place ? Au cours de cette présentation, Juliette vous invite à voir le code comme vous n’avez encore jamais pensé le regarder…

Code mining : comment extraire et exploiter l’information détenue dans du cod...

Margo

Introduction to Big Data

AmpoolIO

DSD-INT 2018 Algorithmic Differentiation - Markus

Deltares

Strata Conference talk on how Data Science brings IoT to life through predictive modeling. The Internet of Things (IoT) will forever change the way businesses interact with each other and their customers. In 2020, 25 billion connected “things” will be in use, reports Gartner. IDC predictions are even higher as analysts estimate IoT will grow from 15 billion devices in 2015 to 30 billion devices in 2020. With the adoption of these devices, effective leveraging of the torrent of data will be critical to driving the transformation of industries. Central to the fundamental shift will be the ability to pool data, and build models that drive real and significant actions. From smart sensors to connected hospitals, Sarah will use illustrative use cases to demonstrate the fundamental concepts required to drive true impact from these connected devices. She will cover which models are most appropriate for a variety of actions and outcomes, what considerations around data access and processing are critical, and which tools are available to accomplish your task at hand.

Strata aerni 2015_09_30_1315

Sarah Aerni

Presenter: Sarah Aerni, Senior Data Scientist, Pivotal The Internet of Things (IoT) will forever change the way businesses interact with each other and their customers. In 2020, 25 billion connected “things” will be in use, reports Gartner. IDC predictions are even higher as analysts estimate IoT will grow from 15 billion devices in 2015 to 30 billion devices in 2020. With the adoption of these devices, effective leveraging of the torrent of data will be critical to driving the transformation of industries. Central to the fundamental shift will be the ability to pool data, and build models that drive real and significant actions. From smart sensors to connected hospitals, Sarah will use illustrative use cases to demonstrate the fundamental concepts required to drive true impact from these connected devices. She will cover which models are most appropriate for a variety of actions and outcomes, what considerations around data access and processing are critical, and which tools are available to accomplish your task at hand.

Business Impact From IoT? Just Add Data Science

VMware Tanzu

Big Data – Is it a hype or for real?

Dirk Ortloff

Howto build open_data_ecosystems_smart_cities_european experience

Alberto Abella

Machine Learning for Finance Master Class

QuantUniversity

Introduction to Data Science (Data Science Thailand Meetup #1)

Data Science Thailand

Presentation: How can Plan Ceibal Land into the Age of Big Data?

@cristobalcobo

Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...

Provectus

Architecting IoT with Machine Learning

Rudradeb Mitra

Tutorial helsinki 20180313 v1

ISSIP

Ähnlich wie PyData Paris 2015 - Track 2.3 AXA (20)

محازةرةي يةكةم

Varied encounters with data science (slide share)

How to build a Community of Practice (CoP) + How we build the Brussels Data S...

Applying Statistical Modeling and Machine Learning to Perform Time-Series For...

Big Data & Machine Learning - TDC2013 Sao Paulo

Big Data & Machine Learning - TDC2013 São Paulo - 12/0713

Creating Value With Artificial Intelligence, Rudradeb Mitra at Google Developers

Code mining : comment extraire et exploiter l’information détenue dans du cod...

Introduction to Big Data

DSD-INT 2018 Algorithmic Differentiation - Markus

Strata aerni 2015_09_30_1315

Business Impact From IoT? Just Add Data Science

Big Data – Is it a hype or for real?

Howto build open_data_ecosystems_smart_cities_european experience

Machine Learning for Finance Master Class

Introduction to Data Science (Data Science Thailand Meetup #1)

Presentation: How can Plan Ceibal Land into the Age of Big Data?

Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...

Architecting IoT with Machine Learning

Tutorial helsinki 20180313 v1

Mehr von Pôle Systematic Paris-Region

Short-range wireless communication technologies such as Bluetooth or ZigBee represent an important part of the Internet of Things ecosystem. By design, this category of smart devices has physically limited reachability inside their Wireless Personal Area Network (WPAN) and are not directly compatible with the TCP/IP stack. However, users may need to access them from anywhere at any moment. To address this problem, we design a new application-agnostic approach called RCM (Remote Connection Manager) enabling transparent communication between an application and out-of-range devices. It creates new IoT use cases by seamlessly mixing remote and local devices. We implemented an open-source prototype for Bluetooth Low Energy (BLE) technology on top of Linux and Android BLE stacks and demonstrated its efficiency through experiments performed on real devices.

OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...

Pôle Systematic Paris-Region

OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...

Pôle Systematic Paris-Region

On parle d’observabilité des services lorsque ceux-ci exposent des états et métriques internes pour améliorer la disponibilité globale. Qu’en est-il de l’observabilité des infrastructures sur lesquelles ils sont déployés, configurés et maintenus ? Les différents logs (centralisés, agrégés) permettent un bon début d’analyse mais il faut aussi observer les systèmes au fil de l’eau pour tracer chaque changement et les corréler avec le monitoring. Aujourd’hui, ces étapes de configuration IT devraient être prises en charge par les outils de gestion de configuration, qui deviennent la passerelle vers l’observabilité des opérations. Nous montrerons l'intérêt de cette approche pour la gestion IT moderne avec un retour d’expérience sur les challenges de leur mise en place dans Rudder, notre solution libre d’audit et de gestion de configuration en continu.

OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...

Pôle Systematic Paris-Region

My research is in virtualized infrastructure domain. I aim at minimizing electricity consumption while improving application performance. To achieve the first goal, I work both at the entire datacenter level (by providing better VM placement strategies) and at the physical machine level (by providing better power management policies). Concerning the second goal, I work both at the VM monitor level (for minimizing its overhead) and at the VM's operating system (OS) level (for making it aware of the fact that it is virtualized). In this talk I present two contributions of my research team, one for each objective. The first contribution presents Drowsy-DC, a novel way to reduce data center power consumption inspired by smartphones. The second contribution presents XPV (eXtended Para-Virtualization), a new principle for well virtualizing NUMA machines.

OSIS19_Cloud : Performance and power management in virtualized data centers, ...

Pôle Systematic Paris-Region

L'expérience du développement de CRESON, support pour des objets distants fortement cohérents dans Infinispan, par Etienne Riviere (UCLouvain). Cet exposé présentera des résultats obtenus dans le cadre du projet européen LEADS que j'ai coordonné et où l'entreprise Red Hat était partenaire. Le code produit a été intégré dans le “staging" de la base de données NoSQL Infinispan, et évalué avec un équivalent open source de Dropbox développé par CloudSpaces, un autre projet européen.

OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...

Pôle Systematic Paris-Region

OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...

Pôle Systematic Paris-Region

OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...

Pôle Systematic Paris-Region

Pointers are a notorious "defect attractor", in particular when dynamic memory management is involved. Ada mitigates these issues by having much less need for pointers overall (thanks to first-class arrays, parameter modes, generics) and stricter rules for pointer manipulations that limit access to dangling memory. Still, dynamic memory management in Ada may lead to use-after-free, double-free and memory leaks, and dangling memory issues may lead to runtime exceptions. The SPARK subset of Ada is focused on making it possible to guarantee properties of the program statically, in particular the absence of programming language errors, with a mostly automatic analysis. For that reason, and because static analysis of pointers is notoriously hard to automate, pointers have been forbidden in SPARK until now. We are working at AdaCore since 2017 on including pointer support in SPARK by restricting the use of pointers in programs so that they respect "ownership" constraints, like what is found in Rust. In this talk, I will present the current state of the ownership rules for pointer support in SPARK, and the current state of the implementation in the GNAT compiler and GNATprove prover, as well as our roadmap for the future.

Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy

Pôle Systematic Paris-Region

Osis18_Cloud : Pas de commun sans communauté ?

Pôle Systematic Paris-Region

Osis18_Cloud : Projet Wolphin

Pôle Systematic Paris-Region

La virtualisation est une technologie mature dont le surcoût est aujourd’hui marginal sur les machines grand public. Néamoins, ce surcoût augmente radicalement pour les machines reposant sur une architecture Non Uniform Memory Access (NUMA), omniprésentes dans les data centers. Les techniques de virtualisation actuelles exploitent mal cette architecture et causent une dégradation des performances des applications allant jusqu’à 700%. Cette présentation détaille les causes de telles dégradations et propose une méthode qui permet la virtualisation efficace d’architectures NUMA. Une évaluation de cette méthode montre qu’il est possible de multiplier par 2 ou plus la performance de 9 des 29 applications testées.

Osis18_Cloud : Virtualisation efficace d’architectures NUMA

Pôle Systematic Paris-Region

Nous présentons une solution Open Source de stockage et d’archivage distribué des données dont l’objectif est la pérennité des données. Il est basé sur le protocole BitTorrent et intègre un haut niveau de redondance, ainsi que d’un mécanisme de régénération automatique des données. Il peut être déployé à grande échelle en LAN et en WAN. Les agents sont compatibles avec des serveurs et postes clients Linux, Window ou Mac OS.

Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent

Pôle Systematic Paris-Region

Le logiciel est au cœur de notre société numérique et le code source des logiciels contient une part croissante de nos connaissances scientifiques, techniques et organisationnelles, au point d'être devenu désormais une partie intégrante du patrimoine de l'Humanité. La mission de «Software Heritage» est de veiller à ce que cette précieuse masse des connaissances soit collectée, préservée, organisée et mise à la disposition de tous. Construire une telle infrastructure pose des défis importants, à la fois techniques et stratégiques, et nous pouvons tous contribuer à les résoudre.

Osis18_Cloud : Software-heritage

Pôle Systematic Paris-Region

Dans cet exposé, on présentera OMicroB, une machine virtuelle OCaml pour microcontrôleurs à faibles ressources, inspirée des travaux précédents sur le projet OCaPIC. Cette machine virtuelle, destinée à être exécutée sur diverses architectures matérielles (AVR, PIC, ARM, ...) permet ainsi de factoriser le développement d’applications, mais aussi de généraliser l’analyse et le débogage du bytecode associé, tout en permettant un usage précautionneux de la mémoire. On cible alors des programmes ludiques ou de domotiques destinés à être exécutés sur des microcontrôleurs à faibles ressources, en insistant sur les particularités inhérentes à la programmation de systèmes embarqués.

OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...

Pôle Systematic Paris-Region

OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot

Pôle Systematic Paris-Region

OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...

Pôle Systematic Paris-Region

OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...

Pôle Systematic Paris-Region

The programming language Ada offers unique features to safely program a micro-controller. From the start, Ada was designed to make it difficult to introduce errors, and to make it easy to discover errors that were introduced. For example, language rules enforced at compile time make it possible to have safe concurrency by design. And run-time checking allows immediate detection of what would be "undefined behavior" in C/C++. In the first part of this presentation, we will present the benefits of using Ada for micro-controller programming, including support for debugging on a board. In the second part of this presentation, we will present how the Ada language and its subset SPARK provide a strong foundation for static analyzers, that make it possible to detect errors and provide guarantees on embedded software in Ada/SPARK.

OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...

Pôle Systematic Paris-Region

OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)

Pôle Systematic Paris-Region

PyParis 2017 / Un mooc python, by thierry parmentelat

Pôle Systematic Paris-Region

Mehr von Pôle Systematic Paris-Region (20)