SlideShare a Scribd company logo
1 of 34
Surface features with nonparametric ML
How such algos might help (or not )
Alexis Bondu Sylvain Ferrandiz
Capturing patterns via model selection
and hyperparameter tuning is …
autoML Logistic
Regression
XOR
MLP 4
neuronsDecision
Tree



XOR
X
Y
Z = (XY > 0)
Z
With MODL nonparametric ML
(why not?)
Or one can capture patterns via data
prep’ and feature engineering …

How? MODL* in a nutshell
*M. Boullé. Data grid models for preparation and modeling in supervised learning.
In Hands-On Pattern Recognition: Challenges in Machine Learning, volume 1,
I. Guyon, G. Cawley, G. Dror, A. Saffari (eds.), pp. 99-130, Microtome Publishing, 2011.
1. The MODL
framework
For users to explore a new world!
3. Nonparametric algorithm to find the best model
(with respect to the criterion)
1. Nonparametric set of ‘models’
2. Nonparametric regularized criterion
*https://www.quora.com/What-makes-a-model-interpretable/answer/Claudia-Perlich
Resource
savvy
Interpretable* Reliable
Why?
Good enough
Performance**
** http://automl.chalearn.org/
No grid-search,
no cross-validation
1. The MODL
framework
Let’s see how
1. MODL: the discretization algo
2. Non informative features detection
3. Drift detection
4. Model calibration
5. Data recoding
6. Supervised bivariate analysis
7. Co-clustering
8. Multi-table
9. Sequential rules extraction
You cannot win a Kaggle competition
by pushing a button, can you?
Whatever, nonparametric ML can help
(Kaggle competition or business project)
Choice of the number of intervals d’intervalles
Choice of the bounds positions
Description of the classe values distribution
Likelihood of the data given the model
1. The MODL
framework
Example: the discretization criterion
https://lnkd.in/dteteWA
1. The MODL
framework
Optimizing only the « Prior »
On interval for each « pure » zone
Optimizing only the « Likelihood »
Example: the discretization criterion
https://lnkd.in/dteteWA
A single interval
1. The MODL
approach
The example of the supervised discretization
N N x 2N / 2
The example of the variable « Age » of the dataset « Adult »
MODL avoids over-fitting, without user parameters to be adjusted !
2. Non informative
features detection
Why we want to select variables ?
Var 1 Var 2 … Class
O 12 … A
Y 98 … B
Y 4 … A
1 – Scaling of the learning algorithms
N
K
2 – Accuracy of the learned models
Curse of dimensionality
How to filter uninformative variables before training a model ?
• in a robust way (depending on N)
• without assumption
x y
Independence ?
2. Non informative
features detection
The supervised discretization can be used as a non-parametric test 
• Method : If the most probable model only contains a single interval / group,
the variable can be eliminated!
• Advantage : MODL is a universal approximator of P(y|x), thus it is able to
detect any kind of correlation.
969 numerical variables + 2141 categorical variables
39 numerical variables + 49 categorical variables
After filtering
Dataset
30 000 rows
Logistic Regression
- default parameters
- AUC: +0.06
- Computing time: x1500
3. Drift detection
Method : detection of univariate « drift »
• The MODL discretization is a universal approximator
=> No hypothesis on the shape of the « Drift » 
Train Model
Deploy
Drift
Scoring
What is drift ?
Train
Deploy
y
0000000001111111111
How detect it ?
3. Drift detection
GC =1-
-log(P(M | D))
-log(P(M0 | D))
Definition : Comparison between coding length of the current model and the simplest model wich include a single interval
Output : variables sorted by drift level
Dataset
Reliable measurement
(compression gain)
A real life example …
4. Model
calibration
Logistic regression : shape of the output
P(y=1|var1, var2)
P(y=1|X)
Logistic regression on the Adult dataset
Some classifiers distort the output estimated probabilities …
How to solve this problem in a robust way, without assumptions ?
4. Model
calibration
The supervised discretization is suitable for this problem 
Estimated P(y=1|X) : output of the model
CalibratedP(y=1|X)
Logistic regression on the “Adult” dataset
Robustness: the number and the size of the intervals depend on N
P(y=1|X) y
0.967 1
0.865 1
0.765 0
0.75 1
New training set
Accuracy: the calibrated distribution is not necessary monotonous
universal
approximator
In this case, there is an improvement
of the AUC: +0.09
5. Data recoding
Color Danger
100%
30%
0%
fit
Color P(danger | color)
1.0
1.0
0.3
0.3
0.0
0.0
0.0
0.3
0.0
1.0
1.0
0.3
transform
Advantages
• Encodes categorical into numerical variables, regardless of the levels number
• Limited number of recoded variables (nb classes -1)
• Gain of robustness
The most of ML algorithms process only numerical variables …
6. Supervised
bivariate analysis
Applied to explain interest level
for a listing
6. Supervised
bivariate analysis
7. Co-clustering
270,000
cookies
40,000
websites
visiting
6,600,000
data points
427 groups
of websites
1,612 groups
of cookies
What it is, from a real use case
7. Co-clustering
How to use it to explain interest
level for a listing
7. Co-clustering
Applied to explain interest level
for a listing
We have a
new feature!
Users
Sales
Web
Users
CustomerId
Firstname
Lastname
Age
Sales
CustomerId
Product
Amount
Time
Web
CustomerId
Page
Time
Users.Customer_Id
Users.Firstname
Users.Lastname
Users.Age
Outcome
Count(Sales.Product)
CountDistinct(Sales.Product)
Mean(Sales.Amount)
Sum(Sales.Amount) where Sales.Product = 'Mobile Data'
Count(Web.Page) where Day(Web.Time) in [6;7]
…
8. Multi-table
feature
engineering
What it is
8. Multi-table
feature
engineering
1 listing, 3 photos
=
Relational data
8. Multi-table
feature
engineering
8. Multi-table
feature
engineering
Some results
Class, Sequence
0, <A,B,D,D,D,E,B,A,D,A,E,A,D>
1, <D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E>
1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V>
0, <C,A,B,D,A,C,B,A,E,A,C>
0, <B,A,C,B,C,A,B,E>
1, <A,C,B,A,B,C,D,A,E>
0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C>
1, <A,B,C>
0, <A,B,C,A,B,E,E>
1, <B,C,A,C,C,A,E,E,D,A,E,D,A>
1, <A,B,C,D,A,B,C,E>
0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D>
1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E>
0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E>
0, <A,B,B,C,A,C,E,E,E>
DNA Texts WEB sessions Predictive maintenance
9. Sequential rule
extraction
Another kind of variable 
9. Sequential rule
extraction
Abstract
Whole genome RNA expression studies permit systematic approaches to understanding
the correlation between gene_expression profiles to disease states or different
developmental stages of a cell. Microarray analysis provides quantitative_information
about the complete transcription profile of cells that facilitate drug and therapeutics
development, disease_diagnosis, and understanding in the basic cell biology. One of the
challenges in microarray analysis, especially in cancerous gene_expression profiles, is to
identify genes or groups of genes that are highly expressed in tumour_cells but not in
normal cells and vice versa. Previously, we have shown that ensemble machine_learning
consistently performs well in classifying biological data. In this paper, we focus on three
different supervised machine_learning techniques in cancer classification, namely C4.5
decision_tree, and bagged and boosted decision_trees.
< classifying, data > → P(ML) = 95%, P(medicine) = 5%
Two classes of scientific articles : medicine, machine learning
Example : categorization of texts
9. Sequential rule
extraction
Robustness of the compression gain illustrated by using the dataset « skater »
Confidence Growth rate
Compression gain
MODL
GC =1-
-log(P(M | D))
-log(P(M0 | D))
Recall : The compression gain compares the coding length of the current model with the
one of the null model M0, which no includes any element in the rule.
9. Sequential rule
extraction
Recoding the rules & Training of a classifier
Ensemble of
informative rules
A B C D E F G
0 0 1 0 1 0 0
1 1 0 0 0 1 0
0 1 0 0 0 0 1
1 1 0 0 0 0 0
0 1 0 1 0 0 0
Binary recoding
Rules
Observations
Training of the
classifier
Compression gain > 0
9. Sequential rule
extraction
Examples of extracted rules
Amazon reviews : sentiment analysis
- No preprocessing
- 2 classes
- AUC = 0.911 with 500 rules
• « I + highly + recommend »
• « dont + waste + your + money »
• « This + is + a + great »
SMS :
- No preprocessing
- 2 classes (spam / non spam)
- AUC = 0.96 with 50 rules
• « FREE »
• « URGENT!»
• « $1000 »
E-mails Reuters :
- No preprocessing
- 10 classes
- 4 sequential variables (organization / place / objet / corps)
- AUC = 0.975 with 1000 rules
• « the + acquisition + of»
• « crude + oil »
• « trade + surplus »
Non informative features detection
Drift detection
Model calibration
Data recoding
Supervised bivariate analysis
Co-clustering
Multi-table
Sequential rules extraction
… Yours soon?
Nonparametric ML can help
(Kaggle competition or business project)
Takeaways
Complementary to autoML
Want to use it?
What is Edge ML ?
- A new kind of Auto ML library
- Optimized by using C++ and OpenMP
- Easy to use (simple command lines)
- Integrated with Python
Who can use Edge ML ?
- The Datascientists, in order to make secure their projects and accelerate it !
- Everyone, by using the automatic mode 
How to get more information ?
- www.edge-ml.fr
- www.marc-boulle.com (MODL approach)
Edge ML is free for competitors, students and professors, Enjoy 
Want to use it?
Happy users
CodersClickers Analysts, Experts,
Scientists, Devops
On-premiseAs-a-service
Infrastructure
14-day free trial!
https://predicsis.ai/free-trial/
Let’s stay in touch!
Alexis BONDU
alexis.bondu@edge-ml.com
Sylvain FERRANDIZ
sfe@predicsis.ai

More Related Content

Similar to Surface features with nonparametric machine learning

Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
Epsrcws08 campbell kbm_01
Epsrcws08 campbell kbm_01Epsrcws08 campbell kbm_01
Epsrcws08 campbell kbm_01Cheng Feng
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
New challenges monolixday2011
New challenges monolixday2011New challenges monolixday2011
New challenges monolixday2011blaudez
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptxArthur240715
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO... ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statisticsSpotle.ai
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
SUPERVISED DISCRETISATION AND GROUPING (VIDEO 2/4)
SUPERVISED DISCRETISATION AND GROUPING (VIDEO 2/4)SUPERVISED DISCRETISATION AND GROUPING (VIDEO 2/4)
SUPERVISED DISCRETISATION AND GROUPING (VIDEO 2/4)Alexis Bondu
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)
EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)
EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)Alexis Bondu
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Jeet Das
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsNatalio Krasnogor
 
Skin melanoma stage detection - CNN.pptx
Skin melanoma stage detection - CNN.pptxSkin melanoma stage detection - CNN.pptx
Skin melanoma stage detection - CNN.pptxVishalLabde
 

Similar to Surface features with nonparametric machine learning (20)

Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Epsrcws08 campbell kbm_01
Epsrcws08 campbell kbm_01Epsrcws08 campbell kbm_01
Epsrcws08 campbell kbm_01
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
New challenges monolixday2011
New challenges monolixday2011New challenges monolixday2011
New challenges monolixday2011
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx
 
AUC: at what cost(s)?
AUC: at what cost(s)?AUC: at what cost(s)?
AUC: at what cost(s)?
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO... ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statistics
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
SUPERVISED DISCRETISATION AND GROUPING (VIDEO 2/4)
SUPERVISED DISCRETISATION AND GROUPING (VIDEO 2/4)SUPERVISED DISCRETISATION AND GROUPING (VIDEO 2/4)
SUPERVISED DISCRETISATION AND GROUPING (VIDEO 2/4)
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)
EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)
EXTRACTION OF SEQUENTIAL RULES (VIDEO 4/4)
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
 
CAD v2
CAD v2CAD v2
CAD v2
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
Skin melanoma stage detection - CNN.pptx
Skin melanoma stage detection - CNN.pptxSkin melanoma stage detection - CNN.pptx
Skin melanoma stage detection - CNN.pptx
 

Recently uploaded

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 

Recently uploaded (20)

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 

Surface features with nonparametric machine learning

  • 1. Surface features with nonparametric ML How such algos might help (or not ) Alexis Bondu Sylvain Ferrandiz
  • 2. Capturing patterns via model selection and hyperparameter tuning is … autoML Logistic Regression XOR MLP 4 neuronsDecision Tree   
  • 3. XOR X Y Z = (XY > 0) Z With MODL nonparametric ML (why not?) Or one can capture patterns via data prep’ and feature engineering … 
  • 4. How? MODL* in a nutshell *M. Boullé. Data grid models for preparation and modeling in supervised learning. In Hands-On Pattern Recognition: Challenges in Machine Learning, volume 1, I. Guyon, G. Cawley, G. Dror, A. Saffari (eds.), pp. 99-130, Microtome Publishing, 2011. 1. The MODL framework For users to explore a new world! 3. Nonparametric algorithm to find the best model (with respect to the criterion) 1. Nonparametric set of ‘models’ 2. Nonparametric regularized criterion
  • 6. Let’s see how 1. MODL: the discretization algo 2. Non informative features detection 3. Drift detection 4. Model calibration 5. Data recoding 6. Supervised bivariate analysis 7. Co-clustering 8. Multi-table 9. Sequential rules extraction You cannot win a Kaggle competition by pushing a button, can you? Whatever, nonparametric ML can help (Kaggle competition or business project)
  • 7. Choice of the number of intervals d’intervalles Choice of the bounds positions Description of the classe values distribution Likelihood of the data given the model 1. The MODL framework Example: the discretization criterion https://lnkd.in/dteteWA
  • 8. 1. The MODL framework Optimizing only the « Prior » On interval for each « pure » zone Optimizing only the « Likelihood » Example: the discretization criterion https://lnkd.in/dteteWA A single interval
  • 9. 1. The MODL approach The example of the supervised discretization N N x 2N / 2 The example of the variable « Age » of the dataset « Adult » MODL avoids over-fitting, without user parameters to be adjusted !
  • 10. 2. Non informative features detection Why we want to select variables ? Var 1 Var 2 … Class O 12 … A Y 98 … B Y 4 … A 1 – Scaling of the learning algorithms N K 2 – Accuracy of the learned models Curse of dimensionality How to filter uninformative variables before training a model ? • in a robust way (depending on N) • without assumption x y Independence ?
  • 11. 2. Non informative features detection The supervised discretization can be used as a non-parametric test  • Method : If the most probable model only contains a single interval / group, the variable can be eliminated! • Advantage : MODL is a universal approximator of P(y|x), thus it is able to detect any kind of correlation. 969 numerical variables + 2141 categorical variables 39 numerical variables + 49 categorical variables After filtering Dataset 30 000 rows Logistic Regression - default parameters - AUC: +0.06 - Computing time: x1500
  • 12. 3. Drift detection Method : detection of univariate « drift » • The MODL discretization is a universal approximator => No hypothesis on the shape of the « Drift »  Train Model Deploy Drift Scoring What is drift ? Train Deploy y 0000000001111111111 How detect it ?
  • 13. 3. Drift detection GC =1- -log(P(M | D)) -log(P(M0 | D)) Definition : Comparison between coding length of the current model and the simplest model wich include a single interval Output : variables sorted by drift level Dataset Reliable measurement (compression gain) A real life example …
  • 14. 4. Model calibration Logistic regression : shape of the output P(y=1|var1, var2) P(y=1|X) Logistic regression on the Adult dataset Some classifiers distort the output estimated probabilities … How to solve this problem in a robust way, without assumptions ?
  • 15. 4. Model calibration The supervised discretization is suitable for this problem  Estimated P(y=1|X) : output of the model CalibratedP(y=1|X) Logistic regression on the “Adult” dataset Robustness: the number and the size of the intervals depend on N P(y=1|X) y 0.967 1 0.865 1 0.765 0 0.75 1 New training set Accuracy: the calibrated distribution is not necessary monotonous universal approximator In this case, there is an improvement of the AUC: +0.09
  • 16. 5. Data recoding Color Danger 100% 30% 0% fit Color P(danger | color) 1.0 1.0 0.3 0.3 0.0 0.0 0.0 0.3 0.0 1.0 1.0 0.3 transform Advantages • Encodes categorical into numerical variables, regardless of the levels number • Limited number of recoded variables (nb classes -1) • Gain of robustness The most of ML algorithms process only numerical variables …
  • 18. Applied to explain interest level for a listing 6. Supervised bivariate analysis
  • 19. 7. Co-clustering 270,000 cookies 40,000 websites visiting 6,600,000 data points 427 groups of websites 1,612 groups of cookies What it is, from a real use case
  • 20. 7. Co-clustering How to use it to explain interest level for a listing
  • 21. 7. Co-clustering Applied to explain interest level for a listing We have a new feature!
  • 23. 8. Multi-table feature engineering 1 listing, 3 photos = Relational data
  • 26. Class, Sequence 0, <A,B,D,D,D,E,B,A,D,A,E,A,D> 1, <D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E> 1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V> 0, <C,A,B,D,A,C,B,A,E,A,C> 0, <B,A,C,B,C,A,B,E> 1, <A,C,B,A,B,C,D,A,E> 0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C> 1, <A,B,C> 0, <A,B,C,A,B,E,E> 1, <B,C,A,C,C,A,E,E,D,A,E,D,A> 1, <A,B,C,D,A,B,C,E> 0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D> 1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E> 0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E> 0, <A,B,B,C,A,C,E,E,E> DNA Texts WEB sessions Predictive maintenance 9. Sequential rule extraction Another kind of variable 
  • 27. 9. Sequential rule extraction Abstract Whole genome RNA expression studies permit systematic approaches to understanding the correlation between gene_expression profiles to disease states or different developmental stages of a cell. Microarray analysis provides quantitative_information about the complete transcription profile of cells that facilitate drug and therapeutics development, disease_diagnosis, and understanding in the basic cell biology. One of the challenges in microarray analysis, especially in cancerous gene_expression profiles, is to identify genes or groups of genes that are highly expressed in tumour_cells but not in normal cells and vice versa. Previously, we have shown that ensemble machine_learning consistently performs well in classifying biological data. In this paper, we focus on three different supervised machine_learning techniques in cancer classification, namely C4.5 decision_tree, and bagged and boosted decision_trees. < classifying, data > → P(ML) = 95%, P(medicine) = 5% Two classes of scientific articles : medicine, machine learning Example : categorization of texts
  • 28. 9. Sequential rule extraction Robustness of the compression gain illustrated by using the dataset « skater » Confidence Growth rate Compression gain MODL GC =1- -log(P(M | D)) -log(P(M0 | D)) Recall : The compression gain compares the coding length of the current model with the one of the null model M0, which no includes any element in the rule.
  • 29. 9. Sequential rule extraction Recoding the rules & Training of a classifier Ensemble of informative rules A B C D E F G 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 Binary recoding Rules Observations Training of the classifier Compression gain > 0
  • 30. 9. Sequential rule extraction Examples of extracted rules Amazon reviews : sentiment analysis - No preprocessing - 2 classes - AUC = 0.911 with 500 rules • « I + highly + recommend » • « dont + waste + your + money » • « This + is + a + great » SMS : - No preprocessing - 2 classes (spam / non spam) - AUC = 0.96 with 50 rules • « FREE » • « URGENT!» • « $1000 » E-mails Reuters : - No preprocessing - 10 classes - 4 sequential variables (organization / place / objet / corps) - AUC = 0.975 with 1000 rules • « the + acquisition + of» • « crude + oil » • « trade + surplus »
  • 31. Non informative features detection Drift detection Model calibration Data recoding Supervised bivariate analysis Co-clustering Multi-table Sequential rules extraction … Yours soon? Nonparametric ML can help (Kaggle competition or business project) Takeaways Complementary to autoML
  • 32. Want to use it? What is Edge ML ? - A new kind of Auto ML library - Optimized by using C++ and OpenMP - Easy to use (simple command lines) - Integrated with Python Who can use Edge ML ? - The Datascientists, in order to make secure their projects and accelerate it ! - Everyone, by using the automatic mode  How to get more information ? - www.edge-ml.fr - www.marc-boulle.com (MODL approach) Edge ML is free for competitors, students and professors, Enjoy 
  • 33. Want to use it? Happy users CodersClickers Analysts, Experts, Scientists, Devops On-premiseAs-a-service Infrastructure 14-day free trial! https://predicsis.ai/free-trial/
  • 34. Let’s stay in touch! Alexis BONDU alexis.bondu@edge-ml.com Sylvain FERRANDIZ sfe@predicsis.ai

Editor's Notes

  1. Alexis + Sylvain : on se présente
  2. Sylvain
  3. Sylvain.
  4. Sylvain Coclustering, classifier, regression model, rule extraction, multi-table
  5. Sylvain
  6. Sylvain
  7. Alexis
  8. Alexis
  9. Alexis
  10. Alexis
  11. Alexis Method : detection of univariate « drift » Discretization with the class values « Train » and « Deploy » The MODL discretization is a universal approximator => No hypothesis on the shape of the « Drift » 
  12. Alexis
  13. Alexis
  14. Alexis
  15. Alexis Essayer Critéo
  16. Sylvain
  17. Sylvain
  18. Sylvain Usages multiples : reco et matrices creuses (tutoriel EGC)
  19. Sylvain
  20. Sylvain
  21. Sylvain La construction de variables : ouverture vers la génération d’agrégats
  22. Sylvain
  23. Sylvain
  24. Sylvain
  25. Alexis
  26. Alexis
  27. Alexis
  28. Alexis
  29. Alexis
  30. Sylvain
  31. Ajouter un slide différentiant.
  32. Ajouter un slide différentiant.