Surface features with nonparametric machine learning

Surface features with nonparametric ML
How such algos might help (or not )
Alexis Bondu Sylvain Ferrandiz

Capturing patterns via model selection
and hyperparameter tuning is …
autoML Logistic
Regression
XOR
MLP 4
neuronsDecision
Tree




XOR
X
Y
Z = (XY > 0)
Z
With MODL nonparametric ML
(why not?)
Or one can capture patterns via data
prep’ and feature engineering …


How? MODL* in a nutshell
*M. Boullé. Data grid models for preparation and modeling in supervised learning.
In Hands-On Pattern Recognition: Challenges in Machine Learning, volume 1,
I. Guyon, G. Cawley, G. Dror, A. Saffari (eds.), pp. 99-130, Microtome Publishing, 2011.
1. The MODL
framework
For users to explore a new world!
3. Nonparametric algorithm to find the best model
(with respect to the criterion)
1. Nonparametric set of ‘models’
2. Nonparametric regularized criterion

*https://www.quora.com/What-makes-a-model-interpretable/answer/Claudia-Perlich
Resource
savvy
Interpretable* Reliable
Why?
Good enough
Performance**
** http://automl.chalearn.org/
No grid-search,
no cross-validation
1. The MODL
framework

Let’s see how
1. MODL: the discretization algo
2. Non informative features detection
3. Drift detection
4. Model calibration
5. Data recoding
6. Supervised bivariate analysis
7. Co-clustering
8. Multi-table
9. Sequential rules extraction
You cannot win a Kaggle competition
by pushing a button, can you?
Whatever, nonparametric ML can help
(Kaggle competition or business project)

Choice of the number of intervals d’intervalles
Choice of the bounds positions
Description of the classe values distribution
Likelihood of the data given the model
1. The MODL
framework
Example: the discretization criterion
https://lnkd.in/dteteWA

1. The MODL
framework
Optimizing only the « Prior »
On interval for each « pure » zone
Optimizing only the « Likelihood »
Example: the discretization criterion
https://lnkd.in/dteteWA
A single interval

1. The MODL
approach
The example of the supervised discretization
N N x 2N / 2
The example of the variable « Age » of the dataset « Adult »
MODL avoids over-fitting, without user parameters to be adjusted !

2. Non informative
features detection
Why we want to select variables ?
Var 1 Var 2 … Class
O 12 … A
Y 98 … B
Y 4 … A
1 – Scaling of the learning algorithms
N
K
2 – Accuracy of the learned models
Curse of dimensionality
How to filter uninformative variables before training a model ?
• in a robust way (depending on N)
• without assumption
x y
Independence ?

2. Non informative
features detection
The supervised discretization can be used as a non-parametric test 
• Method : If the most probable model only contains a single interval / group,
the variable can be eliminated!
• Advantage : MODL is a universal approximator of P(y|x), thus it is able to
detect any kind of correlation.
969 numerical variables + 2141 categorical variables
39 numerical variables + 49 categorical variables
After filtering
Dataset
30 000 rows
Logistic Regression
- default parameters
- AUC: +0.06
- Computing time: x1500

3. Drift detection
Method : detection of univariate « drift »
• The MODL discretization is a universal approximator
=> No hypothesis on the shape of the « Drift » 
Train Model
Deploy
Drift
Scoring
What is drift ?
Train
Deploy
y
0000000001111111111
How detect it ?

3. Drift detection
GC =1-
-log(P(M | D))
-log(P(M0 | D))
Definition : Comparison between coding length of the current model and the simplest model wich include a single interval
Output : variables sorted by drift level
Dataset
Reliable measurement
(compression gain)
A real life example …

4. Model
calibration
Logistic regression : shape of the output
P(y=1|var1, var2)
P(y=1|X)
Logistic regression on the Adult dataset
Some classifiers distort the output estimated probabilities …
How to solve this problem in a robust way, without assumptions ?

4. Model
calibration
The supervised discretization is suitable for this problem 
Estimated P(y=1|X) : output of the model
CalibratedP(y=1|X)
Logistic regression on the “Adult” dataset
Robustness: the number and the size of the intervals depend on N
P(y=1|X) y
0.967 1
0.865 1
0.765 0
0.75 1
New training set
Accuracy: the calibrated distribution is not necessary monotonous
universal
approximator
In this case, there is an improvement
of the AUC: +0.09

5. Data recoding
Color Danger
100%
30%
0%
fit
Color P(danger | color)
1.0
1.0
0.3
0.3
0.0
0.0
0.0
0.3
0.0
1.0
1.0
0.3
transform
Advantages
• Encodes categorical into numerical variables, regardless of the levels number
• Limited number of recoded variables (nb classes -1)
• Gain of robustness
The most of ML algorithms process only numerical variables …

6. Supervised
bivariate analysis

Applied to explain interest level
for a listing
6. Supervised
bivariate analysis

7. Co-clustering
270,000
cookies
40,000
websites
visiting
6,600,000
data points
427 groups
of websites
1,612 groups
of cookies
What it is, from a real use case

7. Co-clustering
How to use it to explain interest
level for a listing

7. Co-clustering
Applied to explain interest level
for a listing
We have a
new feature!

Users
Sales
Web
Users
CustomerId
Firstname
Lastname
Age
Sales
CustomerId
Product
Amount
Time
Web
CustomerId
Page
Time
Users.Customer_Id
Users.Firstname
Users.Lastname
Users.Age
Outcome
Count(Sales.Product)
CountDistinct(Sales.Product)
Mean(Sales.Amount)
Sum(Sales.Amount) where Sales.Product = 'Mobile Data'
Count(Web.Page) where Day(Web.Time) in [6;7]
…
8. Multi-table
feature
engineering
What it is

8. Multi-table
feature
engineering
1 listing, 3 photos
=
Relational data

8. Multi-table
feature
engineering

8. Multi-table
feature
engineering
Some results

Class, Sequence
0, <A,B,D,D,D,E,B,A,D,A,E,A,D>
1, <D,A,B,D,E,D,A,D,E,D,A,D,A,E,D,D,D,E,A,D,D,E,D,A,D,E>
1, <A,C,C,V,A,C,C,A,V,V,A,C,C,A,V,V,A,C,C,A,V>
0, <C,A,B,D,A,C,B,A,E,A,C>
0, <B,A,C,B,C,A,B,E>
1, <A,C,B,A,B,C,D,A,E>
0, <A,B,B,A,C,B,A,C,C,A,B,B,A,C>
1, <A,B,C>
0, <A,B,C,A,B,E,E>
1, <B,C,A,C,C,A,E,E,D,A,E,D,A>
1, <A,B,C,D,A,B,C,E>
0, <A,B,B,C,A,C,D,A,C,D,A,B,B,A,C,D,A,A,B,C,A,D,E,A,E,C,C,A,D>
1, <A,B,C,B,A,C,B,B,C,A,B,D,D,D,A,E>
0, <A,B,C,A,B,C,D,A,D,C,C,A,D,A,E>
0, <A,B,B,C,A,C,E,E,E>
DNA Texts WEB sessions Predictive maintenance
9. Sequential rule
extraction
Another kind of variable 

9. Sequential rule
extraction
Abstract
Whole genome RNA expression studies permit systematic approaches to understanding
the correlation between gene_expression profiles to disease states or different
developmental stages of a cell. Microarray analysis provides quantitative_information
about the complete transcription profile of cells that facilitate drug and therapeutics
development, disease_diagnosis, and understanding in the basic cell biology. One of the
challenges in microarray analysis, especially in cancerous gene_expression profiles, is to
identify genes or groups of genes that are highly expressed in tumour_cells but not in
normal cells and vice versa. Previously, we have shown that ensemble machine_learning
consistently performs well in classifying biological data. In this paper, we focus on three
different supervised machine_learning techniques in cancer classification, namely C4.5
decision_tree, and bagged and boosted decision_trees.
< classifying, data > → P(ML) = 95%, P(medicine) = 5%
Two classes of scientific articles : medicine, machine learning
Example : categorization of texts

9. Sequential rule
extraction
Robustness of the compression gain illustrated by using the dataset « skater »
Confidence Growth rate
Compression gain
MODL
GC =1-
-log(P(M | D))
-log(P(M0 | D))
Recall : The compression gain compares the coding length of the current model with the
one of the null model M0, which no includes any element in the rule.

9. Sequential rule
extraction
Recoding the rules & Training of a classifier
Ensemble of
informative rules
A B C D E F G
0 0 1 0 1 0 0
1 1 0 0 0 1 0
0 1 0 0 0 0 1
1 1 0 0 0 0 0
0 1 0 1 0 0 0
Binary recoding
Rules
Observations
Training of the
classifier
Compression gain > 0

9. Sequential rule
extraction
Examples of extracted rules
Amazon reviews : sentiment analysis
- No preprocessing
- 2 classes
- AUC = 0.911 with 500 rules
• « I + highly + recommend »
• « dont + waste + your + money »
• « This + is + a + great »
SMS :
- No preprocessing
- 2 classes (spam / non spam)
• « FREE »
• « URGENT!»
• « $1000 »
E-mails Reuters :
- No preprocessing
- 10 classes
- 4 sequential variables (organization / place / objet / corps)
• « the + acquisition + of»
• « crude + oil »
• « trade + surplus »

Non informative features detection
Drift detection
Model calibration
Data recoding
Supervised bivariate analysis
Co-clustering
Multi-table
Sequential rules extraction
… Yours soon?
Nonparametric ML can help
(Kaggle competition or business project)
Takeaways
Complementary to autoML

Want to use it?
What is Edge ML ?
- A new kind of Auto ML library
- Optimized by using C++ and OpenMP
- Easy to use (simple command lines)
- Integrated with Python
Who can use Edge ML ?
- The Datascientists, in order to make secure their projects and accelerate it !
- Everyone, by using the automatic mode 
How to get more information ?
- www.edge-ml.fr
- www.marc-boulle.com (MODL approach)
Edge ML is free for competitors, students and professors, Enjoy 

Want to use it?
Happy users
CodersClickers Analysts, Experts,
Scientists, Devops
On-premiseAs-a-service
Infrastructure
14-day free trial!
https://predicsis.ai/free-trial/

Let’s stay in touch!
Alexis BONDU
alexis.bondu@edge-ml.com
Sylvain FERRANDIZ
sfe@predicsis.ai

Surface features with nonparametric machine learning

Recommended

Recommended

More Related Content

Similar to Surface features with nonparametric machine learning

Similar to Surface features with nonparametric machine learning (20)

Recently uploaded

Recently uploaded (20)

Surface features with nonparametric machine learning

Editor's Notes