SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Parallel Filter-Based Feature Selection
Based on Balanced Incomplete Block
Designs
Antonio Salmerón1, Anders L. Madsen2,3, Frank Jensen2,
Helge Langseth4, Thomas D. Nielsen3, Darío Ramos-López1,
Ana M. Martínez3, Andrés R. Masegosa4
1Department of Mathematics, University of Almería, Spain
2 Hugin Expert A/S, Aalborg, Denmark
3Department of Computer Science, Aalborg University, Denmark
4 Department of Computer and Information Science,
The Norwegian University of Science and Technology, Norway
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 1
The AMIDST Toolbox for probabilistic
machine learning
You can download our open source Java toolbox:
http://www.amidsttoolbox.com
Software demonstration today at 12
in the Lobby!
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 2
Outline
1 Motivation
2 Preliminaries
3 Scaling up filter-based feature selection
4 Experimental results
5 Conclusions
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 3
Outline
1 Motivation
2 Preliminaries
3 Scaling up filter-based feature selection
4 Experimental results
5 Conclusions
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 4
Feature Selection Approaches
Before learning a classification model, a key step is choosing the
most informative variables, and excluding the redundant ones. A
good feature selection prevents overfitting and reduces the
computational load of the learning process.
Main groups of feature selection methods:
Wrapper: normally requiring a model fit for each subset of
variables (time consuming).
Embedded methods: also model-dependent, guided by some
specific property of the model.
Filter: independent from the model, usually features are
ranked according to a univariate score (less computational
complexity).
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 5
Our proposal
We propose an algorithm for scaling up filter-based feature
selection in classification problems, such that:
It makes use of conditional mutual information as filter
measure.
It parallelize the workload while avoiding unnecessary
calculations.
It uses the balanced incomplete blocks (BIB) designs to reduce
disk access and to distribute the calculations.
It is able to significantly reduce the computational load in a
multi-core shared-memory environment.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 6
Outline
1 Motivation
2 Preliminaries
3 Scaling up filter-based feature selection
4 Experimental results
5 Conclusions
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 7
Entropy and Conditional Entropy
X = {X1, . . . , Xn}: discrete variables; C: the class variable.
Entropy of X ∈ X:
H(X) = −
x∈ΩX
p(x) log p(x)
(uncertainty in the distribution of X)
Conditional entropy of Xi given Xj :
H(Xi |Xj ) = −
xj ∈ΩXj
p(xj )
xi ∈ΩXi
p(xi |xj ) log p(xi |xj )
(remaining uncertainty in the distribution of Xi after observing Xj )
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 8
Mutual Information
Mutual Information of Xi and Xj :
I(Xi , Xj ) = H(Xi ) − H(Xi |Xj )
(amount of information shared by two variables)
The mutual information is symmetric: I(Xi , Xj ) = I(Xj , Xi )
Conditional Mutual Information between Xi and Xj given Xk:
I(Xi , Xj |Xk) = H(Xi |Xk) − H(Xi |Xj , Xk).
(amount of information shared by two variables, given a third one)
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 9
Information Theoretic Filter Methods
Information-theoretic filter methods have been analyzed using the
conditional likelihood:
L(S, τ|D) =
n
i=1
q(ci
|xi
, τ),
S: features included in the model,
τ: parameters of the distributions involved in the model,
D: a dataset, D = {(xi , ci ), i = 1, . . . , n}
q: the learnt model
The conditional likelihood is maximized by minimizing
I(X  S, C|S) (the mutual information between the class and the
features not included in the model, given the variables actually
included).
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 10
Filter measures
We shall make this technical assumption:
For Xi , Xj ∈ S, Xk ∈ X  S, it holds that Xi and Xj are
conditionally independent both when conditioning on Xk and on
{Xk, C}.
This allows us to select features greedily, using as a filter measure
the following quantity based on the conditional mutual information
(cmi):
Jcmi(Xi ) = I(Xi , C|S) = I(Xi , C) −
Xj ∈S
I(Xi , Xj ) − I(Xi , Xj |C) ,
where Xi is the candidate variable to include in the model.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 11
Filter measures
Another remarkable filter measure is the joint mutual information
(jmi) that is defined as:
Jjmi(Xi ) =
Xj ∈S
I({Xi , Xj }, C) =
= I(Xi , C) −
1
|S|
Xj ∈S
I(Xi , Xj ) − I(Xi , Xj |C) ,
According to the literature, Jjmi is the metric showing the best
accuracy/stability tradeoff, and is therefore the one we utilize in the
following.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 12
Outline
1 Motivation
2 Preliminaries
3 Scaling up filter-based feature selection
4 Experimental results
5 Conclusions
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 13
Filter-based feature selection algorithm
1 Function Filter(X,C,M)
Input: The set of features, X = {X1, . . . , XN}. The class
variable, C. The maximum number of features to be
selected, M.
Output: S, a set of selected features.
2 begin
3 for i ← 1 to N do
4 Compute I(Xi , C);
5 for j ← i + 1 to N do
6 Compute I(Xi , Xj |C);
7 Compute I(Xi , Xj );
8 end
9 end
10 X∗ ← arg max1≤i≤N I(Xi , C);
11 S ← {X∗};
12 for i ← 1 to M − 1 do
13 R ← X  S;
14 for X ∈ R do
15 Compute Jjmi(X) using the statistics computed
in Steps 4, 6, and 7;
16 end
17 X∗ ← arg maxX∈R Jjmi(X);
18 S ← S ∪ {X∗};
19 end
20 return S;
21 end
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 14
Optimizing the calculation of the scores
To compute all the necessary information theory scores, one
possibility is to create a thread for each pair of features. However,
for n variables, this requires accessing the dataset n
2 times,
inducing a significant overhead due to disk/network access.
We propose making groups of variables (blocks), each of which will
only access the dataset a single time.
To do this, two key issues:
Finding an appropriate block size.
Ensure that every pair of variables appears in exactly one block
(to avoid duplicated calculations).
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 15
Balanced incomplete block (BIB) designs
What we seek can be done by using Balanced Incomplete Block
(BIB) designs.
Given a set of variables X, we will say (X, A) is a design if A is a
collection of non-empty subsets of X (blocks).
A design is a (v, k, λ)-BIB design if:
v = |X| is the number of variables in X.
each block in A contains exactly k variables.
every pair of distinct variables is contained in exactly λ blocks.
We have found that (q, 6, 1)-BIB designs (blocks of 6 variables) are
appropriate for practical use.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 16
Balanced incomplete block (BIB) designs
Some considerations about BIB designs:
A (v, k, λ)-BIB might not exist, for some combinations of the
parameters.
Finding a BIB design is a NP-complete problem.
To efficiently use BIB designs, we have pre-calculated a
number of them, and those are utilized on run-time.
BIB designs can be generated from difference sets, avoiding
the need of storing the full design.
Check the paper for more details on this!
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 17
Example of BIB designs
Figure 1: Example illustrating the use of (q, 6, 1) and (3, 2, 1) designs.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 18
Outline
1 Motivation
2 Preliminaries
3 Scaling up filter-based feature selection
4 Experimental results
5 Conclusions
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 19
Experimental setup
Goal: empirical evaluation of the time performance
improvement using BIB designs.
Shared memory computer with multiple cores: Intel (TM)
i7-5820K 3.3GHz processor (6 physical and 12 logical cores),
with 64 GB RAM, running Red Hat Enterprise Linux 7.
Time measurements: averaged over 10 runs with each dataset,
elapsed (wall-clock) time.
Datasets: randomly simulated from well-known Bayesian
networks, and a real-world dataset from a Spanish bank.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 20
Tested Bayesian networks
Table 1: Bayesian networks from which datasets were generated.
Dataset |X| |E| Total CPT size
Munin1 189 282 19,466
Diabetes 413 602 461,069
Munin2 1,003 1,244 83,920
SACSO 2,371 3,521 44,274
|X|: number of variables in the Bayesian network
|E|: number of edges in the Bayesian network
CPT: total conditional probability table size
In addition, a real-world dataset of 1823 variables from a Spanish
bank was also tested.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 21
Speed-up for Munin1 datasets
100.000 cases
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10 12
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
250.000 cases
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 2 4 6 8 10 12
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
500.000 cases
0
0.5
1
1.5
2
2.5
3
0 2 4 6 8 10 12
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 22
Speed-up for Munin2 datasets
100.000 cases
0
5
10
15
20
25
30
35
40
45
0 2 4 6 8 10 12
0
1
2
3
4
5
6
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
250.000 cases
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12
0
0.5
1
1.5
2
2.5
3
3.5
4
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
500.000 cases
0
20
40
60
80
100
120
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 23
Speed-up for SACSO datasets
250.000 cases
0
50
100
150
200
250
300
350
0 2 4 6 8 10 12
0
1
2
3
4
5
6
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
500.000 cases
0
100
200
300
400
500
600
700
800
0 2 4 6 8 10 12
0
1
2
3
4
5
6
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 24
Speed-up for Diabetes datasets
100k cases
0
100
200
300
400
500
600
700
0 2 4 6 8 10 12
0
0.5
1
1.5
2
2.5
3
3.5
4
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 25
Speed-up for Bank datasets
100k cases
0
10
20
30
40
50
60
70
80
0 2 4 6 8 10 12
0
1
2
3
4
5
6
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
250k cases
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
500k cases
0
50
100
150
200
250
300
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 26
Impact of using the (q, 6, 1)-BIB designs
Bank dataset
100k cases with BIB designs
0
10
20
30
40
50
60
70
80
0 2 4 6 8 10 12
0
1
2
3
4
5
6
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
100k cases, using pairs directly
0
50
100
150
200
250
300
0 2 4 6 8 10 12
0
1
2
3
4
5
6
Averageruntimeinseconds
Averagespeed-upfactor
Number of threads
Time
Speed-up
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 27
Outline
1 Motivation
2 Preliminaries
3 Scaling up filter-based feature selection
4 Experimental results
5 Conclusions
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 28
Conclusions
A parallel filter-based feature selection algorithm has been
proposed.
It makes use of information theoretic measures (conditional
mutual information) for filtering.
Also, a two-steps Balanced Incomplete Blocks (BIB) design
has been used to distribute and optimize the computations
asynchronously ((q, 6, 1) and then (3, 2, 1)).
For variables with a large number of states, it might be
preferable to skip the first step (Diabetes).
The performance improvement when using (q, 6, 1) designs is
in most cases substantial.
Speed-up factors of about 4-6 were obtained running on a 6
physical cores computer.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 29
Future work
Horizontal parallelization: each computing unit holding only a
subset of the data over all variables.
This will support parallelization on a distributed memory
system.
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 30
Thank you for your attention
You can download our open source Java toolbox:
http://www.amidsttoolbox.com
Acknowledgments: This project has received funding from the European Union’s
Seventh Framework Programme for research, technological development and
demonstration under grant agreement no 619209
22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 31

Weitere ähnliche Inhalte

Was ist angesagt?

Alternative Infill Strategies for Expensive Multi-Objective Optimisation
Alternative Infill Strategies for Expensive Multi-Objective OptimisationAlternative Infill Strategies for Expensive Multi-Objective Optimisation
Alternative Infill Strategies for Expensive Multi-Objective Optimisation
Alma Rahat
 
01 graphical models
01 graphical models01 graphical models
01 graphical models
zukun
 
Surrogate models for uncertainty quantification and reliability analysis
Surrogate models for uncertainty quantification and reliability analysisSurrogate models for uncertainty quantification and reliability analysis
Surrogate models for uncertainty quantification and reliability analysis
Bruno Sudret
 

Was ist angesagt? (18)

Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding Theory
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
 
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
 
AbdoSummerANS_mod3
AbdoSummerANS_mod3AbdoSummerANS_mod3
AbdoSummerANS_mod3
 
How good is your prediction a gentle introduction to conformal prediction.
How good is your prediction  a gentle introduction to conformal prediction.How good is your prediction  a gentle introduction to conformal prediction.
How good is your prediction a gentle introduction to conformal prediction.
 
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...Session II - Estimation methods and accuracy   Li-Chun Zhang Discussion: Sess...
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
 
Extracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept AnalysisExtracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept Analysis
 
Lecture17 xing fei-fei
Lecture17 xing fei-feiLecture17 xing fei-fei
Lecture17 xing fei-fei
 
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a ...
 
Integrating Practical2009
Integrating Practical2009Integrating Practical2009
Integrating Practical2009
 
Em
EmEm
Em
 
Introduction to probabilistic programming with pyro
Introduction to probabilistic programming with pyroIntroduction to probabilistic programming with pyro
Introduction to probabilistic programming with pyro
 
Alternative Infill Strategies for Expensive Multi-Objective Optimisation
Alternative Infill Strategies for Expensive Multi-Objective OptimisationAlternative Infill Strategies for Expensive Multi-Objective Optimisation
Alternative Infill Strategies for Expensive Multi-Objective Optimisation
 
01 graphical models
01 graphical models01 graphical models
01 graphical models
 
Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...
 
Surrogate models for uncertainty quantification and reliability analysis
Surrogate models for uncertainty quantification and reliability analysisSurrogate models for uncertainty quantification and reliability analysis
Surrogate models for uncertainty quantification and reliability analysis
 

Andere mochten auch

LaFay Resume
LaFay ResumeLaFay Resume
LaFay Resume
Amy LaFay
 
Kevin Lee New Resume (2016)
Kevin Lee New Resume (2016)Kevin Lee New Resume (2016)
Kevin Lee New Resume (2016)
Kevin Lee
 
2015 CSR Annual Report
2015 CSR Annual Report2015 CSR Annual Report
2015 CSR Annual Report
Andrew Russell
 

Andere mochten auch (17)

Teori belajar bahasa pend,bhs dan sastra indonesia
Teori belajar bahasa pend,bhs dan sastra indonesiaTeori belajar bahasa pend,bhs dan sastra indonesia
Teori belajar bahasa pend,bhs dan sastra indonesia
 
October 9th, 2014 - LHS Key Club Meeting
October 9th, 2014 - LHS Key Club MeetingOctober 9th, 2014 - LHS Key Club Meeting
October 9th, 2014 - LHS Key Club Meeting
 
Apresentação Áraucarias - Josiane Corretrora 41- 9608-0587
Apresentação Áraucarias - Josiane Corretrora 41- 9608-0587Apresentação Áraucarias - Josiane Corretrora 41- 9608-0587
Apresentação Áraucarias - Josiane Corretrora 41- 9608-0587
 
LaFay Resume
LaFay ResumeLaFay Resume
LaFay Resume
 
2016 Resume
2016 Resume2016 Resume
2016 Resume
 
Resume
ResumeResume
Resume
 
Kevin Lee New Resume (2016)
Kevin Lee New Resume (2016)Kevin Lee New Resume (2016)
Kevin Lee New Resume (2016)
 
Oberservation Hours
Oberservation HoursOberservation Hours
Oberservation Hours
 
Los ciegos y el elefante
Los ciegos y el elefanteLos ciegos y el elefante
Los ciegos y el elefante
 
Html part1
Html part1Html part1
Html part1
 
Lean sytems lecture ayub jake salik 2012
Lean sytems lecture ayub jake salik 2012Lean sytems lecture ayub jake salik 2012
Lean sytems lecture ayub jake salik 2012
 
October 2nd, 2014 - LHS Key Club Meeting
October 2nd, 2014 - LHS Key Club MeetingOctober 2nd, 2014 - LHS Key Club Meeting
October 2nd, 2014 - LHS Key Club Meeting
 
Case study: Solving Challenges faced by KIIT E-cell
Case study: Solving Challenges faced by KIIT E-cellCase study: Solving Challenges faced by KIIT E-cell
Case study: Solving Challenges faced by KIIT E-cell
 
El jabón 2
El jabón 2El jabón 2
El jabón 2
 
Recommender system
Recommender systemRecommender system
Recommender system
 
sports day
sports daysports day
sports day
 
2015 CSR Annual Report
2015 CSR Annual Report2015 CSR Annual Report
2015 CSR Annual Report
 

Ähnlich wie Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block Designs (ECAI2016)

Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
butest
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
Nesma
 
(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu
Yueshen Xu
 

Ähnlich wie Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block Designs (ECAI2016) (20)

Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)
 
Learning Pulse - paper presentation at LAK17
Learning Pulse - paper presentation at LAK17Learning Pulse - paper presentation at LAK17
Learning Pulse - paper presentation at LAK17
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
SASA 2016
SASA 2016SASA 2016
SASA 2016
 
Artificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdfArtificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdf
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...
Scalable MAP inference in Bayesian networks based on a Map-Reduce approach (P...
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 
Prototype-based models in machine learning
Prototype-based models in machine learningPrototype-based models in machine learning
Prototype-based models in machine learning
 
(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu(Hierarchical) Topic Modeling_Yueshen Xu
(Hierarchical) Topic Modeling_Yueshen Xu
 
Ijetr042170
Ijetr042170Ijetr042170
Ijetr042170
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
Object Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksObject Detection using Deep Neural Networks
Object Detection using Deep Neural Networks
 
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 

Mehr von AMIDST Toolbox

Analysis of massive data using R (CAEPIA2015)
Analysis of massive data using R (CAEPIA2015)Analysis of massive data using R (CAEPIA2015)
Analysis of massive data using R (CAEPIA2015)
AMIDST Toolbox
 

Mehr von AMIDST Toolbox (6)

Analysis of massive data using R (CAEPIA2015)
Analysis of massive data using R (CAEPIA2015)Analysis of massive data using R (CAEPIA2015)
Analysis of massive data using R (CAEPIA2015)
 
Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)
Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)
Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)
 
A Java Toolbox for Analysis of MassIve Data STreams using Probabilistic Graph...
A Java Toolbox for Analysis of MassIve Data STreams using Probabilistic Graph...A Java Toolbox for Analysis of MassIve Data STreams using Probabilistic Graph...
A Java Toolbox for Analysis of MassIve Data STreams using Probabilistic Graph...
 
Amidst demo (BNAIC 2015)
Amidst demo (BNAIC 2015)Amidst demo (BNAIC 2015)
Amidst demo (BNAIC 2015)
 
d-VMP: Distributed Variational Message Passing (PGM2016)
d-VMP: Distributed Variational Message Passing (PGM2016)d-VMP: Distributed Variational Message Passing (PGM2016)
d-VMP: Distributed Variational Message Passing (PGM2016)
 
Flink Forward 2016
Flink Forward 2016Flink Forward 2016
Flink Forward 2016
 

Kürzlich hochgeladen

CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 

Kürzlich hochgeladen (20)

Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 

Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block Designs (ECAI2016)

  • 1. Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block Designs Antonio Salmerón1, Anders L. Madsen2,3, Frank Jensen2, Helge Langseth4, Thomas D. Nielsen3, Darío Ramos-López1, Ana M. Martínez3, Andrés R. Masegosa4 1Department of Mathematics, University of Almería, Spain 2 Hugin Expert A/S, Aalborg, Denmark 3Department of Computer Science, Aalborg University, Denmark 4 Department of Computer and Information Science, The Norwegian University of Science and Technology, Norway 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 1
  • 2. The AMIDST Toolbox for probabilistic machine learning You can download our open source Java toolbox: http://www.amidsttoolbox.com Software demonstration today at 12 in the Lobby! 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 2
  • 3. Outline 1 Motivation 2 Preliminaries 3 Scaling up filter-based feature selection 4 Experimental results 5 Conclusions 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 3
  • 4. Outline 1 Motivation 2 Preliminaries 3 Scaling up filter-based feature selection 4 Experimental results 5 Conclusions 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 4
  • 5. Feature Selection Approaches Before learning a classification model, a key step is choosing the most informative variables, and excluding the redundant ones. A good feature selection prevents overfitting and reduces the computational load of the learning process. Main groups of feature selection methods: Wrapper: normally requiring a model fit for each subset of variables (time consuming). Embedded methods: also model-dependent, guided by some specific property of the model. Filter: independent from the model, usually features are ranked according to a univariate score (less computational complexity). 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 5
  • 6. Our proposal We propose an algorithm for scaling up filter-based feature selection in classification problems, such that: It makes use of conditional mutual information as filter measure. It parallelize the workload while avoiding unnecessary calculations. It uses the balanced incomplete blocks (BIB) designs to reduce disk access and to distribute the calculations. It is able to significantly reduce the computational load in a multi-core shared-memory environment. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 6
  • 7. Outline 1 Motivation 2 Preliminaries 3 Scaling up filter-based feature selection 4 Experimental results 5 Conclusions 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 7
  • 8. Entropy and Conditional Entropy X = {X1, . . . , Xn}: discrete variables; C: the class variable. Entropy of X ∈ X: H(X) = − x∈ΩX p(x) log p(x) (uncertainty in the distribution of X) Conditional entropy of Xi given Xj : H(Xi |Xj ) = − xj ∈ΩXj p(xj ) xi ∈ΩXi p(xi |xj ) log p(xi |xj ) (remaining uncertainty in the distribution of Xi after observing Xj ) 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 8
  • 9. Mutual Information Mutual Information of Xi and Xj : I(Xi , Xj ) = H(Xi ) − H(Xi |Xj ) (amount of information shared by two variables) The mutual information is symmetric: I(Xi , Xj ) = I(Xj , Xi ) Conditional Mutual Information between Xi and Xj given Xk: I(Xi , Xj |Xk) = H(Xi |Xk) − H(Xi |Xj , Xk). (amount of information shared by two variables, given a third one) 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 9
  • 10. Information Theoretic Filter Methods Information-theoretic filter methods have been analyzed using the conditional likelihood: L(S, τ|D) = n i=1 q(ci |xi , τ), S: features included in the model, τ: parameters of the distributions involved in the model, D: a dataset, D = {(xi , ci ), i = 1, . . . , n} q: the learnt model The conditional likelihood is maximized by minimizing I(X S, C|S) (the mutual information between the class and the features not included in the model, given the variables actually included). 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 10
  • 11. Filter measures We shall make this technical assumption: For Xi , Xj ∈ S, Xk ∈ X S, it holds that Xi and Xj are conditionally independent both when conditioning on Xk and on {Xk, C}. This allows us to select features greedily, using as a filter measure the following quantity based on the conditional mutual information (cmi): Jcmi(Xi ) = I(Xi , C|S) = I(Xi , C) − Xj ∈S I(Xi , Xj ) − I(Xi , Xj |C) , where Xi is the candidate variable to include in the model. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 11
  • 12. Filter measures Another remarkable filter measure is the joint mutual information (jmi) that is defined as: Jjmi(Xi ) = Xj ∈S I({Xi , Xj }, C) = = I(Xi , C) − 1 |S| Xj ∈S I(Xi , Xj ) − I(Xi , Xj |C) , According to the literature, Jjmi is the metric showing the best accuracy/stability tradeoff, and is therefore the one we utilize in the following. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 12
  • 13. Outline 1 Motivation 2 Preliminaries 3 Scaling up filter-based feature selection 4 Experimental results 5 Conclusions 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 13
  • 14. Filter-based feature selection algorithm 1 Function Filter(X,C,M) Input: The set of features, X = {X1, . . . , XN}. The class variable, C. The maximum number of features to be selected, M. Output: S, a set of selected features. 2 begin 3 for i ← 1 to N do 4 Compute I(Xi , C); 5 for j ← i + 1 to N do 6 Compute I(Xi , Xj |C); 7 Compute I(Xi , Xj ); 8 end 9 end 10 X∗ ← arg max1≤i≤N I(Xi , C); 11 S ← {X∗}; 12 for i ← 1 to M − 1 do 13 R ← X S; 14 for X ∈ R do 15 Compute Jjmi(X) using the statistics computed in Steps 4, 6, and 7; 16 end 17 X∗ ← arg maxX∈R Jjmi(X); 18 S ← S ∪ {X∗}; 19 end 20 return S; 21 end 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 14
  • 15. Optimizing the calculation of the scores To compute all the necessary information theory scores, one possibility is to create a thread for each pair of features. However, for n variables, this requires accessing the dataset n 2 times, inducing a significant overhead due to disk/network access. We propose making groups of variables (blocks), each of which will only access the dataset a single time. To do this, two key issues: Finding an appropriate block size. Ensure that every pair of variables appears in exactly one block (to avoid duplicated calculations). 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 15
  • 16. Balanced incomplete block (BIB) designs What we seek can be done by using Balanced Incomplete Block (BIB) designs. Given a set of variables X, we will say (X, A) is a design if A is a collection of non-empty subsets of X (blocks). A design is a (v, k, λ)-BIB design if: v = |X| is the number of variables in X. each block in A contains exactly k variables. every pair of distinct variables is contained in exactly λ blocks. We have found that (q, 6, 1)-BIB designs (blocks of 6 variables) are appropriate for practical use. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 16
  • 17. Balanced incomplete block (BIB) designs Some considerations about BIB designs: A (v, k, λ)-BIB might not exist, for some combinations of the parameters. Finding a BIB design is a NP-complete problem. To efficiently use BIB designs, we have pre-calculated a number of them, and those are utilized on run-time. BIB designs can be generated from difference sets, avoiding the need of storing the full design. Check the paper for more details on this! 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 17
  • 18. Example of BIB designs Figure 1: Example illustrating the use of (q, 6, 1) and (3, 2, 1) designs. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 18
  • 19. Outline 1 Motivation 2 Preliminaries 3 Scaling up filter-based feature selection 4 Experimental results 5 Conclusions 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 19
  • 20. Experimental setup Goal: empirical evaluation of the time performance improvement using BIB designs. Shared memory computer with multiple cores: Intel (TM) i7-5820K 3.3GHz processor (6 physical and 12 logical cores), with 64 GB RAM, running Red Hat Enterprise Linux 7. Time measurements: averaged over 10 runs with each dataset, elapsed (wall-clock) time. Datasets: randomly simulated from well-known Bayesian networks, and a real-world dataset from a Spanish bank. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 20
  • 21. Tested Bayesian networks Table 1: Bayesian networks from which datasets were generated. Dataset |X| |E| Total CPT size Munin1 189 282 19,466 Diabetes 413 602 461,069 Munin2 1,003 1,244 83,920 SACSO 2,371 3,521 44,274 |X|: number of variables in the Bayesian network |E|: number of edges in the Bayesian network CPT: total conditional probability table size In addition, a real-world dataset of 1823 variables from a Spanish bank was also tested. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 21
  • 22. Speed-up for Munin1 datasets 100.000 cases 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 250.000 cases 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 500.000 cases 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 22
  • 23. Speed-up for Munin2 datasets 100.000 cases 0 5 10 15 20 25 30 35 40 45 0 2 4 6 8 10 12 0 1 2 3 4 5 6 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 250.000 cases 0 10 20 30 40 50 60 70 0 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 3 3.5 4 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 500.000 cases 0 20 40 60 80 100 120 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 23
  • 24. Speed-up for SACSO datasets 250.000 cases 0 50 100 150 200 250 300 350 0 2 4 6 8 10 12 0 1 2 3 4 5 6 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 500.000 cases 0 100 200 300 400 500 600 700 800 0 2 4 6 8 10 12 0 1 2 3 4 5 6 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 24
  • 25. Speed-up for Diabetes datasets 100k cases 0 100 200 300 400 500 600 700 0 2 4 6 8 10 12 0 0.5 1 1.5 2 2.5 3 3.5 4 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 25
  • 26. Speed-up for Bank datasets 100k cases 0 10 20 30 40 50 60 70 80 0 2 4 6 8 10 12 0 1 2 3 4 5 6 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 250k cases 0 20 40 60 80 100 120 140 160 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 500k cases 0 50 100 150 200 250 300 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 26
  • 27. Impact of using the (q, 6, 1)-BIB designs Bank dataset 100k cases with BIB designs 0 10 20 30 40 50 60 70 80 0 2 4 6 8 10 12 0 1 2 3 4 5 6 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 100k cases, using pairs directly 0 50 100 150 200 250 300 0 2 4 6 8 10 12 0 1 2 3 4 5 6 Averageruntimeinseconds Averagespeed-upfactor Number of threads Time Speed-up 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 27
  • 28. Outline 1 Motivation 2 Preliminaries 3 Scaling up filter-based feature selection 4 Experimental results 5 Conclusions 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 28
  • 29. Conclusions A parallel filter-based feature selection algorithm has been proposed. It makes use of information theoretic measures (conditional mutual information) for filtering. Also, a two-steps Balanced Incomplete Blocks (BIB) design has been used to distribute and optimize the computations asynchronously ((q, 6, 1) and then (3, 2, 1)). For variables with a large number of states, it might be preferable to skip the first step (Diabetes). The performance improvement when using (q, 6, 1) designs is in most cases substantial. Speed-up factors of about 4-6 were obtained running on a 6 physical cores computer. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 29
  • 30. Future work Horizontal parallelization: each computing unit holding only a subset of the data over all variables. This will support parallelization on a distributed memory system. 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 30
  • 31. Thank you for your attention You can download our open source Java toolbox: http://www.amidsttoolbox.com Acknowledgments: This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 619209 22nd European Conference on Artificial Inteligence, The Hague, 1st September 2016 31