Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Feature selection for Big Data:
advances and challenges
Verónica Bolón-Canedo

Big Data
Volume
Velocity
Variety
Veracity
Value
Variability
Visualization
Validity
Vulnerability
Volatility
Variables

The more data, the better… right?
The curse of dimensionality

Feature selection
“Feature selection is the process of selecting the relevant
features and discarding the irrelevant and redundant ones”
Note: not talking about feature extraction for dimensionality reduction!
PCA, t-SNE, manifold learning? No, they lose the meaning of original features

What is a relevant feature?
Imagine that you are trying to guess the price of a car…
● Relevant: engine size, age, mileage, presence of rust, ...
● Irrelevant: color of windscreen wipers, stickers on windows, ...
● Redundant: age / mileage

Why feature selection?
General data
reduction
To limit storage requirements and increase algorithm
speed
Feature set
reduction
To save resources in the next round of data
collection
Performance
improvement
To gain in predictive accuracy
Data understanding
To gain knowledge about the process that
generated the data or for visualization

Feature selection methods
Subset vs Ranker
Filters vs Embedded vs Wrappers
Univariate vs Multivariate
Sorry… There is no one-size-fits-all method!

Feature selection is successful!

If you want to know more about feature selection...

Big Dimensionality
3,000,0001500
100
100
1980s
1990s
2000s

Big Dimensionality
> 29 million features
> 20 million samples
> 54 million features
> 149 million samples

Scalability
In scaling up learning algorithms, the issue is not so much one of
speeding up a slow algorithm, as one of turning an impracticable
algorithm into a practical one
“Good enough” solutions
as “fast” as possible
and as “efficiently” as possible

Scalability
Model complexity
Univariate vs Multivariate
Parameter tuning
Stability
Distributed learning

Distributed feature selection
● Data is, sometimes, distributed in origin
● Privacy issues
● Vertical or horizontal distribution?
● Overlap between partitions?
● How to aggregate partial results?

Arrow’s impossibility theorem:
“When having at least two rankers
(nodes), and at least three options to rank
(features), it is impossible to design an
aggregation function that satisfies in a
strong way a set of desirable conditions at
once”

Good enough solutions in terms
of accuracy
Bolón-Canedo, Verónica, et al. "Exploring the consequences of distributed feature selection in DNA microarray data." In Proceedings of
International Joint Conference on Neural Networks, IJCNN, pp. 1665-1672, (2017).

Real-time processing
Spam detection
Video/image detection
Portable devices
CAD systems
ETC...

Online feature selection
Pre-selecting features
No subsequent online
classification
Classifiers not flexible with
respect to input features
Find flexible feature selection
methods capable of modifying
the selected subset of features
as new training samples arrive
Methods that can be executed
in a dynamic feature space
initially empty but would add
features as new information
arrives

Online feature selection
Chi2
k-means
One-layer
ANN

Feature cost: a real case
In tear film lipid layer classification, the time
(cost) for extracting the features is not the
same and should be minimized.

Visualization and interpretability
Typical approach:
feature extraction
Loss of interpretability!
A model is only as good as its features, so features play a
preponderant role in model interpretability
Two-fold need for interpretability and transparency in feature selection and
model creation processes:
● More interactive model visualizations to better interact with the model
and visualize future scenarios
● More interactive feature selection process where, using interactive
visualizations, it is possible to iterate through different feature subsets

Visualization and interpretability
Digital Diogenes Syndrome
Organizations need to gather data in a meaningful way
Data-rich/Knowledge-poor Data-rich/Knowledge-rich
Krause, J., Perer, A., & Bertini, E. (2014). INFUSE: interactive feature selection for predictive
modeling of high dimensional data. IEEE transactions on visualization and computer graphics,
20(12), 1614-1623.

What is big in Big Data?
New opportunity to develop methods in computationally
constrained platforms!

Take home message
1. If you have never considered applying feature
selection to your problem, give it a try!
2. If you are interested in feature selection, it is
a prolific open line of research facing new
challenges that Big Data brought.

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Ähnlich wie Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017 (20)

Mehr von Big Data Spain

Mehr von Big Data Spain (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017