In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.
https://www.bigdataspain.org/2017/talk/feature-selection-for-big-data-advances-and-challenges
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
4. The more data, the better… right?
The curse of dimensionality
5. Feature selection
“Feature selection is the process of selecting the relevant
features and discarding the irrelevant and redundant ones”
Note: not talking about feature extraction for dimensionality reduction!
PCA, t-SNE, manifold learning? No, they lose the meaning of original features
6. What is a relevant feature?
Imagine that you are trying to guess the price of a car…
● Relevant: engine size, age, mileage, presence of rust, ...
● Irrelevant: color of windscreen wipers, stickers on windows, ...
● Redundant: age / mileage
7. Why feature selection?
General data
reduction
To limit storage requirements and increase algorithm
speed
Feature set
reduction
To save resources in the next round of data
collection
Performance
improvement
To gain in predictive accuracy
Data understanding
To gain knowledge about the process that
generated the data or for visualization
8. Feature selection methods
Subset vs Ranker
Filters vs Embedded vs Wrappers
Univariate vs Multivariate
Sorry… There is no one-size-fits-all method!
13. Big Dimensionality
> 29 million features
> 20 million samples
> 54 million features
> 149 million samples
14. Scalability
In scaling up learning algorithms, the issue is not so much one of
speeding up a slow algorithm, as one of turning an impracticable
algorithm into a practical one
“Good enough” solutions
as “fast” as possible
and as “efficiently” as possible
16. Distributed feature selection
● Data is, sometimes, distributed in origin
● Privacy issues
● Vertical or horizontal distribution?
● Overlap between partitions?
● How to aggregate partial results?
17. Distributed feature selection
Arrow’s impossibility theorem:
“When having at least two rankers
(nodes), and at least three options to rank
(features), it is impossible to design an
aggregation function that satisfies in a
strong way a set of desirable conditions at
once”
18. Distributed feature selection
Good enough solutions in terms
of accuracy
Bolón-Canedo, Verónica, et al. "Exploring the consequences of distributed feature selection in DNA microarray data." In Proceedings of
International Joint Conference on Neural Networks, IJCNN, pp. 1665-1672, (2017).
22. Online feature selection
Pre-selecting features
No subsequent online
classification
Classifiers not flexible with
respect to input features
Find flexible feature selection
methods capable of modifying
the selected subset of features
as new training samples arrive
Methods that can be executed
in a dynamic feature space
initially empty but would add
features as new information
arrives
26. Feature cost: a real case
In tear film lipid layer classification, the time
(cost) for extracting the features is not the
same and should be minimized.
27. Visualization and interpretability
Typical approach:
feature extraction
Loss of interpretability!
A model is only as good as its features, so features play a
preponderant role in model interpretability
Two-fold need for interpretability and transparency in feature selection and
model creation processes:
● More interactive model visualizations to better interact with the model
and visualize future scenarios
● More interactive feature selection process where, using interactive
visualizations, it is possible to iterate through different feature subsets
28. Visualization and interpretability
Digital Diogenes Syndrome
Organizations need to gather data in a meaningful way
Data-rich/Knowledge-poor Data-rich/Knowledge-rich
Krause, J., Perer, A., & Bertini, E. (2014). INFUSE: interactive feature selection for predictive
modeling of high dimensional data. IEEE transactions on visualization and computer graphics,
20(12), 1614-1623.
29. What is big in Big Data?
New opportunity to develop methods in computationally
constrained platforms!
30. Take home message
1. If you have never considered applying feature
selection to your problem, give it a try!
2. If you are interested in feature selection, it is
a prolific open line of research facing new
challenges that Big Data brought.