SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Feature selection for Big Data:
advances and challenges
Verónica Bolón-Canedo
Big Data
Volume
Velocity
Variety
Veracity
Value
Variability
Visualization
Validity
Vulnerability
Volatility
Variables
The more data, the better… right?
The curse of dimensionality
Feature selection
“Feature selection is the process of selecting the relevant
features and discarding the irrelevant and redundant ones”
Note: not talking about feature extraction for dimensionality reduction!
PCA, t-SNE, manifold learning? No, they lose the meaning of original features
What is a relevant feature?
Imagine that you are trying to guess the price of a car…
● Relevant: engine size, age, mileage, presence of rust, ...
● Irrelevant: color of windscreen wipers, stickers on windows, ...
● Redundant: age / mileage
Why feature selection?
General data
reduction
To limit storage requirements and increase algorithm
speed
Feature set
reduction
To save resources in the next round of data
collection
Performance
improvement
To gain in predictive accuracy
Data understanding
To gain knowledge about the process that
generated the data or for visualization
Feature selection methods
Subset vs Ranker
Filters vs Embedded vs Wrappers
Univariate vs Multivariate
Sorry… There is no one-size-fits-all method!
Feature selection is successful!
If you want to know more about feature selection...
Big Dimensionality
3,000,0001500
100
100
1980s
1990s
2000s
Big Dimensionality
> 29 million features
> 20 million samples
> 54 million features
> 149 million samples
Scalability
In scaling up learning algorithms, the issue is not so much one of
speeding up a slow algorithm, as one of turning an impracticable
algorithm into a practical one
“Good enough” solutions
as “fast” as possible
and as “efficiently” as possible
Scalability
Model complexity
Univariate vs Multivariate
Parameter tuning
Stability
Distributed learning
Distributed feature selection
● Data is, sometimes, distributed in origin
● Privacy issues
● Vertical or horizontal distribution?
● Overlap between partitions?
● How to aggregate partial results?
Distributed feature selection
Arrow’s impossibility theorem:
“When having at least two rankers
(nodes), and at least three options to rank
(features), it is impossible to design an
aggregation function that satisfies in a
strong way a set of desirable conditions at
once”
Distributed feature selection
Good enough solutions in terms
of accuracy
Bolón-Canedo, Verónica, et al. "Exploring the consequences of distributed feature selection in DNA microarray data." In Proceedings of
International Joint Conference on Neural Networks, IJCNN, pp. 1665-1672, (2017).
Parallel feature selection
Parallel feature selection
Real-time processing
Spam detection
Video/image detection
Portable devices
CAD systems
ETC...
Online feature selection
Pre-selecting features
No subsequent online
classification
Classifiers not flexible with
respect to input features
Find flexible feature selection
methods capable of modifying
the selected subset of features
as new training samples arrive
Methods that can be executed
in a dynamic feature space
initially empty but would add
features as new information
arrives
Online feature selection
Chi2
k-means
One-layer
ANN
Feature cost
Feature cost
Feature cost: a real case
In tear film lipid layer classification, the time
(cost) for extracting the features is not the
same and should be minimized.
Visualization and interpretability
Typical approach:
feature extraction
Loss of interpretability!
A model is only as good as its features, so features play a
preponderant role in model interpretability
Two-fold need for interpretability and transparency in feature selection and
model creation processes:
● More interactive model visualizations to better interact with the model
and visualize future scenarios
● More interactive feature selection process where, using interactive
visualizations, it is possible to iterate through different feature subsets
Visualization and interpretability
Digital Diogenes Syndrome
Organizations need to gather data in a meaningful way
Data-rich/Knowledge-poor Data-rich/Knowledge-rich
Krause, J., Perer, A., & Bertini, E. (2014). INFUSE: interactive feature selection for predictive
modeling of high dimensional data. IEEE transactions on visualization and computer graphics,
20(12), 1614-1623.
What is big in Big Data?
New opportunity to develop methods in computationally
constrained platforms!
Take home message
1. If you have never considered applying feature
selection to your problem, give it a try!
2. If you are interested in feature selection, it is
a prolific open line of research facing new
challenges that Big Data brought.
Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning Meetup SOF: Intro to ML
Machine Learning Meetup SOF: Intro to MLMachine Learning Meetup SOF: Intro to ML
Machine Learning Meetup SOF: Intro to MLImagga Technology
 
24 artificial intelligence terms you need to know venkat vajradhar - medium
24 artificial intelligence terms you need to know   venkat vajradhar - medium24 artificial intelligence terms you need to know   venkat vajradhar - medium
24 artificial intelligence terms you need to know venkat vajradhar - mediumvenkatvajradhar1
 
alphablues - ML applied to text and image in chat bots
alphablues - ML applied to text and image in chat botsalphablues - ML applied to text and image in chat bots
alphablues - ML applied to text and image in chat botsAndré Karpištšenko
 
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchII-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchDr. Haxel Consult
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine LearningMartyn Sukys
 
DN18 | From Counting to Connecting: A Networked and Data-Driven Approach to M...
DN18 | From Counting to Connecting: A Networked and Data-Driven Approach to M...DN18 | From Counting to Connecting: A Networked and Data-Driven Approach to M...
DN18 | From Counting to Connecting: A Networked and Data-Driven Approach to M...Dataconomy Media
 
DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...
DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...
DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...Dataconomy Media
 
Artificial intelligence and IoT
Artificial intelligence and IoTArtificial intelligence and IoT
Artificial intelligence and IoTVeselin Pizurica
 
Anomaly Detection using Deep Auto-Encoders | Gianmario Spacagna
Anomaly Detection using Deep Auto-Encoders | Gianmario SpacagnaAnomaly Detection using Deep Auto-Encoders | Gianmario Spacagna
Anomaly Detection using Deep Auto-Encoders | Gianmario SpacagnaData Science Milan
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisManuel Martín
 
Mena Salwans - Computer Vision for developers
Mena Salwans - Computer Vision for developersMena Salwans - Computer Vision for developers
Mena Salwans - Computer Vision for developersAWS Chicago
 
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?Agentschap Innoveren & Ondernemen
 
Computers that teach by example
Computers that teach by exampleComputers that teach by example
Computers that teach by examplealikecommunity214
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1iotest
 
AI & ML in Defence Systems - Sunil Chomal
AI & ML in Defence Systems   - Sunil ChomalAI & ML in Defence Systems   - Sunil Chomal
AI & ML in Defence Systems - Sunil ChomalSunil Chomal
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...Data Science Milan
 
SXSW_FutureFriends_Frick
SXSW_FutureFriends_FrickSXSW_FutureFriends_Frick
SXSW_FutureFriends_FrickLaurie Frick
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentationDavid Raj Kanthi
 

Was ist angesagt? (19)

Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
 
Machine Learning Meetup SOF: Intro to ML
Machine Learning Meetup SOF: Intro to MLMachine Learning Meetup SOF: Intro to ML
Machine Learning Meetup SOF: Intro to ML
 
24 artificial intelligence terms you need to know venkat vajradhar - medium
24 artificial intelligence terms you need to know   venkat vajradhar - medium24 artificial intelligence terms you need to know   venkat vajradhar - medium
24 artificial intelligence terms you need to know venkat vajradhar - medium
 
alphablues - ML applied to text and image in chat bots
alphablues - ML applied to text and image in chat botsalphablues - ML applied to text and image in chat bots
alphablues - ML applied to text and image in chat bots
 
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchII-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
DN18 | From Counting to Connecting: A Networked and Data-Driven Approach to M...
DN18 | From Counting to Connecting: A Networked and Data-Driven Approach to M...DN18 | From Counting to Connecting: A Networked and Data-Driven Approach to M...
DN18 | From Counting to Connecting: A Networked and Data-Driven Approach to M...
 
DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...
DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...
DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...
 
Artificial intelligence and IoT
Artificial intelligence and IoTArtificial intelligence and IoT
Artificial intelligence and IoT
 
Anomaly Detection using Deep Auto-Encoders | Gianmario Spacagna
Anomaly Detection using Deep Auto-Encoders | Gianmario SpacagnaAnomaly Detection using Deep Auto-Encoders | Gianmario Spacagna
Anomaly Detection using Deep Auto-Encoders | Gianmario Spacagna
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data Analysis
 
Mena Salwans - Computer Vision for developers
Mena Salwans - Computer Vision for developersMena Salwans - Computer Vision for developers
Mena Salwans - Computer Vision for developers
 
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
 
Computers that teach by example
Computers that teach by exampleComputers that teach by example
Computers that teach by example
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
 
AI & ML in Defence Systems - Sunil Chomal
AI & ML in Defence Systems   - Sunil ChomalAI & ML in Defence Systems   - Sunil Chomal
AI & ML in Defence Systems - Sunil Chomal
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
 
SXSW_FutureFriends_Frick
SXSW_FutureFriends_FrickSXSW_FutureFriends_Frick
SXSW_FutureFriends_Frick
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 

Ähnlich wie Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...theijes
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningPramit Choudhary
 
Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUnity Technologies
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular dataJimmyLiang20
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfCarlos Paredes
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentationNeerajNishad4
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
 
Situation Awareness In A Complex World
Situation Awareness In A Complex WorldSituation Awareness In A Complex World
Situation Awareness In A Complex Worldvsorathia
 
Applications of Pattern Recognition Algorithms in Agriculture: A Review
Applications of Pattern Recognition Algorithms in Agriculture: A ReviewApplications of Pattern Recognition Algorithms in Agriculture: A Review
Applications of Pattern Recognition Algorithms in Agriculture: A ReviewEswar Publications
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsimtiaz khan
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...Egyptian Engineers Association
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Managementmark madsen
 
IRJET - Face Recognition in Digital Documents with Live Image
IRJET - Face Recognition in Digital Documents with Live ImageIRJET - Face Recognition in Digital Documents with Live Image
IRJET - Face Recognition in Digital Documents with Live ImageIRJET Journal
 
Time series anomaly detection using cnn coupled with data augmentation using ...
Time series anomaly detection using cnn coupled with data augmentation using ...Time series anomaly detection using cnn coupled with data augmentation using ...
Time series anomaly detection using cnn coupled with data augmentation using ...Prasenjeet Acharjee
 
Image Stitching: Exploring Practices, Software and Performance, D.Williams & ...
Image Stitching: Exploring Practices, Software and Performance, D.Williams & ...Image Stitching: Exploring Practices, Software and Performance, D.Williams & ...
Image Stitching: Exploring Practices, Software and Performance, D.Williams & ...Burns Digital Imaging LLC
 

Ähnlich wie Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017 (20)

A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model training
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Hanaa phd presentation 14-4-2017
Hanaa phd  presentation  14-4-2017Hanaa phd  presentation  14-4-2017
Hanaa phd presentation 14-4-2017
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentation
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
Situation Awareness In A Complex World
Situation Awareness In A Complex WorldSituation Awareness In A Complex World
Situation Awareness In A Complex World
 
Applications of Pattern Recognition Algorithms in Agriculture: A Review
Applications of Pattern Recognition Algorithms in Agriculture: A ReviewApplications of Pattern Recognition Algorithms in Agriculture: A Review
Applications of Pattern Recognition Algorithms in Agriculture: A Review
 
Z suzanne van_den_bosch
Z suzanne van_den_boschZ suzanne van_den_bosch
Z suzanne van_den_bosch
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analytics
 
C3 w5
C3 w5C3 w5
C3 w5
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
IRJET - Face Recognition in Digital Documents with Live Image
IRJET - Face Recognition in Digital Documents with Live ImageIRJET - Face Recognition in Digital Documents with Live Image
IRJET - Face Recognition in Digital Documents with Live Image
 
Time series anomaly detection using cnn coupled with data augmentation using ...
Time series anomaly detection using cnn coupled with data augmentation using ...Time series anomaly detection using cnn coupled with data augmentation using ...
Time series anomaly detection using cnn coupled with data augmentation using ...
 
Image Stitching: Exploring Practices, Software and Performance, D.Williams & ...
Image Stitching: Exploring Practices, Software and Performance, D.Williams & ...Image Stitching: Exploring Practices, Software and Performance, D.Williams & ...
Image Stitching: Exploring Practices, Software and Performance, D.Williams & ...
 

Mehr von Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Big Data Spain
 
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...Big Data Spain
 

Mehr von Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
 
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

  • 1.
  • 2. Feature selection for Big Data: advances and challenges Verónica Bolón-Canedo
  • 4. The more data, the better… right? The curse of dimensionality
  • 5. Feature selection “Feature selection is the process of selecting the relevant features and discarding the irrelevant and redundant ones” Note: not talking about feature extraction for dimensionality reduction! PCA, t-SNE, manifold learning? No, they lose the meaning of original features
  • 6. What is a relevant feature? Imagine that you are trying to guess the price of a car… ● Relevant: engine size, age, mileage, presence of rust, ... ● Irrelevant: color of windscreen wipers, stickers on windows, ... ● Redundant: age / mileage
  • 7. Why feature selection? General data reduction To limit storage requirements and increase algorithm speed Feature set reduction To save resources in the next round of data collection Performance improvement To gain in predictive accuracy Data understanding To gain knowledge about the process that generated the data or for visualization
  • 8. Feature selection methods Subset vs Ranker Filters vs Embedded vs Wrappers Univariate vs Multivariate Sorry… There is no one-size-fits-all method!
  • 9. Feature selection is successful!
  • 10. If you want to know more about feature selection...
  • 11.
  • 13. Big Dimensionality > 29 million features > 20 million samples > 54 million features > 149 million samples
  • 14. Scalability In scaling up learning algorithms, the issue is not so much one of speeding up a slow algorithm, as one of turning an impracticable algorithm into a practical one “Good enough” solutions as “fast” as possible and as “efficiently” as possible
  • 15. Scalability Model complexity Univariate vs Multivariate Parameter tuning Stability Distributed learning
  • 16. Distributed feature selection ● Data is, sometimes, distributed in origin ● Privacy issues ● Vertical or horizontal distribution? ● Overlap between partitions? ● How to aggregate partial results?
  • 17. Distributed feature selection Arrow’s impossibility theorem: “When having at least two rankers (nodes), and at least three options to rank (features), it is impossible to design an aggregation function that satisfies in a strong way a set of desirable conditions at once”
  • 18. Distributed feature selection Good enough solutions in terms of accuracy Bolón-Canedo, Verónica, et al. "Exploring the consequences of distributed feature selection in DNA microarray data." In Proceedings of International Joint Conference on Neural Networks, IJCNN, pp. 1665-1672, (2017).
  • 21. Real-time processing Spam detection Video/image detection Portable devices CAD systems ETC...
  • 22. Online feature selection Pre-selecting features No subsequent online classification Classifiers not flexible with respect to input features Find flexible feature selection methods capable of modifying the selected subset of features as new training samples arrive Methods that can be executed in a dynamic feature space initially empty but would add features as new information arrives
  • 26. Feature cost: a real case In tear film lipid layer classification, the time (cost) for extracting the features is not the same and should be minimized.
  • 27. Visualization and interpretability Typical approach: feature extraction Loss of interpretability! A model is only as good as its features, so features play a preponderant role in model interpretability Two-fold need for interpretability and transparency in feature selection and model creation processes: ● More interactive model visualizations to better interact with the model and visualize future scenarios ● More interactive feature selection process where, using interactive visualizations, it is possible to iterate through different feature subsets
  • 28. Visualization and interpretability Digital Diogenes Syndrome Organizations need to gather data in a meaningful way Data-rich/Knowledge-poor Data-rich/Knowledge-rich Krause, J., Perer, A., & Bertini, E. (2014). INFUSE: interactive feature selection for predictive modeling of high dimensional data. IEEE transactions on visualization and computer graphics, 20(12), 1614-1623.
  • 29. What is big in Big Data? New opportunity to develop methods in computationally constrained platforms!
  • 30. Take home message 1. If you have never considered applying feature selection to your problem, give it a try! 2. If you are interested in feature selection, it is a prolific open line of research facing new challenges that Big Data brought.