SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Machine	
  learning	
  workshop	
  
guodong@hulu.com	
  

Machine	
  learning	
  introduc7on	
  
Logis7c	
  regression	
  

Feature	
  selec+on	
  

Boos7ng,	
  tree	
  boos7ng	
  
	
  
See	
  more	
  machine	
  learning	
  post:	
  h>p://dongguo.me	
  	
  
	
  
Outline	
  
• 
• 
• 
• 

Introduc7on	
  
Typical	
  feature	
  selec7on	
  methods	
  
Feature	
  selec7on	
  in	
  logis7c	
  regression	
  
Tips	
  and	
  conclusion	
  
What’s/why	
  feature	
  selec7on	
  
•  A	
  procedure	
  in	
  machine	
  learning	
  to	
  find	
  a	
  subset	
  of	
  
features	
  that	
  produces	
  ‘be>er’	
  model	
  for	
  given	
  
dataset	
  
–  Avoid	
  overfiLng	
  and	
  achieve	
  be>er	
  generaliza7on	
  ability	
  
–  Reduce	
  the	
  storage	
  requirement	
  and	
  training	
  7me	
  
–  Interpretability	
  
When	
  feature	
  selec7on	
  is	
  important	
  
• 
• 
• 
• 
• 
• 

Noise	
  data	
  
Lots	
  of	
  low	
  frequent	
  features	
  
Use	
  mul7-­‐type	
  features	
  
Too	
  many	
  features	
  comparing	
  to	
  samples	
  
Complex	
  model	
  
Samples	
  in	
  real	
  scenario	
  is	
  inhomogeneous	
  with	
  
training	
  &	
  test	
  samples	
  
 When	
  No.(samples)/No.(feature)	
  is	
  large	
  
• 
• 
• 
• 

Feature	
  selec7on	
  with	
  Gini	
  indexing	
  
Algorithm:	
  Logis7c	
  regression	
  
Training	
  samples:	
  	
  640k;	
  test	
  samples:	
  49K	
  
Feature:	
  watch	
  behavior	
  of	
  audiences;	
  show	
  level	
  (11327)	
  
AUC	
  

0.83	
  

0.82	
  

L1-­‐LR	
  
L2-­‐LR	
  

0.81	
  

0.8	
  
all	
  

80%	
  

70%	
  

60%	
  

50%	
  

40%	
  

30%	
  

ra+o	
  of	
  features	
  used	
  

20%	
  

10%	
  
When	
  No(samples)	
  equals	
  to	
  No(feature)	
  
•  L1	
  Logis7c	
  regression;	
  	
  
•  Training	
  samples:	
  50k;	
  test	
  samples:	
  49K	
  
•  Feature:	
  watch	
  behavior	
  of	
  audiences;	
  video	
  level	
  (49166)	
  
how	
  AUC	
  change	
  with	
  feature	
  number	
  selected	
  
0.736	
  
0.735	
  
0.734	
  
0.733	
  
0.732	
  

ACU	
  

0.731	
  
0.73	
  
0.729	
  
0.728	
  
all	
  

90%	
   80%	
   70%	
   60%	
   50%	
   40%	
   30%	
   20%	
   10%	
  
Typical	
  methods	
  for	
  feature	
  selec7on	
  
•  Categories	
  
Single	
  feature	
  
evalua7on	
  

Subset	
  selec7on	
  

filter	
  

MI,	
  IG,	
  KL-­‐D,	
  GI,	
  CHI	
  	
   Category	
  distance,	
  
…	
  

wrapper	
  

Ranking	
  accuracy	
  
For	
  LR	
  (SFO,	
  
using	
  single	
  feature	
   Graiing)	
  

•  Single	
  feature	
  evalua7on	
  

–  Frequency	
  based,	
  mutual	
  informa7on,	
  KL	
  divergence,	
  Gini	
  indexing,	
  
informa7on	
  gain,	
  Chi	
  square	
  sta7s7c	
  

•  Subset	
  selec7on	
  method	
  

–  Sequen7al	
  forward	
  selec7on	
  
–  Sequen7al	
  backward	
  selec7on	
  
Single	
  feature	
  evalua7on	
  
•  Measure	
  quality	
  of	
  features	
  by	
  all	
  kinds	
  of	
  metrics	
  
–  Frequency	
  based	
  
–  Dependence	
  of	
  feature	
  and	
  label	
  (Co-­‐occurrence)	
  
•  mutual	
  informa7on,	
  Chi	
  square	
  sta7s7c	
  

–  Informa7on	
  theory	
  
•  KL	
  divergence,	
  informa7on	
  gain	
  

–  Gini	
  indexing	
  
Frequency	
  based	
  
•  Remove	
  features	
  according	
  to	
  frequency	
  of	
  features	
  
or	
  instances	
  contain	
  the	
  feature	
  
•  Typical	
  scenario	
  
–  Text	
  mining	
  
Mutual	
  informa7on	
  
•  Measure	
  the	
  dependence	
  of	
  two	
  random	
  variables	
  
•  Defini7on	
  
Chi	
  Square	
  Sta7s7c	
  
•  Measure	
  the	
  dependence	
  of	
  two	
  variables	
  

	
  
–  A:	
  number	
  of	
  7mes	
  feature	
  t	
  and	
  category	
  c	
  co-­‐occur	
  
–  B:	
  number	
  of	
  7mes	
  t	
  occurs	
  without	
  c	
  
–  C:	
  number	
  of	
  7mes	
  c	
  occurs	
  without	
  t	
  
–  D:	
  number	
  of	
  7mes	
  neither	
  c	
  or	
  t	
  occurs	
  
–  N:	
  total	
  number	
  of	
  instances	
  
Entropy	
  
•  Characterize	
  the	
  (im)purity	
  of	
  an	
  collec7on	
  of	
  
examples	
  
𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)=   −∑𝑖↑▒​ 𝑃↓𝑖  𝐼𝑛​ 𝑃↓𝑖 	
  
Informa7on	
  Gain	
  
•  Reduc7on	
  in	
  entropy	
  caused	
  by	
  par77oning	
  the	
  
examples	
  according	
  to	
  the	
  a>ribute	
  
KL	
  divergence	
  
•  Measure	
  the	
  difference	
  between	
  two	
  probability	
  
distribu7on	
  
Gini	
  indexing	
  
•  Calculate	
  condi7onal	
  probability	
  of	
  f	
  given	
  class	
  label	
  
	
  
•  Normalize	
  across	
  all	
  classes	
  
	
  
•  Calculate	
  Gini	
  coefficient	
  
	
  
	
  
•  For	
  two	
  categories	
  case	
  
Comparison	
  in	
  text	
  categoriza7on	
  (1)	
  
•  A	
  compara)ve	
  study	
  on	
  feature	
  selec)on	
  in	
  text	
  categoriza)on	
  (ICML’97)	
  
Comparison	
  in	
  text	
  categoriza7on	
  (2)	
  
•  Feature	
  selec)on	
  for	
  text	
  classifica)on	
  based	
  on	
  Gini	
  Coefficient	
  of	
  
Inequality	
  (JMLR’03)	
  
Shortages	
  of	
  single	
  feature	
  evalua7on	
  
•  Relevance	
  between	
  features	
  are	
  ignored	
  
–  Features	
  could	
  be	
  redundant	
  	
  
–  A	
  feature	
  that	
  is	
  completely	
  useless	
  by	
  itself	
  can	
  provide	
  a	
  
significant	
  performance	
  improvement	
  when	
  taken	
  with	
  
others	
  
–  Two	
  features	
  that	
  are	
  useless	
  by	
  themselves	
  can	
  be	
  useful	
  
together	
  
Shortages	
  of	
  single	
  feature	
  evalua7on	
  (2)	
  
•  A	
  feature	
  that	
  is	
  completely	
  useless	
  by	
  itself	
  can	
  
provide	
  a	
  significant	
  performance	
  improvement	
  
when	
  taken	
  with	
  others	
  
Shortages	
  of	
  single	
  feature	
  evalua7on	
  (3)	
  
•  Two	
  features	
  that	
  are	
  useless	
  by	
  themselves	
  can	
  be	
  
useful	
  together	
  
Subset	
  selec7on	
  methods	
  
•  Select	
  subsets	
  of	
  features	
  that	
  together	
  have	
  good	
  
predic7ve	
  power,	
  as	
  opposed	
  to	
  ranking	
  features	
  
individually	
  
•  Always	
  by	
  adding	
  new	
  features	
  into	
  exis7ng	
  set	
  or	
  
removing	
  features	
  out	
  of	
  exis7ng	
  set	
  
–  Sequen7al	
  forward	
  selec7on	
  
–  Sequen7al	
  backward	
  selec7on	
  

•  Evalua7on	
  
–  category	
  distance	
  measurement	
  
–  Classifica7on	
  error	
  
Category	
  distance	
  measurement	
  
•  Select	
  features	
  subset	
  with	
  large	
  category	
  distance	
  
Wrapper	
  methods	
  for	
  logis7c	
  regression	
  
•  Forward	
  feature	
  selec7on	
  
–  Naïve	
  method	
  
•  need	
  build	
  models	
  quadra7c	
  in	
  the	
  number	
  of	
  feature	
  

–  Graiing	
  
–  Single	
  feature	
  op7miza7on	
  
SFO	
  (Singhet	
  al.,	
  2009)	
  
•  Only	
  op7mizing	
  coefficient	
  of	
  the	
  new	
  feature	
  

•  Only	
  need	
  iterate	
  over	
  instances	
  that	
  contain	
  the	
  
new	
  feature	
  
•  Also	
  fully	
  relearn	
  one	
  new	
  model	
  with	
  selected	
  
feature	
  included	
  
Graiing	
  (Perkins	
  2003)	
  
•  Use	
  the	
  loss	
  func7on’s	
  gradient	
  with	
  respect	
  to	
  the	
  
new	
  feature	
  to	
  decide	
  whether	
  to	
  add	
  the	
  feature	
  

•  At	
  each	
  step,	
  feature	
  with	
  largest	
  gradient	
  is	
  added	
  
•  Model	
  is	
  fully	
  relearned	
  aier	
  each	
  feature	
  is	
  added	
  
–  Need	
  only	
  build	
  D	
  models	
  overall	
  
Experimenta7on	
  
•  Percent	
  improvement	
  of	
  log-­‐likelihood	
  in	
  test	
  set	
  

•  Both	
  SFO	
  and	
  Graiing	
  are	
  easy	
  parallelized	
  
Summariza7on	
  	
  
•  Categories	
  
Single	
  feature	
  evalua+on	
  

Subset	
  selec+on	
  

filter	
  

MI,	
  IG,	
  KL-­‐D,	
  GI,	
  CHI	
  	
  

Category	
  distance,	
  
…	
  

wrapper	
  

Ranking	
  accuracy	
  using	
  
single	
  feature	
  

For	
  LR	
  (SFO,	
  
Graiing)	
  

•  Filter	
  +	
  Single	
  feature	
  evalua7on	
  
–  Less	
  7me	
  consuming,	
  usually	
  works	
  well	
  

•  Wrapper	
  +	
  Subset	
  selec7on	
  
–  Higher	
  accuracy,	
  but	
  easy	
  overfiLng	
  	
  
Tips	
  about	
  feature	
  selec7on	
  
• 
• 
• 
• 

Remove	
  features	
  could	
  not	
  occur	
  in	
  real	
  scenario	
  
If	
  no	
  contribu7on,	
  the	
  less	
  feature	
  the	
  be>er	
  
Use	
  L1	
  regulariza7on	
  for	
  logis7c	
  regression	
  
Use	
  random	
  subspace	
  method	
  
References	
  
•  Feature	
  selec)on	
  for	
  Classifica)on	
  (IDA’97)	
  
•  An	
  Introduc)on	
  to	
  Variable	
  and	
  Feature	
  Selec)on	
  (JMLR’03)	
  
•  Feature	
  selec)on	
  for	
  text	
  classifica)on	
  Based	
  on	
  Gini	
  
Coefficient	
  of	
  Inequality	
  (JMLR’03)	
  
•  A	
  compara)ve	
  study	
  on	
  feature	
  selec)on	
  in	
  text	
  
categoriza)on	
  (ICML’97)	
  
•  Scaling	
  Up	
  Machine	
  Learning	
  

Weitere ähnliche Inhalte

Was ist angesagt?

KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 

Was ist angesagt? (20)

Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Decision tree
Decision treeDecision tree
Decision tree
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 

Ähnlich wie Feature selection

A Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature SelectionA Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature Selection
Davide Nardone
 
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
DrPArivalaganASSTPRO
 
analytic hierarchy_process
analytic hierarchy_processanalytic hierarchy_process
analytic hierarchy_process
FEG
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Emil Lupu
 

Ähnlich wie Feature selection (20)

Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
Predictable reactive state management - ngrx
Predictable reactive state management - ngrxPredictable reactive state management - ngrx
Predictable reactive state management - ngrx
 
A Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature SelectionA Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature Selection
 
Feature Selection.pdf
Feature Selection.pdfFeature Selection.pdf
Feature Selection.pdf
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...
 
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
 
Basic Engineering Design (Part 6): Test and Evaluate
Basic Engineering Design (Part 6): Test and EvaluateBasic Engineering Design (Part 6): Test and Evaluate
Basic Engineering Design (Part 6): Test and Evaluate
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
5 analytic hierarchy_process
5 analytic hierarchy_process5 analytic hierarchy_process
5 analytic hierarchy_process
 
analytic hierarchy_process
analytic hierarchy_processanalytic hierarchy_process
analytic hierarchy_process
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
 

Mehr von Dong Guo (8)

Convex optimization methods
Convex optimization methodsConvex optimization methods
Convex optimization methods
 
AlphaGo zero
AlphaGo zeroAlphaGo zero
AlphaGo zero
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
机器学习概述
机器学习概述机器学习概述
机器学习概述
 
Expectation propagation
Expectation propagationExpectation propagation
Expectation propagation
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting tree
 
Logistic Regression
Logistic RegressionLogistic Regression
Logistic Regression
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Feature selection

  • 1. Machine  learning  workshop   guodong@hulu.com   Machine  learning  introduc7on   Logis7c  regression   Feature  selec+on   Boos7ng,  tree  boos7ng     See  more  machine  learning  post:  h>p://dongguo.me      
  • 2. Outline   •  •  •  •  Introduc7on   Typical  feature  selec7on  methods   Feature  selec7on  in  logis7c  regression   Tips  and  conclusion  
  • 3. What’s/why  feature  selec7on   •  A  procedure  in  machine  learning  to  find  a  subset  of   features  that  produces  ‘be>er’  model  for  given   dataset   –  Avoid  overfiLng  and  achieve  be>er  generaliza7on  ability   –  Reduce  the  storage  requirement  and  training  7me   –  Interpretability  
  • 4. When  feature  selec7on  is  important   •  •  •  •  •  •  Noise  data   Lots  of  low  frequent  features   Use  mul7-­‐type  features   Too  many  features  comparing  to  samples   Complex  model   Samples  in  real  scenario  is  inhomogeneous  with   training  &  test  samples  
  • 5.  When  No.(samples)/No.(feature)  is  large   •  •  •  •  Feature  selec7on  with  Gini  indexing   Algorithm:  Logis7c  regression   Training  samples:    640k;  test  samples:  49K   Feature:  watch  behavior  of  audiences;  show  level  (11327)   AUC   0.83   0.82   L1-­‐LR   L2-­‐LR   0.81   0.8   all   80%   70%   60%   50%   40%   30%   ra+o  of  features  used   20%   10%  
  • 6. When  No(samples)  equals  to  No(feature)   •  L1  Logis7c  regression;     •  Training  samples:  50k;  test  samples:  49K   •  Feature:  watch  behavior  of  audiences;  video  level  (49166)   how  AUC  change  with  feature  number  selected   0.736   0.735   0.734   0.733   0.732   ACU   0.731   0.73   0.729   0.728   all   90%   80%   70%   60%   50%   40%   30%   20%   10%  
  • 7. Typical  methods  for  feature  selec7on   •  Categories   Single  feature   evalua7on   Subset  selec7on   filter   MI,  IG,  KL-­‐D,  GI,  CHI     Category  distance,   …   wrapper   Ranking  accuracy   For  LR  (SFO,   using  single  feature   Graiing)   •  Single  feature  evalua7on   –  Frequency  based,  mutual  informa7on,  KL  divergence,  Gini  indexing,   informa7on  gain,  Chi  square  sta7s7c   •  Subset  selec7on  method   –  Sequen7al  forward  selec7on   –  Sequen7al  backward  selec7on  
  • 8. Single  feature  evalua7on   •  Measure  quality  of  features  by  all  kinds  of  metrics   –  Frequency  based   –  Dependence  of  feature  and  label  (Co-­‐occurrence)   •  mutual  informa7on,  Chi  square  sta7s7c   –  Informa7on  theory   •  KL  divergence,  informa7on  gain   –  Gini  indexing  
  • 9. Frequency  based   •  Remove  features  according  to  frequency  of  features   or  instances  contain  the  feature   •  Typical  scenario   –  Text  mining  
  • 10. Mutual  informa7on   •  Measure  the  dependence  of  two  random  variables   •  Defini7on  
  • 11. Chi  Square  Sta7s7c   •  Measure  the  dependence  of  two  variables     –  A:  number  of  7mes  feature  t  and  category  c  co-­‐occur   –  B:  number  of  7mes  t  occurs  without  c   –  C:  number  of  7mes  c  occurs  without  t   –  D:  number  of  7mes  neither  c  or  t  occurs   –  N:  total  number  of  instances  
  • 12. Entropy   •  Characterize  the  (im)purity  of  an  collec7on  of   examples   𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)=   −∑𝑖↑▒​ 𝑃↓𝑖  𝐼𝑛​ 𝑃↓𝑖   
  • 13. Informa7on  Gain   •  Reduc7on  in  entropy  caused  by  par77oning  the   examples  according  to  the  a>ribute  
  • 14. KL  divergence   •  Measure  the  difference  between  two  probability   distribu7on  
  • 15. Gini  indexing   •  Calculate  condi7onal  probability  of  f  given  class  label     •  Normalize  across  all  classes     •  Calculate  Gini  coefficient       •  For  two  categories  case  
  • 16. Comparison  in  text  categoriza7on  (1)   •  A  compara)ve  study  on  feature  selec)on  in  text  categoriza)on  (ICML’97)  
  • 17. Comparison  in  text  categoriza7on  (2)   •  Feature  selec)on  for  text  classifica)on  based  on  Gini  Coefficient  of   Inequality  (JMLR’03)  
  • 18. Shortages  of  single  feature  evalua7on   •  Relevance  between  features  are  ignored   –  Features  could  be  redundant     –  A  feature  that  is  completely  useless  by  itself  can  provide  a   significant  performance  improvement  when  taken  with   others   –  Two  features  that  are  useless  by  themselves  can  be  useful   together  
  • 19. Shortages  of  single  feature  evalua7on  (2)   •  A  feature  that  is  completely  useless  by  itself  can   provide  a  significant  performance  improvement   when  taken  with  others  
  • 20. Shortages  of  single  feature  evalua7on  (3)   •  Two  features  that  are  useless  by  themselves  can  be   useful  together  
  • 21. Subset  selec7on  methods   •  Select  subsets  of  features  that  together  have  good   predic7ve  power,  as  opposed  to  ranking  features   individually   •  Always  by  adding  new  features  into  exis7ng  set  or   removing  features  out  of  exis7ng  set   –  Sequen7al  forward  selec7on   –  Sequen7al  backward  selec7on   •  Evalua7on   –  category  distance  measurement   –  Classifica7on  error  
  • 22. Category  distance  measurement   •  Select  features  subset  with  large  category  distance  
  • 23. Wrapper  methods  for  logis7c  regression   •  Forward  feature  selec7on   –  Naïve  method   •  need  build  models  quadra7c  in  the  number  of  feature   –  Graiing   –  Single  feature  op7miza7on  
  • 24. SFO  (Singhet  al.,  2009)   •  Only  op7mizing  coefficient  of  the  new  feature   •  Only  need  iterate  over  instances  that  contain  the   new  feature   •  Also  fully  relearn  one  new  model  with  selected   feature  included  
  • 25. Graiing  (Perkins  2003)   •  Use  the  loss  func7on’s  gradient  with  respect  to  the   new  feature  to  decide  whether  to  add  the  feature   •  At  each  step,  feature  with  largest  gradient  is  added   •  Model  is  fully  relearned  aier  each  feature  is  added   –  Need  only  build  D  models  overall  
  • 26. Experimenta7on   •  Percent  improvement  of  log-­‐likelihood  in  test  set   •  Both  SFO  and  Graiing  are  easy  parallelized  
  • 27. Summariza7on     •  Categories   Single  feature  evalua+on   Subset  selec+on   filter   MI,  IG,  KL-­‐D,  GI,  CHI     Category  distance,   …   wrapper   Ranking  accuracy  using   single  feature   For  LR  (SFO,   Graiing)   •  Filter  +  Single  feature  evalua7on   –  Less  7me  consuming,  usually  works  well   •  Wrapper  +  Subset  selec7on   –  Higher  accuracy,  but  easy  overfiLng    
  • 28. Tips  about  feature  selec7on   •  •  •  •  Remove  features  could  not  occur  in  real  scenario   If  no  contribu7on,  the  less  feature  the  be>er   Use  L1  regulariza7on  for  logis7c  regression   Use  random  subspace  method  
  • 29. References   •  Feature  selec)on  for  Classifica)on  (IDA’97)   •  An  Introduc)on  to  Variable  and  Feature  Selec)on  (JMLR’03)   •  Feature  selec)on  for  text  classifica)on  Based  on  Gini   Coefficient  of  Inequality  (JMLR’03)   •  A  compara)ve  study  on  feature  selec)on  in  text   categoriza)on  (ICML’97)   •  Scaling  Up  Machine  Learning  

Hinweis der Redaktion

  1. Why samples of different categories could be separatedSeparated well -> smaller classification errorDifferent feature has different contribution
  2. Noise data : Lots of low frequent features: use ad-id as a feature, easy overfittingMulti-type features:Too many features comparing to samples : feature number > sample number; feature combinationComplex model: ANNSamples to be predicted is inhomogeneous with training & test samples : demographic targeting; time series related
  3. Key points: “how to measure the quality of features” and “whether and how to use the underlying algorithms”1. Optimal feature set could only be selected through exhaustive method;2. Among all existing feature selection methods, the feature set are generated by adding or removing some features from set in last step
  4. Decision tree
  5. Is not a true metric for distance measurement, because it’s not symmetricCould not be negative (Gibbs inequality)Used in topic model
  6. Features could be redundant : videoId,contentId
  7. With 1000 features, and cost 1 second to build one model on average, would cost about 1 week
  8. Both