SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
Aggressive feature selection for text categorization
E Gabrilovich and S Markovitch, “Text categorization with many
redundant features: Using aggressive feature selection to make
SVMs competitive with C4.5,” 21st International Conference on
Machine Learning, ACM, 2004.
Presented by Hershel Safer
in Machine Learning :: Reading Group Meetup
on 12/2/14
Aggressive feature selection for text categorization – Hershel Safer Page 112 February 2014
Results
Key result: They introduce a measure of the redundancy of
words in a collection of documents that predicts if feature
selection will improve categorization of the documents.
Also:
A method to generate labeled datasets for testing text-
categorization algorithms (previous work)categorization algorithms (previous work)
A platform testing text-categorization algorithms
Aggressive feature selection for text categorization – Hershel Safer Page 212 February 2014
Background: Text categorization
Text categorization: Given a set of natural-language documents
and a set of labels, assign one or more labels to each document.
Most algorithms treat a document as a collection of words, with
each word as a feature; so even modest collections have
thousands or tens of thousands of features.
For such high-dimensional problems, feature selection is oftenFor such high-dimensional problems, feature selection is often
used to reduce noise and avoid overfitting.
Aggressive feature selection for text categorization – Hershel Safer Page 312 February 2014
Background: Feature selection
Use various methods to measure how well specific words
discriminate between categories: information gain (IG), chi-
squared, bi-normal separation, document frequency, etc.
Feature selection: Choose the most informative features using a
score cutoff or a fixed percentage of the top-scoring features.
Previous work on standard document collections found thatPrevious work on standard document collections found that
even words with low discriminative power improved
classification.
Question asked by this work: When does aggressive feature
selection (using ~1% of the words in the collection) improve text
categorization?
Aggressive feature selection for text categorization – Hershel Safer Page 412 February 2014
The data
The data consist of 100 datasets created from Web directories,
each containing documents from 2 categories.
The categorization difficulty ranges from very easy to very hard.
Baseline accuracy of categorization using SVM is fairly uniformly
distributed between 0.6 and 0.92.
Aggressive feature selection for text categorization – Hershel Safer Page 512 February 2014
Distribution of IG and effect of feature selection
Key is not the level of IG values but rather the rate of decrease.
For dataset D and features F, the Outlier Count (OC) is # features
with IG at least 3 standard deviations above the mean:
Aggressive feature selection for text categorization – Hershel Safer Page 612 February 2014
Effect of Outlier Count on SVM accuracy
OC has a strong negative correlation with the improvement in
SVM accuracy that results from aggressive feature selection.
Studies that found no benefit from aggressive feature selection
used datasets with very large OC.
Aggressive feature selection for text categorization – Hershel Safer Page 712 February 2014
Choosing classifier and feature-selection methods
Using feature selection may affect choice of classifier method.
Different methods for feature selection give different results.
They report information gain, Chi-squared, and bi-normal
separation as being best.
Aggressive feature selection for text categorization – Hershel Safer Page 812 February 2014

Weitere ähnliche Inhalte

Andere mochten auch

Esquí
EsquíEsquí
EsquíJose
 
Spanish Migas
Spanish MigasSpanish Migas
Spanish MigasJose
 
Templo budista de Panillo
Templo budista de PanilloTemplo budista de Panillo
Templo budista de PanilloJose
 
Cittaslow international coordinating committee
Cittaslow international coordinating committeeCittaslow international coordinating committee
Cittaslow international coordinating committeeLuca Filippetti
 
La Chinchana (comenius)
La Chinchana (comenius)La Chinchana (comenius)
La Chinchana (comenius)Jose
 
Románico en la ribagorza
Románico en la ribagorzaRománico en la ribagorza
Románico en la ribagorzaJose
 
The Multi-Layered iPad
The Multi-Layered iPadThe Multi-Layered iPad
The Multi-Layered iPadMatt Jacobs
 
Hypergraph for consensus optimization
Hypergraph for consensus optimizationHypergraph for consensus optimization
Hypergraph for consensus optimizationHershel Safer
 
Improving modern art articles on wikipedia, a partnership between Wikimédia F...
Improving modern art articles on wikipedia, a partnership between Wikimédia F...Improving modern art articles on wikipedia, a partnership between Wikimédia F...
Improving modern art articles on wikipedia, a partnership between Wikimédia F...Sylvain Machefert
 

Andere mochten auch (15)

Professioni web 2013
Professioni web 2013Professioni web 2013
Professioni web 2013
 
Esquí
EsquíEsquí
Esquí
 
Spanish Migas
Spanish MigasSpanish Migas
Spanish Migas
 
Templo budista de Panillo
Templo budista de PanilloTemplo budista de Panillo
Templo budista de Panillo
 
Cittaslow international coordinating committee
Cittaslow international coordinating committeeCittaslow international coordinating committee
Cittaslow international coordinating committee
 
La Chinchana (comenius)
La Chinchana (comenius)La Chinchana (comenius)
La Chinchana (comenius)
 
Románico en la ribagorza
Románico en la ribagorzaRománico en la ribagorza
Románico en la ribagorza
 
Young cittaslow
Young cittaslowYoung cittaslow
Young cittaslow
 
The Multi-Layered iPad
The Multi-Layered iPadThe Multi-Layered iPad
The Multi-Layered iPad
 
Jabes2010 sudoc plus
Jabes2010 sudoc plusJabes2010 sudoc plus
Jabes2010 sudoc plus
 
Hypergraph for consensus optimization
Hypergraph for consensus optimizationHypergraph for consensus optimization
Hypergraph for consensus optimization
 
Improving modern art articles on wikipedia, a partnership between Wikimédia F...
Improving modern art articles on wikipedia, a partnership between Wikimédia F...Improving modern art articles on wikipedia, a partnership between Wikimédia F...
Improving modern art articles on wikipedia, a partnership between Wikimédia F...
 
Internships
InternshipsInternships
Internships
 
Crossroads Social Network Survival Guide
Crossroads Social Network Survival GuideCrossroads Social Network Survival Guide
Crossroads Social Network Survival Guide
 
Men's Health Powerpoint Presentation
Men's Health Powerpoint PresentationMen's Health Powerpoint Presentation
Men's Health Powerpoint Presentation
 

Ähnlich wie Agressive feature selection for text categorization

Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxKevinSims18
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...ankarao14
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...ankarao14
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435IJRAT
 
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Absolutdata Analytics
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentSaleihGero
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
A Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsA Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsEditor IJMTER
 
Proceedings Template - WORD
Proceedings Template - WORDProceedings Template - WORD
Proceedings Template - WORDbutest
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationMario Sangiorgio
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements LabelingData Works MD
 

Ähnlich wie Agressive feature selection for text categorization (20)

Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
0
00
0
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docx
 
IJET-V3I1P1
IJET-V3I1P1IJET-V3I1P1
IJET-V3I1P1
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
 
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...Innovating Multi-Class Text Classification:Transforming Models with propmtify...
Innovating Multi-Class Text Classification:Transforming Models with propmtify...
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
 
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
 
Doc format.
Doc format.Doc format.
Doc format.
 
A Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsA Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie Reviews
 
Proceedings Template - WORD
Proceedings Template - WORDProceedings Template - WORD
Proceedings Template - WORD
 
Current Approaches in Search Result Diversification
Current Approaches in Search Result DiversificationCurrent Approaches in Search Result Diversification
Current Approaches in Search Result Diversification
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements Labeling
 

Kürzlich hochgeladen

1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一hwhqz6r1y
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malangadet6151
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7gragkhusi
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一0uyfyq0q4
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 

Kürzlich hochgeladen (20)

1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 

Agressive feature selection for text categorization

  • 1. Aggressive feature selection for text categorization E Gabrilovich and S Markovitch, “Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5,” 21st International Conference on Machine Learning, ACM, 2004. Presented by Hershel Safer in Machine Learning :: Reading Group Meetup on 12/2/14 Aggressive feature selection for text categorization – Hershel Safer Page 112 February 2014
  • 2. Results Key result: They introduce a measure of the redundancy of words in a collection of documents that predicts if feature selection will improve categorization of the documents. Also: A method to generate labeled datasets for testing text- categorization algorithms (previous work)categorization algorithms (previous work) A platform testing text-categorization algorithms Aggressive feature selection for text categorization – Hershel Safer Page 212 February 2014
  • 3. Background: Text categorization Text categorization: Given a set of natural-language documents and a set of labels, assign one or more labels to each document. Most algorithms treat a document as a collection of words, with each word as a feature; so even modest collections have thousands or tens of thousands of features. For such high-dimensional problems, feature selection is oftenFor such high-dimensional problems, feature selection is often used to reduce noise and avoid overfitting. Aggressive feature selection for text categorization – Hershel Safer Page 312 February 2014
  • 4. Background: Feature selection Use various methods to measure how well specific words discriminate between categories: information gain (IG), chi- squared, bi-normal separation, document frequency, etc. Feature selection: Choose the most informative features using a score cutoff or a fixed percentage of the top-scoring features. Previous work on standard document collections found thatPrevious work on standard document collections found that even words with low discriminative power improved classification. Question asked by this work: When does aggressive feature selection (using ~1% of the words in the collection) improve text categorization? Aggressive feature selection for text categorization – Hershel Safer Page 412 February 2014
  • 5. The data The data consist of 100 datasets created from Web directories, each containing documents from 2 categories. The categorization difficulty ranges from very easy to very hard. Baseline accuracy of categorization using SVM is fairly uniformly distributed between 0.6 and 0.92. Aggressive feature selection for text categorization – Hershel Safer Page 512 February 2014
  • 6. Distribution of IG and effect of feature selection Key is not the level of IG values but rather the rate of decrease. For dataset D and features F, the Outlier Count (OC) is # features with IG at least 3 standard deviations above the mean: Aggressive feature selection for text categorization – Hershel Safer Page 612 February 2014
  • 7. Effect of Outlier Count on SVM accuracy OC has a strong negative correlation with the improvement in SVM accuracy that results from aggressive feature selection. Studies that found no benefit from aggressive feature selection used datasets with very large OC. Aggressive feature selection for text categorization – Hershel Safer Page 712 February 2014
  • 8. Choosing classifier and feature-selection methods Using feature selection may affect choice of classifier method. Different methods for feature selection give different results. They report information gain, Chi-squared, and bi-normal separation as being best. Aggressive feature selection for text categorization – Hershel Safer Page 812 February 2014