SlideShare ist ein Scribd-Unternehmen logo
1 von 14
A Comparative Study on Featuree
 Selection in Text Categorization
     Presented by Hector Franco
                TCD
objective
• Reduce the number of dimensions. Some
  methods have problems with too high
  dimension.
Statistical classification methods.
1.   Regression models
2.   Knn
3.   Bayes
4.   Decision treees
5.   Neural netwoks
6.   Symbolic rule learning
7.   Inductive learning algorithms
Features:
•   DF Document frequency thresholding
•   IG Information Gain
•   MI Mutual information
•   CHI statistic
•   TS Term strength
DF Document frequency thresholding

• Number of documents in which term occurs.
• It remove rare terms.
Information gain
• Of the term t:




• Time: O(N) space O(VN)
• N=Documents, V=vocabulary
Mutual information



• If t and c indpendent -> value 0.




                                      O(VN)
Statistic (CHI)
• Measure of the lack of independence between t
  and c,
• A t and c occurs,       B t and not c
• C not t and c ,        D not t and not c
• N total number of documents




It t and c independent value =0.
Statistic (CHI)




                  O(VN)
Ts term strength

• Based on document clustering
• How common is a term is likely to appear in
  closely related documents.
• O(N^2)
EXPERIMENTS
• Classifiers
   – kNN
   – LLSF
• Corporas:
   – Reuters-22173
   – OHSUMED
• Use of SMART system for unified
  preprocessing.
Reduction on number of words
Have the best performance at 2000
vocabulary size
Best ig (more reduction)and chi
Most
aggressive in
term removal
Creative commons license


You are free:
•to copy, distribute, display, and perform the work
•to make derivative works

Under the following conditions:
•Attribution. You must give the original author credit.
What does quot;Attribute this workquot; mean?
The page you came from contained embedded licensing metadata, including how the creator wishes to be
attributed for re-use. You can use the HTML here to cite the work. Doing so will also include metadata on
your page so that others can find the original work as well.

•Non-Commercial. You may not use this work for commercial purposes.
•For any reuse or distribution, you must make clear to others the licence terms of this work.
•Any of these conditions can be waived if you get permission from the copyright holder.
•Nothing in this license impairs or restricts the author's moral rights.

Weitere ähnliche Inhalte

Ähnlich wie A Comparative Study On Featuree Selection In Text2

How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
jcscholtes
 

Ähnlich wie A Comparative Study On Featuree Selection In Text2 (20)

Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 
Caspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve RenkinCaspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve Renkin
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Ontologies
OntologiesOntologies
Ontologies
 
A functional software measurement approach bridging the gap between problem a...
A functional software measurement approach bridging the gap between problem a...A functional software measurement approach bridging the gap between problem a...
A functional software measurement approach bridging the gap between problem a...
 
#kbdata: Exploring potential impact of technology limitations on DH research
#kbdata: Exploring potential impact of technology limitations on DH research#kbdata: Exploring potential impact of technology limitations on DH research
#kbdata: Exploring potential impact of technology limitations on DH research
 
How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patents
 
Cartel screening in the digital era – CADE Brazil – January 2018 OECD Workshop
Cartel screening in the digital era – CADE Brazil – January 2018 OECD WorkshopCartel screening in the digital era – CADE Brazil – January 2018 OECD Workshop
Cartel screening in the digital era – CADE Brazil – January 2018 OECD Workshop
 
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
DataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leadersDataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leaders
 
NDD Project presentation
NDD Project presentationNDD Project presentation
NDD Project presentation
 
Image compression in digital image processing
Image compression in digital image processingImage compression in digital image processing
Image compression in digital image processing
 
A Practical Use of Artificial Intelligence in the Fight Against Cancer by Bri...
A Practical Use of Artificial Intelligence in the Fight Against Cancer by Bri...A Practical Use of Artificial Intelligence in the Fight Against Cancer by Bri...
A Practical Use of Artificial Intelligence in the Fight Against Cancer by Bri...
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 

Mehr von Trector Rancor (6)

Cryptocurrencies overview
Cryptocurrencies overviewCryptocurrencies overview
Cryptocurrencies overview
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
 
Virtual Journalist
Virtual JournalistVirtual Journalist
Virtual Journalist
 
Class Diagram Uml
Class Diagram UmlClass Diagram Uml
Class Diagram Uml
 
Borderline Smote
Borderline SmoteBorderline Smote
Borderline Smote
 
My First Presentation
My First PresentationMy First Presentation
My First Presentation
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

A Comparative Study On Featuree Selection In Text2

  • 1. A Comparative Study on Featuree Selection in Text Categorization Presented by Hector Franco TCD
  • 2. objective • Reduce the number of dimensions. Some methods have problems with too high dimension.
  • 3. Statistical classification methods. 1. Regression models 2. Knn 3. Bayes 4. Decision treees 5. Neural netwoks 6. Symbolic rule learning 7. Inductive learning algorithms
  • 4. Features: • DF Document frequency thresholding • IG Information Gain • MI Mutual information • CHI statistic • TS Term strength
  • 5. DF Document frequency thresholding • Number of documents in which term occurs. • It remove rare terms.
  • 6. Information gain • Of the term t: • Time: O(N) space O(VN) • N=Documents, V=vocabulary
  • 7. Mutual information • If t and c indpendent -> value 0. O(VN)
  • 8. Statistic (CHI) • Measure of the lack of independence between t and c, • A t and c occurs, B t and not c • C not t and c , D not t and not c • N total number of documents It t and c independent value =0.
  • 10. Ts term strength • Based on document clustering • How common is a term is likely to appear in closely related documents. • O(N^2)
  • 11. EXPERIMENTS • Classifiers – kNN – LLSF • Corporas: – Reuters-22173 – OHSUMED • Use of SMART system for unified preprocessing.
  • 12. Reduction on number of words Have the best performance at 2000 vocabulary size Best ig (more reduction)and chi
  • 14. Creative commons license You are free: •to copy, distribute, display, and perform the work •to make derivative works Under the following conditions: •Attribution. You must give the original author credit. What does quot;Attribute this workquot; mean? The page you came from contained embedded licensing metadata, including how the creator wishes to be attributed for re-use. You can use the HTML here to cite the work. Doing so will also include metadata on your page so that others can find the original work as well. •Non-Commercial. You may not use this work for commercial purposes. •For any reuse or distribution, you must make clear to others the licence terms of this work. •Any of these conditions can be waived if you get permission from the copyright holder. •Nothing in this license impairs or restricts the author's moral rights.