SlideShare ist ein Scribd-Unternehmen logo
1 von 26
web page classification
with naïve bayes classifiers

nabeelah ali
27 november 2013
outline
• what is web page classification
• motivation
• literature review
• project design
• experiments
• evaluation
description &
motivation
what is classification?
web page classification
web page classification can
be seen as a type of
document classification
documents vs web pages
• web pages have structure
• HTML indicates headings, paragraphs,
meta-information

• web pages are interconnected
• they contain hyperlinks to other pages
• they have locations (URLs)
why?
web directories
why?
improving search results
why?
• user profile mining
• information filtering
• creation of domain-specific search engines
literature
review
bag of words
text is represented as an unordered
list of words
n-gram representation
• document is represented by vector of
features

• concepts expressed by phrases can be
capture (e.g. “New York” vs “new” and
“york”)
using html structure
• assign weight depending on HTML tags, and
make the feature a linear combination of
these

• e.g. headings would have a greater weight

• four main elements are considered: title,
headings, metadata and main text

Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and
metadata in automated subject classification." Research and Advanced Technology
for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
visual analysis
• visual representation by web browser is
important

• each web page is visualised as an adjacency
multigraph, with each section representing
a different kind of content

Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel
approach for a Web page classification." Proceedings of
SAWM04 workshop, ECML2004. 2004.
URL features
• pages do not need to be fetched or
analysed

• fast!
• derives tokens from the URL and uses
these tokens as features

Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification
using URL features." Proceedings of the 14th ACM international
conference on Information and knowledge management. ACM, 2005.
web page classification
project design
dataset
• 4 universities dataset (cornell, texas,
washington, wisconsin)

• each page must be classified into a

category: course, department, faculty,
project, staff, student, other
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
document classification
single label classification: one and only one
class label is assigned to each instance
hard classification: an instance can either be
or not be in a particular class, with no
intermediate state
multi-class classification: instances that can
be divided into more than two categories
details of the dataset
experiment #1
bag of words
use the words, unweighted, as features
istant
ass
CS
Dr
intern
22
0
ission
adm
Professor
room
a rc h
rese
experiment #2

HTML tag weighting

use words weighted by the HTML tags (e.g.
words in <h1> tags will be weighted more
heavily than those in <p> tags)
sistant
as
CS
Dr
intern
22
0

ission ofe
adm
Pr

ssor
room
arch
rese
experiment #3
n-gram
use phrases instead of single words as features
t ant
assis

arch c
rese
onta

c t in

form

ogram description
pr

course outl
ine

atio
n
evaluation

k-fold cross validation

From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
evaluation
confusion matrix

http://en.wikipedia.org/wiki/Confusion_matrix
bibliography
B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005)
Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and
algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12.
Golub, Koraljka, and Anders Ardö. "Importance of HTML
structural elements and metadata in automated subject classification." Research and
Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378.
Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL
features." Proceedings of the 14th ACM international conference on Information
and knowledge management. ACM, 2005.
Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web
page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
questions?

Weitere ähnliche Inhalte

Was ist angesagt?

A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
UW Forward - CUWL 2011
UW Forward - CUWL 2011UW Forward - CUWL 2011
UW Forward - CUWL 2011Eric Larson
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGcsandit
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Webis20090
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Webostephens
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage MiningDaminda Herath
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
User centred design and students' library search behaviours
User centred design and students' library search behavioursUser centred design and students' library search behaviours
User centred design and students' library search behavioursVernon Fowler
 
Search Analytics: Diagnosing what ails your site
Search Analytics:  Diagnosing what ails your siteSearch Analytics:  Diagnosing what ails your site
Search Analytics: Diagnosing what ails your siteLouis Rosenfeld
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningAmir Masoud Sefidian
 
Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Dirk Lewandowski
 
Navigation Systems
Navigation SystemsNavigation Systems
Navigation SystemsMiles Price
 

Was ist angesagt? (19)

Search Systems
Search SystemsSearch Systems
Search Systems
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Web mining
Web miningWeb mining
Web mining
 
UW Forward - CUWL 2011
UW Forward - CUWL 2011UW Forward - CUWL 2011
UW Forward - CUWL 2011
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
 
Hybrid Approaches to Taxonomy & Folksonmy
Hybrid Approaches to Taxonomy & FolksonmyHybrid Approaches to Taxonomy & Folksonmy
Hybrid Approaches to Taxonomy & Folksonmy
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
User centred design and students' library search behaviours
User centred design and students' library search behavioursUser centred design and students' library search behaviours
User centred design and students' library search behaviours
 
Web data mining
Web data miningWeb data mining
Web data mining
 
confernece paper
confernece paperconfernece paper
confernece paper
 
Search Analytics: Diagnosing what ails your site
Search Analytics:  Diagnosing what ails your siteSearch Analytics:  Diagnosing what ails your site
Search Analytics: Diagnosing what ails your site
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage Mining
 
Web mining
Web miningWeb mining
Web mining
 
Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?
 
Navigation Systems
Navigation SystemsNavigation Systems
Navigation Systems
 
EDS across the pond
EDS across the pondEDS across the pond
EDS across the pond
 

Ähnlich wie web page classification

Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017Drew Madelung
 
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...Jonathan Ralton
 
web page classification and algorithmn.pdf
web page classification and algorithmn.pdfweb page classification and algorithmn.pdf
web page classification and algorithmn.pdfMdAnik19
 
Architecting a CMS for a content centered website
Architecting a CMS for a content centered websiteArchitecting a CMS for a content centered website
Architecting a CMS for a content centered websitekristin rowley
 
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016Drew Madelung
 
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTMETADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTVikas Bhushan
 
Expressing Concept Schemes & Competency Frameworks in CTDL
Expressing Concept Schemes & Competency Frameworks in CTDLExpressing Concept Schemes & Competency Frameworks in CTDL
Expressing Concept Schemes & Competency Frameworks in CTDLCredential Engine
 
Dbms classification according to data models
Dbms classification according to data modelsDbms classification according to data models
Dbms classification according to data modelsABDUL KHALIQ
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingOCLC
 
ECS19 - Marc Anderson - Managing Content Types in the Modern World
ECS19 - Marc Anderson - Managing Content Types in the Modern WorldECS19 - Marc Anderson - Managing Content Types in the Modern World
ECS19 - Marc Anderson - Managing Content Types in the Modern WorldEuropean Collaboration Summit
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldMarc D Anderson
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...CULS
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgOCLC
 
Google Paper
Google Paper Google Paper
Google Paper girish1m
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataJian Wu
 

Ähnlich wie web page classification (20)

Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
 
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
 
web page classification and algorithmn.pdf
web page classification and algorithmn.pdfweb page classification and algorithmn.pdf
web page classification and algorithmn.pdf
 
Architecting a CMS for a content centered website
Architecting a CMS for a content centered websiteArchitecting a CMS for a content centered website
Architecting a CMS for a content centered website
 
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
 
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTMETADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
 
Expressing Concept Schemes & Competency Frameworks in CTDL
Expressing Concept Schemes & Competency Frameworks in CTDLExpressing Concept Schemes & Competency Frameworks in CTDL
Expressing Concept Schemes & Competency Frameworks in CTDL
 
Hansen Metadata for Institutional Repositories
Hansen Metadata for Institutional RepositoriesHansen Metadata for Institutional Repositories
Hansen Metadata for Institutional Repositories
 
Dbms classification according to data models
Dbms classification according to data modelsDbms classification according to data models
Dbms classification according to data models
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
 
Websrc~1
Websrc~1Websrc~1
Websrc~1
 
Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web Archiving
 
ECS19 - Marc Anderson - Managing Content Types in the Modern World
ECS19 - Marc Anderson - Managing Content Types in the Modern WorldECS19 - Marc Anderson - Managing Content Types in the Modern World
ECS19 - Marc Anderson - Managing Content Types in the Modern World
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern World
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.org
 
Google Paper
Google Paper Google Paper
Google Paper
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

web page classification

  • 1. web page classification with naïve bayes classifiers nabeelah ali 27 november 2013
  • 2. outline • what is web page classification • motivation • literature review • project design • experiments • evaluation
  • 5. web page classification web page classification can be seen as a type of document classification
  • 6. documents vs web pages • web pages have structure • HTML indicates headings, paragraphs, meta-information • web pages are interconnected • they contain hyperlinks to other pages • they have locations (URLs)
  • 9. why? • user profile mining • information filtering • creation of domain-specific search engines
  • 11. bag of words text is represented as an unordered list of words
  • 12. n-gram representation • document is represented by vector of features • concepts expressed by phrases can be capture (e.g. “New York” vs “new” and “york”)
  • 13. using html structure • assign weight depending on HTML tags, and make the feature a linear combination of these • e.g. headings would have a greater weight • four main elements are considered: title, headings, metadata and main text Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
  • 14. visual analysis • visual representation by web browser is important • each web page is visualised as an adjacency multigraph, with each section representing a different kind of content Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
  • 15. URL features • pages do not need to be fetched or analysed • fast! • derives tokens from the URL and uses these tokens as features Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.
  • 17. dataset • 4 universities dataset (cornell, texas, washington, wisconsin) • each page must be classified into a category: course, department, faculty, project, staff, student, other http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
  • 18. document classification single label classification: one and only one class label is assigned to each instance hard classification: an instance can either be or not be in a particular class, with no intermediate state multi-class classification: instances that can be divided into more than two categories
  • 19. details of the dataset
  • 20. experiment #1 bag of words use the words, unweighted, as features istant ass CS Dr intern 22 0 ission adm Professor room a rc h rese
  • 21. experiment #2 HTML tag weighting use words weighted by the HTML tags (e.g. words in <h1> tags will be weighted more heavily than those in <p> tags) sistant as CS Dr intern 22 0 ission ofe adm Pr ssor room arch rese
  • 22. experiment #3 n-gram use phrases instead of single words as features t ant assis arch c rese onta c t in form ogram description pr course outl ine atio n
  • 23. evaluation k-fold cross validation From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
  • 25. bibliography B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005) Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12. Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378. Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005. Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.