SlideShare ist ein Scribd-Unternehmen logo
1 von 95
Web Page Classification Feature and Algorithms XiaoguangQi and Brian D. Davison Department of Computer Science & Engineering Lehigh University, June 2007 Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Agenda Webpage classification significance Introduction Background Applications of web classification Features Algorithms Blog Classification Conclusion
Webpage classification significance Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Webpage classification significance Let’s go back in history about 10 years. The Evolution of Websites: How 5 popular Websites have changed 
Apple - present
Apple – 10 Years ago!
Amazon - present
Amazon – 9 Years ago
CNN - present
CNN – 8 Years ago
Yahoo! - present
Yahoo! – 12 Years ago
Webpage classification significance What’s different between past and present what changed?
Nike - present
Nike – 8 Years ago
Webpage classification significance What’s different between past and present what changed? Flash animation Java Script Video Clips, Embedded Object Advertise, GG Ad sense, Yahoo!
Introduction Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Introduction Webpage classification or webpage categorization is the process of assigning a webpage to one or more category labels. E.g. “News”, “Sport” , “Business” GOAL: They observe the existing of web classification techniques to find new area for research. Including web-specific features and algorithms that have been found to be useful for webpage classification.
Introduction What will you learn? A Detailed review of useful features for web classification The algorithms used The future research directions Webpage classification can help improve the quality of web search. Knowing is thing help you to improve your SEO skill. Each search engine, keep their technique in secret.
Background Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Background The general problem of webpage classification can be divided into Subject classification; subject or topic of webpage e.g. “Adult”, “Sport”, “Business”. Function classification; the role that the webpage play e.g. “Personal homepage”, “Course page”, “Admission page”.
Background Based on the number of classes in webpage classification can be divided into  binary classification  multi-class classification 	Based on the number of classes that can be assigned to an instance, classification can be divided into single-label classification and multi-label classification.
Types of classification
Applications of web classification Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Applications of web classification Constructing and expanding web directories (web hierarchies) Yahoo ! ODP or “Open Dictionary Project”  http://www.dmoz.org How are they doing?
Keyworder
Applications of web classification How are they doing? By human effort July 2006, it was reported there are 73,354 editor in the dmoz ODP. As the web changes and continue to grow so “Automatic creation of classifiers from web corpora based on use-defined hierarchies” has been introduced by Huang et al. in 2004 The starting point of this presentation !!
Applications of web classification Improving quality of search results Categories view Ranking view
Categories and Ranking View
Applications of web classification Improving quality of search results  Categories view Ranking view  In 1998, Page and Brin developed the link-based ranking algorithm called PageRank Calculates the hyperlinks with our considering the topic of each page
Google – 11 Years ago
Applications of web classification Helping question answering systems Yang and Chua 2004  suggest finding answers to list questions e.g. “name all the countries in Europe” How it worked? Formulated the queries and sent to search engines. Classified the results into four categories Collection pages (contain list of items) Topic pages (represent the answers instance) Relevant page (Supporting the answers instance) Irrelevant pages After that , topic pages are clustered, from which answers are extracted. Answering question system could benefit from web classification of both accuracy and efficiency
Applications of web classification Other applications Web content filtering Assisted web browsing Knowledge base construction
Features Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Features In this section, we review the types of features that useful in webpage classification research. The most important criteria in webpage classification that make webpage classification different from plaintext classification is HYPERLINK <a>…</a> We classify features into On-page feature: Directly located on the page Neighbors feature: Found on the pages related to the page to be classified.
Features: On-page Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Features: On-page Textual content and tags N-gram feature Imagine of two different documents. One contains phrase “New York”. The other contains the terms “New” and “York”. (2-gram feature). In Yahoo!, They used 5-grams feature. HTML tags or DOM Title, Headings, Metadata and Main text Assigned each of them an arbitrary weight. Now a day most of website using Nested list (<ul><li>) which really help in web page classification.
Features: On-page Textual content and tags URL Kan and Thi 2004 Demonstrated that a webpage can be classified based on its URL
Features: On-page Visual analysis Each webpage has two representations Text which represent in HTML The visual representation rendered by a web browser Most approaches focus on the text while ignoring the visual information which is useful as well Kovacevic et al. 2004 Each webpage is represented as a hierarchical “Visual adjacency multi graph.” In graph each node represents an HTML object and each edge represents the spatial relation in the visual representation.
Visual analysis
Features: Neighbors Features Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Features: Neighbors Features Motivation The useful features that we discuss previously, in a particular these features are missing or unrecognizable
Example webpage which has few useful on-page features
Features: Neighbors features Underlying Assumptions When exploring the features of neighbors, some assumptions are implicitly made in existing work. The presence of many “sports” pages in the neighborhood of P-a increases the probability of P-a being in “Sport”. Chakrabari et al. 2002 and Meczer 2005 showed that  linked pages were more likely to have terms in common . Neighbor selection Existing research mainly focuses on page with in two steps of the page to be classified. At the distance no greater than two.  There are six types of neighboring pages: parent, child, sibling, spouse, grandparent and grandchild.
Neighbors with in radius of two
Features: Neighbors features Neighbor selection cont. Furnkranz 1999 The text on the parent pages surrounding the link is used to train a classifier instead of text on the target page. A Target page will be assigned multiple labels. These label are then combine by some voting scheme to form the final prediction of the target page’s class Sun et al. 2002 Using the text on the target page. Using page title and anchor text from parent pages can improve classification compared a pure text classifier.
Features: Neighbors features Neighbor selection cont. Summary Using parent, child, sibling and spouse pages are all useful in classification, siblings are found to be the best source. Using information from neighboring pages may introduce extra noise, should be use carefully.
Features: Neighbors features Features Label : by editor or keyworder Partial content : anchor text, the surrounding text of anchor text, titles, headers Full content Among the three types of features, using the full content of neighboring pages is the most expensive however it generate better accuracy.
Features: Neighbors features Utilizing artificial links (implicit link) The hyperlinks are not the only one choice. What is implicit link? Connections between pages that appear in the results of the same query and are both clicked by users. Implicit link can help webpage classification as well as hyperlinks.
Discussion: Features However, since the results of different approaches are based on different implementations and different datasets, making it difficult to compare their performance.  Sibling page are even more use full than parents and children. This approach may lie in the process of hyperlink creation. But a page often acts as a bridge to connect its outgoing links, which are likely to have common topic.
Tip!Tracking Incoming LinkHow to know when someone link to you? Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Algorithms Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Algorithm Approaches for Webpage Classification
Dimension Reduction  Feature weighting ,[object Object]
Way of boosting the classification by emphasizing the features with the better discriminative power
Special case of weighing: “Feature Selection”,[object Object]
Dimension Reduction (con)  : Feature Selection Simple approaches First fragment of each document  First fragment to the web documents in hierarchical classification Text categorization approaches Information gain Mutual information Etc.
Feature Selection (Cont’d): Simple measure Using the first fragment of each documents Assumption: a summary is at beginning of the document Fast and accurate classification for news articles Not satisfying for other types of documents ,[object Object],Useful for web documents
Feature Selection (Cont’d): Text Categorization Measures Using expected mutual information and mutual information Two well-known metrics based on variation of the k-Nearest Neighbor algorithm Weighted terms according to its appearing HTML tags  Terms within different tags handle different importance Using information gain Another well-known metric  Still not apparently show which one is more superior for web classification
Feature Selection (Cont’d): Text Categorization Measures Approving the performance of SVM classifiers By aggressive feature selection Developed a measure with the ability to predict the selection effectiveness without training and testing classifiers A popular Latent Semantic Indexing (LSI) In Text documents:  Docs are reinterpreted into a smaller transformed, but less intuitive space Cons:high computational complexity makes it inefficient to scale in Web classification Experiments based on small datasets (to avoid the above ‘cons’) Some work has approved to make it applicable for larger datasets which still needs further study
Algorithm Approaches for Webpage Classification
Relational Learning
Relational Learning (cont’d): 2 Main Approaches Relaxation Labeling Algorithms Original proposal:  Image analysis Current usage: Image and vision analysis Artificial Intelligence pattern recognition web-mining Link-based Classification Algorithms Utilizing 2 popular link-based algorithms Loopy belief propagation Iterative classification
Relational Learning (cont’d): Relaxation Labeling Algorithms ,[object Object],[object Object]
Relational Learning (cont’d): Link-based Classification Algorithms Two popular link-based algorithms: Loopy belief propagation Iterative classification Better performance on a web collection than textual classifiers During the scientists’ study, ‘a toolkit’ was implemented  Toolkit features Classify the networked data which  utilized a relational classifier and a collective inference procedure Demonstrated its great performance on several datasets including web collections
Algorithm Approaches for Webpage Classification
Modifications to traditional algorithms The traditional algorithms adjusted in the context of Webpage classification k-Nearest Neighbors (kNN) Quantify the distance between the test document and each training documents using “a dissimilarity measure” Cosine similarity or inner product is what used by most  existing kNN classifiers  Support Vector Machine (SVM)
Modification Algorithms (Cont’d)                             : k-Nearest Neighbors Algorithm  Varieties of modifications: Using the term co-occurrence in document Using probability computation Using “co-training”
k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties  Using the term co-occurrence in documents An improved similarity measure The more co-occurred terms two documents have in common, the stronger the relationship between them Better performance over the normal kNN (cosine similarity and inner product measures) Using the probability computation Condition: The probability of a document d being in class c is determined by its distance b/w neighbors and itself and its neighbors’ probability of being in c Simple equation Prob. of d @ c = (distance b/w d and neighbors)(neighbors’ Prob. @ c)
k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties (2)  Using “Co-training” Make use of labeled and unlabeled data  Aiming to achieve better accuracy Scenario: Binary classification Classifying the unlabeled instances Two classifiers trained on different sets of features  The prediction of each one is used to train each other Classifying only labeled instances The co-training can cut the error rate by half When generalized to multi-class problems When the number of categories is large Co-training is not satisfying On the other hand, the method of combining error-correcting output coding (more than enough classifiers in use), with co-training can boost performance
Modification Algorithms (Cont’d)                             : SVM-based Approach In classification, both positive and negative examples are required SVM-Based aim: To eliminate the need for manual collection of negative examples while still retaining similar classification accuracy
SVM-based Approach(Cont’d)                             : SVM-based Flow of algorithm
Take a Break!The Internet’s Ad Market PlaceBesides Google Adwords Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Algorithm Approaches for Webpage Classification
Hierarchical Classification Not so many research since most web classifications focus on the same level approaches Approaches: Based on “divide and conquer” Error minimization Topical Hierarchy Hierarchical SVMs Using the degree of misclassification Hierarchical text categoriations
Hierarchical Classification (Cont’d): Approaches The use of hierarchical classification based on “divide and conquer” Classification problems are splitted into sub-problems hierarchically More efficient and accurate that the non-hierarchical way Error minimization when the lower level category is uncertain, Minimize by shifting the assignment into the higher one Topical Hierarchy Classify a web page into a topical hierarchy Update the category information as the hierarchy expands
Hierarchical Classification (Cont’d): Approaches (2) Hierarchical SVMs Observation: Hierarchical SVMs are more efficient than flat SVMs None are satisfying the effectiveness for the large taxonomies  Hierarchical settings do more harm than good to kNNs and naive Bayes classifiers Hierarchical Classification By the degree of misclassification  Opposed to measuring “correctness” Distance are measured b/w the classifier-assigned classes and the true class. Hierarchical text categorization A detailed review was provided in 2005
Algorithm Approaches for Webpage Classification
Combining Information from Multiple Sources Different sources are utilized Combining link and content information is quite popular Common combination way:  Treat information from ‘different sources’ as ‘different (usually disjoint) feature sets’ on which multiple classifiers are trained Then, the generation of FINAL decision will be made by the classifiers Mostly has the potential to have better knowledge than any single method
Information Combination (Cont’d): Approaches Voting and Stacking The well-developed method in machine learning Co-Training Effective in combining multiple sources Since here, different classifiers are trained on disjoint feature sets
Information Combination (Cont’d): Cautions Please be noted that: Additional resource needs sometimes cause ‘disadvantage’ The combination of 2 does NOT always BETTER than each separately
Blog classification Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Take a Break!Follow the Trend!!Everybody RETWEET!! Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Follow me on TwitterFollow pChralso my Blog Http://www.PacharaStudio.com Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Blog classification The word “blog” was originally a short form of “web log” Blogging has gained in popularity in recent years, an increasing amount of research about blog has also been conducted. Broken into three types Blog identification (to determine whether a web document is a blog) Mood classification Genre classification
Blog classification Elgersma and Rijke 2006 Common classification algorithm on Blog identification using number of human-selected feature e.g. “Comments” and “Archives”  Accuracy around 90% Mihalcea and Liu 2006 classify Blog into two polarities of moods, happiness and sadness (Mood classification) Nowson 2006 discussed the distinction of three types of blogs (Genre Classification) News Commentary Journal
Blog classification Qu et al. 2006 Automatic classification of blogs into four genres Personal diary New  Political  Sports Using unigram tfidf document representation and naive Bayes classification. Qu et al.’s approach can achieve an accuracy of 84%.
Conclusion Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Conclusion Webpage classification is a type of supervised learning problem that aims to categorize webpage into a set of predefined categories based on labeled training data. They expect that future web classification efforts will certainly combine content and link information in some form.
Conclusion Future work would be well-advised to Emphasize text and labels from siblings over other types of neighbors. Incorporate anchor text from parents. Utilize other source of (implicit or explicit) human knowledge, such as query logs and click-through behavior, in addition to existing labels to guide classifier creation.
Thank you. Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
Question? Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009

Weitere ähnliche Inhalte

Was ist angesagt?

Drupal@UT: A case study on redesigning the University of Texas at Austin website
Drupal@UT: A case study on redesigning the University of Texas at Austin websiteDrupal@UT: A case study on redesigning the University of Texas at Austin website
Drupal@UT: A case study on redesigning the University of Texas at Austin websiteSpringbox
 
UW Forward - CUWL 2011
UW Forward - CUWL 2011UW Forward - CUWL 2011
UW Forward - CUWL 2011Eric Larson
 
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNebraska Library Commission
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internetcarolyn oldham
 
User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)amytaylor
 
Wc Usability Online Catalogs Combined August2009 Rev1 Ch
Wc Usability Online Catalogs Combined August2009 Rev1 ChWc Usability Online Catalogs Combined August2009 Rev1 Ch
Wc Usability Online Catalogs Combined August2009 Rev1 ChOCLC LAC
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Webostephens
 
Web Accessibility: Understanding & Practice!
Web Accessibility: Understanding & Practice!Web Accessibility: Understanding & Practice!
Web Accessibility: Understanding & Practice!Rabab Gomaa
 
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...Emily Kolvitz
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Webis20090
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for BeginnersValeria de Paiva
 
Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Dirk Lewandowski
 
Maximising Online Resource Effectiveness Workshop Session 3/8 Priority issues
Maximising Online Resource Effectiveness Workshop Session 3/8 Priority issuesMaximising Online Resource Effectiveness Workshop Session 3/8 Priority issues
Maximising Online Resource Effectiveness Workshop Session 3/8 Priority issuesPlatypus
 
Accessibility Testing Using Screen Readers
Accessibility Testing Using Screen ReadersAccessibility Testing Using Screen Readers
Accessibility Testing Using Screen ReadersRabab Gomaa
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
 

Was ist angesagt? (20)

Drupal@UT: A case study on redesigning the University of Texas at Austin website
Drupal@UT: A case study on redesigning the University of Texas at Austin websiteDrupal@UT: A case study on redesigning the University of Texas at Austin website
Drupal@UT: A case study on redesigning the University of Texas at Austin website
 
UW Forward - CUWL 2011
UW Forward - CUWL 2011UW Forward - CUWL 2011
UW Forward - CUWL 2011
 
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internet
 
User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)
 
Wc Usability Online Catalogs Combined August2009 Rev1 Ch
Wc Usability Online Catalogs Combined August2009 Rev1 ChWc Usability Online Catalogs Combined August2009 Rev1 Ch
Wc Usability Online Catalogs Combined August2009 Rev1 Ch
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Web Accessibility: Understanding & Practice!
Web Accessibility: Understanding & Practice!Web Accessibility: Understanding & Practice!
Web Accessibility: Understanding & Practice!
 
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
EDS across the pond
EDS across the pondEDS across the pond
EDS across the pond
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?
 
E3602042044
E3602042044E3602042044
E3602042044
 
Maximising Online Resource Effectiveness Workshop Session 3/8 Priority issues
Maximising Online Resource Effectiveness Workshop Session 3/8 Priority issuesMaximising Online Resource Effectiveness Workshop Session 3/8 Priority issues
Maximising Online Resource Effectiveness Workshop Session 3/8 Priority issues
 
Accessibility Testing Using Screen Readers
Accessibility Testing Using Screen ReadersAccessibility Testing Using Screen Readers
Accessibility Testing Using Screen Readers
 
Search Systems
Search SystemsSearch Systems
Search Systems
 
confernece paper
confernece paperconfernece paper
confernece paper
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 

Andere mochten auch

Mission Hills Christmas Eve Promotion
Mission Hills Christmas  Eve PromotionMission Hills Christmas  Eve Promotion
Mission Hills Christmas Eve PromotionDavid Schroeter
 
Kinect框架2.0
Kinect框架2.0Kinect框架2.0
Kinect框架2.0dasiyjun
 
2010 08 india search summit - opportunities in the future of search marketing
2010  08 india search summit - opportunities in the future of search marketing2010  08 india search summit - opportunities in the future of search marketing
2010 08 india search summit - opportunities in the future of search marketingGillian Muessig
 
Presentación Sherrill Mane
Presentación Sherrill Mane Presentación Sherrill Mane
Presentación Sherrill Mane IAB México
 
Innervate Event June 2009
Innervate Event June 2009Innervate Event June 2009
Innervate Event June 2009guestca81b41
 
Create ultimate-facebook-page-60-mins
Create ultimate-facebook-page-60-minsCreate ultimate-facebook-page-60-mins
Create ultimate-facebook-page-60-minsBWEST Interactive
 
Presentación de Irfan Kamal SVP and Global Head of Data, Analytics and Products
Presentación de Irfan Kamal SVP and Global Head of Data, Analytics and Products Presentación de Irfan Kamal SVP and Global Head of Data, Analytics and Products
Presentación de Irfan Kamal SVP and Global Head of Data, Analytics and Products IAB México
 
Presentación de Peter Minnium en IAB Conecta 2013
Presentación de Peter Minnium en IAB Conecta 2013Presentación de Peter Minnium en IAB Conecta 2013
Presentación de Peter Minnium en IAB Conecta 2013IAB México
 
Presentación Edward Montes
Presentación Edward Montes Presentación Edward Montes
Presentación Edward Montes IAB México
 
Vinci2011会议演讲PPT
Vinci2011会议演讲PPTVinci2011会议演讲PPT
Vinci2011会议演讲PPTdasiyjun
 
Texas Star Chart Presentation
Texas Star Chart PresentationTexas Star Chart Presentation
Texas Star Chart PresentationKim
 

Andere mochten auch (20)

Web page concept final ppt
Web page concept  final pptWeb page concept  final ppt
Web page concept final ppt
 
Mission Hills Christmas Eve Promotion
Mission Hills Christmas  Eve PromotionMission Hills Christmas  Eve Promotion
Mission Hills Christmas Eve Promotion
 
Ur-Energy September 2013 Corporate Presentation
Ur-Energy September 2013 Corporate PresentationUr-Energy September 2013 Corporate Presentation
Ur-Energy September 2013 Corporate Presentation
 
ACH 122 Lecture 01 (Bldg Codes)
ACH 122 Lecture 01 (Bldg Codes)ACH 122 Lecture 01 (Bldg Codes)
ACH 122 Lecture 01 (Bldg Codes)
 
20120808 ure corporate presentation (august 2012 final)
20120808 ure corporate presentation  (august 2012 final)20120808 ure corporate presentation  (august 2012 final)
20120808 ure corporate presentation (august 2012 final)
 
Kinect框架2.0
Kinect框架2.0Kinect框架2.0
Kinect框架2.0
 
2010 08 india search summit - opportunities in the future of search marketing
2010  08 india search summit - opportunities in the future of search marketing2010  08 india search summit - opportunities in the future of search marketing
2010 08 india search summit - opportunities in the future of search marketing
 
Presentación Sherrill Mane
Presentación Sherrill Mane Presentación Sherrill Mane
Presentación Sherrill Mane
 
Innervate Event June 2009
Innervate Event June 2009Innervate Event June 2009
Innervate Event June 2009
 
Vireb may10
Vireb may10Vireb may10
Vireb may10
 
David kenneth waldman's_2012_cv_mayv31
David kenneth waldman's_2012_cv_mayv31David kenneth waldman's_2012_cv_mayv31
David kenneth waldman's_2012_cv_mayv31
 
August 2011 Ur-Energy Corporate Presentation
August 2011 Ur-Energy Corporate PresentationAugust 2011 Ur-Energy Corporate Presentation
August 2011 Ur-Energy Corporate Presentation
 
Create ultimate-facebook-page-60-mins
Create ultimate-facebook-page-60-minsCreate ultimate-facebook-page-60-mins
Create ultimate-facebook-page-60-mins
 
Presentación de Irfan Kamal SVP and Global Head of Data, Analytics and Products
Presentación de Irfan Kamal SVP and Global Head of Data, Analytics and Products Presentación de Irfan Kamal SVP and Global Head of Data, Analytics and Products
Presentación de Irfan Kamal SVP and Global Head of Data, Analytics and Products
 
Presentación de Peter Minnium en IAB Conecta 2013
Presentación de Peter Minnium en IAB Conecta 2013Presentación de Peter Minnium en IAB Conecta 2013
Presentación de Peter Minnium en IAB Conecta 2013
 
Presentación Edward Montes
Presentación Edward Montes Presentación Edward Montes
Presentación Edward Montes
 
Jacky10min4
Jacky10min4Jacky10min4
Jacky10min4
 
Vinci2011会议演讲PPT
Vinci2011会议演讲PPTVinci2011会议演讲PPT
Vinci2011会议演讲PPT
 
Texas Star Chart Presentation
Texas Star Chart PresentationTexas Star Chart Presentation
Texas Star Chart Presentation
 
Why Join the Kessef Group
Why Join the Kessef GroupWhy Join the Kessef Group
Why Join the Kessef Group
 

Ähnlich wie Web Page Classification

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
 
132-ArticleText-800-1-10-20210331 (1).pdf
132-ArticleText-800-1-10-20210331 (1).pdf132-ArticleText-800-1-10-20210331 (1).pdf
132-ArticleText-800-1-10-20210331 (1).pdfvarshasatpute6
 
PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportIOSR Journals
 
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...IRJET Journal
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdfCreationlabz
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEOThanh Nguyen
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学Xu jiakon
 
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Team of Rivals: UX, SEO, Content & Dev  UXDC 2015Team of Rivals: UX, SEO, Content & Dev  UXDC 2015
Team of Rivals: UX, SEO, Content & Dev UXDC 2015Marianne Sweeny
 
SEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEOSEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEOFlutterbyBarb
 
Optimizing Library Websites for Better Visibility
Optimizing Library Websites for Better VisibilityOptimizing Library Websites for Better Visibility
Optimizing Library Websites for Better VisibilityErin Rushton
 
Optimizing Library Websites for Better Visibility
Optimizing Library Websites for Better VisibilityOptimizing Library Websites for Better Visibility
Optimizing Library Websites for Better VisibilityErin Rushton
 
Recent research in web page classification – a review
Recent research in web page classification – a reviewRecent research in web page classification – a review
Recent research in web page classification – a reviewiaemedu
 

Ähnlich wie Web Page Classification (20)

Macran
MacranMacran
Macran
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
 
132-ArticleText-800-1-10-20210331 (1).pdf
132-ArticleText-800-1-10-20210331 (1).pdf132-ArticleText-800-1-10-20210331 (1).pdf
132-ArticleText-800-1-10-20210331 (1).pdf
 
PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey report
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 
Plenary paper-2012-weideman-academic-content-web-visibility-presence
Plenary paper-2012-weideman-academic-content-web-visibility-presencePlenary paper-2012-weideman-academic-content-web-visibility-presence
Plenary paper-2012-weideman-academic-content-web-visibility-presence
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
 
The beginners guide to SEO
The beginners guide to SEOThe beginners guide to SEO
The beginners guide to SEO
 
Mazhiming
MazhimingMazhiming
Mazhiming
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学
 
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Team of Rivals: UX, SEO, Content & Dev  UXDC 2015Team of Rivals: UX, SEO, Content & Dev  UXDC 2015
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
 
SEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEOSEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEO
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
 
Optimizing Library Websites for Better Visibility
Optimizing Library Websites for Better VisibilityOptimizing Library Websites for Better Visibility
Optimizing Library Websites for Better Visibility
 
Optimizing Library Websites for Better Visibility
Optimizing Library Websites for Better VisibilityOptimizing Library Websites for Better Visibility
Optimizing Library Websites for Better Visibility
 
Recent research in web page classification – a review
Recent research in web page classification – a reviewRecent research in web page classification – a review
Recent research in web page classification – a review
 

Kürzlich hochgeladen

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Kürzlich hochgeladen (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Web Page Classification

  • 1. Web Page Classification Feature and Algorithms XiaoguangQi and Brian D. Davison Department of Computer Science & Engineering Lehigh University, June 2007 Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 2. Agenda Webpage classification significance Introduction Background Applications of web classification Features Algorithms Blog Classification Conclusion
  • 3. Webpage classification significance Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 4. Webpage classification significance Let’s go back in history about 10 years. The Evolution of Websites: How 5 popular Websites have changed 
  • 6. Apple – 10 Years ago!
  • 8. Amazon – 9 Years ago
  • 10. CNN – 8 Years ago
  • 12. Yahoo! – 12 Years ago
  • 13. Webpage classification significance What’s different between past and present what changed?
  • 15. Nike – 8 Years ago
  • 16. Webpage classification significance What’s different between past and present what changed? Flash animation Java Script Video Clips, Embedded Object Advertise, GG Ad sense, Yahoo!
  • 17. Introduction Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 18. Introduction Webpage classification or webpage categorization is the process of assigning a webpage to one or more category labels. E.g. “News”, “Sport” , “Business” GOAL: They observe the existing of web classification techniques to find new area for research. Including web-specific features and algorithms that have been found to be useful for webpage classification.
  • 19. Introduction What will you learn? A Detailed review of useful features for web classification The algorithms used The future research directions Webpage classification can help improve the quality of web search. Knowing is thing help you to improve your SEO skill. Each search engine, keep their technique in secret.
  • 20. Background Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 21. Background The general problem of webpage classification can be divided into Subject classification; subject or topic of webpage e.g. “Adult”, “Sport”, “Business”. Function classification; the role that the webpage play e.g. “Personal homepage”, “Course page”, “Admission page”.
  • 22. Background Based on the number of classes in webpage classification can be divided into binary classification multi-class classification Based on the number of classes that can be assigned to an instance, classification can be divided into single-label classification and multi-label classification.
  • 24. Applications of web classification Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 25. Applications of web classification Constructing and expanding web directories (web hierarchies) Yahoo ! ODP or “Open Dictionary Project” http://www.dmoz.org How are they doing?
  • 27. Applications of web classification How are they doing? By human effort July 2006, it was reported there are 73,354 editor in the dmoz ODP. As the web changes and continue to grow so “Automatic creation of classifiers from web corpora based on use-defined hierarchies” has been introduced by Huang et al. in 2004 The starting point of this presentation !!
  • 28. Applications of web classification Improving quality of search results Categories view Ranking view
  • 30. Applications of web classification Improving quality of search results Categories view Ranking view In 1998, Page and Brin developed the link-based ranking algorithm called PageRank Calculates the hyperlinks with our considering the topic of each page
  • 31. Google – 11 Years ago
  • 32. Applications of web classification Helping question answering systems Yang and Chua 2004 suggest finding answers to list questions e.g. “name all the countries in Europe” How it worked? Formulated the queries and sent to search engines. Classified the results into four categories Collection pages (contain list of items) Topic pages (represent the answers instance) Relevant page (Supporting the answers instance) Irrelevant pages After that , topic pages are clustered, from which answers are extracted. Answering question system could benefit from web classification of both accuracy and efficiency
  • 33. Applications of web classification Other applications Web content filtering Assisted web browsing Knowledge base construction
  • 34. Features Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 35. Features In this section, we review the types of features that useful in webpage classification research. The most important criteria in webpage classification that make webpage classification different from plaintext classification is HYPERLINK <a>…</a> We classify features into On-page feature: Directly located on the page Neighbors feature: Found on the pages related to the page to be classified.
  • 36. Features: On-page Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 37. Features: On-page Textual content and tags N-gram feature Imagine of two different documents. One contains phrase “New York”. The other contains the terms “New” and “York”. (2-gram feature). In Yahoo!, They used 5-grams feature. HTML tags or DOM Title, Headings, Metadata and Main text Assigned each of them an arbitrary weight. Now a day most of website using Nested list (<ul><li>) which really help in web page classification.
  • 38. Features: On-page Textual content and tags URL Kan and Thi 2004 Demonstrated that a webpage can be classified based on its URL
  • 39. Features: On-page Visual analysis Each webpage has two representations Text which represent in HTML The visual representation rendered by a web browser Most approaches focus on the text while ignoring the visual information which is useful as well Kovacevic et al. 2004 Each webpage is represented as a hierarchical “Visual adjacency multi graph.” In graph each node represents an HTML object and each edge represents the spatial relation in the visual representation.
  • 41. Features: Neighbors Features Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 42. Features: Neighbors Features Motivation The useful features that we discuss previously, in a particular these features are missing or unrecognizable
  • 43. Example webpage which has few useful on-page features
  • 44. Features: Neighbors features Underlying Assumptions When exploring the features of neighbors, some assumptions are implicitly made in existing work. The presence of many “sports” pages in the neighborhood of P-a increases the probability of P-a being in “Sport”. Chakrabari et al. 2002 and Meczer 2005 showed that linked pages were more likely to have terms in common . Neighbor selection Existing research mainly focuses on page with in two steps of the page to be classified. At the distance no greater than two. There are six types of neighboring pages: parent, child, sibling, spouse, grandparent and grandchild.
  • 45. Neighbors with in radius of two
  • 46. Features: Neighbors features Neighbor selection cont. Furnkranz 1999 The text on the parent pages surrounding the link is used to train a classifier instead of text on the target page. A Target page will be assigned multiple labels. These label are then combine by some voting scheme to form the final prediction of the target page’s class Sun et al. 2002 Using the text on the target page. Using page title and anchor text from parent pages can improve classification compared a pure text classifier.
  • 47. Features: Neighbors features Neighbor selection cont. Summary Using parent, child, sibling and spouse pages are all useful in classification, siblings are found to be the best source. Using information from neighboring pages may introduce extra noise, should be use carefully.
  • 48.
  • 49. Features: Neighbors features Features Label : by editor or keyworder Partial content : anchor text, the surrounding text of anchor text, titles, headers Full content Among the three types of features, using the full content of neighboring pages is the most expensive however it generate better accuracy.
  • 50. Features: Neighbors features Utilizing artificial links (implicit link) The hyperlinks are not the only one choice. What is implicit link? Connections between pages that appear in the results of the same query and are both clicked by users. Implicit link can help webpage classification as well as hyperlinks.
  • 51.
  • 52. Discussion: Features However, since the results of different approaches are based on different implementations and different datasets, making it difficult to compare their performance. Sibling page are even more use full than parents and children. This approach may lie in the process of hyperlink creation. But a page often acts as a bridge to connect its outgoing links, which are likely to have common topic.
  • 53.
  • 54. Tip!Tracking Incoming LinkHow to know when someone link to you? Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 55. Algorithms Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 56. Algorithm Approaches for Webpage Classification
  • 57.
  • 58. Way of boosting the classification by emphasizing the features with the better discriminative power
  • 59.
  • 60. Dimension Reduction (con) : Feature Selection Simple approaches First fragment of each document First fragment to the web documents in hierarchical classification Text categorization approaches Information gain Mutual information Etc.
  • 61.
  • 62. Feature Selection (Cont’d): Text Categorization Measures Using expected mutual information and mutual information Two well-known metrics based on variation of the k-Nearest Neighbor algorithm Weighted terms according to its appearing HTML tags Terms within different tags handle different importance Using information gain Another well-known metric Still not apparently show which one is more superior for web classification
  • 63. Feature Selection (Cont’d): Text Categorization Measures Approving the performance of SVM classifiers By aggressive feature selection Developed a measure with the ability to predict the selection effectiveness without training and testing classifiers A popular Latent Semantic Indexing (LSI) In Text documents: Docs are reinterpreted into a smaller transformed, but less intuitive space Cons:high computational complexity makes it inefficient to scale in Web classification Experiments based on small datasets (to avoid the above ‘cons’) Some work has approved to make it applicable for larger datasets which still needs further study
  • 64. Algorithm Approaches for Webpage Classification
  • 66. Relational Learning (cont’d): 2 Main Approaches Relaxation Labeling Algorithms Original proposal: Image analysis Current usage: Image and vision analysis Artificial Intelligence pattern recognition web-mining Link-based Classification Algorithms Utilizing 2 popular link-based algorithms Loopy belief propagation Iterative classification
  • 67.
  • 68. Relational Learning (cont’d): Link-based Classification Algorithms Two popular link-based algorithms: Loopy belief propagation Iterative classification Better performance on a web collection than textual classifiers During the scientists’ study, ‘a toolkit’ was implemented Toolkit features Classify the networked data which utilized a relational classifier and a collective inference procedure Demonstrated its great performance on several datasets including web collections
  • 69. Algorithm Approaches for Webpage Classification
  • 70. Modifications to traditional algorithms The traditional algorithms adjusted in the context of Webpage classification k-Nearest Neighbors (kNN) Quantify the distance between the test document and each training documents using “a dissimilarity measure” Cosine similarity or inner product is what used by most existing kNN classifiers Support Vector Machine (SVM)
  • 71. Modification Algorithms (Cont’d) : k-Nearest Neighbors Algorithm Varieties of modifications: Using the term co-occurrence in document Using probability computation Using “co-training”
  • 72. k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties Using the term co-occurrence in documents An improved similarity measure The more co-occurred terms two documents have in common, the stronger the relationship between them Better performance over the normal kNN (cosine similarity and inner product measures) Using the probability computation Condition: The probability of a document d being in class c is determined by its distance b/w neighbors and itself and its neighbors’ probability of being in c Simple equation Prob. of d @ c = (distance b/w d and neighbors)(neighbors’ Prob. @ c)
  • 73. k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties (2) Using “Co-training” Make use of labeled and unlabeled data Aiming to achieve better accuracy Scenario: Binary classification Classifying the unlabeled instances Two classifiers trained on different sets of features The prediction of each one is used to train each other Classifying only labeled instances The co-training can cut the error rate by half When generalized to multi-class problems When the number of categories is large Co-training is not satisfying On the other hand, the method of combining error-correcting output coding (more than enough classifiers in use), with co-training can boost performance
  • 74. Modification Algorithms (Cont’d) : SVM-based Approach In classification, both positive and negative examples are required SVM-Based aim: To eliminate the need for manual collection of negative examples while still retaining similar classification accuracy
  • 75. SVM-based Approach(Cont’d) : SVM-based Flow of algorithm
  • 76. Take a Break!The Internet’s Ad Market PlaceBesides Google Adwords Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 77. Algorithm Approaches for Webpage Classification
  • 78. Hierarchical Classification Not so many research since most web classifications focus on the same level approaches Approaches: Based on “divide and conquer” Error minimization Topical Hierarchy Hierarchical SVMs Using the degree of misclassification Hierarchical text categoriations
  • 79. Hierarchical Classification (Cont’d): Approaches The use of hierarchical classification based on “divide and conquer” Classification problems are splitted into sub-problems hierarchically More efficient and accurate that the non-hierarchical way Error minimization when the lower level category is uncertain, Minimize by shifting the assignment into the higher one Topical Hierarchy Classify a web page into a topical hierarchy Update the category information as the hierarchy expands
  • 80. Hierarchical Classification (Cont’d): Approaches (2) Hierarchical SVMs Observation: Hierarchical SVMs are more efficient than flat SVMs None are satisfying the effectiveness for the large taxonomies Hierarchical settings do more harm than good to kNNs and naive Bayes classifiers Hierarchical Classification By the degree of misclassification Opposed to measuring “correctness” Distance are measured b/w the classifier-assigned classes and the true class. Hierarchical text categorization A detailed review was provided in 2005
  • 81. Algorithm Approaches for Webpage Classification
  • 82. Combining Information from Multiple Sources Different sources are utilized Combining link and content information is quite popular Common combination way: Treat information from ‘different sources’ as ‘different (usually disjoint) feature sets’ on which multiple classifiers are trained Then, the generation of FINAL decision will be made by the classifiers Mostly has the potential to have better knowledge than any single method
  • 83. Information Combination (Cont’d): Approaches Voting and Stacking The well-developed method in machine learning Co-Training Effective in combining multiple sources Since here, different classifiers are trained on disjoint feature sets
  • 84. Information Combination (Cont’d): Cautions Please be noted that: Additional resource needs sometimes cause ‘disadvantage’ The combination of 2 does NOT always BETTER than each separately
  • 85. Blog classification Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 86. Take a Break!Follow the Trend!!Everybody RETWEET!! Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 87. Follow me on TwitterFollow pChralso my Blog Http://www.PacharaStudio.com Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 88. Blog classification The word “blog” was originally a short form of “web log” Blogging has gained in popularity in recent years, an increasing amount of research about blog has also been conducted. Broken into three types Blog identification (to determine whether a web document is a blog) Mood classification Genre classification
  • 89. Blog classification Elgersma and Rijke 2006 Common classification algorithm on Blog identification using number of human-selected feature e.g. “Comments” and “Archives” Accuracy around 90% Mihalcea and Liu 2006 classify Blog into two polarities of moods, happiness and sadness (Mood classification) Nowson 2006 discussed the distinction of three types of blogs (Genre Classification) News Commentary Journal
  • 90. Blog classification Qu et al. 2006 Automatic classification of blogs into four genres Personal diary New Political Sports Using unigram tfidf document representation and naive Bayes classification. Qu et al.’s approach can achieve an accuracy of 84%.
  • 91. Conclusion Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 92. Conclusion Webpage classification is a type of supervised learning problem that aims to categorize webpage into a set of predefined categories based on labeled training data. They expect that future web classification efforts will certainly combine content and link information in some form.
  • 93. Conclusion Future work would be well-advised to Emphasize text and labels from siblings over other types of neighbors. Incorporate anchor text from parents. Utilize other source of (implicit or explicit) human knowledge, such as query logs and click-through behavior, in addition to existing labels to guide classifier creation.
  • 94. Thank you. Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009
  • 95. Question? Presented by Mr.Pachara Chutisawaeng Department of Computer Science Mahidol University, July 2009