6. documents vs web pages
• web pages have structure
• HTML indicates headings, paragraphs,
meta-information
• web pages are interconnected
• they contain hyperlinks to other pages
• they have locations (URLs)
12. n-gram representation
• document is represented by vector of
features
• concepts expressed by phrases can be
capture (e.g. “New York” vs “new” and
“york”)
13. using html structure
• assign weight depending on HTML tags, and
make the feature a linear combination of
these
• e.g. headings would have a greater weight
• four main elements are considered: title,
headings, metadata and main text
Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and
metadata in automated subject classification." Research and Advanced Technology
for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
14. visual analysis
• visual representation by web browser is
important
• each web page is visualised as an adjacency
multigraph, with each section representing
a different kind of content
Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel
approach for a Web page classification." Proceedings of
SAWM04 workshop, ECML2004. 2004.
15. URL features
• pages do not need to be fetched or
analysed
• fast!
• derives tokens from the URL and uses
these tokens as features
Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification
using URL features." Proceedings of the 14th ACM international
conference on Information and knowledge management. ACM, 2005.
17. dataset
• 4 universities dataset (cornell, texas,
washington, wisconsin)
• each page must be classified into a
category: course, department, faculty,
project, staff, student, other
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
18. document classification
single label classification: one and only one
class label is assigned to each instance
hard classification: an instance can either be
or not be in a particular class, with no
intermediate state
multi-class classification: instances that can
be divided into more than two categories
20. experiment #1
bag of words
use the words, unweighted, as features
istant
ass
CS
Dr
intern
22
0
ission
adm
Professor
room
a rc h
rese
21. experiment #2
HTML tag weighting
use words weighted by the HTML tags (e.g.
words in <h1> tags will be weighted more
heavily than those in <p> tags)
sistant
as
CS
Dr
intern
22
0
ission ofe
adm
Pr
ssor
room
arch
rese
22. experiment #3
n-gram
use phrases instead of single words as features
t ant
assis
arch c
rese
onta
c t in
form
ogram description
pr
course outl
ine
atio
n
25. bibliography
B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005)
Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and
algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12.
Golub, Koraljka, and Anders Ardö. "Importance of HTML
structural elements and metadata in automated subject classification." Research and
Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378.
Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL
features." Proceedings of the 14th ACM international conference on Information
and knowledge management. ACM, 2005.
Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web
page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.