Machine Learning Web Page Classification Features

Machine
Le
arning
in

DIADEM

Reading
Co
urse
Presen
tation

Andrey
Kra
vchenko

20 th
of
Janu
ary,
2010

Current
area
of
research

Real
estate
page
classiDication

vs

Current
area
of
research

Input
and
output
page
distinction

Current
area
of
research

Page
element
classiDication

The
Reading
List

Papers
not
included
in
this
presentation

0  “An
interactive
clustering
–
based
approach
to
integrating
source
query

interfaces
on
the
Deep
Web”

0  This
paper
is
concerned
with
input
forms.

0  “Automatic
wrapper
induction
from
hidden-‐web
sources
with
domain

knowledge”

0  Only
a
part
of
the
paper
deals
with
the
output
pages.
Their
methodology
for

processing
the
output
pages
is
based
on
gazetteer’s
and
is
thus
closer
to

linguistics
than
ML.

0  “Web
scale
extraction
of
structured
data”

0  Deals
with
the
whole
Web.

0  “An
adaptive
information
extraction
system
based
on
wrapper
induction

with
POS
tagging”

0  The
labels
are
of
very
low
granularity
(e.g.
work_name,
work_location)
and
of

linguistic
nature.
The
comparison
is
done
against
linguistics
systems
such
as

Rapier
(another
excluded
paper
on
the
reading
list),
GATE-‐SVM,
etc.
Introducing

POS
tagging
provides
only
a
5%
gain
in
accuracy
and
only
for
some
target
slots

for
one
corpus
and
no
gain
for
the
other
two.

The
Reading
List

Papers
not
included
in
this
presentation

0  “Learning
(k,l)-‐contextual
tree
languages
for
information
extraction
from

Web
pages”

0  The
paper
deals
with
learning
an
extraction
language
rather
than
extraction
itself.

0  “Bottom-‐up
relational
learning
of
problem
matching
rules
for
Information

Retrieval”

0  Deals
with
textual
documents
only.

0  “Learning
rules
to
pre-‐process
Web
data
for
automatic
integration”

0  Relies
on
web
data
extraction
and
alignment
phases
performed
by
the
VIPER

system
that
are
not
described
in
the
paper.
I
wasn’t
able
to
detect
any
ML
involved

in
the
stage
of
rule
learning.
No
clear
description
of
practical
results.
Low-‐level

granularity
of
labels.

0  “Learning
rules
for
information
extraction”

0  Is
not
HTML/DOM
speciDic.

The
Reading
List

Papers
included
in
this
presentation

#1
“Web-‐page
classiDication:
features
and
algorithms”
-‐
2007

#2
“Web
page
element
classiDication
based
on
visual
features”

#3
“Stylistic
and
lexical
co-‐training
for
Web-‐block
classiDication”

#4
“Can
we
learn
a
template-‐independent

wrapper
for

news
article
extraction
from
a
single
training
site?”

#5
“EfDicient
record-‐level
wrapper
induction”

#6
“Towards
combining
Web
classiDication
and
Web
Information

Extraction:
a
case
study”

Paper
#
1

Web
page
classiDication:
features
and
algorithms

X.
Qi
and
B.
Davison
(Lehigh
University,
2007)

0  The
paper
distinguishes
between
four
types
of
classiDication;

0  They
also
distinguish
between
subject
classiDication,
functional

classiDication,
sentiment
classiDication,
and
other
types
of

classiDication;

0  The
paper
distinguishes
between
on-‐page
features
and
the

features
of
the
neighbours;

0  On-‐page
features:

0  Textual
analysis:
bag
of
words
vs
n-‐gram;

0  Visual
analysis:
the
multigraph
approach.

Paper
#
1

Web
page
classiDication:
features
and
algorithms

X.
Qi
and
B.
Davison
(Lehigh
University,
2007)

Paper
#
1

Web
page
classiDication:
features
and
algorithms

X.
Qi
and
B.
Davison
(Lehigh
University,
2007)

0  When
using
the
features
of
neighbouring
pages
the
authors

distinct
between
the
weak
assumption
and
the
strong
assumption;

0  They
also
distinguish
between
different
types
of
neighbours:

parents/children,
grandparents/grandchildren
and
siblings/
spouses;

0  It
appears
that
siblings
are
the
most
important
neighbours;

0  There
are
various
features

uses
for
different
types
of

neighbouring
pages;

0  Algorithm
survey:
dimension
reduction
and
relational
learning

approaches;

Paper
#
2

Web
page
element
classiDication
based
on
visual
features

R.
Burget
and
I.
Rudolfova
(Brno
University,
2009)

0  Problem:
ClassiDication
of
elements
from
a
web
page
based
on

its
visual
rendering;

0  Assumptions:
A
tagged
corpus,
DOM
tree,
CSSBox
layout;

0  Approach:

Page
segmentation
followed
by
block
classiDication

performed
via
Weka’s
J48
decision
tree
classiYier;

0  Features:
Font
features,
spatial
features,
text
features,
colour

features;

0  Evaluation:
News
domain.
Average
F1
measure
on

coarse-‐grained
labels,
low
F1
measure
on
high-‐grained
labels.

Paper
#
2

Web
page
element
classiDication
based
on
visual
features

R.
Burget
and
I.
Rudolfova
(Brno
University,
2009)

0  The
approach
of
this
papers
is
split
into
two
phases:

0  Page
segmentation;

0  Page
element
classiDication;

0  Page
segmentation
is
done
in
four
phases:

0  Page
rendering;

0  Detecting
basic
visual
areas;

0  Text
line
detection;

0  Block
detection;

0  As
a
result
of
page
segmentation
we
obtain
a
tree
of
areas.

Paper
#
2

Web
page
element
classiDication
based
on
visual
features

R.
Burget
and
I.
Rudolfova
(Brno
University,
2009)

0  The
actual

page
element
classiDication
is
performed

for
each
area
via
Weka’s
J48
decision
tree
classiDier

based
on
the
following
set
of
features:

0  Font
features
{fontsize,
weight};

0  Spatial
features
{aabove,
abelow,
aleft,
aright};

0  Text
features
{tdigits,

tlower,

tupper,
tspaces,
tlength};

0  Colour
features
{contrast}.

Paper
#
2

Web
page
element
classiDication
based
on
visual
features

R.
Burget
and
I.
Rudolfova
(Brno
University,
2009)

Results

The
set
of
labels
(the
testing
pages
from
another

source
than
the
training
pages)

Paper
#
3

Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication

C.
Lee
et
al
(National
University
of
Singapore,
2004)

from
a
web
page
based
on

0  Problem:
ClassiDication
of
elements

both
stylistic
and
lexical
features;

0  Assumptions:
A
tagged
corpus,
DOM
tree,
CSSBox
layout;

0  Approach:

Web
block
division
followed
by
co-‐training
with

Boostexter,
an
ensemble
learning
method
with
a
decision
stump

corresponding
to
a
single
weak
learner;

0  Features:
Lexical
and
stylistic;

0  Evaluation:
News
domain.
Average
F1
measure
on

coarse-‐grained
labels,
low
F1
measure
on
high-‐grained
labels.

Paper
#
3

Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication

C.
Lee
et
al
(National
University
of
Singapore,
2004)

0  The
authors
aim
to
combine
two
different
classiDiers
with

distinctive
set
of
features
(lexical
and
stylistic);

0  They’ve
created
a
PARser
for
Content
Extraction
and
Layout

Structure
(PARCELS);

0  Web
page
division
–
the
authors
differentiate
between

structural
tags
and
content
tags.

Paper
#
3

Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication

C.
Lee
et
al
(National
University
of
Singapore,
2004)

Paper
#
3

Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication

C.
Lee
et
al
(National
University
of
Singapore,
2004)

0  The
authors
distinguish
between
labels
of
different

levels
of

granularity.
They
deDine
17
tags
for
labelling;

0  Stylistic
features:

0  Linear
structure
–
paragraph
(<p>),
header
(<h1>-‐<h6>)
and
rule
tags
(<hr>);

0  Table
structure
–
cell
Dlow,
neighbouring
cells’
data,

the
position
of
table
cells;

0  XHTML/CSS
structure
–
height,
width,
z-‐index;

0  Font
features
–
colour,
weight,
family,
size,
hyperlink
features;

0  Images
–
size,
number
of
images
within
a
block;

0  Lexical
features:

0  Low-‐level
features
–
count
and
vocabulary
of
the
words
present
in
the
text
block;

0  High-‐level
features
–
POS-‐tags,
mailto-‐links,
image-‐links,
text-‐links,
total-‐links;

0  Boostexter
is
used
for
co-‐training.
It
is
an
ensemble
learning
method

with
a
decision
stump
corresponding
to
a
single
weak
learner.

Paper
#
4

Can
we
learn
a
template
independent
wrapper
for

news
article
extraction
for
a
single
training
site?

J.
Wang
et
al
(2009,
Zhejiang
University,
MS
Research)

0  Problem:
ClassiDication
of
titles
and
bodies
of
news
taken
from

the
webpages
belonging
to
the
news
domain;

0  Assumptions:
A
tagged
corpus,
DOM
tree,
CSSBox
layout;

0  Approach:

SVM;
decision
function
gets
converted
to
posterior

probability;

0  Features:
Different
sets
of
features
for
body
and
title

extraction.

Features
are
divided
into
content
and
spatial

features;

0  Evaluation:
Overall
99%
extraction
accuracy.

Paper
#
4

Can
we
learn
a
template
independent
wrapper
for

news
article
extraction
for
a
single
training
site?

J.
Wang
et
al
(2009,
Zhejiang
University,
MS
Research)

0  The
aim
of
the
paper
is
to
efDiciently
extract
and
then
combine

titles
and
bodies
of
news
articles;

0 
The
main
problem
is
in
dealing
with
various
noises
around
the

titles.

Paper
#
4

Can
we
learn
a
template
independent
wrapper
for

news
article
extraction
for
a
single
training
site?

J.
Wang
et
al
(2009,
Zhejiang
University,
MS
Research)

0  News
body
extraction:

0  Content
features:
FormattingElementsNum
and
FormattedContentLen;

0  Spatial
features:
normalised
RectLeft,
RectTop,
RectWidth
and
RectHeight;

0  News
body
extraction
heuristics:
TopInScreen(T)
and
BigEnough(T);

0  News
title
extraction:

0  Content
features:
FontSize,
EndWithFullStop,
WordNum;

0  Spatial
features:
RectLeft,
RectTop,
RectWidth,
RectHeight,
Overlap,
Distance,
Flat;

0  News
title
extraction
heuristics:
WholeInScreen(T),
NoAnchorText(T),

NotCategoryName(T);

0  A
SVM
approach
is
chosen
for
classiDication.
The
decision

function
gets
converted
to
posterior
probability.

Paper
#
4

Can
we
learn
a
template
independent
wrapper
for

news
article
extraction
for
a
single
training
site?

J.
Wang
et
al
(2009,
Zhejiang
University,
MS
Research)

Testing
results
on
the
large

Extraction
results

scale
experiment

Paper
#
5

EfDicient
record
level
wrapper
induction

S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)

0  Problem:
EfDicient
extraction
of
records
from
Web
pages
and

classiDication
of
their
elements;

0  Assumptions:
A
tagged
corpus,
DOM
tree;

0  Approach:

Alignment
of
the
DOM
subtree
and
the
possible

wrappers;

0  Features:
None;

0  Evaluation:
Four
different
domains
(online
shops,
user
reviews,

digital
libraries,
search
results).
Seven
detail
page
datasets
and

eleven
list
page
datasets.
A
99%
F1
value.

Paper
#
5

EfDicient
record
level
wrapper
induction

S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)

0  The
paper
is
concerned
with
extracting
records
and
their

respective
attributes;

0  The
key
distinction
from
other
approaches
is
the
record-‐
level
extraction
opposed
to
page-‐level
extraction;

0  The
authors
propose
a
novel
broom
structure
for
this
task;

0  The
broom
structure
has
a
head
and
a
stick;

0  One
of
the
main
issues
are
crossing
records.

Paper
#
5

EfDicient
record
level
wrapper
induction

S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)

Paper
#
5

EfDicient
record
level
wrapper
induction

S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)

0  The
general
architecture
of
the
system
involves
training
and

testing
phases.

Paper
#
5

EfDicient
record
level
wrapper
induction

S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)

0  The
authors
claim
to
achieve
a
remarkable
extraction
accuracy

and
a
signiDicant
boost
in
running
time
performance;

Paper
#
6

Towards
combining
Web
classiDication
and
Web

Information
Extraction:
a
case
study

P.
Luo
et
al
(HP
Labs
China,
2009)

with
the
extraction
of
its

0  Problem:
Combination
of
web
page
classiDication
based
on

their
relevance
to
a
speciDic
domain

speciDic
elements,
using
both
forward
and
backward

dependencies;

0  Assumptions:
A
tagged
corpus,
DOM
tree;

0  Approach:

Conditional
Random
Fields
(CRFs);

0  Features:
Course
terms
and
heuristics
for
course
homepage

detection;
format,
position
and
content
features
for
course

title
extraction;

0  Evaluation:
OfCourse
system
for
online
course
information

extraction.
90%
F1
value
for
course
page
classiDication,
83%

F1
value
for
course
title
extraction.

Paper
#
6

Towards
combining
Web
classiDication
and
Web

Information
Extraction:
a
case
study

P.
Luo
et
al
(HP
Labs
China,
2009)

0  The
authors
propose
a
method
that
utilises
both
forward
and

backward
dependencies
between
Web
classiDication
and

information
extraction;

0  The
authors
use
a
uniDied
graphical
CRF
model
for
joint
and

simultaneous
optimisation
of
these
two
steps;

0  This
methodology
has
been
used
for
building
the
OfCourse

online
search
engine
;

0  In
their
results
for
OfCourse
the
authors
claim
that
their
model

signiDicantly
outperforms
the
two
baseline
methods;

0  Drawbacks:
they
only
deal
with
DOM
leave
nodes
as

classiDication
variables
for
the
information
extraction
phase.

Lessons
learnt
from
the
Reading
Course

#1
“Web
page
classiYication:
features
and
algorithms”
by
X.
Qi
and

B.
Davison
(2007):
the
importance
of
the
neighbouring
pages’

features,
features
of
neighbouring
pages;

#2
“Web
page
element
classiYication
based
on
visual
features”
by

R.
Burget
and
I.
Rudolfova
(2009):
a
broad
set
of
visual
features

(font
features,
spatial
features,
text
features
and
colour

features);

#3
“Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiYication”

by

C.
Lee
et
al
(2004):

A
useful
web
block
division
algorithm.
A

possibility
of
co-‐training
on
the
same
corpus
using
two

distinctive
set
of
features;

Lessons
learnt
from
the
Reading
Course

#4
“Can
we
learn
a
template
independent
wrapper
for
news

article
extraction
for
a
single
training
site”
by
J.
Weng
et
al

(2009):
a
distinctive
set
of
features
for
news
title
extraction,
a

lot
of
which
can
be
used
for
property
title
extraction
in

DIADEM;

#5
“EfYicient
record
level
wrapper
induction
“by
S.
Zheng
et
al

(2009):
a
new
record-‐level
approach
for
extraction.
Performs

much
better
and
faster
than
the
page-‐level
approaches.
Can
be

useful
for
DIADEM
extraction
in
the
record-‐heavy
domains;

#6
“Towards
combining
Web
classiYication
and
Web
Information

Extraction:
a
case
study”
by
P.
Luo
et
al
(2009):
backward

dependency
between
these
two
tasks
can
work
as
well.
Thus
it

is
worthwhile
to
experiment
with
their
mutual
tie-‐up.

General
lessons
learnt

0  Most
of
the
papers
are
recent
or
very
recent
(2004-‐2009);

0  Features
play
a
much
more
important
role
than
algorithms;

0  Initial
page
segmentation
into
blocks
can
help
with
subsequent

determination
of
relevant
DOM-‐subtrees;

0  All
features
can
be
broadly
divided
into
content
features
and

visual
features;

0  News
domain
is
a
very
popular
one
(3
out
of
5
reviewed

systems).
No
mention
of
real
estate
in
any
of
the
papers.

Summary
of
the
Reading
Course

and
its
relevance
to
DIADEM

0  The
six
proposed
papers
are
of
relevance
to
all
three
areas
of
my

current
research:

0  Real
estate
page
classiDication;

0  Output/Input
page
distinction;

0  Property
page
elements’
classiDication;

0  The
most
obvious
synergy
is
with
Omer’s
NLP
work,
although

cross
sections
with
Cheng’s
and
Xiaonan’s
work
are
also
possible;

0 
I
plan
to
use
a
subset
of
the
features
presented
in
these
papers
in

the
classiDication
of
the
elements
of
output
pages
and
subsequent

real
estate
page
classiDication.

Thank
you
for
your
attention!

Machine Learning Web Page Classification Features

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Machine Learning Web Page Classification Features

Ähnlich wie Machine Learning Web Page Classification Features (20)

Machine Learning Web Page Classification Features