5. The
Reading
List
Papers
not
included
in
this
presentation
0 “An
interactive
clustering
–
based
approach
to
integrating
source
query
interfaces
on
the
Deep
Web”
0 This
paper
is
concerned
with
input
forms.
0 “Automatic
wrapper
induction
from
hidden-‐web
sources
with
domain
knowledge”
0 Only
a
part
of
the
paper
deals
with
the
output
pages.
Their
methodology
for
processing
the
output
pages
is
based
on
gazetteer’s
and
is
thus
closer
to
linguistics
than
ML.
0 “Web
scale
extraction
of
structured
data”
0 Deals
with
the
whole
Web.
0 “An
adaptive
information
extraction
system
based
on
wrapper
induction
with
POS
tagging”
0 The
labels
are
of
very
low
granularity
(e.g.
work_name,
work_location)
and
of
linguistic
nature.
The
comparison
is
done
against
linguistics
systems
such
as
Rapier
(another
excluded
paper
on
the
reading
list),
GATE-‐SVM,
etc.
Introducing
POS
tagging
provides
only
a
5%
gain
in
accuracy
and
only
for
some
target
slots
for
one
corpus
and
no
gain
for
the
other
two.
6. The
Reading
List
Papers
not
included
in
this
presentation
0 “Learning
(k,l)-‐contextual
tree
languages
for
information
extraction
from
Web
pages”
0 The
paper
deals
with
learning
an
extraction
language
rather
than
extraction
itself.
0 “Bottom-‐up
relational
learning
of
problem
matching
rules
for
Information
Retrieval”
0 Deals
with
textual
documents
only.
0 “Learning
rules
to
pre-‐process
Web
data
for
automatic
integration”
0 Relies
on
web
data
extraction
and
alignment
phases
performed
by
the
VIPER
system
that
are
not
described
in
the
paper.
I
wasn’t
able
to
detect
any
ML
involved
in
the
stage
of
rule
learning.
No
clear
description
of
practical
results.
Low-‐level
granularity
of
labels.
0 “Learning
rules
for
information
extraction”
0 Is
not
HTML/DOM
speciDic.
7. The
Reading
List
Papers
included
in
this
presentation
#1
“Web-‐page
classiDication:
features
and
algorithms”
-‐
2007
#2
“Web
page
element
classiDication
based
on
visual
features”
#3
“Stylistic
and
lexical
co-‐training
for
Web-‐block
classiDication”
#4
“Can
we
learn
a
template-‐independent
wrapper
for
news
article
extraction
from
a
single
training
site?”
#5
“EfDicient
record-‐level
wrapper
induction”
#6
“Towards
combining
Web
classiDication
and
Web
Information
Extraction:
a
case
study”
8. Paper
#
1
Web
page
classiDication:
features
and
algorithms
X.
Qi
and
B.
Davison
(Lehigh
University,
2007)
0 The
paper
distinguishes
between
four
types
of
classiDication;
0 They
also
distinguish
between
subject
classiDication,
functional
classiDication,
sentiment
classiDication,
and
other
types
of
classiDication;
0 The
paper
distinguishes
between
on-‐page
features
and
the
features
of
the
neighbours;
0 On-‐page
features:
0 Textual
analysis:
bag
of
words
vs
n-‐gram;
0 Visual
analysis:
the
multigraph
approach.
9. Paper
#
1
Web
page
classiDication:
features
and
algorithms
X.
Qi
and
B.
Davison
(Lehigh
University,
2007)
10. Paper
#
1
Web
page
classiDication:
features
and
algorithms
X.
Qi
and
B.
Davison
(Lehigh
University,
2007)
0 When
using
the
features
of
neighbouring
pages
the
authors
distinct
between
the
weak
assumption
and
the
strong
assumption;
0 They
also
distinguish
between
different
types
of
neighbours:
parents/children,
grandparents/grandchildren
and
siblings/
spouses;
0 It
appears
that
siblings
are
the
most
important
neighbours;
0 There
are
various
features
uses
for
different
types
of
neighbouring
pages;
0 Algorithm
survey:
dimension
reduction
and
relational
learning
approaches;
11. Paper
#
2
Web
page
element
classiDication
based
on
visual
features
R.
Burget
and
I.
Rudolfova
(Brno
University,
2009)
0 Problem:
ClassiDication
of
elements
from
a
web
page
based
on
its
visual
rendering;
0 Assumptions:
A
tagged
corpus,
DOM
tree,
CSSBox
layout;
0 Approach:
Page
segmentation
followed
by
block
classiDication
performed
via
Weka’s
J48
decision
tree
classiYier;
0 Features:
Font
features,
spatial
features,
text
features,
colour
features;
0 Evaluation:
News
domain.
Average
F1
measure
on
coarse-‐grained
labels,
low
F1
measure
on
high-‐grained
labels.
12. Paper
#
2
Web
page
element
classiDication
based
on
visual
features
R.
Burget
and
I.
Rudolfova
(Brno
University,
2009)
0 The
approach
of
this
papers
is
split
into
two
phases:
0 Page
segmentation;
0 Page
element
classiDication;
0 Page
segmentation
is
done
in
four
phases:
0 Page
rendering;
0 Detecting
basic
visual
areas;
0 Text
line
detection;
0 Block
detection;
0 As
a
result
of
page
segmentation
we
obtain
a
tree
of
areas.
13. Paper
#
2
Web
page
element
classiDication
based
on
visual
features
R.
Burget
and
I.
Rudolfova
(Brno
University,
2009)
0 The
actual
page
element
classiDication
is
performed
for
each
area
via
Weka’s
J48
decision
tree
classiDier
based
on
the
following
set
of
features:
0 Font
features
{fontsize,
weight};
0 Spatial
features
{aabove,
abelow,
aleft,
aright};
0 Text
features
{tdigits,
tlower,
tupper,
tspaces,
tlength};
0 Colour
features
{contrast}.
14. Paper
#
2
Web
page
element
classiDication
based
on
visual
features
R.
Burget
and
I.
Rudolfova
(Brno
University,
2009)
Results
The
set
of
labels
(the
testing
pages
from
another
source
than
the
training
pages)
15. Paper
#
3
Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication
C.
Lee
et
al
(National
University
of
Singapore,
2004)
from
a
web
page
based
on
0 Problem:
ClassiDication
of
elements
both
stylistic
and
lexical
features;
0 Assumptions:
A
tagged
corpus,
DOM
tree,
CSSBox
layout;
0 Approach:
Web
block
division
followed
by
co-‐training
with
Boostexter,
an
ensemble
learning
method
with
a
decision
stump
corresponding
to
a
single
weak
learner;
0 Features:
Lexical
and
stylistic;
0 Evaluation:
News
domain.
Average
F1
measure
on
coarse-‐grained
labels,
low
F1
measure
on
high-‐grained
labels.
16. Paper
#
3
Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication
C.
Lee
et
al
(National
University
of
Singapore,
2004)
0 The
authors
aim
to
combine
two
different
classiDiers
with
distinctive
set
of
features
(lexical
and
stylistic);
0 They’ve
created
a
PARser
for
Content
Extraction
and
Layout
Structure
(PARCELS);
0 Web
page
division
–
the
authors
differentiate
between
structural
tags
and
content
tags.
17. Paper
#
3
Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication
C.
Lee
et
al
(National
University
of
Singapore,
2004)
18. Paper
#
3
Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication
C.
Lee
et
al
(National
University
of
Singapore,
2004)
0 The
authors
distinguish
between
labels
of
different
levels
of
granularity.
They
deDine
17
tags
for
labelling;
0 Stylistic
features:
0 Linear
structure
–
paragraph
(<p>),
header
(<h1>-‐<h6>)
and
rule
tags
(<hr>);
0 Table
structure
–
cell
Dlow,
neighbouring
cells’
data,
the
position
of
table
cells;
0 XHTML/CSS
structure
–
height,
width,
z-‐index;
0 Font
features
–
colour,
weight,
family,
size,
hyperlink
features;
0 Images
–
size,
number
of
images
within
a
block;
0 Lexical
features:
0 Low-‐level
features
–
count
and
vocabulary
of
the
words
present
in
the
text
block;
0 High-‐level
features
–
POS-‐tags,
mailto-‐links,
image-‐links,
text-‐links,
total-‐links;
0 Boostexter
is
used
for
co-‐training.
It
is
an
ensemble
learning
method
with
a
decision
stump
corresponding
to
a
single
weak
learner.
19. Paper
#
3
Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiDication
C.
Lee
et
al
(National
University
of
Singapore,
2004)
21. Paper
#
4
Can
we
learn
a
template
independent
wrapper
for
news
article
extraction
for
a
single
training
site?
J.
Wang
et
al
(2009,
Zhejiang
University,
MS
Research)
0 Problem:
ClassiDication
of
titles
and
bodies
of
news
taken
from
the
webpages
belonging
to
the
news
domain;
0 Assumptions:
A
tagged
corpus,
DOM
tree,
CSSBox
layout;
0 Approach:
SVM;
decision
function
gets
converted
to
posterior
probability;
0 Features:
Different
sets
of
features
for
body
and
title
extraction.
Features
are
divided
into
content
and
spatial
features;
0 Evaluation:
Overall
99%
extraction
accuracy.
22. Paper
#
4
Can
we
learn
a
template
independent
wrapper
for
news
article
extraction
for
a
single
training
site?
J.
Wang
et
al
(2009,
Zhejiang
University,
MS
Research)
0 The
aim
of
the
paper
is
to
efDiciently
extract
and
then
combine
titles
and
bodies
of
news
articles;
0
The
main
problem
is
in
dealing
with
various
noises
around
the
titles.
23. Paper
#
4
Can
we
learn
a
template
independent
wrapper
for
news
article
extraction
for
a
single
training
site?
J.
Wang
et
al
(2009,
Zhejiang
University,
MS
Research)
0 News
body
extraction:
0 Content
features:
FormattingElementsNum
and
FormattedContentLen;
0 Spatial
features:
normalised
RectLeft,
RectTop,
RectWidth
and
RectHeight;
0 News
body
extraction
heuristics:
TopInScreen(T)
and
BigEnough(T);
0 News
title
extraction:
0 Content
features:
FontSize,
EndWithFullStop,
WordNum;
0 Spatial
features:
RectLeft,
RectTop,
RectWidth,
RectHeight,
Overlap,
Distance,
Flat;
0 News
title
extraction
heuristics:
WholeInScreen(T),
NoAnchorText(T),
NotCategoryName(T);
0 A
SVM
approach
is
chosen
for
classiDication.
The
decision
function
gets
converted
to
posterior
probability.
24. Paper
#
4
Can
we
learn
a
template
independent
wrapper
for
news
article
extraction
for
a
single
training
site?
J.
Wang
et
al
(2009,
Zhejiang
University,
MS
Research)
Testing
results
on
the
large
Extraction
results
scale
experiment
25. Paper
#
5
EfDicient
record
level
wrapper
induction
S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)
0 Problem:
EfDicient
extraction
of
records
from
Web
pages
and
classiDication
of
their
elements;
0 Assumptions:
A
tagged
corpus,
DOM
tree;
0 Approach:
Alignment
of
the
DOM
subtree
and
the
possible
wrappers;
0 Features:
None;
0 Evaluation:
Four
different
domains
(online
shops,
user
reviews,
digital
libraries,
search
results).
Seven
detail
page
datasets
and
eleven
list
page
datasets.
A
99%
F1
value.
26. Paper
#
5
EfDicient
record
level
wrapper
induction
S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)
0 The
paper
is
concerned
with
extracting
records
and
their
respective
attributes;
0 The
key
distinction
from
other
approaches
is
the
record-‐
level
extraction
opposed
to
page-‐level
extraction;
0 The
authors
propose
a
novel
broom
structure
for
this
task;
0 The
broom
structure
has
a
head
and
a
stick;
0 One
of
the
main
issues
are
crossing
records.
27. Paper
#
5
EfDicient
record
level
wrapper
induction
S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)
28. Paper
#
5
EfDicient
record
level
wrapper
induction
S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)
0 The
general
architecture
of
the
system
involves
training
and
testing
phases.
29. Paper
#
5
EfDicient
record
level
wrapper
induction
S.
Zheng
et
al
(Pennsylvania
State
Univeristy,
2009)
0 The
authors
claim
to
achieve
a
remarkable
extraction
accuracy
and
a
signiDicant
boost
in
running
time
performance;
30. Paper
#
6
Towards
combining
Web
classiDication
and
Web
Information
Extraction:
a
case
study
P.
Luo
et
al
(HP
Labs
China,
2009)
with
the
extraction
of
its
0 Problem:
Combination
of
web
page
classiDication
based
on
their
relevance
to
a
speciDic
domain
speciDic
elements,
using
both
forward
and
backward
dependencies;
0 Assumptions:
A
tagged
corpus,
DOM
tree;
0 Approach:
Conditional
Random
Fields
(CRFs);
0 Features:
Course
terms
and
heuristics
for
course
homepage
detection;
format,
position
and
content
features
for
course
title
extraction;
0 Evaluation:
OfCourse
system
for
online
course
information
extraction.
90%
F1
value
for
course
page
classiDication,
83%
F1
value
for
course
title
extraction.
31. Paper
#
6
Towards
combining
Web
classiDication
and
Web
Information
Extraction:
a
case
study
P.
Luo
et
al
(HP
Labs
China,
2009)
0 The
authors
propose
a
method
that
utilises
both
forward
and
backward
dependencies
between
Web
classiDication
and
information
extraction;
0 The
authors
use
a
uniDied
graphical
CRF
model
for
joint
and
simultaneous
optimisation
of
these
two
steps;
0 This
methodology
has
been
used
for
building
the
OfCourse
online
search
engine
;
0 In
their
results
for
OfCourse
the
authors
claim
that
their
model
signiDicantly
outperforms
the
two
baseline
methods;
0 Drawbacks:
they
only
deal
with
DOM
leave
nodes
as
classiDication
variables
for
the
information
extraction
phase.
32.
33. Lessons
learnt
from
the
Reading
Course
#1
“Web
page
classiYication:
features
and
algorithms”
by
X.
Qi
and
B.
Davison
(2007):
the
importance
of
the
neighbouring
pages’
features,
features
of
neighbouring
pages;
#2
“Web
page
element
classiYication
based
on
visual
features”
by
R.
Burget
and
I.
Rudolfova
(2009):
a
broad
set
of
visual
features
(font
features,
spatial
features,
text
features
and
colour
features);
#3
“Stylistic
and
Lexical
Co-‐training
for
Web
Block
ClassiYication”
by
C.
Lee
et
al
(2004):
A
useful
web
block
division
algorithm.
A
possibility
of
co-‐training
on
the
same
corpus
using
two
distinctive
set
of
features;
34. Lessons
learnt
from
the
Reading
Course
#4
“Can
we
learn
a
template
independent
wrapper
for
news
article
extraction
for
a
single
training
site”
by
J.
Weng
et
al
(2009):
a
distinctive
set
of
features
for
news
title
extraction,
a
lot
of
which
can
be
used
for
property
title
extraction
in
DIADEM;
#5
“EfYicient
record
level
wrapper
induction
“by
S.
Zheng
et
al
(2009):
a
new
record-‐level
approach
for
extraction.
Performs
much
better
and
faster
than
the
page-‐level
approaches.
Can
be
useful
for
DIADEM
extraction
in
the
record-‐heavy
domains;
#6
“Towards
combining
Web
classiYication
and
Web
Information
Extraction:
a
case
study”
by
P.
Luo
et
al
(2009):
backward
dependency
between
these
two
tasks
can
work
as
well.
Thus
it
is
worthwhile
to
experiment
with
their
mutual
tie-‐up.
35. General
lessons
learnt
0 Most
of
the
papers
are
recent
or
very
recent
(2004-‐2009);
0 Features
play
a
much
more
important
role
than
algorithms;
0 Initial
page
segmentation
into
blocks
can
help
with
subsequent
determination
of
relevant
DOM-‐subtrees;
0 All
features
can
be
broadly
divided
into
content
features
and
visual
features;
0 News
domain
is
a
very
popular
one
(3
out
of
5
reviewed
systems).
No
mention
of
real
estate
in
any
of
the
papers.
36. Summary
of
the
Reading
Course
and
its
relevance
to
DIADEM
0 The
six
proposed
papers
are
of
relevance
to
all
three
areas
of
my
current
research:
0 Real
estate
page
classiDication;
0 Output/Input
page
distinction;
0 Property
page
elements’
classiDication;
0 The
most
obvious
synergy
is
with
Omer’s
NLP
work,
although
cross
sections
with
Cheng’s
and
Xiaonan’s
work
are
also
possible;
0
I
plan
to
use
a
subset
of
the
features
presented
in
these
papers
in
the
classiDication
of
the
elements
of
output
pages
and
subsequent
real
estate
page
classiDication.