SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Joint Repairs for Web Wrappers
Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche
ICDE Helsinki - May, 19 2016
E
E
Schindler’s List
Lawrence of Arabia (re-release)
Le cercle Rouge (re-release)
Director: Steven Spielberg Rating: R Runtime: 195 min
Director: David Lean Rating: PG Runtime: 216 min
Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: Steven Spielberg Rating: R Runtime: 195 min
Director: David Lean Rating: PG Runtime: 216 min
Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: David R Lynch Rating: Not Rated Runtime: 123 min
SUFFIX= substring-before(“_(“)
PREFIX= substring-after(“tor:_“)
SUFFIX= substring-before(“_Rat“)
PREFIX=
substring(string-length()-7)
WADaR induces regular expressions by looking, e.g., at
common prefixes, suffixes, non-content strings, and character
length.
Induced expressions improve recall
token value1 token token token token
token token value2 token token token
token token token value3 token token
When WADaR cannot induce regular expressions (not enough
regularity), data is repaired directly with annotators. Wrappers
are instead repaired with value-based expressions, i.e.,
disjunction of the annotated values.
ATTRIBUTE=
string-contains(“value1”|”value2”|”value3”)
y$
$
e)$
WADaR is highly robust to errors of the NERs.
Optimality
WADaR provably produces relations of maximum fitness,
provided that the number of correctly annotated tuples is more
than the maximum error rate of the annotators.
Background: Web wrapping
refcode postcode bedrooms bathrooms available price
33453 OX2 6AR 3 2 15/10/2013 £1280 pcm
33433 OX4 7DG 2 1 18/04/2013 £995 pcm
Process or turning semi-structured (templated) web data into structured form
Hidden databases are actually a form of dark / dim data (ref. panel on Tuesday)
manual / (semi) supervised
accurate
expensive + non-scalable
unsupervised
less accurate
cheaper + scalable
Wrapidity
Background: Web wrapping
Background: Web wrapping
From (manually or automatically) created examples to XPath-based wrappers
Even on templated websites, automatic wrapping can be inaccurate
Pairs <field,expression> that, once applied to the DOM, return structured records
field expression
listing //body
record //div[contains(@class,'movlist_wrap')]
title //span[contains(@class,’title’)]/text()
rated .//span[.='rating:']/following-sibling::strong/text()
genre .//span[.=genre']/following-sibling::strong/text()
releaseMo .//span[@class='release']/text()
releaseDy .//span[@class='release']/text()
releaseYr .//span[@class='release']/text()
image .//@src
runtime .//span[.=runtime']/following-sibling::strong/text()
Problems with wrapping
Inaccurate wrapping results in over(under) segmented data
Attribute_1 Attribute_2
Ava’s Possessions Release Date: March 4, 2016 | Rated: R | Genre(s) : Sci-Fi,
Mystery, Thriller, Horror | Production Company: Off
Hollywood Pictures “| Runtime: 216 min
Camino
Release Date: March 4, 2016 | Rated: Not Rated | Genre(s):
Action, Adventure, Thriller | Production Company: Bielberg
Entertainment | Runtime: 103 min
Cemetery of Splendor
Release Date: March 4, 2016 | Rated: Not Rated | Genre(s):
Drama | User Score: 4.6 | Production Company: Centre
National de la Cinématographie (CNC) | Runtime: 122 min
Title Release Genre Rating Runtime
RS: Source Relation
: Target Schema
Example extraction using
RoadRunner
(Crescenzi Et Al.)
Questions
The questions we want to answer are:
can we fix the data, and use what we learn to repair wrappers as well?
are the solutions scalable?
Why do we care?
Companies such as FB, and Skyscanner spend millions of dollars of engineering
time, creating and maintaining wrappers
Wrapper maintenance is a major cost of data acquisition from the web
Fixing the data
MAKE MODEL PRICEThe wrapper thinks it is filling this schema…
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
£10k Citroën C3
£22k Ford C-max Titanium X
If all instances looked like this (i.e., mis-segmentation, no garbage, no shuffling)
table induction problem: TEGRA, WebTables, etc.
Moreover… we still have no clue on how to fix the wrapper afterwards
…but instead it produces this instance…
£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
What is a good relation?
The problem is that wrapper generated relations really look like this…
First, we need a way to determine how “far” we are from a good relation…
ū = ⟨u1, u2, …, un⟩
a tuple generated by the wrapper
Σ = ⟨A1, A2, …, Am⟩
the (target) schema for the extraction
Ω = {ωA1
, …, ωAarity(Σ)
}
set of oracles for Σ
The fitness then quantifies how well ū (resp. the whole instance) “fits” Σ
Ω = {ωMAKE, ωMODEL,ωPRICE}, Σ = ⟨MAKE, PRICE, MODEL⟩
ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise
f(R, Σ, Ω) = 1/2 = 50%
£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
ωMAKE ωPRICE ωMODEL
Problem Definition: Fitness
Σ = ⟨A1, A2, …, Am⟩ attributes (fields) of the target schema of the relation
ū = ⟨u1, u2, …, un⟩ tuple of the wrapper-generated relation R
Ω = {ωA1
, …, ωAarity(Σ)
} set of oracles for the fields of the Σ, s.t.,
ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise
We define the fitness of a tuple ū (resp. relation R) w.r.t. a schema Σ as:
f(ū, Σ, Ω) = ∑ ωAi
(ui) / d
i=1
c
where: c=min{ arity(Σ), arity(R) } and d=max{ arity(Σ), arity(R) }
resp. f(R, Σ, Ω) = ∑ f(ū, Σ, Ω) / |R|
ū∈R
Input: a wrapper W, a relation R | W(P)=R for some set of pages P, and a schema Σ
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
f(R, Σ, Ω) = 1/6 = 17%
MAKE MODEL PRICE
Problem Definition: Σ-repairs
Π = (i, j, …, k) permutation of the fields of R
ρ = { ⟨A1,ƐAi
⟩, ⟨A2,ƐAi
⟩, …, ⟨Am,ƐAm
⟩ } set of regexes for each attribute in Σ
A Σ-repair is a pair σ = ⟨Π,ρ⟩ where:
Σ-repairs can be applied to a tuple ū in the following way
σ(ū) = ⟨ ƐAi
(Π(ū)), ƐA2
(Π(ū)), … , ƐAm
(Π(ū)) ⟩
The notion of applicability extends naturally to relations σ(R) (i.e., sets of tuples)
Similarly, Σ-repairs can be applied to wrappers as well [details in the paper]
Output: a wrapper W’ and a relation R’ | W’(P)=R’ and R’ is of maximum fitness w.r.t. Σ
The goal is to find the Σ-repair that maximises the fitness
Computing Σ-repairs
Complexity [details in the paper]:
1. non atomic misplacements: NP-complete (red. from Weighted Set Packing)
2. atomic misplacements: polynomial (red. from Stars and Buckets)
We have an atomic misplacement when the correct value for an attribute is:
1. entirely misplaced, or
2. if it is over-segmented, the fragments are in adjacent fields in the relation
MAKE MODEL PRICE
£22k Ford C-max Titanium X
MAKE MODEL PRICE
C-max £22k X Ford Titanium
atomic misplacement non atomic misplacement
Naïve Algorithm:
For each tuple…
1. permute tuples in all possible ways (only if non-atomic misplacements)
2. segment tuples in all possible ways
3. ask the oracles and keep the segmentation of highest fitness
Approximating Σ-repairs
The naïve algorithm has the following problems:
1. oracles do not (always) exist
2. it fixes one tuple at a time, the wrapper needs a single fix for each attribute
3. even under the assumption of atomic misplacements we still have to try O(nk)
different segmentations (worst case) before finding the one of maximum fitness
(1) Weak oracles
Use noisy NERs in spite of oracles. If unavailable, it’s easy to build one.
In this work we use ROSeAnn (Chen&Al. PVLDB13)
(2 and 3) Approximate relation-wide repairs
Wrappers are programs, if they make a mistake they make it consistently
There is hope to have a common underlying attribute structure
Finding the right structure
We have to solve two problems:
find the underlying structure(s) of the relation
find an segmentation that maximises the fitness
An obvious way is sequence labelling (e.g., Markov chains + Viterbi) where oracles are
simulated by NERs (so they can make mistakes)
A SINK
5
SOURCE
B
C
3
D
2
3 3
4 4
2
The maximum likelihood sequence is actually <A,D> which “fits” ~28%
It looks like there’s another sequence that fits better…
a b c
a b c
a b c
a d
a d
b a d
b a d
Ω = {ωA, ωB, ωC, ωD}A B C D
Finding the right structure
The sequence corresponding to the max-flow is <A,B,C> which “fits” ~32%
vA,() SINK
13
SOURCE vB,(A)
vC,(A,B)
9 9 9
vD,(A)
4 4
vB,() vA,(B)
vD,(B,A)
4
4 4
4
The problem is that Markov chains are memory-less…
we have to remember the context and
make sure our sequence satisfies the oracles more than any other
Ok… this sounds like a max-flow!
Ω = {ωA, ωB, ωC, ωD}
a b c
a b c
a b c
a d
a d
b a d
b a d
A B C D
Iteratively compute max flows on the network, i.e., likely sequences of high fitness
MAKE
6/8
SINK
0/3
SOURCE
PRICE
MAKE
0/2
6/6
0/2
6/6PRICE, MAKE, MODEL
MODEL
Iteration 0
PRICE
0/3
0/3
MODEL
MODEL
0/3
6/6
MAKE
SINK
3/3
SOURCE
PRICE
3/3
3/33/3
MAKE, PRICE, MODEL
MODEL
Iteration 1
We stop when we covered “enough” of the tuples in the relation
First, annotate the relation using NERs (surrogate oracles) and build the network
MAKE
8
SINK
3
SOURCE
PRICE
MAKE
2
6
2
6
MODEL
PRICE
3
3
MODEL
MODEL
3
6
Example:
£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad quattro
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
Ω = {ωMAKE, ωMODEL,ωPRICE}
Finding the right structure
Fixing the relation (and the wrapper)
Max flows represent likely sequences. We use them to eliminate unsound annotations.
We can use standard regex-induction algorithms to obtain robust expressions
£19k Make: Audi Model: A3 Sportback
MAKE [11,15)
MODEL [24,36)
PRICE [0,4)
The remaining annotations can be used as examples for regex induction
The induced expressions recover missing (incomplete) annotations
£19k Make: Audi Model: A3 Sportback
£43k Make: Audi Model: A6 Allroad quattro
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
ρ = {
⟨MAKE, substring-before($, £) or
substring-before(substring-after($, ‘ke:␣’),’␣Mo’)⟩,
⟨MODEL, substring-after($, el:␣)⟩,
⟨PRICE, substring-after(substring-before($, ’kMa␣’ || ’kMo␣’),␣)⟩
}
Approximating Σ-repairs
MAKE MODEL PRICE
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
When an expression fails to match a minimum number of tuples, we fall back to
the NERs: value-based expressions
ρ = {
⟨MAKE, value-based($, [Audi, Ford] )⟩,
⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩,
⟨PRICE, substring-after(substring-before($, k␣),␣)⟩
}
Example: (induction threshold 75%)
MAKE MODEL PRICE
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
ρ = {
⟨MAKE, substring-before($, £) or
substring-before(substring-after($, k␣),␣)⟩,
⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩,
⟨PRICE, substring-after(substring-before($, k␣),␣)⟩
}
Example: (induction threshold 20%)
Evaluation
Dataset:
An enhanced version of the SWDE dataset (https://swde.codeplex.com)
10 domains, 100 websites, 78 attributes, ~100k pages, ~130k records
Systems:
wrapper generation systems: DIADEM, Depta, ViNTs, RoadRunner
Baseline wrapper induction/repair systems: WEIR (Crescenzi et Al. VLDB ‘13)
Implementation: WADaR (Wrapper and Data Repair) – Java + SQL
Evaluation: Highlights
0
0.2
0.4
0.6
0.8
1
ViN
Ts
(R
E)
ViN
Ts
(Auto)
D
IAD
EM
(R
E)
D
EPTA
(R
E)
D
IAD
EM
(Auto)
D
EPTA
(Auto)
R
R
(Auto)
R
R
(Book)
R
R
(C
am
era)
R
R
(Job)
R
R
(M
ovie)
R
R
(N
ba)R
R
(R
estaurant)R
R
(U
niversity)
Precision (Original) Precision (Repaired) Recall (Original)
Recall (Repaired) FScore (Original) FScore (Repaired)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FScore Original FScore Repaired
Fig. 2: Impact of repair.
to 30% in real estate, with an identical effect in almost all
domains.
Attribute-level accuracy. Another question is whether
there are substantial differences in attribute-level accuracy.
The top of Table III shows attributes where the repair is
very effective (F1-Score'1 after repair). These values appear
as highly structured attributes on web pages and the corre-
sponding expressions repair almost all tuples. As an example,
DOOR NUMBER is almost always followed by suffixes dr or
door. In these cases, the wrapper induction under-segmented
the text due to lack of sufficient examples.
TABLE III: Attribute-level evaluation.
System Domain Attribute Original F1-Score Repaired F1-Score
DIADEM real estate POSTCODE 0.304 0.947
DIADEM auto DOOR 0 0.984
shows Precision and Recall computed on the sample (values
higher than 0.9 are highlighted in bold). In order to estimate
TABLE IV: Accuracy of large scale evaluation.
Attribute Precision Recall % Modified values
LOCALITY 0.993 0.993 11.34%
OPENING HOURS 1.00 0.461 17.14%
LOCATED WITHIN 1.00 0.224 29.75%
PHONE 0.987 0.849 50.74%
POSTCODE 0.999 0.989 9.4%
STREET ADDRESS 0.983 0.98 83.78%
the impact of the repair, we computed, for each attribute, the
percentage of values that are different before and after the
repair step. These numbers are shown in the last column of
Table IV. Clearly, the repair is beneficial on all of the cases. For
OPENING HOURS and LOCATED WITHIN, where recall is very
WADaR increases F1-score between 15% and 60% (excluding ViNTs)
number of records
ws linearly w.r.t the
w.r.t. the number of
ted network contains
st network obtained
es and 45,797 edges
n 3 seconds.
pare our approach
data integration sys-
0.3$
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
Auto$
Book$
Cam
era$
Job$
M
ovie$
Nba$Restaurant$
University$
WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$
Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$
WADaR is 23% more accurate than WEIR on average
Evaluation: Robustness
We studied how F1-score varies w.r.t. annotation noise
The accuracy numbers are limited to those attributes where
our approach induces regular expressions, since it is already
clear that annotator errors directly reduce the accuracy of
value-based expressions. This is still a significant number of
attributes, i.e., 65% in all cases except for RoadRunner on
book (35%), and RoadRunner on movie (46%). Figure 8 shows
Fig. 8: Annotator recall drop - Fixed threshold
the impact of a drop in recall (x-axis) on F1-Score. As we
can see, our approach is robust to a drop in recall until we
reach 80% loss, then the performance rapidly decays. This is
somehow expected, since the regular expressions compensate
for the missing recall up to the point where the max-flow
sequences are no longer able to determine the underlying
attribute structure reliably.
Figure 9 show the effect on F1-Score if we set a low regex-
induction threshold (i.e., 0.1) instead. Clearly, in this case
our approach is highly robust to annotator inaccuracy and we
notice a loss in performance only after 80-90% loss in recall.
In summary, a lower regex-induction threshold is advisable
when we know that annotators have low recall. Even involving
an annotator with very low accuracy, our approach is robust
on
le
ef
re
Tu
of
sc
W
sh
th
ap
le
so
an
[1
(i)
in
in
re
so
En
re
cu
of
en
us
cl
in
[3
D
re
Fixed induction threshold 75%
(high dependence on annotation quality)
Fig. 8: Annotator recall drop - Fixed threshold
the impact of a drop in recall (x-axis) on F1-Score. As we
can see, our approach is robust to a drop in recall until we
reach 80% loss, then the performance rapidly decays. This is
somehow expected, since the regular expressions compensate
for the missing recall up to the point where the max-flow
sequences are no longer able to determine the underlying
attribute structure reliably.
Figure 9 show the effect on F1-Score if we set a low regex-
induction threshold (i.e., 0.1) instead. Clearly, in this case
our approach is highly robust to annotator inaccuracy and we
notice a loss in performance only after 80-90% loss in recall.
In summary, a lower regex-induction threshold is advisable
when we know that annotators have low recall. Even involving
an annotator with very low accuracy, our approach is robust
Fig. 9: F1-Score variation with a threshold value of 0.1
[1
(i)
in
in
re
so
En
re
cu
of
en
us
cle
in
[3
Di
re
co
in
re
va
flo
pr
pr
re
Fixed induction threshold 10%
(low dependence on annotation quality)
F1 starts being affected when recall loss at ~80%
Precision loss does not affect WADaR until ~300% (random noise)
Evaluation: Scalability
Worst-case scenario: all tuples are annotated with all attribute types
WADaR scales linearly w.r.t. the size of the relation and polynomially w.r.t. attributes
alue
s of
ated
ions
with
uish
n of
ocial
used
plied
been
mple
e IV
i.e., each record contains k annotated tokens, each annotation
has a different context and each record produces a different
path on the network. This results in a network with n · k + 2
nodes, and n · k + n edges. The chart on the left of Figure 3
plots the running time over an increasing number of records
(with number of attributes fixed), while the chart on the right
Fig. 3: Running time.
re the value
ses a loss of
he annotated
E in relations
of text with
ot distinguish
rmance.
extraction of
a large social
ites. We used
then applied
acy has been
on a sample
ed. Table IV
i.e., each record contains k annotated tokens, each annotation
has a different context and each record produces a different
path on the network. This results in a network with n · k + 2
nodes, and n · k + n edges. The chart on the left of Figure 3
plots the running time over an increasing number of records
(with number of attributes fixed), while the chart on the right
Fig. 3: Running time.
Oracles decouple the problem of finding similar instances from the segmentation
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
£10k Citroën C3
£22k Ford C-max Titanium X
Ω = {ωMAKE, ωMODEL, ωPRICE}
Open issues
Learning oracles
Building oracles is not difficult but still requires engineering time.
The IBM SystemT people did some of good work in this direction. We can start there.
Missing attributes
Right now, if the wrapper fails to recover data, then we cannot repair it.
It is possible to manipulate the wrapper to match more content.
Markov Chains vs Max flows on wrapped relations
They seem to eventually compute the same sequences but in different order… proof?
What I know is that max-flows best approximate the maximum fitness at every step.
Questions?
L. Chen, S. Ortona, G. Orsi, M. Benedikt. Aggregating Semantic Annotators. PVLDB ’13
S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. Joint Repairs for Web Wrappers. ICDE ’16
S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. WADaR: Joint Wrapper and Data Repair. VLDB ’15 (Demo)
References:
T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, C. Wang. DIADEM: Thousands of websites
to a single database. PVLDB ’15
Title Director Rating Runtime
Schindler’s
List
Steven
Spielberg
R 195 min
Web Data Extraction
Road Runner
DEPTA
Attribute_1 Attribute_2
Schindler’s List Director: Steven Spielberg Rating: R
Runtime: 195 min
Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min
Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime:
140 min
Joint Data and Wrapper Repair
Attribute_1 Attribute_2
Schindler’s
List
Director: Steven Spielberg
Rating: R
Runtime: 195 min
Title Director Rating Runtime
Schindler’s
List
Steven
Spielberg
R 195 min
Maximal Repair is NP-complete
Attribute
Director: Steven Spielberg Rating: R
Runtime: 195 min
Director Rating Runtime
Director Rating Runtime
Steven Spielberg R 195 min
φ1:
φ2:
φ3:
φ4:
OBSERVATIONS
Templated Websites:
Data is published following a template.
Wrapper Behaviour
Wrappers rarely misplace and over-segment at the same time.
Wrappers make systematic errors.
Oracles
Oracles can be implemented as (ensembles of) NERs.
NERs are not perfect, i.e., they make mistakes
Joint Wrapper And
Data Repair
Authors
When values are both misplaced and
over-segmented, computing repairs
of maximal fitness is hard, otherwise,
just do the following:
(1) Compute all possible k non-
crossing partitions (k = |R|) of
tokens, i.e., assign to each
attribute an element of the
partition (O(nk) - Narayana
Number).
(2) Discard tokens never accepted by
oracles in any of the partitions.
(3) Collapse identical partitions and
choose the one with maximal
fitness.
Without misplacement and over-segmentation, solution in polynomial time by
computing non-crossing k-partition
NP-hardness: reduction from Weighted Set Packing. Membership in NP:
guess a partition, decide non crossing and compute fitness in PTIME.
Stefano Ortona stefano.ortona@cs.ox.ac.uk University of Oxford, UK
Giorgio Orsi giorgio.orsi@cs.ox.ac.uk University of Oxford, UK
Marcello Buoncristiano marcello.buoncrisitano@yahoo.it Università della Basilicata,Italy
Tim Furche tim.furche@cs.ox.ac.uk University of Oxford, UK
http://diadem.cs.ox.ac.uk/wadar
Web data extraction (aka scraping/wrapping) uses
wrappers to turn web pages into structured data.
Wrapper: structure { ⟨R,!R⟩ { ⟨A1,!A1⟩,…,⟨Am,!Am⟩ } } specifying objects to be
extracted (listings, records, attributes) and corresponding XPath expressions.
Wrappers are often created algorithmically and in large numbers.
Tools capable of maintaining them over time are missing.
⟨RATING, //li[@class=‘second’]/p⟩
⟨RUNTIME, //li[@class=‘third’]/ul/li[1]⟩
Algorithmically-created wrappers generate data that is far from perfect.
Data can be badly segmented and misplaced.
⟨TITLE,⟨1⟩,string($)⟩
⟨DIRECTOR,⟨2⟩,substring-before(substring-after($,tor:_),_Rat)⟩
⟨RATING,⟨2⟩,substring-before(substring-after($,ing:_),_Run)⟩
⟨RUNTIME,⟨2⟩,substring-after($,time:_)⟩
Take a set Ω of oracles, where each ωA in Ω can say whether a value vA
belongs to the domain of A. We define the fitness of a relation R w.r.t. Ω as:
Repair: specifies regular expressions that, when applied on the original
relation, produce a new relation with higher fitness.
<Director:
Steven>
<195 min>
<Director:><Steven
Spielberg>
<Rating: R
Runtime:195>
<Runtime:
k195 min>
<min Director:
Steven
Spielberg>
<Rating:
Runtime: 195>
<Director:
Steven
Spielberg>
<Rating: R>
<R>
<Spielberg
Rating: R
Runtime:>
WADaR:
⟨DIRECTOR, //li[@class=‘first’]/div/span⟩
APPROXIMATING JOINT REPAIRS
Annotation
1
Each record is interpreted as a string (concatenation of
attributes), where NERs analyse and identify relevant attributes.
Entity recognisers make mistakes, WADaR tolerates incorrect
and missing annotations.
Attribute_1 Attribute_2
Schindler’s List Director: Steven Spielberg Rating: R
Runtime: 195 min
Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min
Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated
Runtime: 140 min
The life of Jack Tarantino (coming
soon)
Director: David R Lynch Rating: Not Rated Runtime:
123 min
Title
Title
Director
Title
Director Director
Rating
Rating
Runtime
Runtime
Runtime
Runtime
Rating
Director
Segmentation
SINK
RATIN
G
RATIN
G
MAX FLOW SEQUENCE: DIRECTOR
Goal: Understand underlying structure of the relation.
START
TITLE
Two possible ways of encoding the problem:
2
TITLE
11
1. Max Flow Sequence in a Flow Network
RUN
TIME
RUN
TIME
DIREC
TOR
DIREC
TOR
START
TITLE
DIREC
TOR
RATIN
G
2. Most Likely Sequence in a Memoryless Markov Chain
RUN
TIME
SINK
Solutions often coincide.
Markov Chains: intuitive and faster to compute.
Max Flows: provably optimal.
RUNTIMERATING
TITLEMOST LIKELY SEQUENCE: DIRECTOR RATING RUNTIME
3/4
1/4
1 3/4
1/4
1
1
3
11 8
8
11
3
3 3
3
Induction
3
Schindler’s List
Lawrence of Arabia (re-release)
Le cercle Rouge (re-release)
Director: Steven Spielberg Rating: R Runtime: 195 min
Director: David Lean Rating: PG Runtime: 216 min
Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: Steven Spielberg Rating: R Runtime: 195 min
Director: David Lean Rating: PG Runtime: 216 min
Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: David R Lynch Rating: Not Rated Runtime: 123 min
SUFFIX= substring-before(“_(“)
PREFIX= substring-after(“tor:_“)
SUFFIX= substring-before(“_Rat“)
PREFIX=
substring(string-length()-7)
Input: set of clean annotations to be used as positive examples.
WADaR induces regular expressions by looking, e.g., at
common prefixes, suffixes, non-content strings, and character
length.
Induced expressions improve recall
token value1 token token token token
token token value2 token token token
token token token value3 token token
When WADaR cannot induce regular expressions (not enough
regularity), data is repaired directly with annotators. Wrappers
are instead repaired with value-based expressions, i.e.,
disjunction of the annotated values.
ATTRIBUTE=
string-contains(“value1”|”value2”|”value3”)
Empirical Evaluation
Table 1
Precision_O
riginal
Precision_R
epaired
Recall_Origi
nal
Recall_Repai
red
FScore_Origi
nal
FScore_Rep
aired
0.013233 0.5689
0.004155 0.4255
0.006324 0.488
0.535259 0.1396
0.307571 0.2871
0.390661 0.2665
0.8243 0.0914
0.5348 0.3002
0.6487 0.2248
0.5264 0.3716
0.3501 0.5276
0.4205 0.4666
0.7332 0.1943
0.5147 0.3361
0.6048 0.2827
0.6703 0.2281
0.5091 0.3295
0.5787 0.2888
0.5777 0.2766
0.553 0.3314
0.5651 0.304
0.292 0.441
0.2317 0.4597
0.2584 0.4531
0.158 0.6278
0.1404 0.588
0.1487 0.6074
0.5636 0.2263
0.5191 0.235
0.5405 0.2311
0.446 0.3314
0.2552 0.471
0.3246 0.4263
0.6799 0.302
0.5609 0.299
0.6147 0.3022
0.7252 0.1589
0.6525 0.1429
0.6869 0.150
0.5461 0.3267
0.3965 0.3268
0.4594 0.3316
FScore_Origi
nal
FScore_Rep
aired
0.006324 0.488
0.390661 0.2665
0.6487 0.2248
0.4205 0.4666
0.6048 0.2827
0.5787 0.2888
0.5651 0.304
0.2584 0.4531
0.1487 0.6074
0.5405 0.2311
0.3246 0.4263
0.6147 0.1904
0.6869 0.1505
0.4594 0.3316
0
0.2
0.4
0.6
0.8
1
Vi
N
Ts
(R
E)
Vi
N
Ts
(A
ut
o)
D
IA
D
EM
(R
E)
D
EP
TA
(R
E)
D
IA
D
EM
(A
ut
o)
D
EP
TA
(A
ut
o)
R
R
(A
ut
o)
R
R
(B
oo
k)
R
R
(C
am
er
a)
R
R
(J
ob
)
R
R
(M
ov
ie
)
R
R
(N
ba
)R
R
(R
es
ta
ur
an
t)
R
R
(U
ni
ve
rs
ity
)
Precision (Original) Precision (Repaired) Recall (Original)
Recall (Repaired) FScore (Original) FScore (Repaired)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Vin
ts
_R
EVin
ts
_U
CD
IA
D
EM
_R
E
D
EPTA_R
E
D
IA
D
EM
_U
C
D
EPTA_U
CR
R
_AutoR
R
_BookR
R
_C
am
eraR
R
_JobR
R
_M
ovieR
R
_N
ba
R
R
_re
sta
ura
nt
R
R
_U
niv
ers
ity
U
ntitled
1
U
ntitled
2
FScore Original FScore Repaired
5.1 Setting
Datasets. The dataset consists of 100 websites from 10 do-
mains and is an enhanced version of SWDE [20], a benchmark com-
monly used in web data extraction. SWDE’s data is sourced from
80 sites and 8 domains: auto, book, camera, job, movie, NBA player,
restaurant, and university. For each website, SWDE provides collec-
tions of 400 to 2k detail pages (i.e., where each page corresponds
to a single record). We complemented SWDE with collections of
listing pages (i.e., pages with multiple records) from 20 websites
of real estate (RE) and auto domains. Table 1 summarises the char-
acteristics of the dataset. SWDE comes with ground-truth data cre-
Table 1: Dataset characteristics.
Domain Type Sites Pages Records Attributes
Real Estate listing 10 271 3,286 15
Auto listing 10 153 1,749 27
Auto detail 10 17,923 17,923 4
Book detail 10 20,000 20,000 5
Camera detail 10 5,258 5,258 3
Job detail 10 20,000 20,000 4
Movie detail 10 20,000 20,000 4
Nba Player detail 10 4,405 4,405 4
Restaurant detail 10 20,000 20,000 4
University detail 10 16,705 16,705 4
Total - 100 124,715 129,326 78
ated under the assumption that wrapper-generation systems could
only generate extraction rules with DOM-element granularity, i.e.,
without segmenting text nodes. Modern wrapper-generation sys-
tems support text-node segmentation and we therefore refined the
ground-truth accordingly. As an example, in the camera domain,
the original ground-truth values for MODEL consisted of the entire
product title. The text node includes, other than the model, COLOR,
PIXELS, MANUFACTURER. The ground-truth for real estate and auto
domains has been created following the SWDE format. The final
dataset consist of more than 120k pages, for almost 130k records
containing more than 500k attribute values.
Wrapper-generation systems. We generated input relations
for our evaluation using four wrapper-generation systems: DIA-
DEM [19], DEPTA [36] and ViNTs [39] for listing pages, and Road-
Runner [12] for detail pages.1 The output of DIADEM, DEPTA, and
RoadRunner can be readily used in the evaluation since these are
full fledged data extraction systems, supporting the segmentation
of both records and attributes within listings or (sets of) detail-
pages. ViNTs, on the other hand, segments rows into records within
a search result listing and, as such, it does not have a concept of
attribute. Instead, it segments rows within a record. We therefore
post-processed its output, typing the content of lines from differ-
ent records that are likely to have the same semantics. We used a
naïve heuristic similarity based on relative position in the record
and string-edit distance of the row’s content. This is a very simple
version of more advanced alignment methods based on instance-
level redundancy used by, e.g., WEIR and TEGRA [7].
Metrics. The performance of the repair is evaluated by com-
paring wrapper-generated relations against the SWDE ground truth
before and after the repair. The metrics used for the evaluation
are Precision, Recall, and F1-Score computed at attribute-level. Both
the ground truth and the extracted values are normalised, and exact
matching between the extracted values and the ground-truth is re-
quired for a hit. For space reasons, in this paper we only present
the most relevant results. The results of the full evaluation, together
1RoadRunner can be configured for listings but it performs better on
detail pages.
with the dataset, gold standard, extracted relations, the code of the
normaliser and of the scorer are available at the online appendix [1].
All experiments are run on a desktop with an Intel quad-core i7
at 3.40GHz with 16 GB Ram and Linux Mint OS 17.
5.2 Repair performance
Relation-level Accuracy. The first two questions we want to an-
swer are: whether joint repairs are necessary and what their impact
is in terms of quality. Table 2 reports, for each system, the percent-
age of: (i) Correctly extracted values. (ii) Under-segmentations,
i.e., when values for an attribute are extracted together with val-
ues of other attributes or spurious content. Indeed often websites
publish multiple attribute values within the same text node and the
involved extraction systems are not able to split values into multi-
ple attributes. (iii) Over-segmentations, i.e., when attribute values
are split over multiple fields. As anticipated in Section 2, this rarely
happens since an attribute value is often contained in a single text
node. In this setting an attribute value can be over-segmented only
if the extraction system is capable of splitting single text nodes
(DIADEM), but even in this case the splitting happens only when
the system can identify a strong regularity within the text node.
(iv) Misplacements, i.e., values are placed or labeled as the wrong
attribute. This is mostly due to lack of semantic knowledge and
confusion introduced by overlapping attribute domains. (v) Miss-
ing values, due to lack of regularity and optionality in the web
source (RoadRunner, DEPTA, ViNTs) or missing values from the do-
main knowledge (DIADEM). Note that the numbers do not add up to
Table 2: Wrapper generation system errors.
System Correct
(%)
Under
Segmented
(%)
Over
Segmented
(%)
Misplaced
(%)
Missing
(%)
DIADEM 60.9 34.6 0 23.2 3.5
DEPTA 49.7 44 0 25.3 6
ViNTs 23.9 60.8 0 36.4 15.2
RoadRunner 46.3 42.8 0 18.6 10.4
100% since errors may fall into multiple categories. These numbers
clearly show that there is a quality problem in wrapper-generated
relations and also support the atomic misplacement assumption.
Figure 2 shows, for each system and each domain, the impact
of the joint-repair on our metrics. Light (resp. dark)-colored bars
denote the quality of the relation before (resp. after) the repair.
A first conclusion that can be drawn is that a repair is always ben-
eficial. From 697 extracted attributes, 588 (84.4%) require some
form of repair and the average pre-repair F1-Score produced by the
systems is 50%. We are able to induce a correct regular expression
for 335 (57%) attributes, while for the remaining 253 (43%) it pro-
duces value-based expressions. We can repair at least one attribute
in each of the wrappers in all of the cases, and we can repair more
than 75% of attributes in more than 80% of the cases.
Among the considered systems, DIADEM delivers, in average,
the highest pre-repair F1-Score ( 60%), but it never exceeds 65%.
RoadRunner is in average worse than DIADEM but it reaches a bet-
ter 70% F1-Score on restaurant. Websites in this domain are in fact
highly structured and individual attribute values are contained in a
dedicated text node. When attributes are less structured, e.g., on
book, camera, movie, RoadRunner has a significant drop in perfor-
mance. As expected, ViNTs delivers the worst pre-cleaning results.
In terms of accuracy, our approach delivers a boost in F1-Score
between 15% and 60%. Performance is consistently close to or
above 80% across domains and, except for ViNTs, across systems,
with a peak of 91% for RoadRunner on NBA player.
The following are the remaining causes of errors: (i) Missing
values cannot be repaired as we can only use the data available in
8
0.3$
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
Auto
$
Book$
Cam
era
$
Jo
b$
M
ovie
$
Nba$Rest
aura
nt$
Univers
ity$
WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$
Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$
Evaluation
100 websites
10 domains
4 wrapper generation systems.
Precision, Recall, F1-Score
computed before and after repair.
WADaR boosts F1-Score between 15% and 60%.
Performance consistently close to or above 80%.
Metrics computed considering exact matches.
WADaR against WEIR.
WADaR is highly robust to errors of the NERs.
WADaR scales linearly with the size of the input
relation. Optimal joint-repair approximations
computed in polynomial time.
Optimality
WADaR provably produces relations of maximum fitness,
provided that the number of correctly annotated tuples is more
than the maximum error rate of the annotators.
More questions? Come to the poster later!!!
T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton. Data Wrangling for Big Data. EDBT ’16

Weitere ähnliche Inhalte

Was ist angesagt?

Strinng Classes in c++
Strinng Classes in c++Strinng Classes in c++
Strinng Classes in c++
Vikash Dhal
 
From Python to Scala
From Python to ScalaFrom Python to Scala
From Python to Scala
FFunction inc
 
Error Control in Multimedia Communications using Wireless Sensor Networks report
Error Control in Multimedia Communications using Wireless Sensor Networks reportError Control in Multimedia Communications using Wireless Sensor Networks report
Error Control in Multimedia Communications using Wireless Sensor Networks report
Muragesh Kabbinakantimath
 
String & its application
String & its applicationString & its application
String & its application
Tech_MX
 

Was ist angesagt? (20)

Strinng Classes in c++
Strinng Classes in c++Strinng Classes in c++
Strinng Classes in c++
 
String in programming language in c or c++
 String in programming language  in c or c++  String in programming language  in c or c++
String in programming language in c or c++
 
From Python to Scala
From Python to ScalaFrom Python to Scala
From Python to Scala
 
Error Control in Multimedia Communications using Wireless Sensor Networks report
Error Control in Multimedia Communications using Wireless Sensor Networks reportError Control in Multimedia Communications using Wireless Sensor Networks report
Error Control in Multimedia Communications using Wireless Sensor Networks report
 
Strings in c++
Strings in c++Strings in c++
Strings in c++
 
Strings
StringsStrings
Strings
 
E6
E6E6
E6
 
C string
C stringC string
C string
 
Format String Vulnerability
Format String VulnerabilityFormat String Vulnerability
Format String Vulnerability
 
Strings
StringsStrings
Strings
 
05 c++-strings
05 c++-strings05 c++-strings
05 c++-strings
 
String c
String cString c
String c
 
C++ string
C++ stringC++ string
C++ string
 
String & its application
String & its applicationString & its application
String & its application
 
Lex analysis
Lex analysisLex analysis
Lex analysis
 
string in C
string in Cstring in C
string in C
 
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)
 
String.ppt
String.pptString.ppt
String.ppt
 
String in c
String in cString in c
String in c
 
[ASM]Lab8
[ASM]Lab8[ASM]Lab8
[ASM]Lab8
 

Andere mochten auch

DIADEM WWW 2012
DIADEM WWW 2012DIADEM WWW 2012
DIADEM WWW 2012
Giorgio Orsi
 
Datalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web DatabasesDatalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web Databases
Giorgio Orsi
 
Chapter 01 concepts and principles of insurance
Chapter 01   concepts and principles of insuranceChapter 01   concepts and principles of insurance
Chapter 01 concepts and principles of insurance
iipmff2
 

Andere mochten auch (8)

DIADEM WWW 2012
DIADEM WWW 2012DIADEM WWW 2012
DIADEM WWW 2012
 
OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012
 
Deos 2014 - Welcome
Deos 2014 - WelcomeDeos 2014 - Welcome
Deos 2014 - Welcome
 
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
 
Querying UML Class Diagrams - FoSSaCS 2012
Querying UML Class Diagrams - FoSSaCS 2012Querying UML Class Diagrams - FoSSaCS 2012
Querying UML Class Diagrams - FoSSaCS 2012
 
Datalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web DatabasesDatalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web Databases
 
Allegati 1 11
Allegati 1 11Allegati 1 11
Allegati 1 11
 
Chapter 01 concepts and principles of insurance
Chapter 01   concepts and principles of insuranceChapter 01   concepts and principles of insurance
Chapter 01 concepts and principles of insurance
 

Ähnlich wie Joint Repairs for Web Wrappers

Roslyn compiler as a service
Roslyn compiler as a serviceRoslyn compiler as a service
Roslyn compiler as a service
Eugene Zharkov
 

Ähnlich wie Joint Repairs for Web Wrappers (20)

Microchip Mfg. problem
Microchip Mfg. problemMicrochip Mfg. problem
Microchip Mfg. problem
 
St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RSt Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel R
 
Refactoring Ruby Code
Refactoring Ruby CodeRefactoring Ruby Code
Refactoring Ruby Code
 
Computation Assignment Help
Computation Assignment Help Computation Assignment Help
Computation Assignment Help
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
Kotlin: maybe it's the right time
Kotlin: maybe it's the right timeKotlin: maybe it's the right time
Kotlin: maybe it's the right time
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, code
 
Refactoring
RefactoringRefactoring
Refactoring
 
Roslyn compiler as a service
Roslyn compiler as a serviceRoslyn compiler as a service
Roslyn compiler as a service
 
Microsoft cloud workshop - automated cloud service for MongoDB on Microsoft A...
Microsoft cloud workshop - automated cloud service for MongoDB on Microsoft A...Microsoft cloud workshop - automated cloud service for MongoDB on Microsoft A...
Microsoft cloud workshop - automated cloud service for MongoDB on Microsoft A...
 
Parallelising Dynamic Programming
Parallelising Dynamic ProgrammingParallelising Dynamic Programming
Parallelising Dynamic Programming
 
The Last Line Effect
The Last Line EffectThe Last Line Effect
The Last Line Effect
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Building Enigma with State Monad & Lens
Building Enigma with State Monad & LensBuilding Enigma with State Monad & Lens
Building Enigma with State Monad & Lens
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Eta
EtaEta
Eta
 
C2.0 propositional logic
C2.0 propositional logicC2.0 propositional logic
C2.0 propositional logic
 
Template Haskell
Template HaskellTemplate Haskell
Template Haskell
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 

Mehr von Giorgio Orsi

wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_final
Giorgio Orsi
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
Giorgio Orsi
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 Poster
Giorgio Orsi
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem Ontology
Giorgio Orsi
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentation
Giorgio Orsi
 

Mehr von Giorgio Orsi (20)

Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)
 
diadem-vldb-2015
diadem-vldb-2015diadem-vldb-2015
diadem-vldb-2015
 
wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_final
 
Query Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesQuery Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological Databases
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
 
Perv a ds-rr13
Perv a ds-rr13Perv a ds-rr13
Perv a ds-rr13
 
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsHeuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 Poster
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)
 
Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...
 
Table Recognition
Table RecognitionTable Recognition
Table Recognition
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem Ontology
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 
Oxpath vldb
Oxpath vldbOxpath vldb
Oxpath vldb
 
Gottlob ICDE 2011
Gottlob ICDE 2011Gottlob ICDE 2011
Gottlob ICDE 2011
 
OPAL Presentation
OPAL PresentationOPAL Presentation
OPAL Presentation
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentation
 
Orsi PersDB11
Orsi PersDB11Orsi PersDB11
Orsi PersDB11
 
Orsi Vldb11
Orsi Vldb11Orsi Vldb11
Orsi Vldb11
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Joint Repairs for Web Wrappers

  • 1. Joint Repairs for Web Wrappers Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche ICDE Helsinki - May, 19 2016 E E Schindler’s List Lawrence of Arabia (re-release) Le cercle Rouge (re-release) Director: Steven Spielberg Rating: R Runtime: 195 min Director: David Lean Rating: PG Runtime: 216 min Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Director: Steven Spielberg Rating: R Runtime: 195 min Director: David Lean Rating: PG Runtime: 216 min Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Director: David R Lynch Rating: Not Rated Runtime: 123 min SUFFIX= substring-before(“_(“) PREFIX= substring-after(“tor:_“) SUFFIX= substring-before(“_Rat“) PREFIX= substring(string-length()-7) WADaR induces regular expressions by looking, e.g., at common prefixes, suffixes, non-content strings, and character length. Induced expressions improve recall token value1 token token token token token token value2 token token token token token token value3 token token When WADaR cannot induce regular expressions (not enough regularity), data is repaired directly with annotators. Wrappers are instead repaired with value-based expressions, i.e., disjunction of the annotated values. ATTRIBUTE= string-contains(“value1”|”value2”|”value3”) y$ $ e)$ WADaR is highly robust to errors of the NERs. Optimality WADaR provably produces relations of maximum fitness, provided that the number of correctly annotated tuples is more than the maximum error rate of the annotators.
  • 2. Background: Web wrapping refcode postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm Process or turning semi-structured (templated) web data into structured form Hidden databases are actually a form of dark / dim data (ref. panel on Tuesday)
  • 3. manual / (semi) supervised accurate expensive + non-scalable unsupervised less accurate cheaper + scalable Wrapidity Background: Web wrapping
  • 4. Background: Web wrapping From (manually or automatically) created examples to XPath-based wrappers Even on templated websites, automatic wrapping can be inaccurate Pairs <field,expression> that, once applied to the DOM, return structured records field expression listing //body record //div[contains(@class,'movlist_wrap')] title //span[contains(@class,’title’)]/text() rated .//span[.='rating:']/following-sibling::strong/text() genre .//span[.=genre']/following-sibling::strong/text() releaseMo .//span[@class='release']/text() releaseDy .//span[@class='release']/text() releaseYr .//span[@class='release']/text() image .//@src runtime .//span[.=runtime']/following-sibling::strong/text()
  • 5. Problems with wrapping Inaccurate wrapping results in over(under) segmented data Attribute_1 Attribute_2 Ava’s Possessions Release Date: March 4, 2016 | Rated: R | Genre(s) : Sci-Fi, Mystery, Thriller, Horror | Production Company: Off Hollywood Pictures “| Runtime: 216 min Camino Release Date: March 4, 2016 | Rated: Not Rated | Genre(s): Action, Adventure, Thriller | Production Company: Bielberg Entertainment | Runtime: 103 min Cemetery of Splendor Release Date: March 4, 2016 | Rated: Not Rated | Genre(s): Drama | User Score: 4.6 | Production Company: Centre National de la Cinématographie (CNC) | Runtime: 122 min Title Release Genre Rating Runtime RS: Source Relation : Target Schema Example extraction using RoadRunner (Crescenzi Et Al.)
  • 6. Questions The questions we want to answer are: can we fix the data, and use what we learn to repair wrappers as well? are the solutions scalable? Why do we care? Companies such as FB, and Skyscanner spend millions of dollars of engineering time, creating and maintaining wrappers Wrapper maintenance is a major cost of data acquisition from the web
  • 7. Fixing the data MAKE MODEL PRICEThe wrapper thinks it is filling this schema… £19k Audi A3 Sportback £43k Audi A6 Allroad quattro £10k Citroën C3 £22k Ford C-max Titanium X If all instances looked like this (i.e., mis-segmentation, no garbage, no shuffling) table induction problem: TEGRA, WebTables, etc. Moreover… we still have no clue on how to fix the wrapper afterwards …but instead it produces this instance… £19k Make: Audi Model:A3 Sportback £43k Make: Audi Model: A6 Allroad Citroën £10k Model: C3 Ford £22k Model: C-max Titanium X
  • 8. What is a good relation? The problem is that wrapper generated relations really look like this… First, we need a way to determine how “far” we are from a good relation… ū = ⟨u1, u2, …, un⟩ a tuple generated by the wrapper Σ = ⟨A1, A2, …, Am⟩ the (target) schema for the extraction Ω = {ωA1 , …, ωAarity(Σ) } set of oracles for Σ The fitness then quantifies how well ū (resp. the whole instance) “fits” Σ Ω = {ωMAKE, ωMODEL,ωPRICE}, Σ = ⟨MAKE, PRICE, MODEL⟩ ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise f(R, Σ, Ω) = 1/2 = 50% £19k Make: Audi Model:A3 Sportback £43k Make: Audi Model: A6 Allroad Citroën £10k Model: C3 Ford £22k Model: C-max Titanium X ωMAKE ωPRICE ωMODEL
  • 9. Problem Definition: Fitness Σ = ⟨A1, A2, …, Am⟩ attributes (fields) of the target schema of the relation ū = ⟨u1, u2, …, un⟩ tuple of the wrapper-generated relation R Ω = {ωA1 , …, ωAarity(Σ) } set of oracles for the fields of the Σ, s.t., ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise We define the fitness of a tuple ū (resp. relation R) w.r.t. a schema Σ as: f(ū, Σ, Ω) = ∑ ωAi (ui) / d i=1 c where: c=min{ arity(Σ), arity(R) } and d=max{ arity(Σ), arity(R) } resp. f(R, Σ, Ω) = ∑ f(ū, Σ, Ω) / |R| ū∈R Input: a wrapper W, a relation R | W(P)=R for some set of pages P, and a schema Σ £19k Audi A3 Sportback £43k Audi A6 Allroad quattro Citroën £10k C3 Ford £22k C-max Titanium X f(R, Σ, Ω) = 1/6 = 17% MAKE MODEL PRICE
  • 10. Problem Definition: Σ-repairs Π = (i, j, …, k) permutation of the fields of R ρ = { ⟨A1,ƐAi ⟩, ⟨A2,ƐAi ⟩, …, ⟨Am,ƐAm ⟩ } set of regexes for each attribute in Σ A Σ-repair is a pair σ = ⟨Π,ρ⟩ where: Σ-repairs can be applied to a tuple ū in the following way σ(ū) = ⟨ ƐAi (Π(ū)), ƐA2 (Π(ū)), … , ƐAm (Π(ū)) ⟩ The notion of applicability extends naturally to relations σ(R) (i.e., sets of tuples) Similarly, Σ-repairs can be applied to wrappers as well [details in the paper] Output: a wrapper W’ and a relation R’ | W’(P)=R’ and R’ is of maximum fitness w.r.t. Σ The goal is to find the Σ-repair that maximises the fitness
  • 11. Computing Σ-repairs Complexity [details in the paper]: 1. non atomic misplacements: NP-complete (red. from Weighted Set Packing) 2. atomic misplacements: polynomial (red. from Stars and Buckets) We have an atomic misplacement when the correct value for an attribute is: 1. entirely misplaced, or 2. if it is over-segmented, the fragments are in adjacent fields in the relation MAKE MODEL PRICE £22k Ford C-max Titanium X MAKE MODEL PRICE C-max £22k X Ford Titanium atomic misplacement non atomic misplacement Naïve Algorithm: For each tuple… 1. permute tuples in all possible ways (only if non-atomic misplacements) 2. segment tuples in all possible ways 3. ask the oracles and keep the segmentation of highest fitness
  • 12. Approximating Σ-repairs The naïve algorithm has the following problems: 1. oracles do not (always) exist 2. it fixes one tuple at a time, the wrapper needs a single fix for each attribute 3. even under the assumption of atomic misplacements we still have to try O(nk) different segmentations (worst case) before finding the one of maximum fitness (1) Weak oracles Use noisy NERs in spite of oracles. If unavailable, it’s easy to build one. In this work we use ROSeAnn (Chen&Al. PVLDB13) (2 and 3) Approximate relation-wide repairs Wrappers are programs, if they make a mistake they make it consistently There is hope to have a common underlying attribute structure
  • 13. Finding the right structure We have to solve two problems: find the underlying structure(s) of the relation find an segmentation that maximises the fitness An obvious way is sequence labelling (e.g., Markov chains + Viterbi) where oracles are simulated by NERs (so they can make mistakes) A SINK 5 SOURCE B C 3 D 2 3 3 4 4 2 The maximum likelihood sequence is actually <A,D> which “fits” ~28% It looks like there’s another sequence that fits better… a b c a b c a b c a d a d b a d b a d Ω = {ωA, ωB, ωC, ωD}A B C D
  • 14. Finding the right structure The sequence corresponding to the max-flow is <A,B,C> which “fits” ~32% vA,() SINK 13 SOURCE vB,(A) vC,(A,B) 9 9 9 vD,(A) 4 4 vB,() vA,(B) vD,(B,A) 4 4 4 4 The problem is that Markov chains are memory-less… we have to remember the context and make sure our sequence satisfies the oracles more than any other Ok… this sounds like a max-flow! Ω = {ωA, ωB, ωC, ωD} a b c a b c a b c a d a d b a d b a d A B C D
  • 15. Iteratively compute max flows on the network, i.e., likely sequences of high fitness MAKE 6/8 SINK 0/3 SOURCE PRICE MAKE 0/2 6/6 0/2 6/6PRICE, MAKE, MODEL MODEL Iteration 0 PRICE 0/3 0/3 MODEL MODEL 0/3 6/6 MAKE SINK 3/3 SOURCE PRICE 3/3 3/33/3 MAKE, PRICE, MODEL MODEL Iteration 1 We stop when we covered “enough” of the tuples in the relation First, annotate the relation using NERs (surrogate oracles) and build the network MAKE 8 SINK 3 SOURCE PRICE MAKE 2 6 2 6 MODEL PRICE 3 3 MODEL MODEL 3 6 Example: £19k Make: Audi Model:A3 Sportback £43k Make: Audi Model: A6 Allroad quattro Citroën £10k Model: C3 Ford £22k Model: C-max Titanium X Ω = {ωMAKE, ωMODEL,ωPRICE} Finding the right structure
  • 16. Fixing the relation (and the wrapper) Max flows represent likely sequences. We use them to eliminate unsound annotations. We can use standard regex-induction algorithms to obtain robust expressions £19k Make: Audi Model: A3 Sportback MAKE [11,15) MODEL [24,36) PRICE [0,4) The remaining annotations can be used as examples for regex induction The induced expressions recover missing (incomplete) annotations £19k Make: Audi Model: A3 Sportback £43k Make: Audi Model: A6 Allroad quattro Citroën £10k Model: C3 Ford £22k Model: C-max Titanium X ρ = { ⟨MAKE, substring-before($, £) or substring-before(substring-after($, ‘ke:␣’),’␣Mo’)⟩, ⟨MODEL, substring-after($, el:␣)⟩, ⟨PRICE, substring-after(substring-before($, ’kMa␣’ || ’kMo␣’),␣)⟩ }
  • 17. Approximating Σ-repairs MAKE MODEL PRICE £19k Audi A3 Sportback £43k Audi A6 Allroad quattro Citroën £10k C3 Ford £22k C-max Titanium X When an expression fails to match a minimum number of tuples, we fall back to the NERs: value-based expressions ρ = { ⟨MAKE, value-based($, [Audi, Ford] )⟩, ⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩, ⟨PRICE, substring-after(substring-before($, k␣),␣)⟩ } Example: (induction threshold 75%) MAKE MODEL PRICE £19k Audi A3 Sportback £43k Audi A6 Allroad quattro Citroën £10k C3 Ford £22k C-max Titanium X ρ = { ⟨MAKE, substring-before($, £) or substring-before(substring-after($, k␣),␣)⟩, ⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩, ⟨PRICE, substring-after(substring-before($, k␣),␣)⟩ } Example: (induction threshold 20%)
  • 18. Evaluation Dataset: An enhanced version of the SWDE dataset (https://swde.codeplex.com) 10 domains, 100 websites, 78 attributes, ~100k pages, ~130k records Systems: wrapper generation systems: DIADEM, Depta, ViNTs, RoadRunner Baseline wrapper induction/repair systems: WEIR (Crescenzi et Al. VLDB ‘13) Implementation: WADaR (Wrapper and Data Repair) – Java + SQL
  • 19. Evaluation: Highlights 0 0.2 0.4 0.6 0.8 1 ViN Ts (R E) ViN Ts (Auto) D IAD EM (R E) D EPTA (R E) D IAD EM (Auto) D EPTA (Auto) R R (Auto) R R (Book) R R (C am era) R R (Job) R R (M ovie) R R (N ba)R R (R estaurant)R R (U niversity) Precision (Original) Precision (Repaired) Recall (Original) Recall (Repaired) FScore (Original) FScore (Repaired) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FScore Original FScore Repaired Fig. 2: Impact of repair. to 30% in real estate, with an identical effect in almost all domains. Attribute-level accuracy. Another question is whether there are substantial differences in attribute-level accuracy. The top of Table III shows attributes where the repair is very effective (F1-Score'1 after repair). These values appear as highly structured attributes on web pages and the corre- sponding expressions repair almost all tuples. As an example, DOOR NUMBER is almost always followed by suffixes dr or door. In these cases, the wrapper induction under-segmented the text due to lack of sufficient examples. TABLE III: Attribute-level evaluation. System Domain Attribute Original F1-Score Repaired F1-Score DIADEM real estate POSTCODE 0.304 0.947 DIADEM auto DOOR 0 0.984 shows Precision and Recall computed on the sample (values higher than 0.9 are highlighted in bold). In order to estimate TABLE IV: Accuracy of large scale evaluation. Attribute Precision Recall % Modified values LOCALITY 0.993 0.993 11.34% OPENING HOURS 1.00 0.461 17.14% LOCATED WITHIN 1.00 0.224 29.75% PHONE 0.987 0.849 50.74% POSTCODE 0.999 0.989 9.4% STREET ADDRESS 0.983 0.98 83.78% the impact of the repair, we computed, for each attribute, the percentage of values that are different before and after the repair step. These numbers are shown in the last column of Table IV. Clearly, the repair is beneficial on all of the cases. For OPENING HOURS and LOCATED WITHIN, where recall is very WADaR increases F1-score between 15% and 60% (excluding ViNTs) number of records ws linearly w.r.t the w.r.t. the number of ted network contains st network obtained es and 45,797 edges n 3 seconds. pare our approach data integration sys- 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$ Auto$ Book$ Cam era$ Job$ M ovie$ Nba$Restaurant$ University$ WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$ Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$ WADaR is 23% more accurate than WEIR on average
  • 20. Evaluation: Robustness We studied how F1-score varies w.r.t. annotation noise The accuracy numbers are limited to those attributes where our approach induces regular expressions, since it is already clear that annotator errors directly reduce the accuracy of value-based expressions. This is still a significant number of attributes, i.e., 65% in all cases except for RoadRunner on book (35%), and RoadRunner on movie (46%). Figure 8 shows Fig. 8: Annotator recall drop - Fixed threshold the impact of a drop in recall (x-axis) on F1-Score. As we can see, our approach is robust to a drop in recall until we reach 80% loss, then the performance rapidly decays. This is somehow expected, since the regular expressions compensate for the missing recall up to the point where the max-flow sequences are no longer able to determine the underlying attribute structure reliably. Figure 9 show the effect on F1-Score if we set a low regex- induction threshold (i.e., 0.1) instead. Clearly, in this case our approach is highly robust to annotator inaccuracy and we notice a loss in performance only after 80-90% loss in recall. In summary, a lower regex-induction threshold is advisable when we know that annotators have low recall. Even involving an annotator with very low accuracy, our approach is robust on le ef re Tu of sc W sh th ap le so an [1 (i) in in re so En re cu of en us cl in [3 D re Fixed induction threshold 75% (high dependence on annotation quality) Fig. 8: Annotator recall drop - Fixed threshold the impact of a drop in recall (x-axis) on F1-Score. As we can see, our approach is robust to a drop in recall until we reach 80% loss, then the performance rapidly decays. This is somehow expected, since the regular expressions compensate for the missing recall up to the point where the max-flow sequences are no longer able to determine the underlying attribute structure reliably. Figure 9 show the effect on F1-Score if we set a low regex- induction threshold (i.e., 0.1) instead. Clearly, in this case our approach is highly robust to annotator inaccuracy and we notice a loss in performance only after 80-90% loss in recall. In summary, a lower regex-induction threshold is advisable when we know that annotators have low recall. Even involving an annotator with very low accuracy, our approach is robust Fig. 9: F1-Score variation with a threshold value of 0.1 [1 (i) in in re so En re cu of en us cle in [3 Di re co in re va flo pr pr re Fixed induction threshold 10% (low dependence on annotation quality) F1 starts being affected when recall loss at ~80% Precision loss does not affect WADaR until ~300% (random noise)
  • 21. Evaluation: Scalability Worst-case scenario: all tuples are annotated with all attribute types WADaR scales linearly w.r.t. the size of the relation and polynomially w.r.t. attributes alue s of ated ions with uish n of ocial used plied been mple e IV i.e., each record contains k annotated tokens, each annotation has a different context and each record produces a different path on the network. This results in a network with n · k + 2 nodes, and n · k + n edges. The chart on the left of Figure 3 plots the running time over an increasing number of records (with number of attributes fixed), while the chart on the right Fig. 3: Running time. re the value ses a loss of he annotated E in relations of text with ot distinguish rmance. extraction of a large social ites. We used then applied acy has been on a sample ed. Table IV i.e., each record contains k annotated tokens, each annotation has a different context and each record produces a different path on the network. This results in a network with n · k + 2 nodes, and n · k + n edges. The chart on the left of Figure 3 plots the running time over an increasing number of records (with number of attributes fixed), while the chart on the right Fig. 3: Running time. Oracles decouple the problem of finding similar instances from the segmentation £19k Audi A3 Sportback £43k Audi A6 Allroad quattro £10k Citroën C3 £22k Ford C-max Titanium X Ω = {ωMAKE, ωMODEL, ωPRICE}
  • 22. Open issues Learning oracles Building oracles is not difficult but still requires engineering time. The IBM SystemT people did some of good work in this direction. We can start there. Missing attributes Right now, if the wrapper fails to recover data, then we cannot repair it. It is possible to manipulate the wrapper to match more content. Markov Chains vs Max flows on wrapped relations They seem to eventually compute the same sequences but in different order… proof? What I know is that max-flows best approximate the maximum fitness at every step.
  • 23. Questions? L. Chen, S. Ortona, G. Orsi, M. Benedikt. Aggregating Semantic Annotators. PVLDB ’13 S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. Joint Repairs for Web Wrappers. ICDE ’16 S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. WADaR: Joint Wrapper and Data Repair. VLDB ’15 (Demo) References: T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, C. Wang. DIADEM: Thousands of websites to a single database. PVLDB ’15 Title Director Rating Runtime Schindler’s List Steven Spielberg R 195 min Web Data Extraction Road Runner DEPTA Attribute_1 Attribute_2 Schindler’s List Director: Steven Spielberg Rating: R Runtime: 195 min Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Joint Data and Wrapper Repair Attribute_1 Attribute_2 Schindler’s List Director: Steven Spielberg Rating: R Runtime: 195 min Title Director Rating Runtime Schindler’s List Steven Spielberg R 195 min Maximal Repair is NP-complete Attribute Director: Steven Spielberg Rating: R Runtime: 195 min Director Rating Runtime Director Rating Runtime Steven Spielberg R 195 min φ1: φ2: φ3: φ4: OBSERVATIONS Templated Websites: Data is published following a template. Wrapper Behaviour Wrappers rarely misplace and over-segment at the same time. Wrappers make systematic errors. Oracles Oracles can be implemented as (ensembles of) NERs. NERs are not perfect, i.e., they make mistakes Joint Wrapper And Data Repair Authors When values are both misplaced and over-segmented, computing repairs of maximal fitness is hard, otherwise, just do the following: (1) Compute all possible k non- crossing partitions (k = |R|) of tokens, i.e., assign to each attribute an element of the partition (O(nk) - Narayana Number). (2) Discard tokens never accepted by oracles in any of the partitions. (3) Collapse identical partitions and choose the one with maximal fitness. Without misplacement and over-segmentation, solution in polynomial time by computing non-crossing k-partition NP-hardness: reduction from Weighted Set Packing. Membership in NP: guess a partition, decide non crossing and compute fitness in PTIME. Stefano Ortona stefano.ortona@cs.ox.ac.uk University of Oxford, UK Giorgio Orsi giorgio.orsi@cs.ox.ac.uk University of Oxford, UK Marcello Buoncristiano marcello.buoncrisitano@yahoo.it Università della Basilicata,Italy Tim Furche tim.furche@cs.ox.ac.uk University of Oxford, UK http://diadem.cs.ox.ac.uk/wadar Web data extraction (aka scraping/wrapping) uses wrappers to turn web pages into structured data. Wrapper: structure { ⟨R,!R⟩ { ⟨A1,!A1⟩,…,⟨Am,!Am⟩ } } specifying objects to be extracted (listings, records, attributes) and corresponding XPath expressions. Wrappers are often created algorithmically and in large numbers. Tools capable of maintaining them over time are missing. ⟨RATING, //li[@class=‘second’]/p⟩ ⟨RUNTIME, //li[@class=‘third’]/ul/li[1]⟩ Algorithmically-created wrappers generate data that is far from perfect. Data can be badly segmented and misplaced. ⟨TITLE,⟨1⟩,string($)⟩ ⟨DIRECTOR,⟨2⟩,substring-before(substring-after($,tor:_),_Rat)⟩ ⟨RATING,⟨2⟩,substring-before(substring-after($,ing:_),_Run)⟩ ⟨RUNTIME,⟨2⟩,substring-after($,time:_)⟩ Take a set Ω of oracles, where each ωA in Ω can say whether a value vA belongs to the domain of A. We define the fitness of a relation R w.r.t. Ω as: Repair: specifies regular expressions that, when applied on the original relation, produce a new relation with higher fitness. <Director: Steven> <195 min> <Director:><Steven Spielberg> <Rating: R Runtime:195> <Runtime: k195 min> <min Director: Steven Spielberg> <Rating: Runtime: 195> <Director: Steven Spielberg> <Rating: R> <R> <Spielberg Rating: R Runtime:> WADaR: ⟨DIRECTOR, //li[@class=‘first’]/div/span⟩ APPROXIMATING JOINT REPAIRS Annotation 1 Each record is interpreted as a string (concatenation of attributes), where NERs analyse and identify relevant attributes. Entity recognisers make mistakes, WADaR tolerates incorrect and missing annotations. Attribute_1 Attribute_2 Schindler’s List Director: Steven Spielberg Rating: R Runtime: 195 min Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min The life of Jack Tarantino (coming soon) Director: David R Lynch Rating: Not Rated Runtime: 123 min Title Title Director Title Director Director Rating Rating Runtime Runtime Runtime Runtime Rating Director Segmentation SINK RATIN G RATIN G MAX FLOW SEQUENCE: DIRECTOR Goal: Understand underlying structure of the relation. START TITLE Two possible ways of encoding the problem: 2 TITLE 11 1. Max Flow Sequence in a Flow Network RUN TIME RUN TIME DIREC TOR DIREC TOR START TITLE DIREC TOR RATIN G 2. Most Likely Sequence in a Memoryless Markov Chain RUN TIME SINK Solutions often coincide. Markov Chains: intuitive and faster to compute. Max Flows: provably optimal. RUNTIMERATING TITLEMOST LIKELY SEQUENCE: DIRECTOR RATING RUNTIME 3/4 1/4 1 3/4 1/4 1 1 3 11 8 8 11 3 3 3 3 Induction 3 Schindler’s List Lawrence of Arabia (re-release) Le cercle Rouge (re-release) Director: Steven Spielberg Rating: R Runtime: 195 min Director: David Lean Rating: PG Runtime: 216 min Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Director: Steven Spielberg Rating: R Runtime: 195 min Director: David Lean Rating: PG Runtime: 216 min Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min Director: David R Lynch Rating: Not Rated Runtime: 123 min SUFFIX= substring-before(“_(“) PREFIX= substring-after(“tor:_“) SUFFIX= substring-before(“_Rat“) PREFIX= substring(string-length()-7) Input: set of clean annotations to be used as positive examples. WADaR induces regular expressions by looking, e.g., at common prefixes, suffixes, non-content strings, and character length. Induced expressions improve recall token value1 token token token token token token value2 token token token token token token value3 token token When WADaR cannot induce regular expressions (not enough regularity), data is repaired directly with annotators. Wrappers are instead repaired with value-based expressions, i.e., disjunction of the annotated values. ATTRIBUTE= string-contains(“value1”|”value2”|”value3”) Empirical Evaluation Table 1 Precision_O riginal Precision_R epaired Recall_Origi nal Recall_Repai red FScore_Origi nal FScore_Rep aired 0.013233 0.5689 0.004155 0.4255 0.006324 0.488 0.535259 0.1396 0.307571 0.2871 0.390661 0.2665 0.8243 0.0914 0.5348 0.3002 0.6487 0.2248 0.5264 0.3716 0.3501 0.5276 0.4205 0.4666 0.7332 0.1943 0.5147 0.3361 0.6048 0.2827 0.6703 0.2281 0.5091 0.3295 0.5787 0.2888 0.5777 0.2766 0.553 0.3314 0.5651 0.304 0.292 0.441 0.2317 0.4597 0.2584 0.4531 0.158 0.6278 0.1404 0.588 0.1487 0.6074 0.5636 0.2263 0.5191 0.235 0.5405 0.2311 0.446 0.3314 0.2552 0.471 0.3246 0.4263 0.6799 0.302 0.5609 0.299 0.6147 0.3022 0.7252 0.1589 0.6525 0.1429 0.6869 0.150 0.5461 0.3267 0.3965 0.3268 0.4594 0.3316 FScore_Origi nal FScore_Rep aired 0.006324 0.488 0.390661 0.2665 0.6487 0.2248 0.4205 0.4666 0.6048 0.2827 0.5787 0.2888 0.5651 0.304 0.2584 0.4531 0.1487 0.6074 0.5405 0.2311 0.3246 0.4263 0.6147 0.1904 0.6869 0.1505 0.4594 0.3316 0 0.2 0.4 0.6 0.8 1 Vi N Ts (R E) Vi N Ts (A ut o) D IA D EM (R E) D EP TA (R E) D IA D EM (A ut o) D EP TA (A ut o) R R (A ut o) R R (B oo k) R R (C am er a) R R (J ob ) R R (M ov ie ) R R (N ba )R R (R es ta ur an t) R R (U ni ve rs ity ) Precision (Original) Precision (Repaired) Recall (Original) Recall (Repaired) FScore (Original) FScore (Repaired) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vin ts _R EVin ts _U CD IA D EM _R E D EPTA_R E D IA D EM _U C D EPTA_U CR R _AutoR R _BookR R _C am eraR R _JobR R _M ovieR R _N ba R R _re sta ura nt R R _U niv ers ity U ntitled 1 U ntitled 2 FScore Original FScore Repaired 5.1 Setting Datasets. The dataset consists of 100 websites from 10 do- mains and is an enhanced version of SWDE [20], a benchmark com- monly used in web data extraction. SWDE’s data is sourced from 80 sites and 8 domains: auto, book, camera, job, movie, NBA player, restaurant, and university. For each website, SWDE provides collec- tions of 400 to 2k detail pages (i.e., where each page corresponds to a single record). We complemented SWDE with collections of listing pages (i.e., pages with multiple records) from 20 websites of real estate (RE) and auto domains. Table 1 summarises the char- acteristics of the dataset. SWDE comes with ground-truth data cre- Table 1: Dataset characteristics. Domain Type Sites Pages Records Attributes Real Estate listing 10 271 3,286 15 Auto listing 10 153 1,749 27 Auto detail 10 17,923 17,923 4 Book detail 10 20,000 20,000 5 Camera detail 10 5,258 5,258 3 Job detail 10 20,000 20,000 4 Movie detail 10 20,000 20,000 4 Nba Player detail 10 4,405 4,405 4 Restaurant detail 10 20,000 20,000 4 University detail 10 16,705 16,705 4 Total - 100 124,715 129,326 78 ated under the assumption that wrapper-generation systems could only generate extraction rules with DOM-element granularity, i.e., without segmenting text nodes. Modern wrapper-generation sys- tems support text-node segmentation and we therefore refined the ground-truth accordingly. As an example, in the camera domain, the original ground-truth values for MODEL consisted of the entire product title. The text node includes, other than the model, COLOR, PIXELS, MANUFACTURER. The ground-truth for real estate and auto domains has been created following the SWDE format. The final dataset consist of more than 120k pages, for almost 130k records containing more than 500k attribute values. Wrapper-generation systems. We generated input relations for our evaluation using four wrapper-generation systems: DIA- DEM [19], DEPTA [36] and ViNTs [39] for listing pages, and Road- Runner [12] for detail pages.1 The output of DIADEM, DEPTA, and RoadRunner can be readily used in the evaluation since these are full fledged data extraction systems, supporting the segmentation of both records and attributes within listings or (sets of) detail- pages. ViNTs, on the other hand, segments rows into records within a search result listing and, as such, it does not have a concept of attribute. Instead, it segments rows within a record. We therefore post-processed its output, typing the content of lines from differ- ent records that are likely to have the same semantics. We used a naïve heuristic similarity based on relative position in the record and string-edit distance of the row’s content. This is a very simple version of more advanced alignment methods based on instance- level redundancy used by, e.g., WEIR and TEGRA [7]. Metrics. The performance of the repair is evaluated by com- paring wrapper-generated relations against the SWDE ground truth before and after the repair. The metrics used for the evaluation are Precision, Recall, and F1-Score computed at attribute-level. Both the ground truth and the extracted values are normalised, and exact matching between the extracted values and the ground-truth is re- quired for a hit. For space reasons, in this paper we only present the most relevant results. The results of the full evaluation, together 1RoadRunner can be configured for listings but it performs better on detail pages. with the dataset, gold standard, extracted relations, the code of the normaliser and of the scorer are available at the online appendix [1]. All experiments are run on a desktop with an Intel quad-core i7 at 3.40GHz with 16 GB Ram and Linux Mint OS 17. 5.2 Repair performance Relation-level Accuracy. The first two questions we want to an- swer are: whether joint repairs are necessary and what their impact is in terms of quality. Table 2 reports, for each system, the percent- age of: (i) Correctly extracted values. (ii) Under-segmentations, i.e., when values for an attribute are extracted together with val- ues of other attributes or spurious content. Indeed often websites publish multiple attribute values within the same text node and the involved extraction systems are not able to split values into multi- ple attributes. (iii) Over-segmentations, i.e., when attribute values are split over multiple fields. As anticipated in Section 2, this rarely happens since an attribute value is often contained in a single text node. In this setting an attribute value can be over-segmented only if the extraction system is capable of splitting single text nodes (DIADEM), but even in this case the splitting happens only when the system can identify a strong regularity within the text node. (iv) Misplacements, i.e., values are placed or labeled as the wrong attribute. This is mostly due to lack of semantic knowledge and confusion introduced by overlapping attribute domains. (v) Miss- ing values, due to lack of regularity and optionality in the web source (RoadRunner, DEPTA, ViNTs) or missing values from the do- main knowledge (DIADEM). Note that the numbers do not add up to Table 2: Wrapper generation system errors. System Correct (%) Under Segmented (%) Over Segmented (%) Misplaced (%) Missing (%) DIADEM 60.9 34.6 0 23.2 3.5 DEPTA 49.7 44 0 25.3 6 ViNTs 23.9 60.8 0 36.4 15.2 RoadRunner 46.3 42.8 0 18.6 10.4 100% since errors may fall into multiple categories. These numbers clearly show that there is a quality problem in wrapper-generated relations and also support the atomic misplacement assumption. Figure 2 shows, for each system and each domain, the impact of the joint-repair on our metrics. Light (resp. dark)-colored bars denote the quality of the relation before (resp. after) the repair. A first conclusion that can be drawn is that a repair is always ben- eficial. From 697 extracted attributes, 588 (84.4%) require some form of repair and the average pre-repair F1-Score produced by the systems is 50%. We are able to induce a correct regular expression for 335 (57%) attributes, while for the remaining 253 (43%) it pro- duces value-based expressions. We can repair at least one attribute in each of the wrappers in all of the cases, and we can repair more than 75% of attributes in more than 80% of the cases. Among the considered systems, DIADEM delivers, in average, the highest pre-repair F1-Score ( 60%), but it never exceeds 65%. RoadRunner is in average worse than DIADEM but it reaches a bet- ter 70% F1-Score on restaurant. Websites in this domain are in fact highly structured and individual attribute values are contained in a dedicated text node. When attributes are less structured, e.g., on book, camera, movie, RoadRunner has a significant drop in perfor- mance. As expected, ViNTs delivers the worst pre-cleaning results. In terms of accuracy, our approach delivers a boost in F1-Score between 15% and 60%. Performance is consistently close to or above 80% across domains and, except for ViNTs, across systems, with a peak of 91% for RoadRunner on NBA player. The following are the remaining causes of errors: (i) Missing values cannot be repaired as we can only use the data available in 8 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$ Auto $ Book$ Cam era $ Jo b$ M ovie $ Nba$Rest aura nt$ Univers ity$ WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$ Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$ Evaluation 100 websites 10 domains 4 wrapper generation systems. Precision, Recall, F1-Score computed before and after repair. WADaR boosts F1-Score between 15% and 60%. Performance consistently close to or above 80%. Metrics computed considering exact matches. WADaR against WEIR. WADaR is highly robust to errors of the NERs. WADaR scales linearly with the size of the input relation. Optimal joint-repair approximations computed in polynomial time. Optimality WADaR provably produces relations of maximum fitness, provided that the number of correctly annotated tuples is more than the maximum error rate of the annotators. More questions? Come to the poster later!!! T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton. Data Wrangling for Big Data. EDBT ’16