Wrapper Generation Supervised by a Noisy Crowd

Wrapper Generation
Supervised by a Noisy Crowd
Valter Crescenzi, Paolo Merialdo, Disheng Qiu
Dipartimento di Ingegneria
Università degli Studi Roma Tre
Via della Vasca Navale, 79, Rome
disheng@dia.uniroma3.it

Extracting Data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
2

Extracting Data
DB#Wrapper!
2

Extracting Data
Inference
algorithm!
DB#Wrapper!
2

r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath
• All XPath are correct solutions for the annotated page
• Some of the rules do not work correctly in all the target pages

r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
....
Single Page
Other pages
3
Wrapper as XPath

page1 page2 ..
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle ..
Spirited Away - 9.3 ..
Spirited Away City of God null ..
....
Single Page
Other pages
3
Wrapper as XPath

page1 page2 ..
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle ..
Spirited Away - 9.3 ..
Spirited Away City of God null ..
....
Single Page
Other pages
3
Wrapper as XPath
Which one is correct?

Extracting Data
Inference
algorithm!
DB#Wrapper!
Scalability Accuracy Coverage
Supervised
Unsupervised
Sup.+Annot.
NO OK High
OK NO High
OK OK Low
4

Crowdsourcing
An opportunity to scale supervised approaches
Inference
algorithm!
DB#Wrapper!
5

Scaling Wrapper Inference
Scaling out with crowdsourcing platforms opens new challenges:
Issues: Contributions:
Non-expert
workers
• Simple interactions
• Membership Query (yes/no answer)
• Redundant tasks and worker error
rate estimation
• Active Learning*
• Dynamically engaging workers
Costs
Quality
• Quality Model
• Sampling algorithm*
6
*[Crescenzi WWW2013]

page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
....
Yes/No !
First annotation
Sample
Worker’s answers
7

page1 page2
r1
r2
r3
Spirited Away - 9.3
Inference Algorithm
....
Yes/No !
First annotation
Sample
Worker’s answers
7
Quality Model: P(r1)

page1 page2
r1
r2
r3
Spirited Away - 9.3
Inference Algorithm
• Rules compatible with the answer more
likely to be correct
For each new answer
....
Yes/No !
First annotation
Sample
Worker’s answers
7

page1 page2
r1
r2
r3
Spirited Away - 9.3
Inference Algorithm
• Rules compatible with the answer more
likely to be correct
For each new answer
• If no rule is good enough:
• a new query is selected (Active Learning)*
....
Yes/No !
First annotation
Sample
Worker’s answers
7
*[Crescenzi WWW2013]

Termination Strategies
8
Quality
Costs
HALTᵣ
Expected quality of the wrapper
(probability of correctness)
HALTMQ
Number of used MQ
Quality
Costs
HALTH
Uncertainty of the questioned value
(trade-oﬀ quality/costs)
Diﬀerent termination strategies:

Multiple Workers
Workers can make mistakes
We engage multiple workers on the same task, but how many?
?
9

Multiple Workers
Too many
workers
Not enough
workers
Waste of money
Quality loss
?
9

Multiple Workers
Too many
workers
Not enough
workers
Waste of money
Quality loss
We apply our quality model at runtime to:
• Estimate the workers’ error rates
• Select the right number of redundant tasks
?
9

Dynamically Engaging Workers
Workers
answers
Most Likely
Rule
Is it good
enough?• Starts with minimal amount of
redundancy
• Collects workers’ answers
• Estimates rule quality and workers’
error rate. Use
• workers’ error rate to estimate rule quality
• rule quality to estimate workers’ error rate
• If no rule is good enough a new worker
is engaged
Error rate
estimation
10
Algorithm main steps:

Workers
answers
Most Likely
Rule
Is it good
enough?• Starts with minimal amount of
redundancy
• Collects workers’ answers
• Estimates rule quality and workers’
error rate. Use
• workers’ error rate to estimate rule quality
• rule quality to estimate workers’ error rate
• If no rule is good enough a new worker
is engaged
Error rate
estimation
+
10
Algorithm main steps:

Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1 0.1 0.1 0.1
NoYes No Yes No No
• Two real workers are engaged
• A new sequence is deﬁned considering the union of all the answers
11
η = expected error rate

• The most likely rule and its values are returned
• The most likely rule and its probability is adopted to estimate the η
page0 page1 page2
r1
r2
r3
Spirited Away - 9.3
es: P(r1) = 0.9
12
η 0.1 0.1 0.1
NoYes No
η 0.1 0.1 0.1
NoYes No

page0 page1 page2
r1
r2
r3
Spirited Away - 9.3
es: P(r1) = 0.9
12
η 0.1 0.1 0.1
NoYes No
η 0.37 0.37 0.37
NoYes No

page0 page1 page2
r1
r2
r3
Spirited Away - 9.3
es: P(r1) = 0.9
P(r1) = 0.93
12
η 0.1 0.1 0.1
NoYes No
η 0.37 0.37 0.37
NoYes No

page1 page2
r1
r2
r3
Spirited Away - 9.3
P(r1) = 0.95
• When the computation converges, the system checks the termination condition
• If it is not met, a new worker is considered and the computation starts again
13
η 0.05 0.05 0.05
NoYes No
η 0.35 0.35 0.35
NoYes No

page1 page2
r1
r2
r3
Spirited Away - 9.3
P(r1) = 0.95
P(r1) = 0.95
• When the computation converges, the system checks the termination condition
• If it is not met, a new worker is considered and the computation starts again
13
η 0.05 0.05 0.05
NoYes No
η 0.35 0.35 0.35
NoYes No

Experiments - Dataset
Site Entity |Pages|
www.imdb.com Actor 500k
www.imdb.com Movies 500k
www.allmusic.com Band 500k
www.allmusic.com Albums 500k
www.nasdaq.com Stock Quotes 7k
40 attributes
manually crafted golden rules
Measures:
• Costs #MQ
• Quality Precision, Recall and F-measure
14

Simulating Real Workers
0%
10%
20%
30%
40%
0.00 0.10 0.20 0.30 0.40 0.50
error ratee x
100 Real (and noisy) AMT workers
Real workers:
1/3 perfect
Average η* = 10%
ση* = 11%
We simulated the error rate
distribution with an exponential
function
15

η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker:
- η expected error rate
- η* observed error rate
16

MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
16
η close to η*:
(good estimation)
- few MQ
- good F

MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
16
η close to η*:
(good estimation)
- few MQ
- good F
η* > η:
(too optimistic)
- too few MQ
- low F

MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
16
η close to η*:
(good estimation)
- few MQ
- good F
η* > η:
(too optimistic)
- too few MQ
- low F
η > η*:
(too pessimistic)
- too many MQ
- same F

MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
16
η close to η*:
(good estimation)
- few MQ
- good F
η* > η:
(too optimistic)
- too few MQ
- low F
η > η*:
(too pessimistic)
- too many MQ
- same F
Need to estimate the workers’ error rate

Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη
one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17

ALFη
one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
17
lower quality, less MQ

ALFη
one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
17
Almost perfect wrapper

ALFη
one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
17
correct estimation
required

ALFη
one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
17
correct estimation
required
accurate estimation, but
achieved only at the end

ALFη
one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
17
correct estimation
required
accurate estimation, but
achieved only at the end
2
3
4
0% 25% 50% 75% 100%
2%
6%
92%
% |W|
|W|
|W|

Background in solid machine learning and computational learning theories*
Conclusions
18
We proposed a framework for wrapper generation:
• simple tasks can be completed by non expert workers
• cost eﬀective wrapper generation
• highly predictable quality of the output wrapper
The proposed framework can be applied to other learning tasks:
• Crawling
• NLP
*[Angluin-Laird1988, Angluin2001]

Thank you for the attention !!
19

Future development
Learning framework applied to problems (NLP, Entity Linkage)
ALFRED adopted to learn structure-driven crawling algorithm
Hybrid approaches human annotations and automatic annotations
Alternative models of truth/error rate
Optimizing the initial number of workers
20

Wrong Estimation
- η = 0.1
- η* = from 0.05 to 0.4
21
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F
HALTr
HALTH
HALTMQ
4
6
8
10
12
14
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
MQ
HALTr
HALTH

Wrong Estimation
- η = from 0 to 0.4
- η* = 0.1
22
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
F
HALTr
HALTH
HALTMQ
3
10
100
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
MQ
HALTr
HALTH

Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
23

Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away City of God
Spirited Away -
r1 = r3 ≠ r2
23

Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away -
r1 = r3 ≠ r2
page0 page1 page2
r1
r2
r3
Spirited Away - 9.3
r1 ≠ r3 ≠ r2
23

Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away -
r1 = r3 ≠ r2
page0 page1 page2
r1
r2
r3
Spirited Away - 9.3
r1 ≠ r3 ≠ r2
Pages make apparent the
diﬀerences among the rules
Find a small set that makes apparent
the same diﬀerences observed in the
whole set of pages
23

Sampling & Quality
The problem.
Find the smallest set that makes apparent the diﬀerences among the rules:
(e.g., 100 pages that make apparent the same diﬀerences that we would observe in 2M pages).
It is a NP-Hard problem !! Reduction to SET-Cover problem:
Find the smallest set of pages that cover all the group of rules (group = equivalent rules).
The smallest set is not needed:
A greedy algorithm O(|Pages|) in time and O(1) in space works very well in practice.
24

XPath rules
For every page p:
if (p makes apparent new diﬀerences)
representative pages += p
An oﬄine algorithm that can be easily parallelized
Sampling & Quality
25

Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
26

Sampling
Movies
Biased 250 0.98 0.71
Actors
Biased 250 1.00 1.00
Stocks
Biased 86 1.00 0.98
Albums
Biased 258 1.00 0.99
Bands
Biased 289 1.00 0.68
Representative perfect
26

Sampling
Movies
Biased 250 0.98 0.71
Actors
Biased 250 1.00 1.00
Stocks
Biased 86 1.00 0.98
Albums
Biased 258 1.00 0.99
Bands
Biased 289 1.00 0.68
Biased: recall loss
26

Sampling
Movies
Biased 250 0.98 0.71
Actors
Biased 250 1.00 1.00
Stocks
Biased 86 1.00 0.98
Albums
Biased 258 1.00 0.99
Bands
Biased 289 1.00 0.68
Random:
better than biased
but not perfect
26

27
Related Wrapper Generation
Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi et. al VLDB2011
DIADEM
T. Furche
G. Gottlob ... etc
WWW2012
Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai WWW2005
Extracting Structured Data from Web Pages
Arvind Arasu
Hector Garcia-Molina
SIGMOD
2003
RoadRunner Crescenzi VLDB2001
Wrapper Induction for information extraction Kushmerick IJCAI97
Active Learning with Multiple Views Ion Muslea JAIR2006
Interactive Wrapper Generation with Minimal User Eﬀort Utku Irmak WWW2006

Wrapper Generation Supervised by a Noisy Crowd

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Wrapper Generation Supervised by a Noisy Crowd

Ähnlich wie Wrapper Generation Supervised by a Noisy Crowd (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Wrapper Generation Supervised by a Noisy Crowd