We present solutions based on crowdsourcing platforms to support large-scale production of accurate wrappers around data-intensive websites.
Our approach is based on supervised wrapper induction algorithms which demand the burden of generating the training data to the workers of a crowdsourcing platform. Workers are paid for answering simple membership queries chosen by the system. We present two algorithms: a single worker algorithm (ALF) and a multiple workers algorithm (ALFRED). Both the algorithms deal with the inherent uncertainty of the responses and use an active learning approach to select the most informative queries.
ALFRED estimates the workers’ error rate to decide at runtime how many workers are needed. The experiments that we conducted on real and synthetic data are encouraging: our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Wrapper Generation Supervised by a Noisy Crowd
1. Wrapper Generation
Supervised by a Noisy Crowd
Valter Crescenzi, Paolo Merialdo, Disheng Qiu
Dipartimento di Ingegneria
Università degli Studi Roma Tre
Via della Vasca Navale, 79, Rome
disheng@dia.uniroma3.it
3. Extracting Data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
DB#Wrapper!
2
4. Extracting Data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
Inference
algorithm!
DB#Wrapper!
2
5. r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath
• All XPath are correct solutions for the annotated page
• Some of the rules do not work correctly in all the target pages
6. page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath
• All XPath are correct solutions for the annotated page
• Some of the rules do not work correctly in all the target pages
7. page0 page1 page2 ..
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle ..
Spirited Away - 9.3 ..
Spirited Away City of God null ..
r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath
• All XPath are correct solutions for the annotated page
• Some of the rules do not work correctly in all the target pages
8. page0 page1 page2 ..
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle ..
Spirited Away - 9.3 ..
Spirited Away City of God null ..
r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Single Page
Other pages
3
Wrapper as XPath
To generate wrappers:
• From a single annotated page, it generates a pool of XPath
• All XPath are correct solutions for the annotated page
• Some of the rules do not work correctly in all the target pages
Which one is correct?
11. Scaling Wrapper Inference
Scaling out with crowdsourcing platforms opens new challenges:
Issues: Contributions:
Non-expert
workers
• Simple interactions
• Membership Query (yes/no answer)
• Redundant tasks and worker error
rate estimation
• Active Learning*
• Dynamically engaging workers
Costs
Quality
• Quality Model
• Sampling algorithm*
6
*[Crescenzi WWW2013]
12. page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Yes/No !
First annotation
Sample
Worker’s answers
7
13. page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Yes/No !
First annotation
Sample
Worker’s answers
7
Quality Model: P(r1)
14. page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
• Rules compatible with the answer more
likely to be correct
For each new answer
r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Yes/No !
First annotation
Sample
Worker’s answers
7
Quality Model: P(r1)
15. page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
• Rules compatible with the answer more
likely to be correct
For each new answer
• If no rule is good enough:
• a new query is selected (Active Learning)*
r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Yes/No !
First annotation
Sample
Worker’s answers
7
*[Crescenzi WWW2013]
Quality Model: P(r1)
16. page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
Inference Algorithm
• Rules compatible with the answer more
likely to be correct
For each new answer
• If no rule is good enough:
• a new query is selected (Active Learning)*
r1 = /html/table/tr[1]/td/text()
r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()
r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()
....
Yes/No !
First annotation
Sample
Worker’s answers
7
*[Crescenzi WWW2013]
Quality Model: P(r1)
17. Termination Strategies
8
Quality
Costs
HALTᵣ
Expected quality of the wrapper
(probability of correctness)
HALTMQ
Number of used MQ
Quality
Costs
HALTH
Uncertainty of the questioned value
(trade-off quality/costs)
Different termination strategies:
19. Multiple Workers
Workers can make mistakes
We engage multiple workers on the same task, but how many?
Too many
workers
Not enough
workers
Waste of money
Quality loss
?
9
20. Multiple Workers
Workers can make mistakes
We engage multiple workers on the same task, but how many?
Too many
workers
Not enough
workers
Waste of money
Quality loss
We apply our quality model at runtime to:
• Estimate the workers’ error rates
• Select the right number of redundant tasks
?
9
21. Dynamically Engaging Workers
Workers
answers
Most Likely
Rule
Is it good
enough?• Starts with minimal amount of
redundancy
• Collects workers’ answers
• Estimates rule quality and workers’
error rate. Use
• workers’ error rate to estimate rule quality
• rule quality to estimate workers’ error rate
• If no rule is good enough a new worker
is engaged
Error rate
estimation
10
Algorithm main steps:
22. Dynamically Engaging Workers
Workers
answers
Most Likely
Rule
Is it good
enough?• Starts with minimal amount of
redundancy
• Collects workers’ answers
• Estimates rule quality and workers’
error rate. Use
• workers’ error rate to estimate rule quality
• rule quality to estimate workers’ error rate
• If no rule is good enough a new worker
is engaged
Error rate
estimation
+
10
Algorithm main steps:
23. Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1 0.1 0.1 0.1
NoYes No Yes No No
• Two real workers are engaged
• A new sequence is defined considering the union of all the answers
11
η = expected error rate
Dynamically Engaging Workers
24. Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1 0.1 0.1 0.1
NoYes No Yes No No
• Two real workers are engaged
• A new sequence is defined considering the union of all the answers
11
η = expected error rate
Dynamically Engaging Workers
25. • The most likely rule and its values are returned
• The most likely rule and its probability is adopted to estimate the η
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
es: P(r1) = 0.9
12
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3”
η 0.1 0.1 0.1
NoYes No
Dynamically Engaging Workers
26. • The most likely rule and its values are returned
• The most likely rule and its probability is adopted to estimate the η
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
es: P(r1) = 0.9
12
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3”
η 0.37 0.37 0.37
NoYes No
Dynamically Engaging Workers
27. • The most likely rule and its values are returned
• The most likely rule and its probability is adopted to estimate the η
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
es: P(r1) = 0.9
P(r1) = 0.93
12
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.1 0.1 0.1
NoYes No
Answers “Spirited Away” “City of God” “9.3”
η 0.37 0.37 0.37
NoYes No
Dynamically Engaging Workers
28. page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
P(r1) = 0.95
• When the computation converges, the system checks the termination condition
• If it is not met, a new worker is considered and the computation starts again
13
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.05 0.05 0.05
NoYes No
Answers “Spirited Away” “City of God” “9.3”
η 0.35 0.35 0.35
NoYes No
Dynamically Engaging Workers
29. page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
P(r1) = 0.95
P(r1) = 0.95
• When the computation converges, the system checks the termination condition
• If it is not met, a new worker is considered and the computation starts again
13
η = expected error rate
Answers “Spirited Away” “-” “9.3”
η 0.05 0.05 0.05
NoYes No
Answers “Spirited Away” “City of God” “9.3”
η 0.35 0.35 0.35
NoYes No
Dynamically Engaging Workers
30. Experiments - Dataset
Site Entity |Pages|
www.imdb.com Actor 500k
www.imdb.com Movies 500k
www.allmusic.com Band 500k
www.allmusic.com Albums 500k
www.nasdaq.com Stock Quotes 7k
40 attributes
manually crafted golden rules
Measures:
• Costs #MQ
• Quality Precision, Recall and F-measure
14
31. Simulating Real Workers
0%
10%
20%
30%
40%
0.00 0.10 0.20 0.30 0.40 0.50
error ratee x
100 Real (and noisy) AMT workers
Real workers:
1/3 perfect
Average η* = 10%
ση* = 11%
We simulated the error rate
distribution with an exponential
function
15
32. η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker:
- η expected error rate
- η* observed error rate
16
33. η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker:
- η expected error rate
- η* observed error rate
16
η close to η*:
(good estimation)
- few MQ
- good F
34. η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker:
- η expected error rate
- η* observed error rate
16
η close to η*:
(good estimation)
- few MQ
- good F
η* > η:
(too optimistic)
- too few MQ
- low F
35. η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker:
- η expected error rate
- η* observed error rate
16
η close to η*:
(good estimation)
- few MQ
- good F
η* > η:
(too optimistic)
- too few MQ
- low F
η > η*:
(too pessimistic)
- too many MQ
- same F
36. η* > η (optimistic) η* = η (correct) η* < η (pessimistic)
MQ ~10 ~10 ~30
F ~0.65 ~1 ~1
Wrong Estimation
Noisy single worker:
- η expected error rate
- η* observed error rate
16
η close to η*:
(good estimation)
- few MQ
- good F
η* > η:
(too optimistic)
- too few MQ
- low F
η > η*:
(too pessimistic)
- too many MQ
- same F
Need to estimate the workers’ error rate
41. Dynamically Engaging Workers
Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη
one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17
lower quality, less MQ
correct estimation
required
accurate estimation, but
achieved only at the end
Almost perfect wrapper
42. Dynamically Engaging Workers
Algorithm F σF #MQ max-MQ max-|W| |η-η*|
ALFη
one worker
0.92 17% 7.58 11 1 -
ALFREDno 1 1% 18.6 83 9 -
ALFRED 1 1% 16.1 44 4 0.8%
ALFRED* 1 1% 16.07 40 4 0%
Synthetic (and noisy) workers |W| = # workers
17
lower quality, less MQ
correct estimation
required
accurate estimation, but
achieved only at the end
Almost perfect wrapper
2
3
4
0% 25% 50% 75% 100%
2%
6%
92%
% |W|
|W|
|W|
43. Background in solid machine learning and computational learning theories*
Conclusions
18
We proposed a framework for wrapper generation:
• simple tasks can be completed by non expert workers
• cost effective wrapper generation
• highly predictable quality of the output wrapper
The proposed framework can be applied to other learning tasks:
• Crawling
• NLP
*[Angluin-Laird1988, Angluin2001]
45. Future development
Learning framework applied to problems (NLP, Entity Linkage)
ALFRED adopted to learn structure-driven crawling algorithm
Hybrid approaches human annotations and automatic annotations
Alternative models of truth/error rate
Optimizing the initial number of workers
20
49. Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away City of God
Spirited Away -
Spirited Away City of God
r1 = r3 ≠ r2
23
50. Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away City of God
Spirited Away -
Spirited Away City of God
r1 = r3 ≠ r2
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
r1 ≠ r3 ≠ r2
23
51. Sampling & Quality
page0
r1
r2
r3
Spirited Away
Spirited Away
Spirited Away
r1 = r2 = r3
page0 page1
r1
r2
r3
Spirited Away City of God
Spirited Away -
Spirited Away City of God
r1 = r3 ≠ r2
page0 page1 page2
r1
r2
r3
Spirited Away City of God Howl’s Moving Castle
Spirited Away - 9.3
Spirited Away City of God null
r1 ≠ r3 ≠ r2
Pages make apparent the
differences among the rules
Find a small set that makes apparent
the same differences observed in the
whole set of pages
23
52. Sampling & Quality
The problem.
Find the smallest set that makes apparent the differences among the rules:
(e.g., 100 pages that make apparent the same differences that we would observe in 2M pages).
It is a NP-Hard problem !! Reduction to SET-Cover problem:
Find the smallest set of pages that cover all the group of rules (group = equivalent rules).
The smallest set is not needed:
A greedy algorithm O(|Pages|) in time and O(1) in space works very well in practice.
24
53. XPath rules
For every page p:
if (p makes apparent new differences)
representative pages += p
An offline algorithm that can be easily parallelized
Sampling & Quality
25
54. Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Representative 30 1.00 1.00
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Representative 15 1.00 1.00
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Representative 59 1.00 1.00
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
Representative 36 1.00 1.00
26
55. Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Representative 30 1.00 1.00
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Representative 15 1.00 1.00
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Representative 59 1.00 1.00
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
Representative 36 1.00 1.00
Representative perfect
26
56. Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Representative 30 1.00 1.00
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Representative 15 1.00 1.00
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Representative 59 1.00 1.00
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
Representative 36 1.00 1.00
Biased: recall loss
26
57. Sampling
Entity Sampling |Pages| P R
Movies
Biased 250 0.98 0.71
Movies Random 250 0.99 0.99Movies
Representative 42 1.00 1.00
Actors
Biased 250 1.00 1.00
Actors Random 250 1.00 0.96Actors
Representative 30 1.00 1.00
Stocks
Biased 86 1.00 0.98
Stocks Random 86 1.00 0.99Stocks
Representative 15 1.00 1.00
Albums
Biased 258 1.00 0.99
Albums Random 258 1.00 1.00Albums
Representative 59 1.00 1.00
Bands
Biased 289 1.00 0.68
Bands Random 289 1.00 1.00Bands
Representative 36 1.00 1.00
Random:
better than biased
but not perfect
26
58. 27
Related Wrapper Generation
Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi et. al VLDB2011
DIADEM
T. Furche
G. Gottlob ... etc
WWW2012
Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai WWW2005
Extracting Structured Data from Web Pages
Arvind Arasu
Hector Garcia-Molina
SIGMOD
2003
RoadRunner Crescenzi VLDB2001
Wrapper Induction for information extraction Kushmerick IJCAI97
Active Learning with Multiple Views Ion Muslea JAIR2006
Interactive Wrapper Generation with Minimal User Effort Utku Irmak WWW2006