This document discusses query optimization over crowdsourced data in the Deco system. Deco uses a declarative approach to crowdsourcing where queries are executed over conceptual relations by accessing the crowd as needed. Query optimization aims to find the lowest cost query execution plan by estimating costs and cardinalities. Cost estimation considers both monetary costs of fetching data from crowd workers and estimating the final database state. Experimental results show that Deco's cost estimation can accurately estimate and select the lowest cost query plan.
2. Deco: Declarative Crowdsourcing
Give me a Spanish-speaking
country.
Give me a country.
What language do they speak
in country X?
What is the capital of country X?
8/27/2013 Hyunjung Park 2
“Find the capitals of eight
Spanish-speaking countries”
DBMS
country language capital
Italy Italian Rome
Spain Spanish Madrid
… … …
country language capital
Italy Italian Rome
Spain Spanish Madrid
Deco System
3. Deco Query Optimization
• Crowd incurs monetary cost
• Some query plans are much cheaper than others
• Cost estimation is complicated by:
– Previously collected data
– Unknown database state
– Inconsistency of human answers
8/27/2013 Hyunjung Park 3
4. Outline
• Motivating example
• Deco data model and queries
• Cost and cardinality estimation
• Experimental results
8/27/2013 Hyunjung Park 4
Everything implemented in full prototype
5. Motivating Example: Plan 1
8/27/2013 Hyunjung Park 5
Give me a country.
What language do they speak in country X?
What is the capital of country X?
unseen
Spanish
F
T
T
F
“Find the capitals of eight Spanish-speaking countries”
8x
6. Give me a country.Give me a country.Give me a country.
Motivating Example: Plan 2
8/27/2013 Hyunjung Park 6
Give me a Spanish-speaking country.
What language do they speak in country X?
What is the capital of country X?
unseen
Spanish
F
T
T
F
“Find the capitals of eight Spanish-speaking countries”
8x
7. Preview of Experimental Results
0
5
10
15
Plan 1 Plan 2
Actual costs spent on Mechanical Turk
What is the capital of
country X?
What language do they
speak in country X?
Give me a Spanish-speaking
country.
Give me a country.
8/27/2013 Hyunjung Park 7
($)
8. Outline
• Motivating example
• Deco data model and queries
• Cost and cardinality estimation
• Experimental results
8/27/2013 Hyunjung Park 8
9. Deco: Data Model (1/2)
• Conceptual Relation: visible to end-users
Country (country, language, capital)
• Resolution Rules: cleanse raw data using UDFs
country: dupElim
language: majority(3)
capital: majority(3)
8/27/2013 Hyunjung Park 9
10. Deco: Data Model (2/2)
• Fetch Rules: “access methods” for the crowd
language => country
“Give me a {language}-speaking country.”
Ø => country
“Give me a country.”
country => language
“What language do they speak in {country}?”
country => capital
“What is the capital of {country}?”
8/27/2013 Hyunjung Park 10
[$0.05]
[$0.01]
[$0.02]
[$0.03]
11. Deco: Queries
• Deco query: SQL query over conceptual relations
SELECT country, capital
FROM Country
WHERE language=‘Spanish’
MINTUPLES 8
• Query processor: access the crowd as needed to
produce query result while:
1. Minimizing monetary cost
2. Reducing latency
8/27/2013 Hyunjung Park 11
query optimizer
query execution engine
12. Query Optimization
• Find the best query plan in terms of estimated
monetary cost
• As in traditional query optimizer
1. Cost and cardinality estimation
2. Search space
3. Plan enumeration algorithm
8/27/2013 12Hyunjung Park
13. Cost Estimation
• Total monetary cost = ∑Fetch
F
F.price × F.cardinality
– Existing data is “free”
• Definition of Cardinality in Deco
– Total number of expected output tuples from operator
until query execution terminates
• Cardinality estimation
– Final database state needs to be estimated
simultaneously
8/27/2013 Hyunjung Park 13
14. Cardinality Estimation: Setting
• $0.05 for all fetch rules
• No existing data
• Selectivity factors
– language=‘Spanish’: 0.1
– dupElim: 0.8
– majority(3): 0.4 (=1/2.5)
8/27/2013 Hyunjung Park 14
15. Cardinality Estimation: Plan 1
8/27/2013 15Hyunjung Park
SELECT country, capital
FROM Country
WHERE language=‘Spanish’
MINTUPLES 8
MinTuples[8]
Project[co,ca]
DLOJoin[co]
DLOJoin[co]
Resolve[dupeli] Resolve[maj3]
Resolve[maj3]Filter[la=’Spanish’]
Scan
[CtryA]
Fetch
[Øàco]
Scan
[CtryD2]
Fetch
[coàca]
Scan
[CtryD1]
Fetch
[coàla]
1
2
3
4 12
5
13
96
7 8 10 11
14
Ø => country
country => language
country => capital
Cost estimation:
$0.05×(100+200+20)
= $16.00200
20
100
16. Cardinality Estimation: Plan 2
8/27/2013 16Hyunjung Park
MinTuples[8]
Project[co,ca]
DLOJoin[co]
DLOJoin[co]
Resolve[dupeli] Resolve[maj3]
Resolve[maj3]Filter[la=’Spanish’]
Scan
[CtryA]
Fetch
[laàco]
Scan
[CtryD2]
Fetch
[coàca]
Scan
[CtryD1]
Fetch
[coàla]
1
2
3
4 12
5
13
96
7 8a 10 11
14
SELECT country, capital
FROM Country
WHERE language=‘Spanish’
MINTUPLES 8
language => country
country => language
country => capital
Cost estimation:
$0.05×(10+20+20)
= $2.502010
20
17. 8/27/2013 Hyunjung Park 17
0
1
2
3
Actual
Plan 2
Experimental Results
0
5
10
15
Actual
Plan 1
country => capital
country => language
language => country
Ø => country
($) ($)
18. 8/27/2013 Hyunjung Park 18
0
1
2
3
Actual Estimated
Plan 2
Experimental Results
0
5
10
15
Actual Estimated
Plan 1
country => capital
country => language
language => country
Ø => country
($) ($)
19. Related Work
• Declarative approach for crowdsourcing
– Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ...
• Crowd-powered algorithms/operations
– Filter, sort, join, max, entity resolution, …
• Also:
– Traditional query optimization
– Heterogeneous or federated database systems
8/27/2013 19Hyunjung Park
20. Summary
• Cost estimation in Deco
– Distinguish between existing data vs. new data
– Estimate cardinality and final database state
simultaneously
• In the paper:
– Full description of cost estimation and plan
enumeration algorithms
– More experimental results
8/27/2013 Hyunjung Park 20