Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this talk, I will present new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.
Seismic Method Estimate velocity from seismic data.pptx
Dynamically Optimizing Queries over Large Scale Data Platforms
1. Dynamically Optimizing Queries over
Large Scale Data Platforms
[Work done at IBM Almaden Research Center]
Konstantinos Karanasos♯, Andrey Balmin§, Marcel Kutsch♣,
Fatma Özcan*, Vuk Ercegovac◊, Chunyang Xia♦, Jesse Jackson♦
♯Microsoft *IBM Research §Platfora ♣Apple ◊Google ♦IBM
Inria Saclay
November 26, 2014
2. Impala
Dryad HAWQ
2
The Big Data Landscape
Big Data
Platforms
nested
relational
HiveQL
DryadLINQ
Pig
Spark
SQL
Jaql
Stratosphere
unstructured
semi-structured
structured
data streams
Languages
Hadoop
Hive/Stinger
Jaql
Spark
Stratosphere
Hadapt
Polybase
Drill
Need for efficient Big Data management
Challenging due to size and heterogeneity of data,
variety of applications
Query optimization is crucial
3. Query Optimization in Large Scale Data Platforms
3
• Existing challenges
• Exponential error propagation in joins
• Correlations between predicates
• “New” challenges
• Prominent use of UDFs
• Complex data types (arrays, maps, structs)
• Poor statistics (do we own the data?)
• Bad plans over Big data may be disastrous
• Exploit cluster’s resources (parallel execution)
Traditional static techniques are not sufficient
We introduce dynamic techniques that are:
• at least as good as and
• up to 2x (4x) better than
the best hand-written left-deep Jaql (Hive) plans
4. 4
SELECT <projection list> FROM (
SELECT <projection list>
FROM "PART", "SUPPLIER", "LINEITEM",
"PARTSUPP", "ORDERS", "NATION"
5-way join
WHERE <join conditions>
AND "PART"."p_name" LIKE '%green%'
AND "ORDERS"."o_orderdate" BETWEEN '1995-01-01' AND
'1995-07-01'
correlated
predicates
AND "ORDERS"."o_orderstatus"='P'
AND udf("PARTSUPP"."ps_partkey") < 0.001
external UDFs
AND <udf list>
) "PROFIT"
GROUP BY "PROFIT"."NATION", "PROFIT"."order_YEAR"
ORDER BY "PROFIT"."NATION" ASC, "PROFIT"."order_YEAR" DESC;
Example: TPCH Q9’
5. 5
“SQL” Processing in Large Scale Platforms
• Relational operators -> MapReduce jobs
• Two join algorithms:
• Repartition join (RJ) – 1MR job (default)
• Memory join (MJ) – map-only job
• Optimizations based on rewrite rules and hints
• RJ -> MJ
• Chain MJs (multiple joins in one map job)
• Left-deep plans
• This is the picture for Jaql (and Hive)
6. 6
Limitations
• No selectivity estimation for predicates/UDFs
• Conservative application of memory joins
• No cost-based join enumeration
• Rely on order of relations in FROM clause
• Left-deep plans
• Often close to optimal for centralized settings
• Not sufficient for distributed query processing
7. 7
TPCH Q9’: Execution Plans
udf(o,l)
RJ ps
p
l
RJ
RJ l
udf(o,l)
udf(p)
udf(o)
udf(ps)
Best left-deep hand-written
Jaql plan
RJ
o
RJ
Best relational
optimizer plan
MJ
udf(ps) s n
udf(o)
udf(p)
RJ
s
RJ
RJ
p
n
MJ
o
ps
8. 8
Dynamic Optimization
• Key idea: alter execution plan at runtime
• Studied in the relational setting
• Both centralized and distributed
• Basic concern: when to break the pipeline?
• No emphasis on UDFs and correlated predicates
• Increasingly being used in large scale platforms
(e.g., Scope, Shark, Hive)
Goal: dynamic optimization techniques for large
scale data platforms (implemented in Jaql)
9. 9
IBM BigInsights Jaql
Dataflows for conceptual JSON data
Key differentiators
• Functions:
reusability + abstraction
• Physical Transparency:
precise control when needed
• Data model:
semi-structured based on JSON
Flexible scripting language
Scalable map-reduce runtime
Fault Tolerant DFS
Jaql
Map
Jaql
Reduce
Jaql
Map
Jaql
Reduce
Jaql
Map
10. 10
Jaql Script: Example
read transform group by write
Query Data
read(hdfs("reviews"))
-> transform { pid: $.placeid, rev: sentAn($.review) }
-> group by p = ($.pid) as r into { pid: p, revs: r.rev }
-> write(hdfs("group-reviews"))
[
{ pid: 12, revs: [ 3*, 4*, … ] },
{ pid: 19, revs: [ 2*, 1*, … ] }
]
Group user reviews by place
11. 11
Jaql to MapReduce
mapReduce(
input: { type: hdfs, location: "reviews" },
output: { type: hdfs, location: "group-reviews" },
map: fn($mapIn) (
$mapIn -> transform { pid: $.placeid, rev: sentAn($.review) }
-> transform [ $.placeid, $.rev ] ),
reduce: fn($p, $r) ( [ pid: $p, revs: $r ] ) )
• Functions as parameters
• Rewritten script is valid
Jaql!
read(hdfs("reviews"))
-> transform { pid: $.placeid, rev: sentAn($.review) }
-> group by p = ($.pid) as r into { pid: p, revs: r.rev }
-> write(hdfs("group-reviews"))
Rewrite Engine
12. 12
Outline
• Introduction
• System Architecture
• Pilot Runs
• Adaptation of Execution Plans
• Experiments
• Conclusion
13. 13
DynO Architecture
Query
best plan
Query
result
Jaql
plan
Optimizer
(join enumeration)
Jaql compiler
Jaql runtime
MapReduce
join query
blocks
Statistics DFS
execute part
of the plan
pilot runs
remaining
plan
1
2
3
4
8
5
6
7
14. 14
Pilot Runs
• PilR algorithm:
• Push-down selections/UDFs
• Get leaf expressions (scans + local predicates)
• Transform them to map-only jobs
• Execute them over random splits of each relation
• Until k tuples are output
• Collect statistics during execution
• Parallel execution of pilot runs (~4.5x speedup)
• Approx. 3% overhead to the execution
• Performance speedup of up to 2x (4x) for Jaql (Hive)
15. 15
udf(o,l)
RJ ps
p
l
RJ
RJ l
udf(o,l)
udf(p)
udf(o)
udf(ps)
Best left-deep hand-written
Jaql plan
RJ
o
RJ
Best relational
optimizer plan
MJ
udf(ps) s n
udf(o)
udf(p)
RJ
s
RJ
RJ
p
n
MJ
o
ps
udf(o,l)
MJ
p
MJ
o MJ
l
ps
MJ
MJ
s n
udf(ps)
udf(o)
udf(p)
Up to 2x speedup
(4x when applied
to Hive)
DynO plan
TPCH Q9’: Impact of Pilot Runs
16. 16
Pilot Runs: Details
• Collected statistics:
• #tuples, min/max, #distinct values
• add more if the optimizer can support them
• Statistics reusability
• Optimization for selective (and expensive) predicates
• Shortcomings:
• Non-local predicates
• Non primary/foreign key joins
• Join correlations
Runtime adaptation of execution plans
17. 17
Adaptation of Execution Plans
• Cost-based optimizer
• Based on Columbia (top-down) optimizer
• Focuses on join enumeration
• Accurate statistics from pilot runs and/or previous executions
• Bushy plans (intra-query parallelization)
• Online statistics collection
• Re-optimization points (natural in MR)
• Execution strategies: choosing leaf jobs
• Degree of parallelization, cost/uncertainty of jobs
18. 18
TPCH Q8’: Impact of Execution Plan Adaptation
MJ r
MJ n2
RJ c
RJ o
p s Best left-deep hand-written
Jaql plan
RJ l
RJ
n1
MJ
o
RJ
MJ
RJ n2
s
RJ
l
RJ
p
udf(o,c)
r
MJ
MJ
c
n1
udf(o,c)
Best relational
optimizer plan
19. TPCH Q8’: Impact of Execution Plan Adaptation
MJ
RJ n2
19
udf(o,c)
o
RJ n2
s
MJ
RJ
l
RJ
p
t1 RJ
r
MJ
MJ
c
n1
MJ n2
s
MJ
RJ
l
RJ
t1
p
t2
RJ s
p t2
t3
MJ
MJ
n2
s
t3
Speedup up to 2x without
any initial statistics
(despite the added overhead)
20. 20
Outline
• Introduction
• System Architecture
• Pilot Runs
• Adaptation of Execution Plans
• Experiments
• Conclusion
21. 21
Experimental Setup
• 15-node cluster, 10 GbE
• Each machine:
• 12-cores, 96 GB RAM (2GB to each MR slot), 12*2TB disks
• 10 map/8 reduce slots
• Hadoop 1.1.1
• ZooKeeper for coordination (in statistics collection)
• TPCH data, SF = {100, 300, 1000}
• TPCH queries (with additional UDFs)
22. 22
Execution times comparison
• At least as good as the best left-deep hand-written plans
• Benefits from bushy plans (Q2)
• Benefits from pilot runs due to many UDFs (Q9’)
• Benefits from re-optimization due to UDF on join result (Q8’)
• Biggest benefit is brought by the pilot runs
23. 23
Benefits of our Approach on Hive
• Similar
performance
trends with Jaql
• Bigger speedup
(up to 4x) due to
implementation of
broadcast joins
(Hive 0.12 exploits
DistributedCache)
24. 24
Overhead of Dynamic Optimization
• Pilot runs overhead
2.5-6.5%
• Stats collection
overhead 0.1-2.8%
• Overall overhead
7-10%
25. 25
Conclusion
• Pilot runs to account for UDFs
• Dynamic adaptation of execution plans
• Traditional optimizer for join ordering (bushy plans)
• Online statistics collection (no need for initial statistics)
• Execution strategies
• At least as good plans as the left-deep hand-written ones
• Up to 2x faster (4x for Hive)
• Applicability to other systems (e.g., Hive)
26. 26
Perspectives
• Broader range of applications (e.g., ML)
• Other runtimes (e.g., Tez)
• Adaptive operators
• Extend optimizer to support grouping, ordering