Weitere ähnliche Inhalte Ähnlich wie GPORCA: Query Optimization as a Service (20) Mehr von PivotalOpenSourceHub (20) Kürzlich hochgeladen (20) GPORCA: Query Optimization as a Service1. 1© Copyright 2015 Pivotal. All rights reserved. 1© Copyright 2015 Pivotal. All rights reserved.
GPORCA: Query Optimization as a Service
Venkatesh Raghavan
Apache HAWQ Nest
2. 2© Copyright 2015 Pivotal. All rights reserved.
• Motivation
• Introduction to GPORCA (a.k.a Orca)
• How to add an Orca feature?
• How to enable Orca to Apache HAWQ?
Outline
3. 3© Copyright 2015 Pivotal. All rights reserved.
Why a New Optimizer?
• Query optimization is key to performance
• Legacy Planner was not initially designed with distributed
data processing in mind
• Average time to fix customer issues:
– Legacy Optimizer (Planner) ~ 70 days
– Pivotal Query Optimizer (Orca) ~ 13 days
4. 4© Copyright 2015 Pivotal. All rights reserved.
Legacy Query Planner
• Technology of the 90’s
• Addresses Join re-ordering
• Treats everything else as add-on (grouping, with clause, etc.)
• Imposes order on specific optimization steps
• Recursively descends into sub-queries
– Cannot Unnest Complex Correlated Sub Queries
• High Code Complexity
– Maximum: 102 (Orca 8.5)
– Minimum: 6.4 (Orca 1.5)
5. 5© Copyright 2015 Pivotal. All rights reserved.
Join Ordering vs. “Everything Else”
• TPC-H Query 5
– 6 Tables
– “Harmless” query
“Everything Else”
Size of search space ~230,000,000
Join Order Problem
Size of search space
< 100,000
6. 6© Copyright 2015 Pivotal. All rights reserved.
What Is GP-Orca?
• State-of-the-art query optimization framework designed from
scratch to
– Improve – performance, ease-of-use
– Enable – foundation for future research and development
– Connect – applies to multiple host systems (GPDB and HAWQ)
7. 7© Copyright 2015 Pivotal. All rights reserved.
Modularity
• Orca is not baked into one host system
Orca
Parser
Host System
< />
SQL
Q2DXL
DXL
Query
MD Provider
MDreq.
Catalog Executor
DXL
MD
DXL
Plan
DXL2Plan
Results
8. 8© Copyright 2015 Pivotal. All rights reserved.
Key Features
• Smarter partition elimination
• Subquery unnesting
• Common table expressions (CTE)
• Additional Functionality
– Improved join ordering
– Join-Aggregate reordering
– Sort order optimization
– Skew aware
9. 9© Copyright 2015 Pivotal. All rights reserved.
Not Yet Feature Complete
• Improve Performance for Short Running Queries
• External parameters
• Cubes
• Multiple grouping sets
• Inverse distribution functions
• Ordered aggregates
• Catalog Queries
10. 10© Copyright 2015 Pivotal. All rights reserved.
Currently in Apache HAWQ
Parser Executor
Orca
PlannerSQL Results
Query Plan
11. 11© Copyright 2015 Pivotal. All rights reserved.
When Orca is exercised
Parser Executor
Orca
Planner
Query Plan
SQL Results
12. 12© Copyright 2015 Pivotal. All rights reserved.
When Orca fallbacks
Parser Executor
Orca
Planner
Query
SQL Results
Fallback
Plan
Orca
Orca will automatically fallback to the legacy optimizer
for unsupported features
13. 13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved.
Subquery Unnesting
14. 14© Copyright 2015 Pivotal. All rights reserved.
• A query that is nested inside an outer query block
• Correlated Subquery (CSQ) is a subquery that uses values
from the outer query
Subqueries: Definition
SELECT * FROM part p1
WHERE price >
(SELECT avg(price) FROM part p2 WHERE p2.brand = p1.brand)
15. 15© Copyright 2015 Pivotal. All rights reserved.
Subqueries: Impact
• Heavily used in many workloads
– BI/Reporting tools generate substantial number of subqueries
– TPC-H workload: 40% of the 22 queries
– TPC-DS workload: 20% of the 111 queries
• Inefficient plans means query takes a long time or does not terminate
• Optimizations
– De-correlation
– Conversion of subqueries to joins
16. 16© Copyright 2015 Pivotal. All rights reserved.
Subqueries in Disjunctive Filters
• Find parts with: size > 40 OR price > the average brand price
SELECT *
FROM part p1
WHERE p_size > 40 OR
p_retailprice >(SELECT avg(p_retailprice)
FROM part p2
WHERE p2.p_brand = p1.p_brand)
18. 18© Copyright 2015 Pivotal. All rights reserved.
Subquery Handling: Orca vs. Planner
CSQ Class Planner Orca
CSQ in select list Correlated Execution Join
CSQ in disjunctive filter Correlated Execution Join
Multi-Level CSQ No Plan Join
CSQ with group by and inequality Correlated Execution Join
CSQ must return one row Correlated Execution Join
CSQ with correlation in select list Correlated Execution Correlated Execution
19. 19© Copyright 2015 Pivotal. All rights reserved.
TPC-DS Orca vs. Planner
TPC-DS 10TB, 16 nodes, 48 GB/node
20. 20© Copyright 2015 Pivotal. All rights reserved. 20© Copyright 2015 Pivotal. All rights reserved. 20© Copyright 2015 Pivotal. All rights reserved.
How to add an Orca feature?
21. 21© Copyright 2015 Pivotal. All rights reserved.
• Step 1: Pre-Process Input Logical Expression
– Apply heuristics like pushing selects down etc.
• Step 2: Exploration (via Transforms)
– Generate all equivalent logical plans
• Step 3: Implementation (via Transforms)
– Generate all physical implementation for all logical operators
• Step 4: Optimization
– Enforce distribution and ordering requirements and pick the cheapest
plan
Optimization Steps in Orca
22. 22© Copyright 2015 Pivotal. All rights reserved.
• Split an aggregate into a pair of local and global aggregate.
• Schema: CREATE TABLE foo (a int, b int, c int) distributed by (a);
• Query: SELECT sum(c) FROM foo GROUP BY b
• Do local aggregation and then send the partial aggregation to
the master.
• The final aggregation can then be done on the master.
Let’s Pair
23. 23© Copyright 2015 Pivotal. All rights reserved.
// HEADER FILES
~/orca/libgpopt/include/gpopt/xforms
// SOURCE FILES
~/orca/libgpopt/src/xforms
CXformSplitGbAgg
24. 24© Copyright 2015 Pivotal. All rights reserved.
• Pattern
• Pre-Condition Check
Define What Will Trigger This Transformation
25. 25© Copyright 2015 Pivotal. All rights reserved.
Pattern
GPOS_NEW(pmp)
CExpression
(
pmp,
// logical aggregate operator
GPOS_NEW(pmp) CLogicalGbAgg(pmp),
// relational child
GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternLeaf(pmp)),
// scalar project list
GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternTree(pmp))
));
26. 26© Copyright 2015 Pivotal. All rights reserved.
// Compatibility function for splitting aggregates
virtual
BOOL FCompatible(CXform::EXformId exfid)
{
return (CXform::ExfSplitGbAgg != exfid);
}
Pre-Condition Check
Common sense rules such as do not fire this rule on a logical operator
that was a result of the same rule. (Avoid Infinite Recurssion)
27. 27© Copyright 2015 Pivotal. All rights reserved.
void Transform
(
CXformContext *pxfctxt,
CXformResult *pxfres,
CExpression *pexpr
)
const;
The Actual Transformation
28. 28© Copyright 2015 Pivotal. All rights reserved.
void CXformFactory::Instantiate()
{
….
Add(GPOS_NEW(m_pmp) CXformSplitGbAgg(m_pmp));
….
}
Register Transformation
29. 29© Copyright 2015 Pivotal. All rights reserved. 29© Copyright 2015 Pivotal. All rights reserved. 29© Copyright 2015 Pivotal. All rights reserved.
How to enable Orca on HAWQ?
30. 30© Copyright 2015 Pivotal. All rights reserved.
• GPORCA https://github.com/greenplum-db/gporca
• White Paper: bit.ly/1ntrE8v
• Pivotal Tracker: bit.ly/1m1WGDn
Get Your Hands On It!
31. 31© Copyright 2015 Pivotal. All rights reserved.
• Step 1: Get Centos07 Docker Image
https://github.com/xinzweb/hawq-devel-env/blob/master/README.md
• Step 2: Update CMake to 3.4.3
• Step 3: Install GPORCA
– Install Xerces 3.1.2
– Install GPOS
Getting Orca on Apache HAWQ
32. 32© Copyright 2015 Pivotal. All rights reserved.
• Step 4: Apply my changes to Apache HAWQ Makefile
https://github.com/vraghavan78/incubator-hawq/tree/fix-enable-orca
• Step 5: Compile HAWQ with orca enabled
./configure --enable-orca --with-perl --with-python --with-libxml
• Step 6: May have to copy libraries to the different segments
For Example: scp -r . gpadmin@centos7-datanode1:/usr/local/lib
Getting Orca on Apache HAWQ
33. 33© Copyright 2015 Pivotal. All rights reserved.
Publications
• Optimization of Common Table Expressions in MPP Database Systems, VLDB 2015
– Amr El-Helw, Venkatesh Raghavan, Mohamed A. Soliman, George C. Caragea, Zhongxian Gu, Michalis Petropoulos.
• Orca: A Modular Query Optimizer Architecture for Big Data, SIGMOD 2014
– Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-
Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, Rhonda
Baldwin
• Optimizing Queries over Partitioned Tables in MPP Systems, SIGMOD 2014
– Lyublena Antova, Amr El-Helw, Mohamed Soliman, Zhongxian Gu, Michalis Petropoulos, Florian Waas
• Reversing Statistics for Scalable Test Databases Generation, DBTest 2013
– Entong Shen, Lyublena Antova
• Total Operator State Recall - Cost-Effective Reuse of Results in Greenplum Database, ICDE Workshops 2013
– George C. Caragea, Carlos Garcia-Alvarado, Michalis Petropoulos, Florian M. Waas
• Testing the Accuracy of Query Optimizers, DBTest 2012
– Zhongxian Gu, Mohamed A. Soliman, Florian M. Waas
– Automatic Capture of Minimal, Portable, and Executable Bug Repros using AMPERe, DBTest 2012
– Lyublena Antova, Konstantinos Krikellas, Florian M. Waas
– Automatic Data Placement in MPP Databases, ICDE Workshops 2012
– Carlos Garcia-Alvarado, Venkatesh Raghavan, Sivaramakrishnan Narayanan, Florian M. Waas