Deep Dive Into Catalyst:
Apache Spark 2.0’s Optimizer
Yin Huai
Spark Summit 2016
2
SELECT count(*)
FROM (
SELECT t1.id
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Write Programs Using ...
Solution 1
3
val count = t1.cartesian(t2).filter {
case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2
}.filter {
case (...
Solution 2
4
val filteredT2 =
t2.filter(id1FromT2 => id1FromT2 > 50 * 1000)
val preparedT1 =
t1.map(id1FromT1 => (id1FromT...
Solution 1 vs. Solution 2
0.26
53.92
Executiontime(s)
Solution 1 Solution 2
5
~ 200x
Solution 1
6
val count = t1.cartesian(t2).filter {
case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2
}.filter {
case (...
Solution 2
7
val filteredT2 =
t2.filter(id1FromT2 => id1FromT2 > 50 * 1000)
val preparedT1 =
t1.map(id1FromT1 => (id1FromT...
Write Programs Using RDD API
• Users’functionsare black boxes
• Opaque computation
• Opaque data type
• Programs built usi...
9
Is there an easy way to write efficient programs?
10
The easiest way to write efficient programs is
to not worry about it and get your programs
automatically optimized
How?
• Write programs using high level programming interfaces
• Programs are used to describe what data operations are nee...
12
Catalyst:
Apache Spark’s Optimizer
13
How Catalyst Works:An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code Generat...
14
How Catalyst Works:An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code Generat...
15
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t...
16
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t...
17
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t...
Logical Plan
• A LogicalPlan describescomputation
on datasetswithout defininghow to
conductthe computation
• output: a lis...
Physical Plan
• A PhysicalPlan describescomputation
on datasetswith specificdefinitionson
how to conductthe computation
• ...
20
How Catalyst Works:An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Code Generat...
Transformations
• Transformationswithoutchangingthetree type
(Transform andRule Executor)
• Expression => Expression
• Log...
• A functionassociatedwith everytree usedto
implementa singlerule
Transform
22
Attribute
(t1.value)
Add
Add
Literal(1) Lit...
Transform
• A transformationis definedas a Partial Function
• Partial Function:A functionthat is definedfor a subset
of it...
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Lite...
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Lite...
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Lite...
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Lite...
Combining Multiple Rules
28
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
...
Combining Multiple Rules
29
Constant Folding
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
...
Combining Multiple Rules
30
Column Pruning
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v...
Combining Multiple Rules
31
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
...
Combining Multiple Rules: Rule Executor
32
Batch 1 Batch 2 Batch n…
Rule1
Rule2
…
Rule1
Rule2
…
A Rule Executor transforms...
Transformations
• Transformationswithoutchangingthetree type
(Transform andRule Executor)
• Expression => Expression
• Log...
From Logical Plan to Physical Plan
• A Logical Plan is transformed to a Physical Plan by applying
a set of Strategies
• Ev...
35
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Unresolved
Logical Plan
Logical Plan
Optimi...
36
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Selected
Physical Plan
CostModel
Physical
Plans
Catalog
Ana...
Spark’s Planner
• 1st Phase:Transformsthe LogicalPlan to the Physical
Plan usingStrategies
• 2nd Phase:Use a Rule Executor...
Ensure Requirements on Input Rows
38
Sort-Merge Join
[t1.id = t2.id]
t1
Requirements:
Rows Sorted by t1.id
t2
Requirements...
Ensure Requirements on Input Rows
39
Sort-Merge Join
[t1.id = t2.id]
t1
Requirements:
Rows Sorted by t1.id
Requirements:
R...
Ensure Requirements on Input Rows
40
Sort-Merge Join
[t1.id = t2.id]
t1
[sorted by t1.id]
Requirements:
Rows Sorted by t1....
Ensure Requirements on Input Rows
41
Sort-Merge Join
[t1.id = t2.id]
t1
[sorted by t1.id]
Requirements:
Rows Sorted by t1....
42
Catalyst in Apache Spark
43
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
ML Pipelines
Structured
Streaming
{ JSON }
JDBC
and more…
With Spark 2.0...
Where to Start
• Source Code:
• Trees: TreeNode, Expression, Logical Plan, and Physical Plan
• Transformations: Analyzer, ...
45
~ 200 lines of changes
(95 lines are for tests)
Certain workloads can
run hundreds times
faster
SPARK-12032
46
~ 250 lines of changes
(99 lines are for tests)
Pivot table support
SPARK-8992
• Try latestversionof ApacheSpark and previewof Spark 2.0
Try Apache Spark with Databricks
47
http://databricks.com/try
Thank you.
Officehour:2:45pm–3:30pm@ ExpoHall
Nächste SlideShare
Wird geladen in …5
×

Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer

2.202 Aufrufe

Veröffentlicht am

Spark Summit 2016 talk by Yin Huai (Databricks)

Veröffentlicht in: Daten & Analysen

Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer

  1. 1. Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer Yin Huai Spark Summit 2016
  2. 2. 2 SELECT count(*) FROM ( SELECT t1.id FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Write Programs Using RDD API
  3. 3. Solution 1 3 val count = t1.cartesian(t2).filter { case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2 }.filter { case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000 }.map { case (id1FromT1, id2FromT2) => id1FromT1 }.count println("Count: " + count) t1 join t2 t1.id = t2.id t2.id > 50 * 1000
  4. 4. Solution 2 4 val filteredT2 = t2.filter(id1FromT2 => id1FromT2 > 50 * 1000) val preparedT1 = t1.map(id1FromT1 => (id1FromT1, id1FromT1)) val preparedT2 = filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2)) val count = preparedT1.join(preparedT2).map { case (id1FromT1, _) => id1FromT1 }.count println("Count: " + count) t2.id > 50 * 1000 t1 join t2 WHERE t1.id = t2.id
  5. 5. Solution 1 vs. Solution 2 0.26 53.92 Executiontime(s) Solution 1 Solution 2 5 ~ 200x
  6. 6. Solution 1 6 val count = t1.cartesian(t2).filter { case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2 }.filter { case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000 }.map { case (id1FromT1, id2FromT2) => id1FromT1 }.count println("Count: " + count) t1 join t2 t1.id = t2.id t2.id > 50 * 1000
  7. 7. Solution 2 7 val filteredT2 = t2.filter(id1FromT2 => id1FromT2 > 50 * 1000) val preparedT1 = t1.map(id1FromT1 => (id1FromT1, id1FromT1)) val preparedT2 = filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2)) val count = preparedT1.join(preparedT2).map { case (id1FromT1, _) => id1FromT1 }.count println("Count: " + count) t2.id > 50 * 1000 t1 join t2 WHERE t1.id = t2.id
  8. 8. Write Programs Using RDD API • Users’functionsare black boxes • Opaque computation • Opaque data type • Programs built usingRDD API havetotal controlon how to executeeverydata operation • Developershaveto write efficientprogramsfor differentkinds of workloads 8
  9. 9. 9 Is there an easy way to write efficient programs?
  10. 10. 10 The easiest way to write efficient programs is to not worry about it and get your programs automatically optimized
  11. 11. How? • Write programs using high level programming interfaces • Programs are used to describe what data operations are needed withoutspecifying how to execute those operations • High level programming interfaces: SQL, DataFrame, and Dataset • Get an optimizer that automatically finds out the most efficient plan to execute data operations specified in the user’s program 11
  12. 12. 12 Catalyst: Apache Spark’s Optimizer
  13. 13. 13 How Catalyst Works:An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Code GenerationTransformations Catalyst Abstractions of users’ programs (Trees)
  14. 14. 14 How Catalyst Works:An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Code GenerationTransformations Catalyst Abstractions of users’ programs (Trees)
  15. 15. 15 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp
  16. 16. 16 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Expression • An expressionrepresentsa new value, computedbased on input values • e.g. 1 + 2 + t1.value • Attribute: A columnof a dataset(e.g.t1.id) or a column generatedbya specificdata operation(e.g. v)
  17. 17. 17 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Query Plan Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  18. 18. Logical Plan • A LogicalPlan describescomputation on datasetswithout defininghow to conductthe computation • output: a listof attributesgenerated by this LogicalPlan, e.g.[id, v] • constraints:a setof invariants about the rows generatedby this plan, e.g. t2.id > 50 * 1000 18 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  19. 19. Physical Plan • A PhysicalPlan describescomputation on datasetswith specificdefinitionson how to conductthe computation • A PhysicalPlan is executable 19 Parquet Scan (t1) JSON Scan (t2) Sort-Merge Join Filter Project Hash- Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  20. 20. 20 How Catalyst Works:An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Code GenerationTransformations Catalyst Abstractions of users’ programs (Trees)
  21. 21. Transformations • Transformationswithoutchangingthetree type (Transform andRule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforminga tree to anotherkind of tree • Logical Plan => Physical Plan 21
  22. 22. • A functionassociatedwith everytree usedto implementa singlerule Transform 22 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2 for every row
  23. 23. Transform • A transformationis definedas a Partial Function • Partial Function:A functionthat is definedfor a subset of its possiblearguments 23 val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Case statement determine if the partial function is defined for agiven input
  24. 24. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 24 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  25. 25. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 25 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  26. 26. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 26 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  27. 27. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 27 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.value
  28. 28. Combining Multiple Rules 28 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Predicate Pushdown Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id
  29. 29. Combining Multiple Rules 29 Constant Folding Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id
  30. 30. Combining Multiple Rules 30 Column Pruning Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Project t1.id t1.value t2.id
  31. 31. Combining Multiple Rules 31 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Projectt1.id t1.value t2.id Before transformations After transformations
  32. 32. Combining Multiple Rules: Rule Executor 32 Batch 1 Batch 2 Batch n… Rule1 Rule2 … Rule1 Rule2 … A Rule Executor transforms a Tree to another same type Tree by applying many rules defined in batches Every rule is implemented based on Transform Approaches of applying rules 1. Fixed point 2. Once
  33. 33. Transformations • Transformationswithoutchangingthetree type (Transform andRule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforminga tree to anotherkind of tree • Logical Plan => Physical Plan 33
  34. 34. From Logical Plan to Physical Plan • A Logical Plan is transformed to a Physical Plan by applying a set of Strategies • Every Strategy uses pattern matching to convert a Tree to another kind of Tree 34 object BasicOperators extends Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { … case logical.Project(projectList, child) => execution.ProjectExec(projectList, planLater(child)) :: Nil case logical.Filter(condition, child) => execution.FilterExec(condition, planLater(child)) :: Nil … } } Triggers other Strategies
  35. 35. 35 SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning Catalyst
  36. 36. 36 Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning • Analysis (Rule Executor): Transforms an Unresolved Logical Plan to a Resolved Logical Plan • Unresolved => Resolved: Use Catalog to find where datasets and columns are coming from and types of columns • Logical Optimization (Rule Executor): Transforms a Resolved Logical Plan to an Optimized Logical Plan • Physical Planning (Strategies + Rule Executor): Transforms a Optimized Logical Plan to a Physical Plan
  37. 37. Spark’s Planner • 1st Phase:Transformsthe LogicalPlan to the Physical Plan usingStrategies • 2nd Phase:Use a Rule Executorto make the Physical Plan ready forexecution • Prepare Scalar sub-queries • Ensure requirements on input rows • Apply physical optimizations 37
  38. 38. Ensure Requirements on Input Rows 38 Sort-Merge Join [t1.id = t2.id] t1 Requirements: Rows Sorted by t1.id t2 Requirements: Rows Sorted by t2.id
  39. 39. Ensure Requirements on Input Rows 39 Sort-Merge Join [t1.id = t2.id] t1 Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id Sort [t1.id] t2 Sort [t2.id]
  40. 40. Ensure Requirements on Input Rows 40 Sort-Merge Join [t1.id = t2.id] t1 [sorted by t1.id] Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id Sort [t1.id] t2 Sort [t2.id]What if t1 has been sorted by t1.id?
  41. 41. Ensure Requirements on Input Rows 41 Sort-Merge Join [t1.id = t2.id] t1 [sorted by t1.id] Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id t2 Sort [t2.id]
  42. 42. 42 Catalyst in Apache Spark
  43. 43. 43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL ML Pipelines Structured Streaming { JSON } JDBC and more… With Spark 2.0, we expect mostusersto migrate to high levelAPIs (SQL, DataFrame,and Dataset) Spark SQL GraphFrames
  44. 44. Where to Start • Source Code: • Trees: TreeNode, Expression, Logical Plan, and Physical Plan • Transformations: Analyzer, Optimizer, and Planner • Check outprevious pull requests • Start to write code usingCatalyst 44
  45. 45. 45 ~ 200 lines of changes (95 lines are for tests) Certain workloads can run hundreds times faster SPARK-12032
  46. 46. 46 ~ 250 lines of changes (99 lines are for tests) Pivot table support SPARK-8992
  47. 47. • Try latestversionof ApacheSpark and previewof Spark 2.0 Try Apache Spark with Databricks 47 http://databricks.com/try
  48. 48. Thank you. Officehour:2:45pm–3:30pm@ ExpoHall

×