Optimize Spark SQL Queries with Catalyst Optimizer

Catalyst optimizer
Presented by Ayub Mohammad

Agenda
08-03-2019
• What is catalyst optimizer
• Why is it used
• How does it optimize
• Fundamentals of Apache Spark Catalyst
Optimizer
• References
2

What is catalyst optimizer
• It optimizes all the queries written in Spark SQL and DataFrame API. The optimizer helps us
to run queries much faster than their counter RDD part.
• Supports rule based and cost based optimization.
• In rule-based optimization the rule based optimizer use set of rule to determine how to
execute the query. While the cost based optimization finds the most suitable way to carry
out SQL statement. In cost-based optimization, multiple plans are generated using rules
and then their cost is computed.
08-03-2019
3

What is catalyst optimizer
• It includes Scala’s pattern matching and quasi quotes.
• Also, offers to build an extensible query optimizer.
08-03-2019
4

Purpose of catalyst optimizer
Catalyst’s extensible design had two purposes.
• Easy to add new optimization techniques and features to Spark SQL, especially for the
purpose of tackling various problems we were seeing with big data (e.g.,
semistructured data and advanced analytics).
• Second, Enable external developers to extend the optimizer — for example, by adding
data source specific rules that can push filtering or aggregation into external storage
systems, or support for new data types.
08-03-2019
5

How catalyst optimizer works
• user=spark.read.option("header",true).option("delimiter","
t").option("inferSchema",true).csv("user.txt");
• purchase=spark.read.option("header",true).option("delimit
er","t").option("inferSchema",true).csv("purchase.txt");
• joined=purchase.join(user,Seq("userid"),"leftouter").select(
"pid","location").filter("amount>60").select("location");
• Joined.explode(true)
08-03-2019 6

Spark SQL Execution Plan
Spark uses Catalyst’s general tree transformation framework in 4 phases:
• Analysis
• Logical Optimization
• Physical planning
• Code generation
08-03-2019 8

Analysis
Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by
a SQL parser, or from a DataFrame object constructed using the API. It starts by creating an unresolved
logical plan, and then apply the following steps for the sql query
joinedDF.registerTempTable("joinedTable");
spark.sql("select location from joinedTable where pid > 2").explain(true)
• Search relation BY NAME FROM CATALOG.
• Map the name attribute, for example, salary, to the input provided given operator’s children.
• Determine which attributes match to the same value to give them unique ID.
08-03-2019
9

Logical Optimization
• In this phase of Spark SQL optimization, the standard rule-based optimization is applied to
the logical plan. It includes
1. Constant folding
2. Predicate pushdown
3. Projection pruning
4. null propagation and other rules.
08-03-2019
12

Example:
08-03-2019
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
13

Optimized logical plan
08-03-2019
project
filter
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Purchase.am
ount>60
Select
userlocation
project project
Select
purchase.userId
Select
user.userId, user.
location
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
14

Physical Planning
• After an optimized logical plan is generated it is passed through a series of SparkStrategies that produce one or
more Physical plans
• It then selects a plan using a cost model.
• Currently, cost-based optimization is only used to select join algorithms.
• The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively
for a whole tree using a rule. So it is possible to implement richer cost-based optimization in the future.
• It also can push operations from the logical plan into data sources that support predicate or projection pushdown.
08-03-2019
15

Code Generation
• The final phase of query optimization involves generating Java bytecode to run on each machine.
• Because Spark SQL often operates on in-memory datasets, where processing is CPU-bound, supporting
code generation can speed up execution.
• Catalyst relies on a special feature of the Scala language, quasiquotes, to make code generation simpler.
• Quasiquotes allow the programmatic construction of abstract syntax trees (ASTs) in the Scala language,
which can then be fed to the Scala compiler at runtime to generate bytecode.
• Catalyst is used to transform a tree representing an expression in SQL to an AST for Scala code to evaluate
that expression, and then compile and run the generated code.
08-03-2019
16

Fundamentals of Apache Spark Catalyst
Optimizer
• At its core, Catalyst contains a general library for representing trees and applying rules to
manipulate them
08-03-2019
18

Tree
• The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or
more children. New node types are defined in Scala as subclasses of the TreeNode class.
• Immutable.
• As a simple example, suppose we have the following three node classes for a very simple expression
language:
• Literal(value: Int): a constant value
• Attribute(name: String): an attribute from an input row, e.g.,“x”
• Add(left: TreeNode, right: TreeNode): sum of two expressions.
08-03-2019
19

Tree example for an
expression : x+(1+2)
These classes can be used to build up trees;
for example, the tree for the expression
x+(1+2), would be represented in Scala code
as follows:
Add(Attribute(x), Add(Literal(1), Literal(2)))
08-03-2019
20

Rules
• Trees can be manipulated using rules
• Functions from a tree to another tree. While a rule can run arbitrary code on its input tree, the most
common approach is to use a set of pattern matching functions that find and replace subtrees with a
specific structure.
• Pattern matching is a feature of many functional languages that allows extracting values from potentially
nested structures.
• Can have arbitrary Scala code that’s gives user the flexibility to add new rules easily.
• In Catalyst, trees offer a transform method that applies a pattern matching function recursively on all nodes
of the tree, transforming the ones that match each pattern to a result. For example, we could implement a
rule that folds Add operations between constants as follows:
08-03-2019
21

Rules
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
}
• Rules can match multiple patterns in the same transform call, making it very concise to implement multiple
transformations at once:
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
08-03-2019
22

Sample CombineFilter rule from spark source
code
object CombineFilters extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case ff @ Filter(fc, nf @ Filter(nc, grandChild)) => Filter(And(nc, fc), grandChild)
}
}
08-03-2019
23

Custom rules
object MultiplyOptimizationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
case Multiply(left,right) if right.isInstanceOf[Literal] &&
right.asInstanceOf[Literal].value.asInstanceOf[Double] == 1.0 =>
println("optimization of one applied")
left
}
}
08-03-2019
24

Custom rules
val purchase=spark.read.option("header",true).option("delimiter","t").csv("purchase.txt");
val purchaseamount = purchase.selectExpr("amount * 1")
println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString)
00 Project [(cast(amount#3 as double) * 1.0) AS (amount * 1)#5]
01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv
sparkSession.experimental.extraOptimizations = Seq(MultiplyOptimizationRule)
val purchaseamount = purchase.selectExpr("amount * 1")
println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString)
00 Project [cast(amount#3 as double) AS (amount * 1)#7]
01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv
08-03-2019
25

References
• https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
• http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/
• https://virtuslab.com/blog/spark-sql-hood-part-i/
• https://data-flair.training/blogs/spark-sql-optimization/
• https://www.tutorialkart.com/apache-spark/dag-and-physical-execution-plan/
08-03-2019
26

Optimize Spark SQL Queries with Catalyst Optimizer

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Optimize Spark SQL Queries with Catalyst Optimizer

Ähnlich wie Optimize Spark SQL Queries with Catalyst Optimizer (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Optimize Spark SQL Queries with Catalyst Optimizer