SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Catalyst optimizer
Presented by Ayub Mohammad
Agenda
08-03-2019
• What is catalyst optimizer
• Why is it used
• How does it optimize
• Fundamentals of Apache Spark Catalyst
Optimizer
• References
2
What is catalyst optimizer
• It optimizes all the queries written in Spark SQL and DataFrame API. The optimizer helps us
to run queries much faster than their counter RDD part.
• Supports rule based and cost based optimization.
• In rule-based optimization the rule based optimizer use set of rule to determine how to
execute the query. While the cost based optimization finds the most suitable way to carry
out SQL statement. In cost-based optimization, multiple plans are generated using rules
and then their cost is computed.
08-03-2019
3
What is catalyst optimizer
• It includes Scala’s pattern matching and quasi quotes.
• Also, offers to build an extensible query optimizer.
08-03-2019
4
Purpose of catalyst optimizer
Catalyst’s extensible design had two purposes.
• Easy to add new optimization techniques and features to Spark SQL, especially for the
purpose of tackling various problems we were seeing with big data (e.g.,
semistructured data and advanced analytics).
• Second, Enable external developers to extend the optimizer — for example, by adding
data source specific rules that can push filtering or aggregation into external storage
systems, or support for new data types.
08-03-2019
5
How catalyst optimizer works
• user=spark.read.option("header",true).option("delimiter","
t").option("inferSchema",true).csv("user.txt");
• purchase=spark.read.option("header",true).option("delimit
er","t").option("inferSchema",true).csv("purchase.txt");
• joined=purchase.join(user,Seq("userid"),"leftouter").select(
"pid","location").filter("amount>60").select("location");
• Joined.explode(true)
08-03-2019 6
08-03-2019 7
Spark SQL Execution Plan
Spark uses Catalyst’s general tree transformation framework in 4 phases:
• Analysis
• Logical Optimization
• Physical planning
• Code generation
08-03-2019 8
Analysis
Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by
a SQL parser, or from a DataFrame object constructed using the API. It starts by creating an unresolved
logical plan, and then apply the following steps for the sql query
joinedDF.registerTempTable("joinedTable");
spark.sql("select location from joinedTable where pid > 2").explain(true)
• Search relation BY NAME FROM CATALOG.
• Map the name attribute, for example, salary, to the input provided given operator’s children.
• Determine which attributes match to the same value to give them unique ID.
08-03-2019
9
Analysis
08-03-2019
10
08-03-2019 11
Logical Optimization
• In this phase of Spark SQL optimization, the standard rule-based optimization is applied to
the logical plan. It includes
1. Constant folding
2. Predicate pushdown
3. Projection pruning
4. null propagation and other rules.
08-03-2019
12
Example:
08-03-2019
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
13
Optimized logical plan
08-03-2019
project
filter
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Purchase.am
ount>60
Select
userlocation
project project
Select
purchase.userId
Select
user.userId, user.
location
project
filter
project
Join
Scan
Table user
Scan
Table purchase
User.
Userid==purchase.
userid
Select pid,
location,amoun
t
Purchase.am
ount>60
Select location
14
Physical Planning
• After an optimized logical plan is generated it is passed through a series of SparkStrategies that produce one or
more Physical plans
• It then selects a plan using a cost model.
• Currently, cost-based optimization is only used to select join algorithms.
• The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively
for a whole tree using a rule. So it is possible to implement richer cost-based optimization in the future.
• It also can push operations from the logical plan into data sources that support predicate or projection pushdown.
08-03-2019
15
Code Generation
• The final phase of query optimization involves generating Java bytecode to run on each machine.
• Because Spark SQL often operates on in-memory datasets, where processing is CPU-bound, supporting
code generation can speed up execution.
• Catalyst relies on a special feature of the Scala language, quasiquotes, to make code generation simpler.
• Quasiquotes allow the programmatic construction of abstract syntax trees (ASTs) in the Scala language,
which can then be fed to the Scala compiler at runtime to generate bytecode.
• Catalyst is used to transform a tree representing an expression in SQL to an AST for Scala code to evaluate
that expression, and then compile and run the generated code.
08-03-2019
16
Code Generation
08-03-2019 17
Fundamentals of Apache Spark Catalyst
Optimizer
• At its core, Catalyst contains a general library for representing trees and applying rules to
manipulate them
08-03-2019
18
Tree
• The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or
more children. New node types are defined in Scala as subclasses of the TreeNode class.
• Immutable.
• As a simple example, suppose we have the following three node classes for a very simple expression
language:
• Literal(value: Int): a constant value
• Attribute(name: String): an attribute from an input row, e.g.,“x”
• Add(left: TreeNode, right: TreeNode): sum of two expressions.
08-03-2019
19
Tree example for an
expression : x+(1+2)
These classes can be used to build up trees;
for example, the tree for the expression
x+(1+2), would be represented in Scala code
as follows:
Add(Attribute(x), Add(Literal(1), Literal(2)))
08-03-2019
20
Rules
• Trees can be manipulated using rules
• Functions from a tree to another tree. While a rule can run arbitrary code on its input tree, the most
common approach is to use a set of pattern matching functions that find and replace subtrees with a
specific structure.
• Pattern matching is a feature of many functional languages that allows extracting values from potentially
nested structures.
• Can have arbitrary Scala code that’s gives user the flexibility to add new rules easily.
• In Catalyst, trees offer a transform method that applies a pattern matching function recursively on all nodes
of the tree, transforming the ones that match each pattern to a result. For example, we could implement a
rule that folds Add operations between constants as follows:
08-03-2019
21
Rules
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
}
• Rules can match multiple patterns in the same transform call, making it very concise to implement multiple
transformations at once:
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
08-03-2019
22
Sample CombineFilter rule from spark source
code
object CombineFilters extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case ff @ Filter(fc, nf @ Filter(nc, grandChild)) => Filter(And(nc, fc), grandChild)
}
}
08-03-2019
23
Custom rules
object MultiplyOptimizationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
case Multiply(left,right) if right.isInstanceOf[Literal] &&
right.asInstanceOf[Literal].value.asInstanceOf[Double] == 1.0 =>
println("optimization of one applied")
left
}
}
08-03-2019
24
Custom rules
val purchase=spark.read.option("header",true).option("delimiter","t").csv("purchase.txt");
val purchaseamount = purchase.selectExpr("amount * 1")
println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString)
00 Project [(cast(amount#3 as double) * 1.0) AS (amount * 1)#5]
01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv
sparkSession.experimental.extraOptimizations = Seq(MultiplyOptimizationRule)
val purchaseamount = purchase.selectExpr("amount * 1")
println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString)
00 Project [cast(amount#3 as double) AS (amount * 1)#7]
01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv
08-03-2019
25
References
• https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
• http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/
• https://virtuslab.com/blog/spark-sql-hood-part-i/
• https://data-flair.training/blogs/spark-sql-optimization/
• https://www.tutorialkart.com/apache-spark/dag-and-physical-execution-plan/
08-03-2019
26

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 

Was ist angesagt? (20)

Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Apache spark
Apache sparkApache spark
Apache spark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 

Ähnlich wie Optimize Spark SQL Queries with Catalyst Optimizer

Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cRonald Francisco Vargas Quesada
 
Beginners guide to_optimizer
Beginners guide to_optimizerBeginners guide to_optimizer
Beginners guide to_optimizerMaria Colgan
 
01 surya bpc_script_ppt
01 surya bpc_script_ppt01 surya bpc_script_ppt
01 surya bpc_script_pptSurya Padhi
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Kafka meetup - kafka connect
Kafka meetup -  kafka connectKafka meetup -  kafka connect
Kafka meetup - kafka connectYi Zhang
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 
05_DP_300T00A_Optimize.pptx
05_DP_300T00A_Optimize.pptx05_DP_300T00A_Optimize.pptx
05_DP_300T00A_Optimize.pptxKareemBullard1
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streamsYoni Farin
 
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...cscpconf
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 

Ähnlich wie Optimize Spark SQL Queries with Catalyst Optimizer (20)

Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Beginners guide to_optimizer
Beginners guide to_optimizerBeginners guide to_optimizer
Beginners guide to_optimizer
 
01 surya bpc_script_ppt
01 surya bpc_script_ppt01 surya bpc_script_ppt
01 surya bpc_script_ppt
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Remus_3_0
Remus_3_0Remus_3_0
Remus_3_0
 
Kafka meetup - kafka connect
Kafka meetup -  kafka connectKafka meetup -  kafka connect
Kafka meetup - kafka connect
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
05_DP_300T00A_Optimize.pptx
05_DP_300T00A_Optimize.pptx05_DP_300T00A_Optimize.pptx
05_DP_300T00A_Optimize.pptx
 
Spark
SparkSpark
Spark
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Design Patterns
Design PatternsDesign Patterns
Design Patterns
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 

Kürzlich hochgeladen

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Kürzlich hochgeladen (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Optimize Spark SQL Queries with Catalyst Optimizer

  • 2. Agenda 08-03-2019 • What is catalyst optimizer • Why is it used • How does it optimize • Fundamentals of Apache Spark Catalyst Optimizer • References 2
  • 3. What is catalyst optimizer • It optimizes all the queries written in Spark SQL and DataFrame API. The optimizer helps us to run queries much faster than their counter RDD part. • Supports rule based and cost based optimization. • In rule-based optimization the rule based optimizer use set of rule to determine how to execute the query. While the cost based optimization finds the most suitable way to carry out SQL statement. In cost-based optimization, multiple plans are generated using rules and then their cost is computed. 08-03-2019 3
  • 4. What is catalyst optimizer • It includes Scala’s pattern matching and quasi quotes. • Also, offers to build an extensible query optimizer. 08-03-2019 4
  • 5. Purpose of catalyst optimizer Catalyst’s extensible design had two purposes. • Easy to add new optimization techniques and features to Spark SQL, especially for the purpose of tackling various problems we were seeing with big data (e.g., semistructured data and advanced analytics). • Second, Enable external developers to extend the optimizer — for example, by adding data source specific rules that can push filtering or aggregation into external storage systems, or support for new data types. 08-03-2019 5
  • 6. How catalyst optimizer works • user=spark.read.option("header",true).option("delimiter"," t").option("inferSchema",true).csv("user.txt"); • purchase=spark.read.option("header",true).option("delimit er","t").option("inferSchema",true).csv("purchase.txt"); • joined=purchase.join(user,Seq("userid"),"leftouter").select( "pid","location").filter("amount>60").select("location"); • Joined.explode(true) 08-03-2019 6
  • 8. Spark SQL Execution Plan Spark uses Catalyst’s general tree transformation framework in 4 phases: • Analysis • Logical Optimization • Physical planning • Code generation 08-03-2019 8
  • 9. Analysis Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a DataFrame object constructed using the API. It starts by creating an unresolved logical plan, and then apply the following steps for the sql query joinedDF.registerTempTable("joinedTable"); spark.sql("select location from joinedTable where pid > 2").explain(true) • Search relation BY NAME FROM CATALOG. • Map the name attribute, for example, salary, to the input provided given operator’s children. • Determine which attributes match to the same value to give them unique ID. 08-03-2019 9
  • 12. Logical Optimization • In this phase of Spark SQL optimization, the standard rule-based optimization is applied to the logical plan. It includes 1. Constant folding 2. Predicate pushdown 3. Projection pruning 4. null propagation and other rules. 08-03-2019 12
  • 13. Example: 08-03-2019 project filter project Join Scan Table user Scan Table purchase User. Userid==purchase. userid Select pid, location,amoun t Purchase.am ount>60 Select location project filter project Join Scan Table user Scan Table purchase User. Userid==purchase. userid Select pid, location,amoun t Purchase.am ount>60 Select location 13
  • 14. Optimized logical plan 08-03-2019 project filter Join Scan Table user Scan Table purchase User. Userid==purchase. userid Purchase.am ount>60 Select userlocation project project Select purchase.userId Select user.userId, user. location project filter project Join Scan Table user Scan Table purchase User. Userid==purchase. userid Select pid, location,amoun t Purchase.am ount>60 Select location 14
  • 15. Physical Planning • After an optimized logical plan is generated it is passed through a series of SparkStrategies that produce one or more Physical plans • It then selects a plan using a cost model. • Currently, cost-based optimization is only used to select join algorithms. • The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively for a whole tree using a rule. So it is possible to implement richer cost-based optimization in the future. • It also can push operations from the logical plan into data sources that support predicate or projection pushdown. 08-03-2019 15
  • 16. Code Generation • The final phase of query optimization involves generating Java bytecode to run on each machine. • Because Spark SQL often operates on in-memory datasets, where processing is CPU-bound, supporting code generation can speed up execution. • Catalyst relies on a special feature of the Scala language, quasiquotes, to make code generation simpler. • Quasiquotes allow the programmatic construction of abstract syntax trees (ASTs) in the Scala language, which can then be fed to the Scala compiler at runtime to generate bytecode. • Catalyst is used to transform a tree representing an expression in SQL to an AST for Scala code to evaluate that expression, and then compile and run the generated code. 08-03-2019 16
  • 18. Fundamentals of Apache Spark Catalyst Optimizer • At its core, Catalyst contains a general library for representing trees and applying rules to manipulate them 08-03-2019 18
  • 19. Tree • The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or more children. New node types are defined in Scala as subclasses of the TreeNode class. • Immutable. • As a simple example, suppose we have the following three node classes for a very simple expression language: • Literal(value: Int): a constant value • Attribute(name: String): an attribute from an input row, e.g.,“x” • Add(left: TreeNode, right: TreeNode): sum of two expressions. 08-03-2019 19
  • 20. Tree example for an expression : x+(1+2) These classes can be used to build up trees; for example, the tree for the expression x+(1+2), would be represented in Scala code as follows: Add(Attribute(x), Add(Literal(1), Literal(2))) 08-03-2019 20
  • 21. Rules • Trees can be manipulated using rules • Functions from a tree to another tree. While a rule can run arbitrary code on its input tree, the most common approach is to use a set of pattern matching functions that find and replace subtrees with a specific structure. • Pattern matching is a feature of many functional languages that allows extracting values from potentially nested structures. • Can have arbitrary Scala code that’s gives user the flexibility to add new rules easily. • In Catalyst, trees offer a transform method that applies a pattern matching function recursively on all nodes of the tree, transforming the ones that match each pattern to a result. For example, we could implement a rule that folds Add operations between constants as follows: 08-03-2019 21
  • 22. Rules tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) } • Rules can match multiple patterns in the same transform call, making it very concise to implement multiple transformations at once: tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) case Add(left, Literal(0)) => left case Add(Literal(0), right) => right } 08-03-2019 22
  • 23. Sample CombineFilter rule from spark source code object CombineFilters extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case ff @ Filter(fc, nf @ Filter(nc, grandChild)) => Filter(And(nc, fc), grandChild) } } 08-03-2019 23
  • 24. Custom rules object MultiplyOptimizationRule extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { case Multiply(left,right) if right.isInstanceOf[Literal] && right.asInstanceOf[Literal].value.asInstanceOf[Double] == 1.0 => println("optimization of one applied") left } } 08-03-2019 24
  • 25. Custom rules val purchase=spark.read.option("header",true).option("delimiter","t").csv("purchase.txt"); val purchaseamount = purchase.selectExpr("amount * 1") println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString) 00 Project [(cast(amount#3 as double) * 1.0) AS (amount * 1)#5] 01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv sparkSession.experimental.extraOptimizations = Seq(MultiplyOptimizationRule) val purchaseamount = purchase.selectExpr("amount * 1") println(purchaseamount.queryExecution.optimizedPlan.numberedTreeString) 00 Project [cast(amount#3 as double) AS (amount * 1)#7] 01 +- Relation[tid#10,pid#11,userid#12,amount#3,itemdesc#14] csv 08-03-2019 25
  • 26. References • https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html • http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/ • https://virtuslab.com/blog/spark-sql-hood-part-i/ • https://data-flair.training/blogs/spark-sql-optimization/ • https://www.tutorialkart.com/apache-spark/dag-and-physical-execution-plan/ 08-03-2019 26