SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Kazuaki Ishizaki
IBM Research – Tokyo
Analyzing and Optimizing
Java Code Generation
for Apache Spark Query Plan
1
What is Apache Spark?
▪ Framework that processes distributed computing by transforming
distributed immutable memory structure using set of parallel operations
▪ e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset
– SQL-based data types are supported
– Scala is primary language for programming on Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.4 released in 2018/11
2
val ds = ...
val ds1 = ...
What is Apache Spark?
▪ Framework that processes distributed computing by transforming
distributed immutable memory structure using a set of parallel operations
▪ e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset
– SQL-based data types are supported
– Scala is primary language for programming on Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.4 released in 2018/11
3
val ds = ...
val ds1 = ...
This talk focuses on
executor behavior
How Code on Each Executor is Generated?
▪ The program written as embedded DSL is translated to Java code thru
analysis and optimizations in Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki4
val ds: Dataset[Array[Int]] = Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds.filter(a => a(0) > 0).map(a => a)
while (rowIterator.hasNext()) {
Row row = rowIterator.next;
ArrayData a = row.getArray(0);
…
}
SQL
Analyzer
Rule-based
Optimizer
Code
Generator
DataFrame
Dataset
(0, 2)
(1, 3)
Column 0
Row 0
Row 1
Java virtual machine
Spark Program
Generated Java code
Motivating Example
▪ A simple Spark program that performs filter and map operations
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki5
val ds = Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds.filter(a => a(0) > 0)
.map(a => a)
(0, 2)
(1, 3)
Column 0
Row 0
Row 1
Motivating Example
▪ Generate complicated code from a simple Spark program
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki6
val ds = Seq(
Array(0, 2),
Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next();
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj1[i] = new Intger(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1);
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
Note: Actually generated code is
more complicated
Performance Issues in Generated Code
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki7
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next();
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = Integer.valueOf(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj1[i] = Intger.valueOf(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1);
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
P1 (from columnar to row-oriented)
P3 (Boxing)
P2
P3 (Unboxing)
P3 (Boxing)
P2
P3 (Unboxing)
P2
val ds = Seq(
Array(0, 2),
Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
Our Contributions
▪ Revealed performance issues in generated code from a Spark program
▪ Devised three optimizations
– to eliminate unnecessary data copy (Data-copy)
– to improve efficiency of data representation (Data-representation)
– to eliminate unnecessary data conversion (Data-conversion)
▪ Achieved up to 1.4x performance improvements
– 22 TPC-H queries
– Two machine learning programs
▪ Merged these optimizations into Spark 2.3 and later versions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki8
These optimizations reduce
path length of handling data.
Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki9
Basic Compilation Strategy of an Operator
▪ Use Volcano style [Graefe93]
– Connect operations using an iterator for easy adding of new operators
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki10
Row-based
iterator
Operator
val ds = Seq((Array(0, 2), 10, “Tokyo”))
val ds1 = ds.filter(a => a.int_ > 0)
.map(a => a)
Row-based
iterator
Operator
Array Int String
(0, 2) 10 “Tokyo”
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// map(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
}
Overview of Generated Code in Spark
▪ Put multiple operator into one loop [Neumann11] when possible
– Can avoid overhead of iterators
– Encourage compiler optimizations in a loop
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki11
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// map(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
// map(...)
...
}
val ds1 = ds.filter(a => a(0) > 0)
.map(a => a)
Whole-stage code generationVolcano style
Columnar Storage to Generated Code
▪ While the source uses columnar storage, generated code requires
data in row-based storage
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki12
Columnar
Storage
Row-based
iterator
Operators
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
val ds = Seq((Array(0, 2), 10, “Tokyo”),
(Array(1, 3), 20, “Mumbay”)).toDS.cache
(0, 2) 10 “Tokyo”
Two operations
in a loop
Problem: Data Copy From Columnar Storage
▪ Copy a set of columns to a row occurs when the iterator is used
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki13
Columnar
Storage
Row-based
iterator
Operators
Data copy
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
(0, 2) 10 “Tokyo”
val ds = Seq((Array(0, 2), 10, “Tokyo”),
(Array(1, 3), 20, “Mumbay”)).toDS.cache
Solution: Generate Optimized Code
▪ If new analysis identifies the source is columnar storage,
– Use a counter-based loop without a row-based iterator
▪ To identify a row position in a columnar storage with an index
– Get data from columnar storage directly
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki14
Columnar
Storage
Operators
Column column1 = df1.getColumn(1);
int sum = 0;
for (int i = 0; i < column0.numRows; i++) {
int x = column1.getInteger(i);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki15
Overview of Old Internal Array Representation
▪ Use an Object array for each element to handle NULL value for SQL in
addition to a primitive value (e.g. 0 or 2)
☺Easy to represent NULL
 Use a boxed object (e.g. Integer object) to hold a primitive value (e.g. int)
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki16
[0] [1]
Object array
0 2
Integer Integer
NULL
[2]
val ds = Seq(Array(0, 2, NULL))
len = 3
Problem: Boxing and Unboxing Occur
▪ Cause boxing (i.e. create an object) by a setter method (e.g. setInt(1, 2))
▪ Cause unboxing by a getter method (e.g. getInt(0))
▪ Increase memory footprint by having a pointer to an object
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki17
int getInt(int i) {
return Integer.getValue(array[i]);
}
void setInt(int i, int v) {
array[i] = new Integer(v);
}
Unboxing
from Integer object to int value
Boxing
from int value to Integer object
len = 3
[0] [1]
0 2
Integer Integer
NULL
[2]
Object array
Solution: Use Primitive Type When Possible
▪ Keep a value in a primitive field when possible based on analysis
▪ Keep NULL in a separate bit field
☺Avoid boxing and unboxing
☺Reduce memory footprint
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki18
Non
Null
Non
Null 0
[0] [1]
int array
int getInt(int i) {
return array[i];
}
void setInt(int i, int v) {
array[i] = v;
}
len = 3 Null 2 0
[2]
bit field
Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki19
Problem: Boxing Occur
▪ When convert data representation from the Spark internal to Java object,
boxing occurs in the generated code
☺Easy to handle NULL value
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki20
ArrayData a = …;
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
Boxing
Solution: Use Primitive Type When Possible
▪ When the analysis identifies that the array is a primitive type array without
NULL, generate code using a primitive array.
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki21
ArrayData a = …;
int[] array = a.toIntArray();
Note: data-representation optimization improves efficiency of toIntArray()
Generated Code without Our Optimizations
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki22
val ds =
Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next(); // P1
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length]; // P3
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0); // P2
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length]; // P3
for (int i = 0; i < a.length; i++)
obj1[i] = new Intger(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1); // P2
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue); // P2
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
Generated Code with Our Optimizations
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki23
val ds =
Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Column column0 = ...getColumn(0);
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
for (int i = 0; i < column0.numRows(); i++) {
// eliminated data copy (P1)
ArrayData a = column0.getArray(i);
// eliminated data conversion (P3)
int[] input_filter = a.toDoubleArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
// eliminated data conversion (P3)
int[] input_map = a.toDoubleArray();
int[] mvalue = (int[])map_func.apply(input_map);
// use efficient data representation (P2)
ArrayData value = new IntArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Made data representation effective (Data-representation)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki24
Performance Evaluation Methodology
▪ Measured performance improvement of two types of applications using
our optimizations
– Database: TPC-H
– Machine learning: Logistic regression and k-means
▪ Experimental environment used
– Five 16-core Intel Xeon E5-2683 v4 CPU (2.1 GHz with 128 GB of RAM)
machines
▪ One for driver and four for executors
– Spark 2.2
– OpenJDK 1.8.0_181 with 96GB heap using default collection policy
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki25
Performance Improvements of TPC-H queries
▪ Achieve up to 1.41x performance improvement
– 1.10x on geometric mean
▪ Accomplished by only data-copy optimization
– No array is used in TPC-H
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki26
1
1.1
1.2
1.3
1.4
1.5
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Performanceimprovementover
nooptimization
With data-copy optimization
Scale factor=10
Higher is better
Performance Improvements of ML applications
▪ Achieve up to 1.42x performance improvement of logistic regression
– 1.21x on geometric mean
▪ Accomplished by optimization for array representation
– columnar storage optimization contributed slightly
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki27
1
1.1
1.2
1.3
1.4
1.5
K-means Logistic regression
Performanceimprovementover
nooptimization
Data-representation and data-conversion
optimizations
All optimizations
5M data points with 200 dimensions 32M data points with 200 dimensions
Higher is better
Cycle Breakdown for Logistic Regression
▪ Took 28% for data conversion without data-representation and data-
conversion optimizations
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki28
28.1
0
60.1
87.6
11.8 12.4
0%
20%
40%
60%
80%
100%
w/o optimizations w/ optimizations
Percentageofconsumedcycles
Others
Computation
Data conversion
Conclusion
▪ Revealed performance issues in generated code from a Spark program
▪ Devised three optimizations
– to eliminate unnecessary data copy (Data-copy)
– to improve efficiency of data representation (Data-representation)
– to eliminate unnecessary data conversion (Data-conversion)
▪ Achieved up to 1.4x performance improvements
– 22 TPC-H queries
– Two machine learning programs: logistic regression and k-means
▪ Merged these optimizations into Spark 2.3 and later versions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki29
Acknowledgments
▪ Thanks Apache Spark community for suggestions on merging our
optimizations into Apache Spark, especially
– Wenchen Fan, Herman van Hovell, Liang-Chi Hsieh, Takuya Ueshin, Sameer
Agarwal, Andrew Or, Davies Liu, Nong Li, and Reynold Xin
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki30

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Data stax academy
Data stax academyData stax academy
Data stax academyDuyhai Doan
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureSpark Summit
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014StampedeCon
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰Wayne Chen
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 

Was ist angesagt? (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Data stax academy
Data stax academyData stax academy
Data stax academy
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 Lessons from the Field, Episode II: Applying Best Practices to Your Apache S... Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 

Ähnlich wie icpe2019_ishizaki_public

Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDatabricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big DataLeonardo Gamas
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 

Ähnlich wie icpe2019_ishizaki_public (20)

Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 

Mehr von Kazuaki Ishizaki

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Kazuaki Ishizaki
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_publicKazuaki Ishizaki
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaKazuaki Ishizaki
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseKazuaki Ishizaki
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki publicKazuaki Ishizaki
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_publicKazuaki Ishizaki
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラKazuaki Ishizaki
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化Kazuaki Ishizaki
 

Mehr von Kazuaki Ishizaki (17)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Kürzlich hochgeladen

Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 

Kürzlich hochgeladen (20)

Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 

icpe2019_ishizaki_public

  • 1. Kazuaki Ishizaki IBM Research – Tokyo Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan 1
  • 2. What is Apache Spark? ▪ Framework that processes distributed computing by transforming distributed immutable memory structure using set of parallel operations ▪ e.g. map(), filter(), reduce(), … – Distributed immutable in-memory structures ▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset – SQL-based data types are supported – Scala is primary language for programming on Spark Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki Spark Runtime (written in Java and Scala) Spark Streaming (real-time) GraphX (graph) SparkSQL (SQL) MLlib (machine learning) Java Virtual Machine tasks Executor Driver Executor results Executor Data Data Data Open source: http://spark.apache.org/ Data Source (HDFS, DB, File, etc.) Latest version is 2.4 released in 2018/11 2 val ds = ... val ds1 = ...
  • 3. What is Apache Spark? ▪ Framework that processes distributed computing by transforming distributed immutable memory structure using a set of parallel operations ▪ e.g. map(), filter(), reduce(), … – Distributed immutable in-memory structures ▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset – SQL-based data types are supported – Scala is primary language for programming on Spark Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki Spark Runtime (written in Java and Scala) Spark Streaming (real-time) GraphX (graph) SparkSQL (SQL) MLlib (machine learning) Java Virtual Machine tasks Executor Driver Executor results Executor Data Data Data Open source: http://spark.apache.org/ Data Source (HDFS, DB, File, etc.) Latest version is 2.4 released in 2018/11 3 val ds = ... val ds1 = ... This talk focuses on executor behavior
  • 4. How Code on Each Executor is Generated? ▪ The program written as embedded DSL is translated to Java code thru analysis and optimizations in Spark Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki4 val ds: Dataset[Array[Int]] = Seq(Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds.filter(a => a(0) > 0).map(a => a) while (rowIterator.hasNext()) { Row row = rowIterator.next; ArrayData a = row.getArray(0); … } SQL Analyzer Rule-based Optimizer Code Generator DataFrame Dataset (0, 2) (1, 3) Column 0 Row 0 Row 1 Java virtual machine Spark Program Generated Java code
  • 5. Motivating Example ▪ A simple Spark program that performs filter and map operations Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki5 val ds = Seq(Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds.filter(a => a(0) > 0) .map(a => a) (0, 2) (1, 3) Column 0 Row 0 Row 1
  • 6. Motivating Example ▪ Generate complicated code from a simple Spark program Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki6 val ds = Seq( Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds .filter(a => a(0) > 0) .map(a => a) final class GeneratedIterator { Iterator inputIterator = ...; Row projectRow = new Row(1); RowWriter rowWriter = new RowWriter(projectRow); protected void processNext() { while (inputIterator.hasNext()) { Row inputRow = (Row) inputIterator.next(); ArrayData a = inputRow.getArray(0); Object[] obj0 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj0[i] = new Integer(a.getInt(i)); ArrayData array_filter = new GenericArrayData(obj0); int[] input_filter = array_filter.toIntArray(); boolean fvalue = (Boolean)filter_func.apply(input_filter); if (!fvalue) continue; Object[] obj1 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj1[i] = new Intger(a.getInt(i)); ArrayData array_map = new GenericArrayData(obj1); int[] input_map = array_map.toIntArray(); int[] mvalue = (double[])map_func.apply(input_map); ArrayData value = new GenericArrayData(mvalue); rowWriter.write(0, value); appendRow(projectRow); } } } Note: Actually generated code is more complicated
  • 7. Performance Issues in Generated Code ▪ P1: Unnecessary data copy ▪ P2: Inefficient data representation ▪ P3: Unnecessary data conversions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki7 final class GeneratedIterator { Iterator inputIterator = ...; Row projectRow = new Row(1); RowWriter rowWriter = new RowWriter(projectRow); protected void processNext() { while (inputIterator.hasNext()) { Row inputRow = (Row) inputIterator.next(); ArrayData a = inputRow.getArray(0); Object[] obj0 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj0[i] = Integer.valueOf(a.getInt(i)); ArrayData array_filter = new GenericArrayData(obj0); int[] input_filter = array_filter.toIntArray(); boolean fvalue = (Boolean)filter_func.apply(input_filter); if (!fvalue) continue; Object[] obj1 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj1[i] = Intger.valueOf(a.getInt(i)); ArrayData array_map = new GenericArrayData(obj1); int[] input_map = array_map.toIntArray(); int[] mvalue = (double[])map_func.apply(input_map); ArrayData value = new GenericArrayData(mvalue); rowWriter.write(0, value); appendRow(projectRow); } } } P1 (from columnar to row-oriented) P3 (Boxing) P2 P3 (Unboxing) P3 (Boxing) P2 P3 (Unboxing) P2 val ds = Seq( Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds .filter(a => a(0) > 0) .map(a => a)
  • 8. Our Contributions ▪ Revealed performance issues in generated code from a Spark program ▪ Devised three optimizations – to eliminate unnecessary data copy (Data-copy) – to improve efficiency of data representation (Data-representation) – to eliminate unnecessary data conversion (Data-conversion) ▪ Achieved up to 1.4x performance improvements – 22 TPC-H queries – Two machine learning programs ▪ Merged these optimizations into Spark 2.3 and later versions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki8 These optimizations reduce path length of handling data.
  • 9. Outline ▪ Problems ▪ Eliminate unnecessary data copy (Data-copy) ▪ Improve efficiency of data representation (Data-representation) ▪ Eliminate unnecessary data conversion (Data-conversion) ▪ Experiments Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki9
  • 10. Basic Compilation Strategy of an Operator ▪ Use Volcano style [Graefe93] – Connect operations using an iterator for easy adding of new operators Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki10 Row-based iterator Operator val ds = Seq((Array(0, 2), 10, “Tokyo”)) val ds1 = ds.filter(a => a.int_ > 0) .map(a => a) Row-based iterator Operator Array Int String (0, 2) 10 “Tokyo” while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // map(...) ... } while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // filter(...) ... }
  • 11. Overview of Generated Code in Spark ▪ Put multiple operator into one loop [Neumann11] when possible – Can avoid overhead of iterators – Encourage compiler optimizations in a loop Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki11 while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // map(...) ... } while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // filter(...) ... } while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); // filter(...) ... // map(...) ... } val ds1 = ds.filter(a => a(0) > 0) .map(a => a) Whole-stage code generationVolcano style
  • 12. Columnar Storage to Generated Code ▪ While the source uses columnar storage, generated code requires data in row-based storage Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki12 Columnar Storage Row-based iterator Operators while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); ... ... } Array Int String (0, 2) (1, 3) 10 20 “Tokyo” “Mumbay” val ds = Seq((Array(0, 2), 10, “Tokyo”), (Array(1, 3), 20, “Mumbay”)).toDS.cache (0, 2) 10 “Tokyo” Two operations in a loop
  • 13. Problem: Data Copy From Columnar Storage ▪ Copy a set of columns to a row occurs when the iterator is used Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki13 Columnar Storage Row-based iterator Operators Data copy while (rowIterator.hasNext()) { Row row = rowIterator.next(); int x = row.getInteger(1); ... ... } Array Int String (0, 2) (1, 3) 10 20 “Tokyo” “Mumbay” (0, 2) 10 “Tokyo” val ds = Seq((Array(0, 2), 10, “Tokyo”), (Array(1, 3), 20, “Mumbay”)).toDS.cache
  • 14. Solution: Generate Optimized Code ▪ If new analysis identifies the source is columnar storage, – Use a counter-based loop without a row-based iterator ▪ To identify a row position in a columnar storage with an index – Get data from columnar storage directly Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki14 Columnar Storage Operators Column column1 = df1.getColumn(1); int sum = 0; for (int i = 0; i < column0.numRows; i++) { int x = column1.getInteger(i); ... ... } Array Int String (0, 2) (1, 3) 10 20 “Tokyo” “Mumbay”
  • 15. Outline ▪ Problems ▪ Eliminate unnecessary data copy (Data-copy) ▪ Improve efficiency of data representation (Data-representation) ▪ Eliminate unnecessary data conversion (Data-conversion) ▪ Experiments Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki15
  • 16. Overview of Old Internal Array Representation ▪ Use an Object array for each element to handle NULL value for SQL in addition to a primitive value (e.g. 0 or 2) ☺Easy to represent NULL  Use a boxed object (e.g. Integer object) to hold a primitive value (e.g. int) Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki16 [0] [1] Object array 0 2 Integer Integer NULL [2] val ds = Seq(Array(0, 2, NULL)) len = 3
  • 17. Problem: Boxing and Unboxing Occur ▪ Cause boxing (i.e. create an object) by a setter method (e.g. setInt(1, 2)) ▪ Cause unboxing by a getter method (e.g. getInt(0)) ▪ Increase memory footprint by having a pointer to an object Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki17 int getInt(int i) { return Integer.getValue(array[i]); } void setInt(int i, int v) { array[i] = new Integer(v); } Unboxing from Integer object to int value Boxing from int value to Integer object len = 3 [0] [1] 0 2 Integer Integer NULL [2] Object array
  • 18. Solution: Use Primitive Type When Possible ▪ Keep a value in a primitive field when possible based on analysis ▪ Keep NULL in a separate bit field ☺Avoid boxing and unboxing ☺Reduce memory footprint Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki18 Non Null Non Null 0 [0] [1] int array int getInt(int i) { return array[i]; } void setInt(int i, int v) { array[i] = v; } len = 3 Null 2 0 [2] bit field
  • 19. Outline ▪ Problems ▪ Eliminate unnecessary data copy (Data-copy) ▪ Improve efficiency of data representation (Data-representation) ▪ Eliminate unnecessary data conversion (Data-conversion) ▪ Experiments Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki19
  • 20. Problem: Boxing Occur ▪ When convert data representation from the Spark internal to Java object, boxing occurs in the generated code ☺Easy to handle NULL value Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki20 ArrayData a = …; Object[] obj0 = new Object[a.length]; for (int i = 0; i < a.length; i++) obj0[i] = new Integer(a.getInt(i)); ArrayData array_filter = new GenericArrayData(obj0); int[] input_filter = array_filter.toIntArray(); Boxing
  • 21. Solution: Use Primitive Type When Possible ▪ When the analysis identifies that the array is a primitive type array without NULL, generate code using a primitive array. Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki21 ArrayData a = …; int[] array = a.toIntArray(); Note: data-representation optimization improves efficiency of toIntArray()
  • 22. Generated Code without Our Optimizations ▪ P1: Unnecessary data copy ▪ P2: Inefficient data representation ▪ P3: Unnecessary data conversions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki22 val ds = Seq(Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds .filter(a => a(0) > 0) .map(a => a) final class GeneratedIterator { Iterator inputIterator = ...; Row projectRow = new Row(1); RowWriter rowWriter = new RowWriter(projectRow); protected void processNext() { while (inputIterator.hasNext()) { Row inputRow = (Row) inputIterator.next(); // P1 ArrayData a = inputRow.getArray(0); Object[] obj0 = new Object[a.length]; // P3 for (int i = 0; i < a.length; i++) obj0[i] = new Integer(a.getInt(i)); ArrayData array_filter = new GenericArrayData(obj0); // P2 int[] input_filter = array_filter.toIntArray(); boolean fvalue = (Boolean)filter_func.apply(input_filter); if (!fvalue) continue; Object[] obj1 = new Object[a.length]; // P3 for (int i = 0; i < a.length; i++) obj1[i] = new Intger(a.getInt(i)); ArrayData array_map = new GenericArrayData(obj1); // P2 int[] input_map = array_map.toIntArray(); int[] mvalue = (double[])map_func.apply(input_map); ArrayData value = new GenericArrayData(mvalue); // P2 rowWriter.write(0, value); appendRow(projectRow); } } }
  • 23. Generated Code with Our Optimizations ▪ P1: Unnecessary data copy ▪ P2: Inefficient data representation ▪ P3: Unnecessary data conversions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki23 val ds = Seq(Array(0, 2), Array(1, 3)) .toDS.cache val ds1 = ds .filter(a => a(0) > 0) .map(a => a) final class GeneratedIterator { Column column0 = ...getColumn(0); Row projectRow = new Row(1); RowWriter rowWriter = new RowWriter(projectRow); protected void processNext() { for (int i = 0; i < column0.numRows(); i++) { // eliminated data copy (P1) ArrayData a = column0.getArray(i); // eliminated data conversion (P3) int[] input_filter = a.toDoubleArray(); boolean fvalue = (Boolean)filter_func.apply(input_filter); if (!fvalue) continue; // eliminated data conversion (P3) int[] input_map = a.toDoubleArray(); int[] mvalue = (int[])map_func.apply(input_map); // use efficient data representation (P2) ArrayData value = new IntArrayData(mvalue); rowWriter.write(0, value); appendRow(projectRow); } } }
  • 24. Outline ▪ Problems ▪ Eliminate unnecessary data copy (Data-copy) ▪ Made data representation effective (Data-representation) ▪ Experiments Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki24
  • 25. Performance Evaluation Methodology ▪ Measured performance improvement of two types of applications using our optimizations – Database: TPC-H – Machine learning: Logistic regression and k-means ▪ Experimental environment used – Five 16-core Intel Xeon E5-2683 v4 CPU (2.1 GHz with 128 GB of RAM) machines ▪ One for driver and four for executors – Spark 2.2 – OpenJDK 1.8.0_181 with 96GB heap using default collection policy Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki25
  • 26. Performance Improvements of TPC-H queries ▪ Achieve up to 1.41x performance improvement – 1.10x on geometric mean ▪ Accomplished by only data-copy optimization – No array is used in TPC-H Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki26 1 1.1 1.2 1.3 1.4 1.5 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Performanceimprovementover nooptimization With data-copy optimization Scale factor=10 Higher is better
  • 27. Performance Improvements of ML applications ▪ Achieve up to 1.42x performance improvement of logistic regression – 1.21x on geometric mean ▪ Accomplished by optimization for array representation – columnar storage optimization contributed slightly Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki27 1 1.1 1.2 1.3 1.4 1.5 K-means Logistic regression Performanceimprovementover nooptimization Data-representation and data-conversion optimizations All optimizations 5M data points with 200 dimensions 32M data points with 200 dimensions Higher is better
  • 28. Cycle Breakdown for Logistic Regression ▪ Took 28% for data conversion without data-representation and data- conversion optimizations Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki28 28.1 0 60.1 87.6 11.8 12.4 0% 20% 40% 60% 80% 100% w/o optimizations w/ optimizations Percentageofconsumedcycles Others Computation Data conversion
  • 29. Conclusion ▪ Revealed performance issues in generated code from a Spark program ▪ Devised three optimizations – to eliminate unnecessary data copy (Data-copy) – to improve efficiency of data representation (Data-representation) – to eliminate unnecessary data conversion (Data-conversion) ▪ Achieved up to 1.4x performance improvements – 22 TPC-H queries – Two machine learning programs: logistic regression and k-means ▪ Merged these optimizations into Spark 2.3 and later versions Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki29
  • 30. Acknowledgments ▪ Thanks Apache Spark community for suggestions on merging our optimizations into Apache Spark, especially – Wenchen Fan, Herman van Hovell, Liang-Chi Hsieh, Takuya Ueshin, Sameer Agarwal, Andrew Or, Davies Liu, Nong Li, and Reynold Xin Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki30