2. What is Apache Spark?
▪ Framework that processes distributed computing by transforming
distributed immutable memory structure using set of parallel operations
▪ e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset
– SQL-based data types are supported
– Scala is primary language for programming on Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.4 released in 2018/11
2
val ds = ...
val ds1 = ...
3. What is Apache Spark?
▪ Framework that processes distributed computing by transforming
distributed immutable memory structure using a set of parallel operations
▪ e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset
– SQL-based data types are supported
– Scala is primary language for programming on Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.4 released in 2018/11
3
val ds = ...
val ds1 = ...
This talk focuses on
executor behavior
4. How Code on Each Executor is Generated?
▪ The program written as embedded DSL is translated to Java code thru
analysis and optimizations in Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki4
val ds: Dataset[Array[Int]] = Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds.filter(a => a(0) > 0).map(a => a)
while (rowIterator.hasNext()) {
Row row = rowIterator.next;
ArrayData a = row.getArray(0);
…
}
SQL
Analyzer
Rule-based
Optimizer
Code
Generator
DataFrame
Dataset
(0, 2)
(1, 3)
Column 0
Row 0
Row 1
Java virtual machine
Spark Program
Generated Java code
5. Motivating Example
▪ A simple Spark program that performs filter and map operations
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki5
val ds = Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds.filter(a => a(0) > 0)
.map(a => a)
(0, 2)
(1, 3)
Column 0
Row 0
Row 1
6. Motivating Example
▪ Generate complicated code from a simple Spark program
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki6
val ds = Seq(
Array(0, 2),
Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next();
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj1[i] = new Intger(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1);
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
Note: Actually generated code is
more complicated
7. Performance Issues in Generated Code
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki7
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next();
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = Integer.valueOf(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj1[i] = Intger.valueOf(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1);
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
P1 (from columnar to row-oriented)
P3 (Boxing)
P2
P3 (Unboxing)
P3 (Boxing)
P2
P3 (Unboxing)
P2
val ds = Seq(
Array(0, 2),
Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
8. Our Contributions
▪ Revealed performance issues in generated code from a Spark program
▪ Devised three optimizations
– to eliminate unnecessary data copy (Data-copy)
– to improve efficiency of data representation (Data-representation)
– to eliminate unnecessary data conversion (Data-conversion)
▪ Achieved up to 1.4x performance improvements
– 22 TPC-H queries
– Two machine learning programs
▪ Merged these optimizations into Spark 2.3 and later versions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki8
These optimizations reduce
path length of handling data.
9. Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki9
10. Basic Compilation Strategy of an Operator
▪ Use Volcano style [Graefe93]
– Connect operations using an iterator for easy adding of new operators
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki10
Row-based
iterator
Operator
val ds = Seq((Array(0, 2), 10, “Tokyo”))
val ds1 = ds.filter(a => a.int_ > 0)
.map(a => a)
Row-based
iterator
Operator
Array Int String
(0, 2) 10 “Tokyo”
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// map(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
}
11. Overview of Generated Code in Spark
▪ Put multiple operator into one loop [Neumann11] when possible
– Can avoid overhead of iterators
– Encourage compiler optimizations in a loop
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki11
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// map(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
// map(...)
...
}
val ds1 = ds.filter(a => a(0) > 0)
.map(a => a)
Whole-stage code generationVolcano style
12. Columnar Storage to Generated Code
▪ While the source uses columnar storage, generated code requires
data in row-based storage
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki12
Columnar
Storage
Row-based
iterator
Operators
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
val ds = Seq((Array(0, 2), 10, “Tokyo”),
(Array(1, 3), 20, “Mumbay”)).toDS.cache
(0, 2) 10 “Tokyo”
Two operations
in a loop
13. Problem: Data Copy From Columnar Storage
▪ Copy a set of columns to a row occurs when the iterator is used
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki13
Columnar
Storage
Row-based
iterator
Operators
Data copy
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
(0, 2) 10 “Tokyo”
val ds = Seq((Array(0, 2), 10, “Tokyo”),
(Array(1, 3), 20, “Mumbay”)).toDS.cache
14. Solution: Generate Optimized Code
▪ If new analysis identifies the source is columnar storage,
– Use a counter-based loop without a row-based iterator
▪ To identify a row position in a columnar storage with an index
– Get data from columnar storage directly
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki14
Columnar
Storage
Operators
Column column1 = df1.getColumn(1);
int sum = 0;
for (int i = 0; i < column0.numRows; i++) {
int x = column1.getInteger(i);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
15. Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki15
16. Overview of Old Internal Array Representation
▪ Use an Object array for each element to handle NULL value for SQL in
addition to a primitive value (e.g. 0 or 2)
☺Easy to represent NULL
Use a boxed object (e.g. Integer object) to hold a primitive value (e.g. int)
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki16
[0] [1]
Object array
0 2
Integer Integer
NULL
[2]
val ds = Seq(Array(0, 2, NULL))
len = 3
17. Problem: Boxing and Unboxing Occur
▪ Cause boxing (i.e. create an object) by a setter method (e.g. setInt(1, 2))
▪ Cause unboxing by a getter method (e.g. getInt(0))
▪ Increase memory footprint by having a pointer to an object
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki17
int getInt(int i) {
return Integer.getValue(array[i]);
}
void setInt(int i, int v) {
array[i] = new Integer(v);
}
Unboxing
from Integer object to int value
Boxing
from int value to Integer object
len = 3
[0] [1]
0 2
Integer Integer
NULL
[2]
Object array
18. Solution: Use Primitive Type When Possible
▪ Keep a value in a primitive field when possible based on analysis
▪ Keep NULL in a separate bit field
☺Avoid boxing and unboxing
☺Reduce memory footprint
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki18
Non
Null
Non
Null 0
[0] [1]
int array
int getInt(int i) {
return array[i];
}
void setInt(int i, int v) {
array[i] = v;
}
len = 3 Null 2 0
[2]
bit field
19. Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki19
20. Problem: Boxing Occur
▪ When convert data representation from the Spark internal to Java object,
boxing occurs in the generated code
☺Easy to handle NULL value
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki20
ArrayData a = …;
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
Boxing
21. Solution: Use Primitive Type When Possible
▪ When the analysis identifies that the array is a primitive type array without
NULL, generate code using a primitive array.
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki21
ArrayData a = …;
int[] array = a.toIntArray();
Note: data-representation optimization improves efficiency of toIntArray()
22. Generated Code without Our Optimizations
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki22
val ds =
Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next(); // P1
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length]; // P3
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0); // P2
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length]; // P3
for (int i = 0; i < a.length; i++)
obj1[i] = new Intger(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1); // P2
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue); // P2
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
23. Generated Code with Our Optimizations
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki23
val ds =
Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Column column0 = ...getColumn(0);
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
for (int i = 0; i < column0.numRows(); i++) {
// eliminated data copy (P1)
ArrayData a = column0.getArray(i);
// eliminated data conversion (P3)
int[] input_filter = a.toDoubleArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
// eliminated data conversion (P3)
int[] input_map = a.toDoubleArray();
int[] mvalue = (int[])map_func.apply(input_map);
// use efficient data representation (P2)
ArrayData value = new IntArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
24. Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Made data representation effective (Data-representation)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki24
25. Performance Evaluation Methodology
▪ Measured performance improvement of two types of applications using
our optimizations
– Database: TPC-H
– Machine learning: Logistic regression and k-means
▪ Experimental environment used
– Five 16-core Intel Xeon E5-2683 v4 CPU (2.1 GHz with 128 GB of RAM)
machines
▪ One for driver and four for executors
– Spark 2.2
– OpenJDK 1.8.0_181 with 96GB heap using default collection policy
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki25
26. Performance Improvements of TPC-H queries
▪ Achieve up to 1.41x performance improvement
– 1.10x on geometric mean
▪ Accomplished by only data-copy optimization
– No array is used in TPC-H
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki26
1
1.1
1.2
1.3
1.4
1.5
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Performanceimprovementover
nooptimization
With data-copy optimization
Scale factor=10
Higher is better
27. Performance Improvements of ML applications
▪ Achieve up to 1.42x performance improvement of logistic regression
– 1.21x on geometric mean
▪ Accomplished by optimization for array representation
– columnar storage optimization contributed slightly
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki27
1
1.1
1.2
1.3
1.4
1.5
K-means Logistic regression
Performanceimprovementover
nooptimization
Data-representation and data-conversion
optimizations
All optimizations
5M data points with 200 dimensions 32M data points with 200 dimensions
Higher is better
28. Cycle Breakdown for Logistic Regression
▪ Took 28% for data conversion without data-representation and data-
conversion optimizations
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki28
28.1
0
60.1
87.6
11.8 12.4
0%
20%
40%
60%
80%
100%
w/o optimizations w/ optimizations
Percentageofconsumedcycles
Others
Computation
Data conversion
29. Conclusion
▪ Revealed performance issues in generated code from a Spark program
▪ Devised three optimizations
– to eliminate unnecessary data copy (Data-copy)
– to improve efficiency of data representation (Data-representation)
– to eliminate unnecessary data conversion (Data-conversion)
▪ Achieved up to 1.4x performance improvements
– 22 TPC-H queries
– Two machine learning programs: logistic regression and k-means
▪ Merged these optimizations into Spark 2.3 and later versions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki29
30. Acknowledgments
▪ Thanks Apache Spark community for suggestions on merging our
optimizations into Apache Spark, especially
– Wenchen Fan, Herman van Hovell, Liang-Chi Hsieh, Takuya Ueshin, Sameer
Agarwal, Andrew Or, Davies Liu, Nong Li, and Reynold Xin
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki30