Kazuaki Ishizaki discussed the evolution of in-memory storage in Apache Spark and its relationship to Apache Arrow. He highlighted talks about using Arrow to exchange data between Spark and other frameworks like R and .NET, as well as hardware accelerators. Arrow allows sharing columnar data formats and transferring data to improve performance and programmability when integrating Spark with other systems.
1. Kazuaki Ishizaki (石崎 一明)
IBM Research – Tokyo, 日本アイ・ビー・エム(株)東京基礎研究所
@kiszk
Spark In-Memoryの発表と関連セッションの紹介
1
2. About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research – Tokyo
https://ibm.biz/ishizaki
– Compiler optimization, language runtime, and parallel processing
▪ Work for IBM Java (Open J9, now) from 1996
– Technical lead for Just-in-time compiler for PowerPC
▪ Apache Spark committer from 2018/9 (SQL module)
– Four Apache Spark committers in Japan
▪ ACM Distinguished Member (2018-)
▪ SNS
– @kiszk
– ishizaki
2 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
3. Today’s topics
▪ Highlight in a talk “In-Memory Storage Evolution in Apache Spark”
▪ Relationship between Apache Spark and Apache Arrow
▪ Highlights in talks regarding Apache Spark with Apache Arrow
3 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
4. In-Memory Storage Evolution in Apache Spark
▪ History of in-memory storage from Spark 1.3 to 2.4
– From Java Object to own memory format managed by Spark (Project
Tungsten)
– Introduction of Columnar storage class
▪ Support of Apache Arrow
– Performance improvements of PySpark in the case of Pandas UDF
▪ Refactoring of internal data structure
– One public abstract class: ColumnVector
4 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
https://www.slideshare.net/ishizaki/in-memory-evolution-in-apache-spark
5. Why is In-Memory Storage?
• In-memory storage is mandatory for high performance
• In-memory columnar storage is necessary to
– Support first-class citizen column format Parquet
– Achieve better compression rate for table cache
5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
memory address memory address
SummitAISpark
5000.01.92.0
321Summit
AI
Spark
5000.0
1.9
2.0
3
2
1
Row format Column format
Row 0
Row 1
Row 2
Column x
Column y
Column z
6. In-Memory Storage Evolution (1/2)
6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
AI|
Spark| Spark AI
Table cache
2.0
1.9
2.0 1.9
Spark AI
Parquet vectorized reader
2.0 1.9
1.4 to 1.6
RDD table cache
to 1.3 2.0 to 2.2
RDD table cache : Java objects
Table cache : Own memory layout by Project Tungsten for table
cache
Parquet : Own memory layout, but different class from table
cacheSpark
version
7. In-Memory Storage Evolution (2/2)
7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Spark AI
Table cache
2.0 1.9
Parquet vectorized reader
2.42.3
Pandas UDF with Arrow ORC vectorized reader
ColumnVector becomes a public class
ColumnVector class becomes public class from Spark 2.3
Table cache, Parquet, ORC, and Arrow use common ColumnVector
class
Spark
version
8. Performance among Spark Versions
• DataFrame table cache from Spark 2.0 to Spark 2.4
8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spark 2.0
Spark 2.3
Spark 2.4
Performance comparison among different Spark versions
Relative elapsed time
shorter is better
df.filter(“i % 16 == 0").count
9. How Columnar Storage is used in
PySpark
• Share data in columnar storages of Spark and Pandas
– No serialization and deserialization
– 3-100x performance improvements
9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector
Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin
Source: ”Introducing Pandas UDF for PySpark” by Databricks blog
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
Apache Arrow
11. Integrate Spark with Others
• Frameworks: Deep DL/ML
frameworks
• SPARK-24579
• SPARK-26413
• Resources: GPU, FPGA, ..
• SPARK-27396
• SAIS2019: “Apache Arrow-Based
Unified Data Sharing and
Transferring Format Among
CPU and Accelerators”
11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
From rapids.ai
FPGA
GPU
12. Presentation with Spark & Arrow in SAIS2019
▪ Language / Framework
– Running R at Scale with Apache Arrow on Spark
– Introducing .NET Bindings for Apache Spark
– Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in
Apache Spark
– Make your PySpark Data Fly with Arrow!
▪ Hardware resources (e.g. GPU and FPGA)
– Accelerating Machine Learning Workloads and Apache Spark Applications via
CUDA and NCCL
– Apache Arrow-Based Unified Data Sharing and Transferring Format Among
CPU and Accelerators
12 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
13. Exchange data between Spark and R
▪ フレームワーク
–
13 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Running R at Scale with Apache Arrow on Spark
https://www.slideshare.net/databricks/running-r-at-scale-with-apache-arrow-on-spark
14. Exchange data between Spark and .NET UDF
▪ フレームワーク
–
14 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Introducing .NET Bindings for Apache Spark
https://www.slideshare.net/databricks/introducing-net-bindings-for-apache-spark
15. ▪ フレームワーク
–
15 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Make your PySpark Data Fly with Arrow!
https://www.slideshare.net/databricks/make-your-pyspark-data-fly-with-arrow
Exchange data between Spark and TensorFlow
16. ▪ フレームワーク
–
16 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark
https://www.slideshare.net/databricks/updates-from-project-hydrogen-unifying-stateoftheart-ai-and-big-data-in-apache-spark
Make Arrow format standard in Spark
17. Exchange data between Spark and RAPIDS library
▪ フレームワーク
–
17 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL
https://www.slideshare.net/databricks/accelerating-machine-learning-workloads-and-apache-spark-
applications-via-cuda-and-nccl
#24795 (SPARK-27945)
is minimal support
for columnar
processing
18. ▪ フレームワーク
–
18 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators
https://www.slideshare.net/databricks/apache-arrowbased-unified-data-sharing-and-transferring-
format-among-cpu-and-accelerators
Exchange data between Spark and accelerator
19. Takeaway
▪ Evolving in-memory in Apache Spark while keep the same API (e.g.
DataFrame and Dataset)
– Improve performance by using columnar storage and own memory format
– Support Apache Arrow
– Define API to increase generality and ease of supporting other data sources
▪ Improve performance and programmability to exchange data by using
Apache Arrow between Spark and
– Framework
– Hardware accelerators
19 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki