SparkTokyo2019

Kazuaki Ishizaki (石崎一明)
IBM Research – Tokyo, 日本アイ・ビー・エム（株）東京基礎研究所
@kiszk
Spark In-Memoryの発表と関連セッションの紹介
1

About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research – Tokyo
https://ibm.biz/ishizaki
– Compiler optimization, language runtime, and parallel processing
▪ Work for IBM Java (Open J9, now) from 1996
– Technical lead for Just-in-time compiler for PowerPC
▪ Apache Spark committer from 2018/9 (SQL module)
– Four Apache Spark committers in Japan
▪ ACM Distinguished Member (2018-)
▪ SNS
– @kiszk
– ishizaki
2 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki

Today’s topics
▪ Highlight in a talk “In-Memory Storage Evolution in Apache Spark”
▪ Relationship between Apache Spark and Apache Arrow
▪ Highlights in talks regarding Apache Spark with Apache Arrow

In-Memory Storage Evolution in Apache Spark
▪ History of in-memory storage from Spark 1.3 to 2.4
– From Java Object to own memory format managed by Spark (Project
Tungsten)
– Introduction of Columnar storage class
▪ Support of Apache Arrow
– Performance improvements of PySpark in the case of Pandas UDF
▪ Refactoring of internal data structure
– One public abstract class: ColumnVector
https://www.slideshare.net/ishizaki/in-memory-evolution-in-apache-spark

Why is In-Memory Storage?
• In-memory storage is mandatory for high performance
• In-memory columnar storage is necessary to
– Support first-class citizen column format Parquet
– Achieve better compression rate for table cache
5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
memory address memory address
SummitAISpark
5000.01.92.0
321Summit
AI
Spark
5000.0
1.9
2.0
3
2
1
Row format Column format
Row 0
Row 1
Row 2
Column x
Column y
Column z

In-Memory Storage Evolution (1/2)
AI|
Spark| Spark AI
Table cache
2.0
1.9
2.0 1.9
Spark AI
Parquet vectorized reader
2.0 1.9
1.4 to 1.6
RDD table cache
to 1.3 2.0 to 2.2
RDD table cache : Java objects
Table cache : Own memory layout by Project Tungsten for table
cache
Parquet : Own memory layout, but different class from table
cacheSpark
version

In-Memory Storage Evolution (2/2)
Spark AI
Table cache
2.0 1.9
Parquet vectorized reader
2.42.3
Pandas UDF with Arrow ORC vectorized reader
ColumnVector becomes a public class
ColumnVector class becomes public class from Spark 2.3
Table cache, Parquet, ORC, and Arrow use common ColumnVector
class
Spark
version

Performance among Spark Versions
• DataFrame table cache from Spark 2.0 to Spark 2.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spark 2.0
Spark 2.3
Spark 2.4
Performance comparison among different Spark versions
Relative elapsed time
shorter is better
df.filter(“i % 16 == 0").count

How Columnar Storage is used in
PySpark
• Share data in columnar storages of Spark and Pandas
– No serialization and deserialization
– 3-100x performance improvements
ColumnVector
Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin
Source: ”Introducing Pandas UDF for PySpark” by Databricks blog
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
Apache Arrow

How Columnar Storage is Used
• Table cache ORC
• Pandas UDF Parquet
df = ...
df.cache
df1 = df.selectExpr(“y + 1.2”)
df = spark.read.parquet(“c”)
df = spark.read.format(“orc”).load(“c”)
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
df1 = df.withColumn(‘yy’, plus(df.y))

Integrate Spark with Others
• Frameworks: Deep DL/ML
frameworks
• SPARK-24579
• SPARK-26413
• Resources: GPU, FPGA, ..
• SPARK-27396
• SAIS2019: “Apache Arrow-Based
Unified Data Sharing and
Transferring Format Among
CPU and Accelerators”
From rapids.ai
FPGA
GPU

Presentation with Spark & Arrow in SAIS2019
▪ Language / Framework
– Running R at Scale with Apache Arrow on Spark
– Introducing .NET Bindings for Apache Spark
– Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in
Apache Spark
– Make your PySpark Data Fly with Arrow!
▪ Hardware resources (e.g. GPU and FPGA)
– Accelerating Machine Learning Workloads and Apache Spark Applications via
CUDA and NCCL
– Apache Arrow-Based Unified Data Sharing and Transferring Format Among
CPU and Accelerators

Exchange data between Spark and R
▪ フレームワーク
–
Running R at Scale with Apache Arrow on Spark
https://www.slideshare.net/databricks/running-r-at-scale-with-apache-arrow-on-spark

Exchange data between Spark and .NET UDF
–
Introducing .NET Bindings for Apache Spark
https://www.slideshare.net/databricks/introducing-net-bindings-for-apache-spark

–
Make your PySpark Data Fly with Arrow!
https://www.slideshare.net/databricks/make-your-pyspark-data-fly-with-arrow
Exchange data between Spark and TensorFlow

–
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark
https://www.slideshare.net/databricks/updates-from-project-hydrogen-unifying-stateoftheart-ai-and-big-data-in-apache-spark
Make Arrow format standard in Spark

Exchange data between Spark and RAPIDS library
–
Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL
https://www.slideshare.net/databricks/accelerating-machine-learning-workloads-and-apache-spark-
applications-via-cuda-and-nccl
#24795 (SPARK-27945)
is minimal support
for columnar
processing

–
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators
https://www.slideshare.net/databricks/apache-arrowbased-unified-data-sharing-and-transferring-
format-among-cpu-and-accelerators
Exchange data between Spark and accelerator

Takeaway
▪ Evolving in-memory in Apache Spark while keep the same API (e.g.
DataFrame and Dataset)
– Improve performance by using columnar storage and own memory format
– Support Apache Arrow
– Define API to increase generality and ease of supporting other data sources
▪ Improve performance and programmability to exchange data by using
Apache Arrow between Spark and
– Framework
– Hardware accelerators

SparkTokyo2019

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie SparkTokyo2019

Ähnlich wie SparkTokyo2019 (20)

Mehr von Kazuaki Ishizaki

Mehr von Kazuaki Ishizaki (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SparkTokyo2019