SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Kazuaki Ishizaki (石崎 一明)
IBM Research – Tokyo, 日本アイ・ビー・エム(株)東京基礎研究所
@kiszk
Spark In-Memoryの発表と関連セッションの紹介
1
About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research – Tokyo
https://ibm.biz/ishizaki
– Compiler optimization, language runtime, and parallel processing
▪ Work for IBM Java (Open J9, now) from 1996
– Technical lead for Just-in-time compiler for PowerPC
▪ Apache Spark committer from 2018/9 (SQL module)
– Four Apache Spark committers in Japan
▪ ACM Distinguished Member (2018-)
▪ SNS
– @kiszk
– ishizaki
2 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Today’s topics
▪ Highlight in a talk “In-Memory Storage Evolution in Apache Spark”
▪ Relationship between Apache Spark and Apache Arrow
▪ Highlights in talks regarding Apache Spark with Apache Arrow
3 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
In-Memory Storage Evolution in Apache Spark
▪ History of in-memory storage from Spark 1.3 to 2.4
– From Java Object to own memory format managed by Spark (Project
Tungsten)
– Introduction of Columnar storage class
▪ Support of Apache Arrow
– Performance improvements of PySpark in the case of Pandas UDF
▪ Refactoring of internal data structure
– One public abstract class: ColumnVector
4 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
https://www.slideshare.net/ishizaki/in-memory-evolution-in-apache-spark
Why is In-Memory Storage?
• In-memory storage is mandatory for high performance
• In-memory columnar storage is necessary to
– Support first-class citizen column format Parquet
– Achieve better compression rate for table cache
5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
memory address memory address
SummitAISpark
5000.01.92.0
321Summit
AI
Spark
5000.0
1.9
2.0
3
2
1
Row format Column format
Row 0
Row 1
Row 2
Column x
Column y
Column z
In-Memory Storage Evolution (1/2)
6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
AI|
Spark| Spark AI
Table cache
2.0
1.9
2.0 1.9
Spark AI
Parquet vectorized reader
2.0 1.9
1.4 to 1.6
RDD table cache
to 1.3 2.0 to 2.2
RDD table cache : Java objects
Table cache : Own memory layout by Project Tungsten for table
cache
Parquet : Own memory layout, but different class from table
cacheSpark
version
In-Memory Storage Evolution (2/2)
7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Spark AI
Table cache
2.0 1.9
Parquet vectorized reader
2.42.3
Pandas UDF with Arrow ORC vectorized reader
ColumnVector becomes a public class
ColumnVector class becomes public class from Spark 2.3
Table cache, Parquet, ORC, and Arrow use common ColumnVector
class
Spark
version
Performance among Spark Versions
• DataFrame table cache from Spark 2.0 to Spark 2.4
8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spark 2.0
Spark 2.3
Spark 2.4
Performance comparison among different Spark versions
Relative elapsed time
shorter is better
df.filter(“i % 16 == 0").count
How Columnar Storage is used in
PySpark
• Share data in columnar storages of Spark and Pandas
– No serialization and deserialization
– 3-100x performance improvements
9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector
Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin
Source: ”Introducing Pandas UDF for PySpark” by Databricks blog
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
Apache Arrow
How Columnar Storage is Used
• Table cache ORC
• Pandas UDF Parquet
10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
df = ...
df.cache
df1 = df.selectExpr(“y + 1.2”)
df = spark.read.parquet(“c”)
df1 = df.selectExpr(“y + 1.2”)
df = spark.read.format(“orc”).load(“c”)
df1 = df.selectExpr(“y + 1.2”)
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
df1 = df.withColumn(‘yy’, plus(df.y))
Integrate Spark with Others
• Frameworks: Deep DL/ML
frameworks
• SPARK-24579
• SPARK-26413
• Resources: GPU, FPGA, ..
• SPARK-27396
• SAIS2019: “Apache Arrow-Based
Unified Data Sharing and
Transferring Format Among
CPU and Accelerators”
11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
From rapids.ai
FPGA
GPU
Presentation with Spark & Arrow in SAIS2019
▪ Language / Framework
– Running R at Scale with Apache Arrow on Spark
– Introducing .NET Bindings for Apache Spark
– Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in
Apache Spark
– Make your PySpark Data Fly with Arrow!
▪ Hardware resources (e.g. GPU and FPGA)
– Accelerating Machine Learning Workloads and Apache Spark Applications via
CUDA and NCCL
– Apache Arrow-Based Unified Data Sharing and Transferring Format Among
CPU and Accelerators
12 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Exchange data between Spark and R
▪ フレームワーク
–
13 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Running R at Scale with Apache Arrow on Spark
https://www.slideshare.net/databricks/running-r-at-scale-with-apache-arrow-on-spark
Exchange data between Spark and .NET UDF
▪ フレームワーク
–
14 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Introducing .NET Bindings for Apache Spark
https://www.slideshare.net/databricks/introducing-net-bindings-for-apache-spark
▪ フレームワーク
–
15 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Make your PySpark Data Fly with Arrow!
https://www.slideshare.net/databricks/make-your-pyspark-data-fly-with-arrow
Exchange data between Spark and TensorFlow
▪ フレームワーク
–
16 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark
https://www.slideshare.net/databricks/updates-from-project-hydrogen-unifying-stateoftheart-ai-and-big-data-in-apache-spark
Make Arrow format standard in Spark
Exchange data between Spark and RAPIDS library
▪ フレームワーク
–
17 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL
https://www.slideshare.net/databricks/accelerating-machine-learning-workloads-and-apache-spark-
applications-via-cuda-and-nccl
#24795 (SPARK-27945)
is minimal support
for columnar
processing
▪ フレームワーク
–
18 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators
https://www.slideshare.net/databricks/apache-arrowbased-unified-data-sharing-and-transferring-
format-among-cpu-and-accelerators
Exchange data between Spark and accelerator
Takeaway
▪ Evolving in-memory in Apache Spark while keep the same API (e.g.
DataFrame and Dataset)
– Improve performance by using columnar storage and own memory format
– Support Apache Arrow
– Define API to increase generality and ease of supporting other data sources
▪ Improve performance and programmability to exchange data by using
Apache Arrow between Spark and
– Framework
– Hardware accelerators
19 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki

Weitere ähnliche Inhalte

Was ist angesagt?

From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 

Was ist angesagt? (20)

Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache Spark
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 

Ähnlich wie SparkTokyo2019

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Ähnlich wie SparkTokyo2019 (20)

Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASK
 
Spark introduction & Architecture.pptx
Spark introduction & Architecture.pptxSpark introduction & Architecture.pptx
Spark introduction & Architecture.pptx
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
 
What is Spark
What is SparkWhat is Spark
What is Spark
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Mehr von Kazuaki Ishizaki

20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
Kazuaki Ishizaki
 

Mehr von Kazuaki Ishizaki (17)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Kürzlich hochgeladen

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Kürzlich hochgeladen (20)

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 

SparkTokyo2019

  • 1. Kazuaki Ishizaki (石崎 一明) IBM Research – Tokyo, 日本アイ・ビー・エム(株)東京基礎研究所 @kiszk Spark In-Memoryの発表と関連セッションの紹介 1
  • 2. About Me – Kazuaki Ishizaki ▪ Researcher at IBM Research – Tokyo https://ibm.biz/ishizaki – Compiler optimization, language runtime, and parallel processing ▪ Work for IBM Java (Open J9, now) from 1996 – Technical lead for Just-in-time compiler for PowerPC ▪ Apache Spark committer from 2018/9 (SQL module) – Four Apache Spark committers in Japan ▪ ACM Distinguished Member (2018-) ▪ SNS – @kiszk – ishizaki 2 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
  • 3. Today’s topics ▪ Highlight in a talk “In-Memory Storage Evolution in Apache Spark” ▪ Relationship between Apache Spark and Apache Arrow ▪ Highlights in talks regarding Apache Spark with Apache Arrow 3 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
  • 4. In-Memory Storage Evolution in Apache Spark ▪ History of in-memory storage from Spark 1.3 to 2.4 – From Java Object to own memory format managed by Spark (Project Tungsten) – Introduction of Columnar storage class ▪ Support of Apache Arrow – Performance improvements of PySpark in the case of Pandas UDF ▪ Refactoring of internal data structure – One public abstract class: ColumnVector 4 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki https://www.slideshare.net/ishizaki/in-memory-evolution-in-apache-spark
  • 5. Why is In-Memory Storage? • In-memory storage is mandatory for high performance • In-memory columnar storage is necessary to – Support first-class citizen column format Parquet – Achieve better compression rate for table cache 5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit memory address memory address SummitAISpark 5000.01.92.0 321Summit AI Spark 5000.0 1.9 2.0 3 2 1 Row format Column format Row 0 Row 1 Row 2 Column x Column y Column z
  • 6. In-Memory Storage Evolution (1/2) 6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit AI| Spark| Spark AI Table cache 2.0 1.9 2.0 1.9 Spark AI Parquet vectorized reader 2.0 1.9 1.4 to 1.6 RDD table cache to 1.3 2.0 to 2.2 RDD table cache : Java objects Table cache : Own memory layout by Project Tungsten for table cache Parquet : Own memory layout, but different class from table cacheSpark version
  • 7. In-Memory Storage Evolution (2/2) 7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit Spark AI Table cache 2.0 1.9 Parquet vectorized reader 2.42.3 Pandas UDF with Arrow ORC vectorized reader ColumnVector becomes a public class ColumnVector class becomes public class from Spark 2.3 Table cache, Parquet, ORC, and Arrow use common ColumnVector class Spark version
  • 8. Performance among Spark Versions • DataFrame table cache from Spark 2.0 to Spark 2.4 8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Spark 2.0 Spark 2.3 Spark 2.4 Performance comparison among different Spark versions Relative elapsed time shorter is better df.filter(“i % 16 == 0").count
  • 9. How Columnar Storage is used in PySpark • Share data in columnar storages of Spark and Pandas – No serialization and deserialization – 3-100x performance improvements 9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin Source: ”Introducing Pandas UDF for PySpark” by Databricks blog @pandas_udf(‘double’) def plus(v): return v + 1.2 Apache Arrow
  • 10. How Columnar Storage is Used • Table cache ORC • Pandas UDF Parquet 10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit df = ... df.cache df1 = df.selectExpr(“y + 1.2”) df = spark.read.parquet(“c”) df1 = df.selectExpr(“y + 1.2”) df = spark.read.format(“orc”).load(“c”) df1 = df.selectExpr(“y + 1.2”) @pandas_udf(‘double’) def plus(v): return v + 1.2 df1 = df.withColumn(‘yy’, plus(df.y))
  • 11. Integrate Spark with Others • Frameworks: Deep DL/ML frameworks • SPARK-24579 • SPARK-26413 • Resources: GPU, FPGA, .. • SPARK-27396 • SAIS2019: “Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators” 11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit From rapids.ai FPGA GPU
  • 12. Presentation with Spark & Arrow in SAIS2019 ▪ Language / Framework – Running R at Scale with Apache Arrow on Spark – Introducing .NET Bindings for Apache Spark – Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark – Make your PySpark Data Fly with Arrow! ▪ Hardware resources (e.g. GPU and FPGA) – Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL – Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators 12 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki
  • 13. Exchange data between Spark and R ▪ フレームワーク – 13 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Running R at Scale with Apache Arrow on Spark https://www.slideshare.net/databricks/running-r-at-scale-with-apache-arrow-on-spark
  • 14. Exchange data between Spark and .NET UDF ▪ フレームワーク – 14 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Introducing .NET Bindings for Apache Spark https://www.slideshare.net/databricks/introducing-net-bindings-for-apache-spark
  • 15. ▪ フレームワーク – 15 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Make your PySpark Data Fly with Arrow! https://www.slideshare.net/databricks/make-your-pyspark-data-fly-with-arrow Exchange data between Spark and TensorFlow
  • 16. ▪ フレームワーク – 16 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark https://www.slideshare.net/databricks/updates-from-project-hydrogen-unifying-stateoftheart-ai-and-big-data-in-apache-spark Make Arrow format standard in Spark
  • 17. Exchange data between Spark and RAPIDS library ▪ フレームワーク – 17 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL https://www.slideshare.net/databricks/accelerating-machine-learning-workloads-and-apache-spark- applications-via-cuda-and-nccl #24795 (SPARK-27945) is minimal support for columnar processing
  • 18. ▪ フレームワーク – 18 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators https://www.slideshare.net/databricks/apache-arrowbased-unified-data-sharing-and-transferring- format-among-cpu-and-accelerators Exchange data between Spark and accelerator
  • 19. Takeaway ▪ Evolving in-memory in Apache Spark while keep the same API (e.g. DataFrame and Dataset) – Improve performance by using columnar storage and own memory format – Support Apache Arrow – Define API to increase generality and ease of supporting other data sources ▪ Improve performance and programmability to exchange data by using Apache Arrow between Spark and – Framework – Hardware accelerators 19 Spark In-Memoryの発表と関連セッションの紹介- Kazuaki Ishizaki