This document summarizes a presentation about using the Crail distributed storage system to improve Spark performance on high-performance computing clusters with RDMA networking and NVMe flash storage. The key points are:
1) Traditional Spark storage and networking APIs do not bypass the operating system kernel, limiting performance on modern hardware.
2) The Crail system provides user-level APIs for RDMA networking and NVMe flash to improve Spark shuffle, join, and sorting workloads by 2-10x on a 128-node cluster.
3) Crail allows Spark workloads to fully utilize high-speed networks and disaggregate memory and flash storage across nodes without performance penalties.
15. How can we fix this...
● Not just for shuffle
– Also for broadcast, RDD transport, inter-job sharing, etc.
● Not just for RDMA and NVMe hardware
– But for any possible future high-performance I/O hardware
● Not just for co-located compute/storage
– Also for resource disaggregation, heterogeneous resource distribution, etc.
● Not just improve things
– Make it perform at the hardware limit
17. Example: Crail Shuffle (Map)
Buffer
Crail Serializer
CrailOutputStream
Object
local
DRAM
remote
DRAM
local
Flash
remote
Flash
memcpy RDMA NVMe NVMeF
No
buffering
18. Example: Crail Shuffle (Reduce)
Buffer
Crail Serializer
CrailInputStream
local
DRAM
remote
DRAM
local
Flash
remote
Flash
memcpy RDMA NVMe NVMeF
No
buffering
Object
19. Example: Crail Shuffle (Reduce)
Buffer
Crail Serializer
CrailInputStream
local
DRAM
remote
DRAM
local
Flash
remote
Flash
memcpy RDMA NVMe NVMeF
No
buffering
Object
Similar for
broadcast,
Input/output,
RDD, etc.
26. Storage Tiering: DRAM & NVMe
Spark
Exec0
Spark
Exec1
Spark
ExecN
Memory Tier
Flash
Tier
Network
Input/Output/
Shuffle/
Spark
With Crail, replacing memory with NVMe flash
for storing input/output/shuffle reduces the 200GB sorting runtime by only 48%
0
20
40
60
80
100
120
Reduce
MapVanilla
Spark
(100%
in-memory)
memory/flash ratio
Time(sec)
sorting runtime
27. DRAM & Flash Disaggregation
Spark
Exec1
Spark
ExecN
Using Crail, a Spark 200GB sorting workload can be run with
memory and flash disaggregated at no extra cost
Spark
Exec0
storage
nodes
compute
nodes
0
40
80
120
160
200
Reduce
Map
Crail/
DRAM
Crail/
NVMe
Vanilla/
Alluxio
Input/Output/
Shuffle
Time(sec)
sorting runtime
28. Conclusions
●
Effectively using high-performance I/O hardware in
Spark is challenging
●
Crail is an attempt to re-think how data processing
frameworks (not only Spark) should interact with
network and storage hardware
– User-level I/O, storage disaggregation, memory/flash convergence
●
Spark's modular architecture allows Crail to be used
seamlessly
29. Crail for Spark is Open Source
●
www.crail.io
●
github.com/zrlio/spark-io
●
github.com/zrlio/crail
●
github.com/zrlio/parquetgenerator
●
github.com/zrlio/crail-terasort
30. Thank You.
Contributors:
Animesh Trivedi, Jonas Pfefferle, Michael Kaufmann, Bernard
Metzler, Radu Stoica, Ioannis Koltsidas, Nikolos Ioannou,
Christoph Hagleitner, Peter Hofstee, Evangelos Eleftheriou, ...