SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Patrick Stuedi, IBM Research
Running Spark on a High-
Performance Cluster
using RDMA Networking
and NVMe Flash
Hardware Trends
community
target
20172010
1Gbps
50us
10Gbps
20us
100 MB/s
100ms
1000 MB/s
200us
Hardware Trends
2017
10 GB/s
50us
100Gbps
1us
community
target
our
target
20172010
1Gbps
50us
10Gbps
20us
100 MB/s
100ms
1000 MB/s
200us
User-Level APIs
Storage App
HDD
FileSystem
iSCSI
Block Layer
OS Kernel
Network App
NIC
Sockets
Ethernet
TCP/IP
OS Kernel
Storage App
SSD, NIC
SPDK
Network App
NIC
OS KernelRDMA, DPDK
Kernel
Kernel
Traditional / Kernel-based User-Level / Kernel-Bypass
Needed to
achieve
1us RTT!
Remote Data Access
512 B 4 KB 64 KB 1 MB
0
200
400
600
800
1000
1200
NVMe over Fabrics
iSCSI
data size
time[usec]
DRAM NVMe Flash
512 B 4 KB 64 KB 1 MB
0
50
100
150
200
250
RDMA
Sockets
data size
time[usec]
User-level APIs offer 2-10x better performance than
traditional kernel-based APIs
Let's Use it!
Spark
NVMe Flash
DRAM
High-Performance RDMA network
SQL Streaming MLlib GraphX
Performance!
Realtime!
Disaggregation!
Case Study: Sorting in Spark
MapCores ReduceInput Output
read & classify shuffle store
Net Ser
Net Ser
Net Ser
Sort
Sort
Sort
Experiment Setup
●
Total data size: 12.8 TB
●
Cluster size: 128 nodes
●
Cluster hardware:
– DRAM: 512 GB DDR 4
– Storage: 4x 1.2 TB NVMe SSD
– Network: 100GbE Mellanox RDMA
Flash bandwidth
per node matches
network bandwidth
How is the Network Used?
Hardware limit
0
20
40
60
80
100
0 100 200 300 400 500
Throughput[Gbit/s]
Elapsed time (seconds)
What is the Problem?
JVM
Netty
TeraSort
Sorter
Serializer
sockets
Spark
● Spark uses legacy networking and storage APIs: no kernel-bypass
● Spark itself introduces additional I/O layers: Netty, serializer, sorter, etc.
TCP/IP
Ethernet
NIC
filesystem
block layer
iSCSI
SSD
Example: Shuffle (Map)
Buffer
Buffer
Buffer
Sorter
Kryo
BufferedOutputStream
BufferJNI backend
BufferFile System Buffer Cache
Object
Example: Shuffle (Map)
Buffer
Buffer
Buffer
Sorter
Kryo
BufferJNI
BufferFS
Object
Stream
Example: Shuffle (Map+Reduce)
Buffer
Buffer
Buffer
Sorter
Kryo
BufferJNI
BufferFS
Object
Stream
Buffer
Buffer Buffer SocketBuffer
Netty
Buffer
Buffer
Kryo
Object
network
Sorter
Example: Shuffle (Map+Reduce)
Buffer
Buffer
Buffer
Sorter
Kryo
BufferJNI
BufferFS
Object
Stream
Buffer
Buffer Buffer SocketBuffer
Netty
Buffer
Buffer
Kryo
Object
network
Sorter
Difficult at
100 Gbps!
How can we fix this...
● Not just for shuffle
– Also for broadcast, RDD transport, inter-job sharing, etc.
● Not just for RDMA and NVMe hardware
– But for any possible future high-performance I/O hardware
● Not just for co-located compute/storage
– Also for resource disaggregation, heterogeneous resource distribution, etc.
● Not just improve things
– Make it perform at the hardware limit
The CRAIL Approach
Example: Crail Shuffle (Map)
Buffer
Crail Serializer
CrailOutputStream
Object
local
DRAM
remote
DRAM
local
Flash
remote
Flash
memcpy RDMA NVMe NVMeF
No
buffering
Example: Crail Shuffle (Reduce)
Buffer
Crail Serializer
CrailInputStream
local
DRAM
remote
DRAM
local
Flash
remote
Flash
memcpy RDMA NVMe NVMeF
No
buffering
Object
Example: Crail Shuffle (Reduce)
Buffer
Crail Serializer
CrailInputStream
local
DRAM
remote
DRAM
local
Flash
remote
Flash
memcpy RDMA NVMe NVMeF
No
buffering
Object
Similar for
broadcast,
Input/output,
RDD, etc.
Performance: Configuration
●
Experiments
– Memory-only: Broadcast, GroupBy, Sorting, SQL
– Memory/Flash: disaggregation, tiering
●
Cluster size: 8 nodes, except Sorting: 128 nodes
●
Cluster hardware:
– DRAM: 512 GB DDR 4
– Storage: 4x 1.2 TB NVMe SSD
– Network: 100GbE Mellanox RDMA
●
Spark 2.1.0, Alluxio 1.4, Crail 1.0
●
Hadoop.tmp.dir: RamFS for microbenchmarks, flash SSD for workloads
Spark Broadcast
val bcVar = sc.Broadcast(new Array[Byte](128))
sc.parallelize(1 to tasks, tasks).map(_ => {
bcVar.value.length
}).count
Spark
driver
Spark
Executor
/spark/broadcast/broadcast_0 Crail
Store
Broadcast writing
RPC+2xRDMA+RPC
CDF
Broadcast reading
RPC+RDMA1
2
3
4
Spark GroupBy (80M keys, 4K)
val pairs = sc.parallelize(1 to tasks, tasks).flatmap(_ => {
var values = new array[(Long,Array[Byte])](numKeys)
values = initValues(values)
}).cache().groupByKey().count()
Spark
Exec0
/shuffle/1/ f0 f1 fN
/shuffle/2/ f0 f1 fN
/shuffle/3 f0 f1 fN
Spark
Exec1
Spark
ExecN
Spark
Exec0
Spark
Exec1
Spark
ExecN
hash hash hash1
2
map: Hash
Append (file)
reduce: fetch
Bucket (dir)
Crail
Spark/
Vanilla
5x
2.5x
2x
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100 110 120
Throughput(Gbit/s)
Elapsed time (seconds)
1 core
4 cores
8 cores
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100 110 120
Throughput(Gbit/s)
Elapsed time (seconds)
1 core
4 cores
8 cores
Spark/
Crail
Sorting 12.8 TB on 128 nodes
Spark Spark/Crail
0
200
400
600
reduce
map
Spark/Crail Network UsageSorting Runtime
Task ID
Throughput(Gbit/s)
6x
time(sec)
How fast is this?
Spark/Crail Winner 2014 Winner 2016
Size (TB) 12.8 100 100
Time (sec) 98 1406 134
Total cores 2560 6592 10240
Network HW (Gbit/s) 100 10 100
Rate/core (GB/min) 3.13 0.66 4.4
Native C
distributed
sorting
benchmark
Spark
www.sortingbenchmark.org
Sorting rate of
Crail/Spark only 27%
slower than rate of
Winner 2016
Spark SQL JoinSpark
Exec0
/shuffle/1/ f0 f1 fN
/shuffle/2/ f0 f1 fN
/shuffle/3 f0 f1 fN
Spark
Exec1
Spark
ExecN
hash hash hash
1
2
map: Hash
Append (file)
val ds1 = sparkSession.read.parquet(“...”) //64GB dataset
val ds2 = sparkSession.read.parquet(“...”) //64GB dataset
val resultDS = ds1.joinWith(ds2, ds1(“key”) === ds2(“key”)) //~100GB
resultDS.write.format(“parquet”).mode(SaveMode.Overwrite).save(“...”)
parquet parquet parquet
read from
Hdfs/crail/alluxio
deser
sort
join
write
sort
join
write
sort
join
write
deser deser
3 Sort
Merge join
hdfs/crail/alluxio
4
CrailStore
0
10
20
30
40
50
output
join
hash
input
Time(sec)
Spark/
Alluxio
Spark/
Crail
Storage Tiering: DRAM & NVMe
Spark
Exec0
Spark
Exec1
Spark
ExecN
Memory Tier
Flash
Tier
Network
Input/Output/
Shuffle/
Spark
With Crail, replacing memory with NVMe flash
for storing input/output/shuffle reduces the 200GB sorting runtime by only 48%
0
20
40
60
80
100
120
Reduce
MapVanilla
Spark
(100%
in-memory)
memory/flash ratio
Time(sec)
sorting runtime
DRAM & Flash Disaggregation
Spark
Exec1
Spark
ExecN
Using Crail, a Spark 200GB sorting workload can be run with
memory and flash disaggregated at no extra cost
Spark
Exec0
storage
nodes
compute
nodes
0
40
80
120
160
200
Reduce
Map
Crail/
DRAM
Crail/
NVMe
Vanilla/
Alluxio
Input/Output/
Shuffle
Time(sec)
sorting runtime
Conclusions
●
Effectively using high-performance I/O hardware in
Spark is challenging
●
Crail is an attempt to re-think how data processing
frameworks (not only Spark) should interact with
network and storage hardware
– User-level I/O, storage disaggregation, memory/flash convergence
●
Spark's modular architecture allows Crail to be used
seamlessly
Crail for Spark is Open Source
●
www.crail.io
●
github.com/zrlio/spark-io
●
github.com/zrlio/crail
●
github.com/zrlio/parquetgenerator
●
github.com/zrlio/crail-terasort
Thank You.
Contributors:
Animesh Trivedi, Jonas Pfefferle, Michael Kaufmann, Bernard
Metzler, Radu Stoica, Ioannis Koltsidas, Nikolos Ioannou,
Christoph Hagleitner, Peter Hofstee, Evangelos Eleftheriou, ...

Weitere ähnliche Inhalte

Was ist angesagt?

How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 

Was ist angesagt? (20)

How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 

Ähnlich wie Running Spark on High-Performance Hardware using Crail

Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
 
Experiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerJomaSoft
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep DiveUniFabric
 
SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3UniFabric
 
Storage and performance, Whiptail
Storage and performance, Whiptail Storage and performance, Whiptail
Storage and performance, Whiptail Internet World
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014Philippe Fierens
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Sal Marcus
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfRedis Labs
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 

Ähnlich wie Running Spark on High-Performance Hardware using Crail (20)

Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
Experiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 Server
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep Dive
 
SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3
 
Storage and performance, Whiptail
Storage and performance, Whiptail Storage and performance, Whiptail
Storage and performance, Whiptail
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConf
 
Meetup Oracle Database MAD_BCN: 4 Saborea Exadata
Meetup Oracle Database MAD_BCN: 4 Saborea ExadataMeetup Oracle Database MAD_BCN: 4 Saborea Exadata
Meetup Oracle Database MAD_BCN: 4 Saborea Exadata
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Kürzlich hochgeladen (20)

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

Running Spark on High-Performance Hardware using Crail

  • 1. Patrick Stuedi, IBM Research Running Spark on a High- Performance Cluster using RDMA Networking and NVMe Flash
  • 4. User-Level APIs Storage App HDD FileSystem iSCSI Block Layer OS Kernel Network App NIC Sockets Ethernet TCP/IP OS Kernel Storage App SSD, NIC SPDK Network App NIC OS KernelRDMA, DPDK Kernel Kernel Traditional / Kernel-based User-Level / Kernel-Bypass Needed to achieve 1us RTT!
  • 5. Remote Data Access 512 B 4 KB 64 KB 1 MB 0 200 400 600 800 1000 1200 NVMe over Fabrics iSCSI data size time[usec] DRAM NVMe Flash 512 B 4 KB 64 KB 1 MB 0 50 100 150 200 250 RDMA Sockets data size time[usec] User-level APIs offer 2-10x better performance than traditional kernel-based APIs
  • 6. Let's Use it! Spark NVMe Flash DRAM High-Performance RDMA network SQL Streaming MLlib GraphX Performance! Realtime! Disaggregation!
  • 7. Case Study: Sorting in Spark MapCores ReduceInput Output read & classify shuffle store Net Ser Net Ser Net Ser Sort Sort Sort
  • 8. Experiment Setup ● Total data size: 12.8 TB ● Cluster size: 128 nodes ● Cluster hardware: – DRAM: 512 GB DDR 4 – Storage: 4x 1.2 TB NVMe SSD – Network: 100GbE Mellanox RDMA Flash bandwidth per node matches network bandwidth
  • 9. How is the Network Used? Hardware limit 0 20 40 60 80 100 0 100 200 300 400 500 Throughput[Gbit/s] Elapsed time (seconds)
  • 10. What is the Problem? JVM Netty TeraSort Sorter Serializer sockets Spark ● Spark uses legacy networking and storage APIs: no kernel-bypass ● Spark itself introduces additional I/O layers: Netty, serializer, sorter, etc. TCP/IP Ethernet NIC filesystem block layer iSCSI SSD
  • 14. Example: Shuffle (Map+Reduce) Buffer Buffer Buffer Sorter Kryo BufferJNI BufferFS Object Stream Buffer Buffer Buffer SocketBuffer Netty Buffer Buffer Kryo Object network Sorter Difficult at 100 Gbps!
  • 15. How can we fix this... ● Not just for shuffle – Also for broadcast, RDD transport, inter-job sharing, etc. ● Not just for RDMA and NVMe hardware – But for any possible future high-performance I/O hardware ● Not just for co-located compute/storage – Also for resource disaggregation, heterogeneous resource distribution, etc. ● Not just improve things – Make it perform at the hardware limit
  • 17. Example: Crail Shuffle (Map) Buffer Crail Serializer CrailOutputStream Object local DRAM remote DRAM local Flash remote Flash memcpy RDMA NVMe NVMeF No buffering
  • 18. Example: Crail Shuffle (Reduce) Buffer Crail Serializer CrailInputStream local DRAM remote DRAM local Flash remote Flash memcpy RDMA NVMe NVMeF No buffering Object
  • 19. Example: Crail Shuffle (Reduce) Buffer Crail Serializer CrailInputStream local DRAM remote DRAM local Flash remote Flash memcpy RDMA NVMe NVMeF No buffering Object Similar for broadcast, Input/output, RDD, etc.
  • 20. Performance: Configuration ● Experiments – Memory-only: Broadcast, GroupBy, Sorting, SQL – Memory/Flash: disaggregation, tiering ● Cluster size: 8 nodes, except Sorting: 128 nodes ● Cluster hardware: – DRAM: 512 GB DDR 4 – Storage: 4x 1.2 TB NVMe SSD – Network: 100GbE Mellanox RDMA ● Spark 2.1.0, Alluxio 1.4, Crail 1.0 ● Hadoop.tmp.dir: RamFS for microbenchmarks, flash SSD for workloads
  • 21. Spark Broadcast val bcVar = sc.Broadcast(new Array[Byte](128)) sc.parallelize(1 to tasks, tasks).map(_ => { bcVar.value.length }).count Spark driver Spark Executor /spark/broadcast/broadcast_0 Crail Store Broadcast writing RPC+2xRDMA+RPC CDF Broadcast reading RPC+RDMA1 2 3 4
  • 22. Spark GroupBy (80M keys, 4K) val pairs = sc.parallelize(1 to tasks, tasks).flatmap(_ => { var values = new array[(Long,Array[Byte])](numKeys) values = initValues(values) }).cache().groupByKey().count() Spark Exec0 /shuffle/1/ f0 f1 fN /shuffle/2/ f0 f1 fN /shuffle/3 f0 f1 fN Spark Exec1 Spark ExecN Spark Exec0 Spark Exec1 Spark ExecN hash hash hash1 2 map: Hash Append (file) reduce: fetch Bucket (dir) Crail Spark/ Vanilla 5x 2.5x 2x 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100 110 120 Throughput(Gbit/s) Elapsed time (seconds) 1 core 4 cores 8 cores 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100 110 120 Throughput(Gbit/s) Elapsed time (seconds) 1 core 4 cores 8 cores Spark/ Crail
  • 23. Sorting 12.8 TB on 128 nodes Spark Spark/Crail 0 200 400 600 reduce map Spark/Crail Network UsageSorting Runtime Task ID Throughput(Gbit/s) 6x time(sec)
  • 24. How fast is this? Spark/Crail Winner 2014 Winner 2016 Size (TB) 12.8 100 100 Time (sec) 98 1406 134 Total cores 2560 6592 10240 Network HW (Gbit/s) 100 10 100 Rate/core (GB/min) 3.13 0.66 4.4 Native C distributed sorting benchmark Spark www.sortingbenchmark.org Sorting rate of Crail/Spark only 27% slower than rate of Winner 2016
  • 25. Spark SQL JoinSpark Exec0 /shuffle/1/ f0 f1 fN /shuffle/2/ f0 f1 fN /shuffle/3 f0 f1 fN Spark Exec1 Spark ExecN hash hash hash 1 2 map: Hash Append (file) val ds1 = sparkSession.read.parquet(“...”) //64GB dataset val ds2 = sparkSession.read.parquet(“...”) //64GB dataset val resultDS = ds1.joinWith(ds2, ds1(“key”) === ds2(“key”)) //~100GB resultDS.write.format(“parquet”).mode(SaveMode.Overwrite).save(“...”) parquet parquet parquet read from Hdfs/crail/alluxio deser sort join write sort join write sort join write deser deser 3 Sort Merge join hdfs/crail/alluxio 4 CrailStore 0 10 20 30 40 50 output join hash input Time(sec) Spark/ Alluxio Spark/ Crail
  • 26. Storage Tiering: DRAM & NVMe Spark Exec0 Spark Exec1 Spark ExecN Memory Tier Flash Tier Network Input/Output/ Shuffle/ Spark With Crail, replacing memory with NVMe flash for storing input/output/shuffle reduces the 200GB sorting runtime by only 48% 0 20 40 60 80 100 120 Reduce MapVanilla Spark (100% in-memory) memory/flash ratio Time(sec) sorting runtime
  • 27. DRAM & Flash Disaggregation Spark Exec1 Spark ExecN Using Crail, a Spark 200GB sorting workload can be run with memory and flash disaggregated at no extra cost Spark Exec0 storage nodes compute nodes 0 40 80 120 160 200 Reduce Map Crail/ DRAM Crail/ NVMe Vanilla/ Alluxio Input/Output/ Shuffle Time(sec) sorting runtime
  • 28. Conclusions ● Effectively using high-performance I/O hardware in Spark is challenging ● Crail is an attempt to re-think how data processing frameworks (not only Spark) should interact with network and storage hardware – User-level I/O, storage disaggregation, memory/flash convergence ● Spark's modular architecture allows Crail to be used seamlessly
  • 29. Crail for Spark is Open Source ● www.crail.io ● github.com/zrlio/spark-io ● github.com/zrlio/crail ● github.com/zrlio/parquetgenerator ● github.com/zrlio/crail-terasort
  • 30. Thank You. Contributors: Animesh Trivedi, Jonas Pfefferle, Michael Kaufmann, Bernard Metzler, Radu Stoica, Ioannis Koltsidas, Nikolos Ioannou, Christoph Hagleitner, Peter Hofstee, Evangelos Eleftheriou, ...