This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
2. About me
Enterprise Architect @ Pivotal
ï 7 years in data processing
ï 5 years with MPP
ï 4 years with Hadoop
ï Spark contributor
ï http://0x0fff.com
6. Spark Motivation
ï Difficultly of programming directly in Hadoop MapReduce
ï Performance bottlenecks, or batch not fitting use cases
7. Spark Motivation
ï Difficultly of programming directly in Hadoop MapReduce
ï Performance bottlenecks, or batch not fitting use cases
ï Better support iterative jobs typical for machine learning
8. Difficulty of Programming in MR
Word Count implementations
ï Hadoop MR â 61 lines in Java
ï Spark â 1 line in interactive shell
sc.textFile('...').flatMap(lambda x: x.split())
.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)
.saveAsTextFile('...')
VS
9. Performance Bottlenecks
How many times the data is put to the HDD during a single
MapReduce Job?
ï One
ï Two
ï Three
ï More
10. Performance Bottlenecks
How many times the data is put to the HDD during a single
MapReduce Job?
ï One
ï Two
ïŒ Three
ïŒ More
13. Performance Bottlenecks
Consider Hive as main SQL tool
ï Typical Hive query is translated to 3-5 MR jobs
ï Each MR would scan put data to HDD 3+ times
14. Performance Bottlenecks
Consider Hive as main SQL tool
ï Typical Hive query is translated to 3-5 MR jobs
ï Each MR would scan put data to HDD 3+ times
ï Each put to HDD â write followed by read
15. Performance Bottlenecks
Consider Hive as main SQL tool
ï Typical Hive query is translated to 3-5 MR jobs
ï Each MR would scan put data to HDD 3+ times
ï Each put to HDD â write followed by read
ï Sums up to 18-30 scans of data during a single
Hive query
17. Performance Bottlenecks
Spark offers you
ï Lazy Computations
â Optimize the job before executing
ï In-memory data caching
â Scan HDD only once, then scan your RAM
18. Performance Bottlenecks
Spark offers you
ï Lazy Computations
â Optimize the job before executing
ï In-memory data caching
â Scan HDD only once, then scan your RAM
ï Efficient pipelining
â Avoids the data hitting the HDD by all means
22. Spark Pillars
Two main abstractions of Spark
ï RDD â Resilient Distributed Dataset
ï DAG â Direct Acyclic Graph
23. RDD
ï Simple view
â RDD is collection of data items split into partitions and
stored in memory on worker nodes of the cluster
24. RDD
ï Simple view
â RDD is collection of data items split into partitions and
stored in memory on worker nodes of the cluster
ï Complex view
â RDD is an interface for data transformation
25. RDD
ï Simple view
â RDD is collection of data items split into partitions and
stored in memory on worker nodes of the cluster
ï Complex view
â RDD is an interface for data transformation
â RDD refers to the data stored either in persisted store
(HDFS, Cassandra, HBase, etc.) or in cache (memory,
memory+disks, disk only, etc.) or in another RDD
26. RDD
ï Complex view (contâd)
â Partitions are recomputed on failure or cache eviction
27. RDD
ï Complex view (contâd)
â Partitions are recomputed on failure or cache eviction
â Metadata stored for interface
âȘ Partitions â set of data splits associated with this RDD
28. RDD
ï Complex view (contâd)
â Partitions are recomputed on failure or cache eviction
â Metadata stored for interface
âȘ Partitions â set of data splits associated with this RDD
âȘ Dependencies â list of parent RDDs involved in computation
29. RDD
ï Complex view (contâd)
â Partitions are recomputed on failure or cache eviction
â Metadata stored for interface
âȘ Partitions â set of data splits associated with this RDD
âȘ Dependencies â list of parent RDDs involved in computation
âȘ Compute â function to compute partition of the RDD given the
parent partitions from the Dependencies
30. RDD
ï Complex view (contâd)
â Partitions are recomputed on failure or cache eviction
â Metadata stored for interface
âȘ Partitions â set of data splits associated with this RDD
âȘ Dependencies â list of parent RDDs involved in computation
âȘ Compute â function to compute partition of the RDD given the
parent partitions from the Dependencies
âȘ Preferred Locations â where is the best place to put
computations on this partition (data locality)
31. RDD
ï Complex view (contâd)
â Partitions are recomputed on failure or cache eviction
â Metadata stored for interface
âȘ Partitions â set of data splits associated with this RDD
âȘ Dependencies â list of parent RDDs involved in computation
âȘ Compute â function to compute partition of the RDD given the
parent partitions from the Dependencies
âȘ Preferred Locations â where is the best place to put
computations on this partition (data locality)
âȘ Partitioner â how the data is split into partitions
32. RDD
ï RDD is the main and only tool for data
manipulation in Spark
ï Two classes of operations
â Transformations
â Actions
35. DAG
Direct Acyclic Graph â sequence of computations
performed on data
ï Node â RDD partition
36. DAG
Direct Acyclic Graph â sequence of computations
performed on data
ï Node â RDD partition
ï Edge â transformation on top of data
37. DAG
Direct Acyclic Graph â sequence of computations
performed on data
ï Node â RDD partition
ï Edge â transformation on top of data
ï Acyclic â graph cannot return to the older partition
38. DAG
Direct Acyclic Graph â sequence of computations
performed on data
ï Node â RDD partition
ï Edge â transformation on top of data
ï Acyclic â graph cannot return to the older partition
ï Direct â transformation is an action that transitions
data partition state (from A to B)
47. Spark Cluster
ï Driver
â Entry point of the Spark Shell (Scala, Python, R)
â The place where SparkContext is created
48. Spark Cluster
ï Driver
â Entry point of the Spark Shell (Scala, Python, R)
â The place where SparkContext is created
â Translates RDD into the execution graph
49. Spark Cluster
ï Driver
â Entry point of the Spark Shell (Scala, Python, R)
â The place where SparkContext is created
â Translates RDD into the execution graph
â Splits graph into stages
50. Spark Cluster
ï Driver
â Entry point of the Spark Shell (Scala, Python, R)
â The place where SparkContext is created
â Translates RDD into the execution graph
â Splits graph into stages
â Schedules tasks and controls their execution
51. Spark Cluster
ï Driver
â Entry point of the Spark Shell (Scala, Python, R)
â The place where SparkContext is created
â Translates RDD into the execution graph
â Splits graph into stages
â Schedules tasks and controls their execution
â Stores metadata about all the RDDs and their partitions
52. Spark Cluster
ï Driver
â Entry point of the Spark Shell (Scala, Python, R)
â The place where SparkContext is created
â Translates RDD into the execution graph
â Splits graph into stages
â Schedules tasks and controls their execution
â Stores metadata about all the RDDs and their partitions
â Brings up Spark WebUI with job information
55. Spark Cluster
ï Executor
â Stores the data in cache in JVM heap or on HDDs
â Reads data from external sources
â Writes data to external sources
56. Spark Cluster
ï Executor
â Stores the data in cache in JVM heap or on HDDs
â Reads data from external sources
â Writes data to external sources
â Performs all the data processing
60. Application Decomposition
ï Application
â Single instance of SparkContext that stores some data
processing logic and can schedule series of jobs,
sequentially or in parallel (SparkContext is thread-safe)
61. Application Decomposition
ï Application
â Single instance of SparkContext that stores some data
processing logic and can schedule series of jobs,
sequentially or in parallel (SparkContext is thread-safe)
ï Job
â Complete set of transformations on RDD that finishes
with action or data saving, triggered by the driver
application
62. Application Decomposition
ï Stage
â Set of transformations that can be pipelined and
executed by a single independent worker. Usually it is
app the transformations between âreadâ, âshuffleâ,
âactionâ, âsaveâ
63. Application Decomposition
ï Stage
â Set of transformations that can be pipelined and
executed by a single independent worker. Usually it is
app the transformations between âreadâ, âshuffleâ,
âactionâ, âsaveâ
ï Task
â Execution of the stage on a single data partition. Basic
unit of scheduling
70. Persistence in Spark
Persistence Level Description
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in
memory, some partitions will not be cached and will be recomputed on the fly each
time they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in
memory, store the partitions that don't fit on disk, and read them from there when
they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is
generally more space-efficient than deserialized objects, especially when using a
fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk
instead of recomputing them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
DISK_ONLY_2, etc.
Same as the levels above, but replicate each partition on two cluster nodes.
71. Persistence in Spark
ï Spark considers memory as a cache with LRU
eviction rules
ï If âDiskâ is involved, data is evicted to disks
rdd = sc.parallelize(xrange(1000))
rdd.cache().count()
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER).count()
rdd.unpersist()
111. DataFrame Implementation
ï Interface
â DataFrame is an RDD with schema â field names, field
data types and statistics
â Unified transformation interface in all languages, all the
transformations are passed to JVM
â Can be accessed as RDD, in this case transformed to
the RDD of Row objects
112. DataFrame Implementation
ï Internals
â Internally it is the same RDD
â Data is stored in row-columnar format, row chunk size
is set by spark.sql.inMemoryColumnarStorage.batchSize
â Each column in each partition stores min-max values
for partition pruning
â Allows better compression ratio than standard RDD
â Delivers faster performance for small subsets of
columns