9. Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark client
(app master) Spark worker
HDFS, HBase, …
Block
manager
Task
threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
manager
Spark Internals
https://cwiki.apache.org/confluence/di
splay/SPARK/Spark+Internals
10. Scheduling Process
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
Spark Internals
https://cwiki.apache.org/confluence/di
splay/SPARK/Spark+Internals
11. Spark Internals
RDD
Reference
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin,
S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012
SparkContext
Contains SparkConfig, Scheduler, entry point of running jobs (runJobs)
Dependency
Input RDDs
11
13. Spark Internals
RDD Iterator
13
First, check the local cache
If not found, compute the RDD
StorageLevel
Off-heap
distributed memory store
16. Spark Internals
Delay Scheduling
Reference
M. Zaharia, D. Borthakur, J.
Sen Sarma, K. Elmeleegy, S.
Shenker and I. Stoica. Delay
Scheduling: A Simple
Technique for Achieving
Locality and Fairness in
Cluster Scheduling, EuroSys
2010, April 2010.
Try to run tasks in the
following order:
Local
Rack local
At any node
16
19. Spark Internals
ClosureSerializer
Clean
Function in scala: Closure
Closure: free variable + function body (class)
class A$apply$1 extends Function1[T, U] {
val $outer : A$outer
def apply(T:input) : U = …
}
class A$outer {
val N = 100, val M = (large object)
}
Fill M with null, then serialize the closure.
19
20. Spark Internals
Traversing Byte Codes
Closure is a class in Scala
Traverse outer variable accesses
Using ASM4 library
20