2. Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and machine learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
3. Table of Contents
● Distributed programming introduction
● Programming models
● Datafow systems and DAGs
● RDD
● Transformations, Actions, Persistence, Shared variables
4. Distributed programming
● reminder
○ unreliable network
○ ubiquitous failures
○ everything asynchronous
○ consistency, ordering and synchronisation expensive
○ local time
○ correctness properties safety and liveness
○ ...
5. Two armies (generals)
● two armies, A (Red) and B (Blue)
● separated parts A1 and A2 of A army must synchronize attack to win
● consensus with unreliable communication channel
● no node failures, no byzantine failures, …
● designated leader
6. Parallel programming models
● Parallel computing models
○ Different parallel computing problems
■ Easily parallelizable or communication needed
○ Shared memory
■ On one machine
● Multiple CPUs/GPUs share memory
■ On multiple machines
● Shared memory accessed via network
● Still much slower compared to memory
■ OpenMP, Global Arrays, …
○ Share nothing
■ Processes communicate by sending messages
■ Send(), Receive()
■ MPI
○ usually no fault tolerance
7. Dataflow system
● term used to describe general parallel programming approach
● in traditional von Neumann architecture instructions executed sequentially by a
worker (cpu) and data do not move
● in Dataflow workers have different tasks assigned to them and form an assembly
line
● program represented by connections and black box operations - directed graph
● data moves between tasks
● task executed by worker as soon as inputs available
● inherently parallel
● no shared state
● closer to functional programming
● not Spark specific (Stratosphere, MapReduce, Pregel, Giraph, Storm, ...)
8. MapReduce
● shows that Dataflow can be expressed in terms of map and reduce
operations
● simple to parallelize
● but each map-reduce is separate from the rest
9. Directed acyclic graph
● Spark is a Dataflow execution engine that supports cyclic data flows
● whole DAG is formed lazily
● allows global optimizations
● has expresiveness of MPI
● lineage tracking
10. Optimizations
● similar to optimizations of RDBMS (operation reordering, bushy
join-order enumeration, aggregation push-down)
● however DAGs less restrictive than database queries and it is
difficult to optimize UDFs (higher order functions used in Spark,
Flink)
● potentially major performance improvement
● partially support for incremental algorithm optimization (local
change) with sparse computational dependencies (GraphX)
11. Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.age < 35)
sc
.parallelize(people)
.filter(_.age < 35)
.map(p => Person(p.age, p.height * 2.54))
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
12. Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.height < 170)
sc
.parallelize(people)
.filter(_.height < 170)
.map(p => Person(p.age, p.height * 2.54))
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
???
13. Optimizations
1. logical rewriting applying rules to trees of operators (e.g. filter push down)
○ static code analysis (bytecode of each UDF) to check reordering rules
○ emits all valid reordered data flow alternatives
2. logical representation translated to physical representation
○ chooses physical execution strategies for each alternative (partitioning,
broadcasting, external sorts, merge and hash joins, …)
○ uses a cost based optimizer (I/O, disk I/O, CPU costs, UDF costs, network)
14. Stream optimizations
● similar, because in Spark streams are just mini batches
● a few extra window, state operations
pageViews = readStream("http://...", "1s")
ones = pageViews.map(event => (event.url, 1))
counts = ones.runningReduce((a, b) => a + b)
15. Performance
Hadoop Spark Spark
Data size 102.5 TB 100 TB 1000 TB
Time [min] 72 23 234
Nodes 2100 206 190
Cores 50400 6592 6080
Rate/node [GB/min] 0.67 20.7 22.5
Environment dedicated data center EC2 EC2
● fastest open source solution to sort 100TB data in Daytona Gray Sort Benchmark (http:
//sortbenchmark.org/)
● required some improvements in shuffle approach
● very optimized sorting algorithm (cache locality, unsafe off-heap memory structures, gc, …)
● Databricks blog + presentation
22. Cache
● cache partitions to be reused in next actions on it or on datasets derived
from it
● snapshot used instead of lineage recomputation
● fault tolerant
● cache(), persist()
● levels
○ memory
○ disk
○ both
○ serialized
○ replicated
○ off-heap
● automatic cache after shuffle
23. Shared variables - broadcast
● usually all variables used in UDF are copies on each node
● shared r/w variables would be very inefficient
● broadcast
○ read only variables
○ efficient broadcast algorithm, can deliver data cheaply to all nodes
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
24. Shared variables - accumulators
● accumulators
○ add only
○ use associative operation so efficient in parallel
○ only driver program can read the value
○ exactly once semantics only guaranteed for actions (in case of failure
and recalculation)
val accum = sc.accumulator(0, "My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
26. Conclusion
● expressive and abstract programming model
● user defined functions
● based on research
● optimizations
● constraining in certain cases (spanning partition boundaries, functions of
multiple variables, ...)
anything can fail (network, nodes, lost or damaged packets, …)
Liveness properties : assert that something ‘good’ will eventually happen during execution.
Safety Properties : assert that nothing ‘bad’ will ever happen during an execution (that is, that the program will never enter a ‘bad’ state).
HPC
shared memory may or may not be good
depends on communication patterns
locks may be needed
descibe each - e.g. serialized, off-heap, replicated