In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
2. Andy
@Noootsab, I am
@NextLab_be owner
@SparkNotebook creator
@Wajug co-driver
@Devoxx4Kids organizer
Maths & CS
Data lover: geo, open, massive
Fool
Who are we?
Xavier
@xtordoir
SilicoCloud
-> Physics
-> Data analysis
-> genomics
-> scalable systems
-> ...
3. So what...
Part I
● What
○ distributed resources
○ data
○ managers
● Why:
○ fastest
○ smartest
○ biggest
● How:
○ Map Reduce
○ Limitations
○ Extensions
PART II
● Spark
○ Model
○ Caching and lineage
○ Master and Workers
○ Core example
● Beyond Processing
○ Streaming
○ SQL
○ GraphX
○ MLlib
○ Example
● Use cases
○ Parallel batch processing of
timeseries
○ ADAM
5. What is a distributed environment
Computations needs three kind of resources:
● CPU
● MEM
● Data storage
However, it’s hard to extent each of them at will on a single
machine
6. What is a distributed environment
Lacking of one of these will result in higher response time
or reduced accuracy.
Unfortunately, it doesn’t matter how parallelized is the
algorithm or optimized are the computations
If the solution can’t be inside, it must be outside.
8. Distributed File System
You have 100 nodes in your cluster, but only 1 dataset.
Will you replicate it on all nodes?
Extended case: your dataset is 1 Zettabyte (10⁹Tb)?
Lonesome solution:
● split the file on nodes
● axing the algorithm to access local data subsets
9. HDFS towards Tachyon
Hadoop Distributed File System
Implements GoogleFS
Store and read files splitted and replicated on nodes
1Zb file = 8E12 x 128Mb files
IOPs are expensive and require more CPU clocks than
DRAM access
Hence... Tachyon: memory-centric distributed file system
10. Nodes will fail, jobs cannot
We need resilience
Management
Resources are generally fewer than required by algorithm.
We need scheduling
The requirements are fluctuating
We need elasticity
11. Mesos and Marathon
Mesos: High available cluster manager
Nodes: attach or remove them on the fly
Nodes are offering resources -- Applications accept them
Node crash: the application restarts the assigned tasks
Marathon: Meta application on Mesos
Application crash: automatically restarted on different node
12. Why: for everybody and now ?
Fastest:
1. Time to result
2. Near real time processing
13. Runtime is smaller, Dev lifecyle is shorter
→ no synchronization-hell
It can even be really interactive
→ consoles or notebooks tools.
Why for everybody and now
14. Why for everybody and now
No bottlenecks → new-coming data are readily available for
processing
Opens the doors for online models!
15. Why for everybody and now
Smartest: train more and more models, ensembling lots of
them is no more a problem
More complex modelling can be tackled if required
16. Why for everybody and now
Accessing an higher level of accuracy is tricky and might
require lots and lots of models.
Running a model takes quite some time, specially if the
data has to be read every single time.
Example: Netflix contest winner (AT&T labs) ensembled 500 models to gain 10% accuracy.
Although in 2009 it wasn’t possible to use it in production, today this could change.
17. Why for everybody and now
Biggest: no need for sampling big datasets
…
…
That’s it!
18. How!?
Google papers stimulated the open software community,
hence competitive tools now exist.
In the area of computation in distributed environment, there
are two disruptive papers:
● Google’s Mapreduce
● Berkeley’s Spark
19. How!?
MapReduce (Google white paper 2004):
Programming model for distributed data intensive
computations
Helps dealing with parallelization, fault-tolerance, data
distribution, load balancing
20. Functions:
Map ≅ transform data to key value pairs
Reduce ≅ aggregate key value pairs per key (e.g. sum,
max, count)
Mappers and Reducers are sent to data location (nodes)
How!?
21. Map
Reduce: apply a binary associative operator on all
elements
Image from RxJava: https://github.com/ReactiveX/RxJava/wiki/Transforming-Observables
How!?
22. Hadoop implementation has some limitations
Mappers and Reducers ship functions to data while java is not a functional
language
⇒ Composability is difficult and more IO/network operations are required
Iterative algorithms (e.g. stochastic gradient) have to read data at each step
(while data has not changed, only parameters)
How!?
23. How!?
MapReduce on steroids
I) Functional paradigm:
- process built lazily based on simple concepts
- Map and Reduce are two of them
II) Cache data in memory. No more IO.
24. So what...
Part I
● What
○ distributed resources
○ data
○ managers
● Why:
○ fastest
○ smartest
○ biggest
● How:
○ Map Reduce
○ Limitations
○ Extensions
PART II
● Spark
○ Model
○ Caching and lineage
○ Master and Workers
○ Core example
● Beyond Processing
○ Streaming
○ SQL
○ GraphX
○ MLlib
○ Example (notebook)
● Use cases
○ Parallel batch processing of
timeseries
○ ADAM
26. RDDs
Think of an RDD[T] as an immutable, distributed collection
of objects of type T
• Resilient => Can be reconstructed in case of failure
• Distributed => Transformations are parallelizable
operations
• Dataset => Data loaded and partitioned across cluster
nodes (executors)
31. Spark Streaming
When you have big fat streams behaving as one single
collection
t
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
DStreams: Discretized Streams (= Sequence of RDDs)
35. Use cases examples
- Parallel batch processing of time series
- Bayesian Network in financial market
- IoT platform (Lambda architecture)
- OpenStreetMap cities topologies classification
- Markov Chain in Land Use/Land Cover prediction
- Genomics: ADAM
41. Mashup
prediction
Sample [NA20332] is in cluster #0 for population Some( ASW)
Sample [NA20334] is in cluster # 2 for population Some( ASW)
Sample [HG00120] is in cluster # 2 for population Some( GBR)
Sample [NA18560] is in cluster # 1 for population Some( CHB)
44. Eggo project (public genomics data in ADAM format on s3)
We…
1000genomes in ADAM format on S3.
Open Source GA4GH Interop services implementation
Machine learning on 1000genomes
Genomic data and distributed computing
45. The end (of the slides)
Thanks for your attention!
Xavier Tordoir
xavier@silicocloud.eu
Andy Petrella
andy.petrella@nextlab.be