SlideShare a Scribd company logo
1 of 45
Download to read offline
BigData, newborn technologies
evolving fast. Why Apache Spark
outruns Apache Hadoop
Andy Petrella, Nextlab
Xavier Tordoir, SilicoCloud
Andy
@Noootsab, I am
@NextLab_be owner
@SparkNotebook creator
@Wajug co-driver
@Devoxx4Kids organizer
Maths & CS
Data lover: geo, open, massive
Fool
Who are we?
Xavier
@xtordoir
SilicoCloud
-> Physics
-> Data analysis
-> genomics
-> scalable systems
-> ...
So what...
Part I
● What
○ distributed resources
○ data
○ managers
● Why:
○ fastest
○ smartest
○ biggest
● How:
○ Map Reduce
○ Limitations
○ Extensions
PART II
● Spark
○ Model
○ Caching and lineage
○ Master and Workers
○ Core example
● Beyond Processing
○ Streaming
○ SQL
○ GraphX
○ MLlib
○ Example
● Use cases
○ Parallel batch processing of
timeseries
○ ADAM
Part I: The Distributed Age
What is a distributed environment
Computations needs three kind of resources:
● CPU
● MEM
● Data storage
However, it’s hard to extent each of them at will on a single
machine
What is a distributed environment
Lacking of one of these will result in higher response time
or reduced accuracy.
Unfortunately, it doesn’t matter how parallelized is the
algorithm or optimized are the computations
If the solution can’t be inside, it must be outside.
What is a distributed environment
Distributed File System
You have 100 nodes in your cluster, but only 1 dataset.
Will you replicate it on all nodes?
Extended case: your dataset is 1 Zettabyte (10⁹Tb)?
Lonesome solution:
● split the file on nodes
● axing the algorithm to access local data subsets
HDFS towards Tachyon
Hadoop Distributed File System
Implements GoogleFS
Store and read files splitted and replicated on nodes
1Zb file = 8E12 x 128Mb files
IOPs are expensive and require more CPU clocks than
DRAM access
Hence... Tachyon: memory-centric distributed file system
Nodes will fail, jobs cannot
We need resilience
Management
Resources are generally fewer than required by algorithm.
We need scheduling
The requirements are fluctuating
We need elasticity
Mesos and Marathon
Mesos: High available cluster manager
Nodes: attach or remove them on the fly
Nodes are offering resources -- Applications accept them
Node crash: the application restarts the assigned tasks
Marathon: Meta application on Mesos
Application crash: automatically restarted on different node
Why: for everybody and now ?
Fastest:
1. Time to result
2. Near real time processing
Runtime is smaller, Dev lifecyle is shorter
→ no synchronization-hell
It can even be really interactive
→ consoles or notebooks tools.
Why for everybody and now
Why for everybody and now
No bottlenecks → new-coming data are readily available for
processing
Opens the doors for online models!
Why for everybody and now
Smartest: train more and more models, ensembling lots of
them is no more a problem
More complex modelling can be tackled if required
Why for everybody and now
Accessing an higher level of accuracy is tricky and might
require lots and lots of models.
Running a model takes quite some time, specially if the
data has to be read every single time.
Example: Netflix contest winner (AT&T labs) ensembled 500 models to gain 10% accuracy.
Although in 2009 it wasn’t possible to use it in production, today this could change.
Why for everybody and now
Biggest: no need for sampling big datasets
…
…
That’s it!
How!?
Google papers stimulated the open software community,
hence competitive tools now exist.
In the area of computation in distributed environment, there
are two disruptive papers:
● Google’s Mapreduce
● Berkeley’s Spark
How!?
MapReduce (Google white paper 2004):
Programming model for distributed data intensive
computations
Helps dealing with parallelization, fault-tolerance, data
distribution, load balancing
Functions:
Map ≅ transform data to key value pairs
Reduce ≅ aggregate key value pairs per key (e.g. sum,
max, count)
Mappers and Reducers are sent to data location (nodes)
How!?
Map
Reduce: apply a binary associative operator on all
elements
Image from RxJava: https://github.com/ReactiveX/RxJava/wiki/Transforming-Observables
How!?
Hadoop implementation has some limitations
Mappers and Reducers ship functions to data while java is not a functional
language
⇒ Composability is difficult and more IO/network operations are required
Iterative algorithms (e.g. stochastic gradient) have to read data at each step
(while data has not changed, only parameters)
How!?
How!?
MapReduce on steroids
I) Functional paradigm:
- process built lazily based on simple concepts
- Map and Reduce are two of them
II) Cache data in memory. No more IO.
So what...
Part I
● What
○ distributed resources
○ data
○ managers
● Why:
○ fastest
○ smartest
○ biggest
● How:
○ Map Reduce
○ Limitations
○ Extensions
PART II
● Spark
○ Model
○ Caching and lineage
○ Master and Workers
○ Core example
● Beyond Processing
○ Streaming
○ SQL
○ GraphX
○ MLlib
○ Example (notebook)
● Use cases
○ Parallel batch processing of
timeseries
○ ADAM
Part II: Spark to the Rescue
RDDs
Think of an RDD[T] as an immutable, distributed collection
of objects of type T
• Resilient => Can be reconstructed in case of failure
• Distributed => Transformations are parallelizable
operations
• Dataset => Data loaded and partitioned across cluster
nodes (executors)
RDD[T]
Data distribution hierarchy:
- RDD[T]
- Elements
[ x1, x2 ]
[ x10 ]
[ x8,x5,x6 ]
[ x11 ]
[ x14,x13 ]
[ x9,x16 ]
[ x3 ]
[ x7,x12 ]
[ x15 ]
[ x17,x4 ]
Executor 1
- Executors
- Partitions
Executor 2 Executor 3 Executor 4
Execution
Execution is split in fundamental units: Tasks
Tasks running in parallel are grouped in Stages
Execution
Core1
Task0
(read/process/write)
Task0
(read/process/write)
Task0
(read/process/write)
Core2
Task1
(read/process/write)
Task1
(read/process/write)
Task1
(read/process/write)
Core3
Task2
(read/process/write)
Task2
(read/process/write)
Task2
(read/process/write)
Stage2 Stage1 Stage0
Master and Workers
Spark Streaming
When you have big fat streams behaving as one single
collection
t
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
DStreams: Discretized Streams (= Sequence of RDDs)
Spark SQL
Mapping: RDD -> “table”, Element Field -> “column”
MLLib: Distributed ML
Classification
● linear SVM, logistic regression, classification trees, naive Bayes Models
Regression
● SVM, regression trees, linear regression (regularized)
Clustering & dimensionality reduction
● singular value decomposition, PCA, k-means clustering
“The library to teach them all”
GraphX
Connecting the dots
Graph processing at scale.
> Take edges
> Link nodes
> Combine/Send messages
Use cases examples
- Parallel batch processing of time series
- Bayesian Network in financial market
- IoT platform (Lambda architecture)
- OpenStreetMap cities topologies classification
- Markov Chain in Land Use/Land Cover prediction
- Genomics: ADAM
Genomics
Biological systems are very complex
One human sequence is 60Gb
ADAM
Credits: AmpLab (UC Berkeley)
Stratification using 1000Genomes
http://www.1000genomes.org/
ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
Machine Learning model
Clustering: KMeans
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model
MLLib, KMeans
MLLib:
● Machine Learning Algorithms
● Data structures (e.g. Vector)
Mashup
prediction
Sample [NA20332] is in cluster #0 for population Some( ASW)
Sample [NA20334] is in cluster # 2 for population Some( ASW)
Sample [HG00120] is in cluster # 2 for population Some( GBR)
Sample [NA18560] is in cluster # 1 for population Some( CHB)
Mashup
#0 #1 #2
GBR 0 0 89
ASW 54 0 7
CHB 0 97 0
Cluster
40 m3.xlarge
160 cores + 600G
Eggo project (public genomics data in ADAM format on s3)
We…
1000genomes in ADAM format on S3.
Open Source GA4GH Interop services implementation
Machine learning on 1000genomes
Genomic data and distributed computing
The end (of the slides)
Thanks for your attention!
Xavier Tordoir
xavier@silicocloud.eu
Andy Petrella
andy.petrella@nextlab.be

More Related Content

What's hot

Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Juan Pedro Moreno
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 

What's hot (20)

Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 

Similar to What is Distributed Computing, Why we use Apache Spark

Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 

Similar to What is Distributed Computing, Why we use Apache Spark (20)

Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 

More from Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 

More from Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

What is Distributed Computing, Why we use Apache Spark

  • 1. BigData, newborn technologies evolving fast. Why Apache Spark outruns Apache Hadoop Andy Petrella, Nextlab Xavier Tordoir, SilicoCloud
  • 2. Andy @Noootsab, I am @NextLab_be owner @SparkNotebook creator @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Who are we? Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...
  • 3. So what... Part I ● What ○ distributed resources ○ data ○ managers ● Why: ○ fastest ○ smartest ○ biggest ● How: ○ Map Reduce ○ Limitations ○ Extensions PART II ● Spark ○ Model ○ Caching and lineage ○ Master and Workers ○ Core example ● Beyond Processing ○ Streaming ○ SQL ○ GraphX ○ MLlib ○ Example ● Use cases ○ Parallel batch processing of timeseries ○ ADAM
  • 4. Part I: The Distributed Age
  • 5. What is a distributed environment Computations needs three kind of resources: ● CPU ● MEM ● Data storage However, it’s hard to extent each of them at will on a single machine
  • 6. What is a distributed environment Lacking of one of these will result in higher response time or reduced accuracy. Unfortunately, it doesn’t matter how parallelized is the algorithm or optimized are the computations If the solution can’t be inside, it must be outside.
  • 7. What is a distributed environment
  • 8. Distributed File System You have 100 nodes in your cluster, but only 1 dataset. Will you replicate it on all nodes? Extended case: your dataset is 1 Zettabyte (10⁹Tb)? Lonesome solution: ● split the file on nodes ● axing the algorithm to access local data subsets
  • 9. HDFS towards Tachyon Hadoop Distributed File System Implements GoogleFS Store and read files splitted and replicated on nodes 1Zb file = 8E12 x 128Mb files IOPs are expensive and require more CPU clocks than DRAM access Hence... Tachyon: memory-centric distributed file system
  • 10. Nodes will fail, jobs cannot We need resilience Management Resources are generally fewer than required by algorithm. We need scheduling The requirements are fluctuating We need elasticity
  • 11. Mesos and Marathon Mesos: High available cluster manager Nodes: attach or remove them on the fly Nodes are offering resources -- Applications accept them Node crash: the application restarts the assigned tasks Marathon: Meta application on Mesos Application crash: automatically restarted on different node
  • 12. Why: for everybody and now ? Fastest: 1. Time to result 2. Near real time processing
  • 13. Runtime is smaller, Dev lifecyle is shorter → no synchronization-hell It can even be really interactive → consoles or notebooks tools. Why for everybody and now
  • 14. Why for everybody and now No bottlenecks → new-coming data are readily available for processing Opens the doors for online models!
  • 15. Why for everybody and now Smartest: train more and more models, ensembling lots of them is no more a problem More complex modelling can be tackled if required
  • 16. Why for everybody and now Accessing an higher level of accuracy is tricky and might require lots and lots of models. Running a model takes quite some time, specially if the data has to be read every single time. Example: Netflix contest winner (AT&T labs) ensembled 500 models to gain 10% accuracy. Although in 2009 it wasn’t possible to use it in production, today this could change.
  • 17. Why for everybody and now Biggest: no need for sampling big datasets … … That’s it!
  • 18. How!? Google papers stimulated the open software community, hence competitive tools now exist. In the area of computation in distributed environment, there are two disruptive papers: ● Google’s Mapreduce ● Berkeley’s Spark
  • 19. How!? MapReduce (Google white paper 2004): Programming model for distributed data intensive computations Helps dealing with parallelization, fault-tolerance, data distribution, load balancing
  • 20. Functions: Map ≅ transform data to key value pairs Reduce ≅ aggregate key value pairs per key (e.g. sum, max, count) Mappers and Reducers are sent to data location (nodes) How!?
  • 21. Map Reduce: apply a binary associative operator on all elements Image from RxJava: https://github.com/ReactiveX/RxJava/wiki/Transforming-Observables How!?
  • 22. Hadoop implementation has some limitations Mappers and Reducers ship functions to data while java is not a functional language ⇒ Composability is difficult and more IO/network operations are required Iterative algorithms (e.g. stochastic gradient) have to read data at each step (while data has not changed, only parameters) How!?
  • 23. How!? MapReduce on steroids I) Functional paradigm: - process built lazily based on simple concepts - Map and Reduce are two of them II) Cache data in memory. No more IO.
  • 24. So what... Part I ● What ○ distributed resources ○ data ○ managers ● Why: ○ fastest ○ smartest ○ biggest ● How: ○ Map Reduce ○ Limitations ○ Extensions PART II ● Spark ○ Model ○ Caching and lineage ○ Master and Workers ○ Core example ● Beyond Processing ○ Streaming ○ SQL ○ GraphX ○ MLlib ○ Example (notebook) ● Use cases ○ Parallel batch processing of timeseries ○ ADAM
  • 25. Part II: Spark to the Rescue
  • 26. RDDs Think of an RDD[T] as an immutable, distributed collection of objects of type T • Resilient => Can be reconstructed in case of failure • Distributed => Transformations are parallelizable operations • Dataset => Data loaded and partitioned across cluster nodes (executors)
  • 27. RDD[T] Data distribution hierarchy: - RDD[T] - Elements [ x1, x2 ] [ x10 ] [ x8,x5,x6 ] [ x11 ] [ x14,x13 ] [ x9,x16 ] [ x3 ] [ x7,x12 ] [ x15 ] [ x17,x4 ] Executor 1 - Executors - Partitions Executor 2 Executor 3 Executor 4
  • 28. Execution Execution is split in fundamental units: Tasks Tasks running in parallel are grouped in Stages
  • 31. Spark Streaming When you have big fat streams behaving as one single collection t DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] DStreams: Discretized Streams (= Sequence of RDDs)
  • 32. Spark SQL Mapping: RDD -> “table”, Element Field -> “column”
  • 33. MLLib: Distributed ML Classification ● linear SVM, logistic regression, classification trees, naive Bayes Models Regression ● SVM, regression trees, linear regression (regularized) Clustering & dimensionality reduction ● singular value decomposition, PCA, k-means clustering “The library to teach them all”
  • 34. GraphX Connecting the dots Graph processing at scale. > Take edges > Link nodes > Combine/Send messages
  • 35. Use cases examples - Parallel batch processing of time series - Bayesian Network in financial market - IoT platform (Lambda architecture) - OpenStreetMap cities topologies classification - Markov Chain in Land Use/Land Cover prediction - Genomics: ADAM
  • 36. Genomics Biological systems are very complex One human sequence is 60Gb
  • 38. Stratification using 1000Genomes http://www.1000genomes.org/ ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
  • 39. Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 40. Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)
  • 41. Mashup prediction Sample [NA20332] is in cluster #0 for population Some( ASW) Sample [NA20334] is in cluster # 2 for population Some( ASW) Sample [HG00120] is in cluster # 2 for population Some( GBR) Sample [NA18560] is in cluster # 1 for population Some( CHB)
  • 42. Mashup #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0
  • 44. Eggo project (public genomics data in ADAM format on s3) We… 1000genomes in ADAM format on S3. Open Source GA4GH Interop services implementation Machine learning on 1000genomes Genomic data and distributed computing
  • 45. The end (of the slides) Thanks for your attention! Xavier Tordoir xavier@silicocloud.eu Andy Petrella andy.petrella@nextlab.be