SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Spark’s distributed programming model
Martin Zapletal Cake Solutions
Apache Spark
Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and machine learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
Table of Contents
● Distributed programming introduction
● Programming models
● Datafow systems and DAGs
● RDD
● Transformations, Actions, Persistence, Shared variables
Distributed programming
● reminder
○ unreliable network
○ ubiquitous failures
○ everything asynchronous
○ consistency, ordering and synchronisation expensive
○ local time
○ correctness properties safety and liveness
○ ...
Two armies (generals)
● two armies, A (Red) and B (Blue)
● separated parts A1 and A2 of A army must synchronize attack to win
● consensus with unreliable communication channel
● no node failures, no byzantine failures, …
● designated leader
Parallel programming models
● Parallel computing models
○ Different parallel computing problems
■ Easily parallelizable or communication needed
○ Shared memory
■ On one machine
● Multiple CPUs/GPUs share memory
■ On multiple machines
● Shared memory accessed via network
● Still much slower compared to memory
■ OpenMP, Global Arrays, …
○ Share nothing
■ Processes communicate by sending messages
■ Send(), Receive()
■ MPI
○ usually no fault tolerance
Dataflow system
● term used to describe general parallel programming approach
● in traditional von Neumann architecture instructions executed sequentially by a
worker (cpu) and data do not move
● in Dataflow workers have different tasks assigned to them and form an assembly
line
● program represented by connections and black box operations - directed graph
● data moves between tasks
● task executed by worker as soon as inputs available
● inherently parallel
● no shared state
● closer to functional programming
● not Spark specific (Stratosphere, MapReduce, Pregel, Giraph, Storm, ...)
MapReduce
● shows that Dataflow can be expressed in terms of map and reduce
operations
● simple to parallelize
● but each map-reduce is separate from the rest
Directed acyclic graph
● Spark is a Dataflow execution engine that supports cyclic data flows
● whole DAG is formed lazily
● allows global optimizations
● has expresiveness of MPI
● lineage tracking
Optimizations
● similar to optimizations of RDBMS (operation reordering, bushy
join-order enumeration, aggregation push-down)
● however DAGs less restrictive than database queries and it is
difficult to optimize UDFs (higher order functions used in Spark,
Flink)
● potentially major performance improvement
● partially support for incremental algorithm optimization (local
change) with sparse computational dependencies (GraphX)
Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.age < 35)
sc
.parallelize(people)
.filter(_.age < 35)
.map(p => Person(p.age, p.height * 2.54))
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.height < 170)
sc
.parallelize(people)
.filter(_.height < 170)
.map(p => Person(p.age, p.height * 2.54))
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
???
Optimizations
1. logical rewriting applying rules to trees of operators (e.g. filter push down)
○ static code analysis (bytecode of each UDF) to check reordering rules
○ emits all valid reordered data flow alternatives
2. logical representation translated to physical representation
○ chooses physical execution strategies for each alternative (partitioning,
broadcasting, external sorts, merge and hash joins, …)
○ uses a cost based optimizer (I/O, disk I/O, CPU costs, UDF costs, network)
Stream optimizations
● similar, because in Spark streams are just mini batches
● a few extra window, state operations
pageViews = readStream("http://...", "1s")
ones = pageViews.map(event => (event.url, 1))
counts = ones.runningReduce((a, b) => a + b)
Performance
Hadoop Spark Spark
Data size 102.5 TB 100 TB 1000 TB
Time [min] 72 23 234
Nodes 2100 206 190
Cores 50400 6592 6080
Rate/node [GB/min] 0.67 20.7 22.5
Environment dedicated data center EC2 EC2
● fastest open source solution to sort 100TB data in Daytona Gray Sort Benchmark (http:
//sortbenchmark.org/)
● required some improvements in shuffle approach
● very optimized sorting algorithm (cache locality, unsafe off-heap memory structures, gc, …)
● Databricks blog + presentation
Spark programming model
● RDD
● parallelizing collections
● loading external datasets
● operations
○ transformations
○ actions
● persistence
● shared variables
RDD
● transformations
○ lazy, form the DAG
○ map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union,
intersection, distinct, groupByKey, reduceByKey, sortByKey, join, cogroup,
repatition, cartesian, glom, ...
● actions
○ execute DAG
○ retrieve result
○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
● different categories of transformations with different complexity, performance and
sematics
● e.g. mapping, filtering, grouping, set operations, sorting, reducing, partitioning
● full list https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.
rdd.RDD
Transformations with narrow deps
● map
● union
● join with copartitioned inputs
Transformations with wide deps
● groupBy
● join without copartitioned inputs
Actions collect
● retrieves result to driver program
● no longer distributed
Actions reduction
● associative, commutative operation
Cache
● cache partitions to be reused in next actions on it or on datasets derived
from it
● snapshot used instead of lineage recomputation
● fault tolerant
● cache(), persist()
● levels
○ memory
○ disk
○ both
○ serialized
○ replicated
○ off-heap
● automatic cache after shuffle
Shared variables - broadcast
● usually all variables used in UDF are copies on each node
● shared r/w variables would be very inefficient
● broadcast
○ read only variables
○ efficient broadcast algorithm, can deliver data cheaply to all nodes
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
Shared variables - accumulators
● accumulators
○ add only
○ use associative operation so efficient in parallel
○ only driver program can read the value
○ exactly once semantics only guaranteed for actions (in case of failure
and recalculation)
val accum = sc.accumulator(0, "My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
Shared variables - accumulators
object VectorAccumulatorParam extends AccumulatorParam[Vector] {
def zero(initialValue: Vector): Vector = {
Vector.zeros(initialValue.size)
}
def addInPlace(v1: Vector, v2: Vector): Vector = {
v1 += v2
}
}
Conclusion
● expressive and abstract programming model
● user defined functions
● based on research
● optimizations
● constraining in certain cases (spanning partition boundaries, functions of
multiple variables, ...)
Questions

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 

Was ist angesagt? (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Andere mochten auch

Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Martin Zapletal
 
Reloj inteligente para ciegos (smartwatch)
Reloj inteligente para ciegos (smartwatch)Reloj inteligente para ciegos (smartwatch)
Reloj inteligente para ciegos (smartwatch)AgustinaBarreto11
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
Gadgets (I/O) for Disabled/Physically Challenged
Gadgets (I/O) for Disabled/Physically ChallengedGadgets (I/O) for Disabled/Physically Challenged
Gadgets (I/O) for Disabled/Physically ChallengedMujab Muneeb
 
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607D.CAMP
 
Deep learning review
Deep learning reviewDeep learning review
Deep learning reviewManas Gaur
 
20151223application of deep learning in basic bio
20151223application of deep learning in basic bio 20151223application of deep learning in basic bio
20151223application of deep learning in basic bio Charlene Hsuan-Lin Her
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Impetus Technologies
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyMartin Zapletal
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingSayed Chhattan Shah
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginningsDaniel Leon
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...Edge AI and Vision Alliance
 
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012Justin Sutton
 
Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_Jeffrey Shomaker
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedinYukti Kaura
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceJonathan Mugan
 

Andere mochten auch (20)

Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2
 
Reloj inteligente para ciegos (smartwatch)
Reloj inteligente para ciegos (smartwatch)Reloj inteligente para ciegos (smartwatch)
Reloj inteligente para ciegos (smartwatch)
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Gadgets (I/O) for Disabled/Physically Challenged
Gadgets (I/O) for Disabled/Physically ChallengedGadgets (I/O) for Disabled/Physically Challenged
Gadgets (I/O) for Disabled/Physically Challenged
 
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
 
Deep learning review
Deep learning reviewDeep learning review
Deep learning review
 
20151223application of deep learning in basic bio
20151223application of deep learning in basic bio 20151223application of deep learning in basic bio
20151223application of deep learning in basic bio
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
 
Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
 

Ähnlich wie Apache spark - Spark's distributed programming model

Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 

Ähnlich wie Apache spark - Spark's distributed programming model (20)

Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark
SparkSpark
Spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 

Mehr von Martin Zapletal

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience Martin Zapletal
 
Customer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveCustomer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveMartin Zapletal
 
Intelligent System Optimizations
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System OptimizationsMartin Zapletal
 
Intelligent Distributed Systems Optimizations
Intelligent Distributed Systems OptimizationsIntelligent Distributed Systems Optimizations
Intelligent Distributed Systems OptimizationsMartin Zapletal
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Martin Zapletal
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 

Mehr von Martin Zapletal (6)

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience
 
Customer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveCustomer experience at disney+ through data perspective
Customer experience at disney+ through data perspective
 
Intelligent System Optimizations
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System Optimizations
 
Intelligent Distributed Systems Optimizations
Intelligent Distributed Systems OptimizationsIntelligent Distributed Systems Optimizations
Intelligent Distributed Systems Optimizations
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 

Kürzlich hochgeladen

What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 

Kürzlich hochgeladen (20)

What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 

Apache spark - Spark's distributed programming model

  • 1. Spark’s distributed programming model Martin Zapletal Cake Solutions Apache Spark
  • 2. Apache Spark and Big Data 1) History and market overview 2) Installation 3) MLlib and machine learning on Spark 4) Porting R code to Scala and Spark 5) Concepts - Core, SQL, GraphX, Streaming 6) Spark’s distributed programming model
  • 3. Table of Contents ● Distributed programming introduction ● Programming models ● Datafow systems and DAGs ● RDD ● Transformations, Actions, Persistence, Shared variables
  • 4. Distributed programming ● reminder ○ unreliable network ○ ubiquitous failures ○ everything asynchronous ○ consistency, ordering and synchronisation expensive ○ local time ○ correctness properties safety and liveness ○ ...
  • 5. Two armies (generals) ● two armies, A (Red) and B (Blue) ● separated parts A1 and A2 of A army must synchronize attack to win ● consensus with unreliable communication channel ● no node failures, no byzantine failures, … ● designated leader
  • 6. Parallel programming models ● Parallel computing models ○ Different parallel computing problems ■ Easily parallelizable or communication needed ○ Shared memory ■ On one machine ● Multiple CPUs/GPUs share memory ■ On multiple machines ● Shared memory accessed via network ● Still much slower compared to memory ■ OpenMP, Global Arrays, … ○ Share nothing ■ Processes communicate by sending messages ■ Send(), Receive() ■ MPI ○ usually no fault tolerance
  • 7. Dataflow system ● term used to describe general parallel programming approach ● in traditional von Neumann architecture instructions executed sequentially by a worker (cpu) and data do not move ● in Dataflow workers have different tasks assigned to them and form an assembly line ● program represented by connections and black box operations - directed graph ● data moves between tasks ● task executed by worker as soon as inputs available ● inherently parallel ● no shared state ● closer to functional programming ● not Spark specific (Stratosphere, MapReduce, Pregel, Giraph, Storm, ...)
  • 8. MapReduce ● shows that Dataflow can be expressed in terms of map and reduce operations ● simple to parallelize ● but each map-reduce is separate from the rest
  • 9. Directed acyclic graph ● Spark is a Dataflow execution engine that supports cyclic data flows ● whole DAG is formed lazily ● allows global optimizations ● has expresiveness of MPI ● lineage tracking
  • 10. Optimizations ● similar to optimizations of RDBMS (operation reordering, bushy join-order enumeration, aggregation push-down) ● however DAGs less restrictive than database queries and it is difficult to optimize UDFs (higher order functions used in Spark, Flink) ● potentially major performance improvement ● partially support for incremental algorithm optimization (local change) with sparse computational dependencies (GraphX)
  • 11. Optimizations sc .parallelize(people) .map(p => Person(p.age, p.height * 2.54)) .filter(_.age < 35) sc .parallelize(people) .filter(_.age < 35) .map(p => Person(p.age, p.height * 2.54)) case class Person(age: Int, height: Double) val people = (0 to 100).map(x => Person(x, x))
  • 12. Optimizations sc .parallelize(people) .map(p => Person(p.age, p.height * 2.54)) .filter(_.height < 170) sc .parallelize(people) .filter(_.height < 170) .map(p => Person(p.age, p.height * 2.54)) case class Person(age: Int, height: Double) val people = (0 to 100).map(x => Person(x, x)) ???
  • 13. Optimizations 1. logical rewriting applying rules to trees of operators (e.g. filter push down) ○ static code analysis (bytecode of each UDF) to check reordering rules ○ emits all valid reordered data flow alternatives 2. logical representation translated to physical representation ○ chooses physical execution strategies for each alternative (partitioning, broadcasting, external sorts, merge and hash joins, …) ○ uses a cost based optimizer (I/O, disk I/O, CPU costs, UDF costs, network)
  • 14. Stream optimizations ● similar, because in Spark streams are just mini batches ● a few extra window, state operations pageViews = readStream("http://...", "1s") ones = pageViews.map(event => (event.url, 1)) counts = ones.runningReduce((a, b) => a + b)
  • 15. Performance Hadoop Spark Spark Data size 102.5 TB 100 TB 1000 TB Time [min] 72 23 234 Nodes 2100 206 190 Cores 50400 6592 6080 Rate/node [GB/min] 0.67 20.7 22.5 Environment dedicated data center EC2 EC2 ● fastest open source solution to sort 100TB data in Daytona Gray Sort Benchmark (http: //sortbenchmark.org/) ● required some improvements in shuffle approach ● very optimized sorting algorithm (cache locality, unsafe off-heap memory structures, gc, …) ● Databricks blog + presentation
  • 16. Spark programming model ● RDD ● parallelizing collections ● loading external datasets ● operations ○ transformations ○ actions ● persistence ● shared variables
  • 17. RDD ● transformations ○ lazy, form the DAG ○ map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey, join, cogroup, repatition, cartesian, glom, ... ● actions ○ execute DAG ○ retrieve result ○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ... ● different categories of transformations with different complexity, performance and sematics ● e.g. mapping, filtering, grouping, set operations, sorting, reducing, partitioning ● full list https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark. rdd.RDD
  • 18. Transformations with narrow deps ● map ● union ● join with copartitioned inputs
  • 19. Transformations with wide deps ● groupBy ● join without copartitioned inputs
  • 20. Actions collect ● retrieves result to driver program ● no longer distributed
  • 21. Actions reduction ● associative, commutative operation
  • 22. Cache ● cache partitions to be reused in next actions on it or on datasets derived from it ● snapshot used instead of lineage recomputation ● fault tolerant ● cache(), persist() ● levels ○ memory ○ disk ○ both ○ serialized ○ replicated ○ off-heap ● automatic cache after shuffle
  • 23. Shared variables - broadcast ● usually all variables used in UDF are copies on each node ● shared r/w variables would be very inefficient ● broadcast ○ read only variables ○ efficient broadcast algorithm, can deliver data cheaply to all nodes val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar.value
  • 24. Shared variables - accumulators ● accumulators ○ add only ○ use associative operation so efficient in parallel ○ only driver program can read the value ○ exactly once semantics only guaranteed for actions (in case of failure and recalculation) val accum = sc.accumulator(0, "My Accumulator") sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) accum.value
  • 25. Shared variables - accumulators object VectorAccumulatorParam extends AccumulatorParam[Vector] { def zero(initialValue: Vector): Vector = { Vector.zeros(initialValue.size) } def addInPlace(v1: Vector, v2: Vector): Vector = { v1 += v2 } }
  • 26. Conclusion ● expressive and abstract programming model ● user defined functions ● based on research ● optimizations ● constraining in certain cases (spanning partition boundaries, functions of multiple variables, ...)

Hinweis der Redaktion

  1. anything can fail (network, nodes, lost or damaged packets, …) Liveness properties : assert that something ‘good’ will eventually happen during execution. Safety Properties : assert that nothing ‘bad’ will ever happen during an execution (that is, that the program will never enter a ‘bad’ state).
  2. HPC shared memory may or may not be good depends on communication patterns locks may be needed
  3. descibe each - e.g. serialized, off-heap, replicated