SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Noam Shaish
Spark Streaming
Scale	
  
Fault	
  tolerance	
  
High	
  throughput
Agenda
❖ Overview	
  
❖ Architecture	
  
❖ Fault-­‐tolerance	
  
❖ Why	
  Spark	
  streaming?	
  We	
  have	
  Storm	
  
❖ Demo
Overview
❖ Spark	
  Streaming	
  is	
  an	
  extension	
  of	
  core	
  Spark	
  API.	
  It	
  enables	
  scalable,	
  
high-­‐throughput,	
  fault-­‐tolerant	
  stream	
  processing	
  of	
  live	
  data	
  streams.	
  
❖ ConnecGons	
  for	
  most	
  of	
  common	
  data	
  sources	
  such	
  as	
  KaIa,	
  Flume,	
  
TwiKer,	
  ZeroMQ,	
  Kinesis,	
  TCP,	
  etc.	
  
❖ Spark	
  streaming	
  differ	
  from	
  most	
  online	
  processing	
  soluGon	
  by	
  
espousing	
  mini	
  batch	
  approach,	
  instead	
  of	
  data	
  stream.	
  
❖ Based	
  on	
  DiscreGzed	
  Stream	
  paper	
  	
  
❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing

Matei Zaharia,Tathagata Das, Haoyuan Li, 

Timothy Hunter, Scott Shenker, Ion Stoica

Berkeley EECS (2012-12-14)

www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Overview
Spark	
  streaming	
  runs	
  streaming	
  computaGon	
  as	
  a	
  series	
  of	
  very	
  small,	
  
determinis1c	
  batch	
  jobs	
  
Spark	
  
streaming
Spark
Live	
  data	
  stream
Batches	
  of	
  X	
  milliseconds
Processed	
  results
❖ Chops	
  live	
  stream	
  into	
  batches	
  of	
  x	
  
milliseconds	
  
❖ Spark	
  treats	
  each	
  batch	
  of	
  data	
  as	
  
RDDs	
  
❖ Processed	
  results	
  of	
  the	
  RDD	
  
operaGons	
  are	
  returned	
  in	
  batches
DStream, not just RDD
* Datastax cassandra connector
Transformations
• map(),	
  	
  
• flatMap()	
  	
  
• filter()	
  	
  
• count()	
  
• reparGGon()	
  
• union()	
  
• reduce()	
  	
  
• countByValue()	
  
• reduceByKey()	
  
• join()	
  	
  
• cogroup()	
  
• transform()	
  
• updateStateByKey()
Output Operations
• print()	
  
• foreachRDD()	
  
• saveAsObjectToFiles()	
  
• saveAsTextFiles()	
  
• saveAsHadoopFiles()	
  
• *saveToCassandra()
Window Operations
• window()	
  
• countByWindow()	
  
• reduceByWindow()	
  
• reduceByKeyAndWindow()	
  
• countByValueAndWindow()
Example 1 - DStream to RDD
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
Twi8er	
  Streaming	
  API	
  
!
!
tweets	
  DStream	
  
batch	
  @	
  t batch	
  @	
  t	
  +	
  1 batch	
  @	
  t	
  +	
  3batch	
  @	
  t	
  +	
  2
stored	
  in	
  memory	
  as	
  an	
  RDD	
  
(immutable,	
  distributed)
Example 1 - DStream to RDD relation
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))
tweets	
  DStream	
  
batch	
  @	
  t batch	
  @	
  t	
  +	
  1 batch	
  @	
  t	
  +	
  3batch	
  @	
  t	
  +	
  2
hashTags	
  DStream	
  
[#hobbitch,	
  	
  #bilboleggins,	
  …]
flatMap flatMap flatMap flatMap
new	
  RDDs	
  for	
  
each	
  batch
new	
  DStream
Example 1 - DStream to RDD
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
hashTags.saveToCassandra(“keyspace”, “tableName”)
tweets	
  DStream	
  
hashTags	
  DStream	
  
[#hobbitch,	
  	
  #bilboleggins,	
  …]
flatMap flatMap flatMap flatMap
every	
  batch	
  
saved	
  to	
  
Cassandra
save save save save
Example 2 - DStream to RDD relation
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
val tagCounts = hashTags.countByValue()
tweets	
  DStream	
  
hashTags	
  
flatMap flatMap flatMap flatMap
map map map map
reduceByKey reduceByKey reduceByKey reduceByKey
hashTags	
  
[(#hobbitch,	
  10),	
  	
  (#bilboleggins,	
  34),	
  …]
Example 3 - Count the hash tags over last 10 minutes
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
Sliding	
  window	
  
operaGon Window	
  length Sliding	
  interval
Example 3 - Count the hash tags over last 10 minutes
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
t-1 t t+1 t+2 t+3
sliding	
  window
hashTags	
  
hashTags	
  
Count	
  over	
  all	
  
data	
  in	
  window
Example 4 - Count hash tags over last 10 minutes smartly
val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1))
t-1 t t+1 t+2 t+3
sliding	
  window
hashTags	
  
hashTags	
  
Add	
  count	
  of	
  new	
  
batch	
  in	
  window
+-
Reduce	
  count	
  of	
  
batch	
  out	
  of	
  window
generalizaGon	
  of	
  smart	
  window	
  reduce	
  exists:	
  	
  
reduceByKeyAndWindow(reduce,	
  inverseReduce,	
  window,	
  	
  interval)
Architecture
❖ Receivers	
  divides	
  data	
  into	
  mini	
  batches	
  
❖ Size	
  of	
  batches	
  can	
  be	
  defined	
  in	
  milliseconds	
  (best	
  pracGce	
  
is	
  greater	
  than	
  500	
  milliseconds)
Spark	
  Streaming
Receivers
Spark	
  
Engine
Batches	
  of	
  	
  
input	
  RDDs
Batches	
  of	
  	
  
output	
  RDDs
Input	
  streams
Fault-tolerance
❖ RDDs	
  are	
  not	
  generated	
  from	
  
fault-­‐tolerance	
  source	
  	
  	
  
❖ Replicate	
  data	
  among	
  worker	
  
nodes	
  

(default	
  replicaGon	
  factor	
  of	
  2)	
  
❖ In	
  state-­‐full	
  jobs	
  checkpoints	
  
should	
  be	
  used	
  	
  
❖ Journaling	
  such	
  as	
  in	
  DB	
  can	
  
be	
  acGvated	
  
flatMap
Tweets	
  RDD
hashTags	
  RDD
input	
  data	
  
replicated	
  in	
  
memory
lost	
  parGGons	
  
recomputed	
  on	
  other	
  
workers
Fault-tolerance
❖ Two	
  kinds	
  of	
  data	
  to	
  recover	
  in	
  the	
  event	
  of	
  failure:	
  
• Data	
  received	
  and	
  replicated	
  -­‐	
  

This	
  data	
  survives	
  failure	
  of	
  a	
  single	
  worker	
  node,	
  since	
  a	
  copy	
  of	
  it	
  
exists	
  on	
  one	
  of	
  the	
  other	
  nodes.	
  
• Data	
  received	
  but	
  buffered	
  for	
  replicaGon	
  -­‐

As	
  this	
  is	
  not	
  replicated,	
  the	
  only	
  way	
  to	
  recover	
  that	
  data	
  is	
  to	
  get	
  
it	
  from	
  the	
  source	
  again.
Fault-tolerance
❖ Two	
  receiver	
  semanGcs:	
  
• Reliable	
  receiver	
  -­‐	
  

Acknowledges	
  only	
  ager	
  received	
  data	
  is	
  replicated.	
  If	
  fails,	
  
buffered	
  data	
  does	
  not	
  get	
  acknowledged	
  to	
  the	
  source.	
  If	
  the	
  
receiver	
  is	
  restarted,	
  the	
  source	
  will	
  resend	
  the	
  data,	
  and	
  
therefore	
  no	
  data	
  will	
  be	
  lost	
  due	
  to	
  the	
  failure.	
  	
  
• Unreliable	
  Receiver	
  -­‐	
  

Such	
  receivers	
  can	
  lose	
  data	
  when	
  they	
  fail	
  due	
  to	
  worker	
  or	
  driver	
  
failures.
Fault-tolerance
Deployment	
  
Scenario
Receiver	
  Failure Driver	
  failure
without	
  write	
  
ahead	
  log
Buffered	
  data	
  lost	
  with	
  unreliable	
  receivers	
  
Zero	
  data	
  lost	
  with	
  reliable	
  receivers	
  and	
  files
Buffered	
  data	
  lost	
  with	
  unreliable	
  receivers	
  
Past	
  data	
  lost	
  with	
  all	
  receivers	
  
Zero	
  data	
  lost	
  with	
  files
with	
  write	
  
ahead	
  log
Zero	
  data	
  lost	
  with	
  receivers	
  and	
  files Zero	
  data	
  lost	
  with	
  receivers	
  and	
  files
Why Spark streaming? 

We have Storm
One model to rule them all
❖ Same	
  model	
  for	
  offline	
  AND	
  
online	
  processing	
  
❖ Common	
  code	
  base	
  for	
  offline	
  
AND	
  online	
  processing	
  
❖ Less	
  bugs	
  due	
  to	
  duplicaGon	
  
❖ Less	
  bugs	
  of	
  framework	
  difference	
  
❖ Increase	
  developer	
  producGvity
One stack to rule them all
❖ Explore	
  data	
  
interacGvely	
  using	
  Spark	
  
shell	
  to	
  idenGfy	
  problem	
  
❖ Use	
  same	
  code	
  in	
  Spark	
  
standalone	
  to	
  idenGfy	
  
problem	
  in	
  producGon	
  
environment	
  
❖ Use	
  similar	
  code	
  in	
  
Spark	
  Streaming	
  to	
  
monitor	
  problem	
  online
$	
  ./spark-­‐shell	
  
scala>	
  val	
  file	
  =	
  sc.hadoopFile(“smallLogs”)	
  
...	

scala>	
  val	
  filtered	
  =	
  file.filter(_.contains(“ERROR”))	
  
...	

scala>	
  va
object	
  ProcessProductionData	
  {	
  
	
   def	
  main(args:	
  Array[String])	
  {	
  
	
   	
   val	
  sc	
  =	
  new	
  SparkContext(...)	
  
	
   	
   val	
  file	
  =	
  sc.hadoopFile(“productionLogs”)	
  
	
   	
   val	
  filtered	
  =	
  file.filter(_.contains(“ERROR”))	
  
	
   	
   val	
  mapped	
  =	
  filtered.map(...)	
  
	
   	
   ...	
  
	
   }	
  
} object	
  ProcessLiveStream	
  {	
  
	
   def	
  main(args:	
  Array[String])	
  {	
  
	
   	
   val	
  sc	
  =	
  new	
  StreamingContext(...)	
  
	
   	
   val	
  stream	
  =	
  sc.kafkaStream(...)	
  
	
   	
   val	
  filtered	
  =	
  stream.filter(_.contains(“ERROR”))	
  
	
   	
   val	
  mapped	
  =	
  filtered.map(...)	
  
	
   	
   ...	
  
	
   }	
  
}
Performance
❖ Higher	
  throughput	
  than	
  Storm	
  
• Spark	
  Streaming:	
  670k	
  records/second/node	
  
• Storm:	
  115k	
  records/seconds/node
Grep
Throughput	
  per	
  
node	
  (MB/s)
0
17.5
35
52.5
70
Record	
  size	
  (bytes)
100 1000
Spark
Storm
WordCount
0
7.5
15
22.5
30
Record	
  size	
  (bytes)
100 1000
Tested	
  with	
  100	
  EC2	
  instances	
  with	
  4	
  core	
  each	
  
Comparison	
  taken	
  from	
  Das	
  Thatagata	
  and	
  Reynold	
  Xin	
  Hadoop	
  summit	
  2013	
  presentaGon
Community
Community
Community
Monitoring
In	
  addiGon	
  StreamListener	
  interface	
  provides	
  addiGonal	
  informaGon	
  in	
  various	
  levels	
  	
  
(ApplicaGon,	
  Job,	
  Task,	
  etc.)	
  	
  
Language
vs
Utilization
❖ Spark	
  1.2	
  introduces	
  dynamic	
  cluster	
  resource	
  allocaGon	
  
❖ Jobs	
  can	
  request	
  more	
  resources	
  and	
  release	
  resource	
  
❖ Available	
  only	
  on	
  YARN
Demo
hKps://github.com/NoamShaish/spark-­‐streaming-­‐workshop.git

Weitere ähnliche Inhalte

Was ist angesagt?

H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling WaterSri Ambati
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
H2O Big Join Slides
H2O Big Join SlidesH2O Big Join Slides
H2O Big Join SlidesSri Ambati
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.darach
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217Sri Ambati
 

Was ist angesagt? (20)

H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara University
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
H2O Big Join Slides
H2O Big Join SlidesH2O Big Join Slides
H2O Big Join Slides
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217
 

Andere mochten auch

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data PipelinesMapR Technologies
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemBryan Bende
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comknowbigdata
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming OverviewStratio
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Spark Streaming - The simple way
Spark Streaming - The simple waySpark Streaming - The simple way
Spark Streaming - The simple wayYogesh Kumar
 

Andere mochten auch (20)

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.com
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Spark Streaming - The simple way
Spark Streaming - The simple waySpark Streaming - The simple way
Spark Streaming - The simple way
 

Ähnlich wie Spark streaming

Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptAbhijitManna19
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptsnowflakebatch
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingShidrokhGoudarzi1
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptrveiga100
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 

Ähnlich wie Spark streaming (20)

Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 

Kürzlich hochgeladen

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 

Kürzlich hochgeladen (20)

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 

Spark streaming

  • 1. Noam Shaish Spark Streaming Scale   Fault  tolerance   High  throughput
  • 2. Agenda ❖ Overview   ❖ Architecture   ❖ Fault-­‐tolerance   ❖ Why  Spark  streaming?  We  have  Storm   ❖ Demo
  • 3. Overview ❖ Spark  Streaming  is  an  extension  of  core  Spark  API.  It  enables  scalable,   high-­‐throughput,  fault-­‐tolerant  stream  processing  of  live  data  streams.   ❖ ConnecGons  for  most  of  common  data  sources  such  as  KaIa,  Flume,   TwiKer,  ZeroMQ,  Kinesis,  TCP,  etc.   ❖ Spark  streaming  differ  from  most  online  processing  soluGon  by   espousing  mini  batch  approach,  instead  of  data  stream.   ❖ Based  on  DiscreGzed  Stream  paper     ❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing
 Matei Zaharia,Tathagata Das, Haoyuan Li, 
 Timothy Hunter, Scott Shenker, Ion Stoica
 Berkeley EECS (2012-12-14)
 www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 4. Overview Spark  streaming  runs  streaming  computaGon  as  a  series  of  very  small,   determinis1c  batch  jobs   Spark   streaming Spark Live  data  stream Batches  of  X  milliseconds Processed  results ❖ Chops  live  stream  into  batches  of  x   milliseconds   ❖ Spark  treats  each  batch  of  data  as   RDDs   ❖ Processed  results  of  the  RDD   operaGons  are  returned  in  batches
  • 5. DStream, not just RDD * Datastax cassandra connector Transformations • map(),     • flatMap()     • filter()     • count()   • reparGGon()   • union()   • reduce()     • countByValue()   • reduceByKey()   • join()     • cogroup()   • transform()   • updateStateByKey() Output Operations • print()   • foreachRDD()   • saveAsObjectToFiles()   • saveAsTextFiles()   • saveAsHadoopFiles()   • *saveToCassandra() Window Operations • window()   • countByWindow()   • reduceByWindow()   • reduceByKeyAndWindow()   • countByValueAndWindow()
  • 6. Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) Twi8er  Streaming  API   ! ! tweets  DStream   batch  @  t batch  @  t  +  1 batch  @  t  +  3batch  @  t  +  2 stored  in  memory  as  an  RDD   (immutable,  distributed)
  • 7. Example 1 - DStream to RDD relation val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)! val hashTags = tweets.flatMap(status => getTags(status)) tweets  DStream   batch  @  t batch  @  t  +  1 batch  @  t  +  3batch  @  t  +  2 hashTags  DStream   [#hobbitch,    #bilboleggins,  …] flatMap flatMap flatMap flatMap new  RDDs  for   each  batch new  DStream
  • 8. Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)! val hashTags = tweets.flatMap(status => getTags(status))! hashTags.saveToCassandra(“keyspace”, “tableName”) tweets  DStream   hashTags  DStream   [#hobbitch,    #bilboleggins,  …] flatMap flatMap flatMap flatMap every  batch   saved  to   Cassandra save save save save
  • 9. Example 2 - DStream to RDD relation val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)! val hashTags = tweets.flatMap(status => getTags(status))! val tagCounts = hashTags.countByValue() tweets  DStream   hashTags   flatMap flatMap flatMap flatMap map map map map reduceByKey reduceByKey reduceByKey reduceByKey hashTags   [(#hobbitch,  10),    (#bilboleggins,  34),  …]
  • 10. Example 3 - Count the hash tags over last 10 minutes val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)! val hashTags = tweets.flatMap(status => getTags(status))! val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() Sliding  window   operaGon Window  length Sliding  interval
  • 11. Example 3 - Count the hash tags over last 10 minutes val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() t-1 t t+1 t+2 t+3 sliding  window hashTags   hashTags   Count  over  all   data  in  window
  • 12. Example 4 - Count hash tags over last 10 minutes smartly val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1)) t-1 t t+1 t+2 t+3 sliding  window hashTags   hashTags   Add  count  of  new   batch  in  window +- Reduce  count  of   batch  out  of  window generalizaGon  of  smart  window  reduce  exists:     reduceByKeyAndWindow(reduce,  inverseReduce,  window,    interval)
  • 13. Architecture ❖ Receivers  divides  data  into  mini  batches   ❖ Size  of  batches  can  be  defined  in  milliseconds  (best  pracGce   is  greater  than  500  milliseconds) Spark  Streaming Receivers Spark   Engine Batches  of     input  RDDs Batches  of     output  RDDs Input  streams
  • 14. Fault-tolerance ❖ RDDs  are  not  generated  from   fault-­‐tolerance  source       ❖ Replicate  data  among  worker   nodes  
 (default  replicaGon  factor  of  2)   ❖ In  state-­‐full  jobs  checkpoints   should  be  used     ❖ Journaling  such  as  in  DB  can   be  acGvated   flatMap Tweets  RDD hashTags  RDD input  data   replicated  in   memory lost  parGGons   recomputed  on  other   workers
  • 15. Fault-tolerance ❖ Two  kinds  of  data  to  recover  in  the  event  of  failure:   • Data  received  and  replicated  -­‐  
 This  data  survives  failure  of  a  single  worker  node,  since  a  copy  of  it   exists  on  one  of  the  other  nodes.   • Data  received  but  buffered  for  replicaGon  -­‐
 As  this  is  not  replicated,  the  only  way  to  recover  that  data  is  to  get   it  from  the  source  again.
  • 16. Fault-tolerance ❖ Two  receiver  semanGcs:   • Reliable  receiver  -­‐  
 Acknowledges  only  ager  received  data  is  replicated.  If  fails,   buffered  data  does  not  get  acknowledged  to  the  source.  If  the   receiver  is  restarted,  the  source  will  resend  the  data,  and   therefore  no  data  will  be  lost  due  to  the  failure.     • Unreliable  Receiver  -­‐  
 Such  receivers  can  lose  data  when  they  fail  due  to  worker  or  driver   failures.
  • 17. Fault-tolerance Deployment   Scenario Receiver  Failure Driver  failure without  write   ahead  log Buffered  data  lost  with  unreliable  receivers   Zero  data  lost  with  reliable  receivers  and  files Buffered  data  lost  with  unreliable  receivers   Past  data  lost  with  all  receivers   Zero  data  lost  with  files with  write   ahead  log Zero  data  lost  with  receivers  and  files Zero  data  lost  with  receivers  and  files
  • 18. Why Spark streaming? 
 We have Storm
  • 19. One model to rule them all ❖ Same  model  for  offline  AND   online  processing   ❖ Common  code  base  for  offline   AND  online  processing   ❖ Less  bugs  due  to  duplicaGon   ❖ Less  bugs  of  framework  difference   ❖ Increase  developer  producGvity
  • 20. One stack to rule them all ❖ Explore  data   interacGvely  using  Spark   shell  to  idenGfy  problem   ❖ Use  same  code  in  Spark   standalone  to  idenGfy   problem  in  producGon   environment   ❖ Use  similar  code  in   Spark  Streaming  to   monitor  problem  online $  ./spark-­‐shell   scala>  val  file  =  sc.hadoopFile(“smallLogs”)   ... scala>  val  filtered  =  file.filter(_.contains(“ERROR”))   ... scala>  va object  ProcessProductionData  {     def  main(args:  Array[String])  {       val  sc  =  new  SparkContext(...)       val  file  =  sc.hadoopFile(“productionLogs”)       val  filtered  =  file.filter(_.contains(“ERROR”))       val  mapped  =  filtered.map(...)       ...     }   } object  ProcessLiveStream  {     def  main(args:  Array[String])  {       val  sc  =  new  StreamingContext(...)       val  stream  =  sc.kafkaStream(...)       val  filtered  =  stream.filter(_.contains(“ERROR”))       val  mapped  =  filtered.map(...)       ...     }   }
  • 21. Performance ❖ Higher  throughput  than  Storm   • Spark  Streaming:  670k  records/second/node   • Storm:  115k  records/seconds/node Grep Throughput  per   node  (MB/s) 0 17.5 35 52.5 70 Record  size  (bytes) 100 1000 Spark Storm WordCount 0 7.5 15 22.5 30 Record  size  (bytes) 100 1000 Tested  with  100  EC2  instances  with  4  core  each   Comparison  taken  from  Das  Thatagata  and  Reynold  Xin  Hadoop  summit  2013  presentaGon
  • 25. Monitoring In  addiGon  StreamListener  interface  provides  addiGonal  informaGon  in  various  levels     (ApplicaGon,  Job,  Task,  etc.)    
  • 27. Utilization ❖ Spark  1.2  introduces  dynamic  cluster  resource  allocaGon   ❖ Jobs  can  request  more  resources  and  release  resource   ❖ Available  only  on  YARN