SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
Four Things to Know about
Reliable Spark Streaming
Dean Wampler, Typesafe
Tathagata Das, Databricks
Agenda for today
• The Stream Processing Landscape
• How Spark Streaming Works - A Quick Overview
• Features in Spark Streaming that Help Prevent
Data Loss
• Design Tips for Successful Streaming
Applications
The Stream
Processing Landscape
Stream Processors
Stream Storage
Stream Sources
MQTT$
How Spark Streaming Works:
A Quick Overview
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrates with MLlib, SQL,
DataFrames, GraphX
Spark Streaming
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
9
data streams
receivers
batches results
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
10
entry point of streaming
functionality
create DStream
from Kafka data
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
val!words!=!lines.flatMap(_.split("!"))!
11
split lines into words
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
val!words!=!lines.flatMap(_.split("!"))!
val!wordCounts!=!words.map(x!=>!(x,!1))!
!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
wordCounts.print()!
context.start()!
12
print some counts on screen
count the words
start receiving and
transforming the data
Word Count with Kafka
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(new!SparkConf(),!Seconds(1))!
!!!!val!lines!=!KafkaUtils.createStream(context,!...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1)).reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
13
Features in Spark Streaming
that Help Prevent Data Loss
15
A Deeper View of
Spark Streaming
Any Spark Application
16
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Any Spark Application
17
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Spark
Executor
Spark
Executor
Spark
Executor
Driver launches
executors in
cluster
Any Spark Application
18
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Tasks sent to
executors for
processing data
Spark
Executor
Spark
Executor
Spark
Executor
Driver launches
executors in
cluster
Spark Streaming Application: Receive data
19
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Spark Streaming Application: Receive data
20
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Receiver divides
stream into blocks and
keeps in memory
Data Blocks!
Spark Streaming Application: Receive data
21
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Receiver divides
stream into blocks and
keeps in memory
Data Blocks!
Blocks also
replicated to
another executor Data Blocks!
Spark Streaming Application: Process data
22
Executor
Executor
Receiver
Data Blocks!
Data Blocks!
Every batch
interval, driver
launches tasks to
process the blocks
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Spark Streaming Application: Process data
23
Executor
Executor
Receiver
Data Blocks!
Data Blocks!
Data
store
Every batch
interval, driver
launches tasks to
process the blocks
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Fault Tolerance
and Reliability
Failures? Why care?
Many streaming applications need zero data loss guarantees
despite any kind of failures in the system
At least once guarantee – every record processed at least once
Exactly once guarantee – every record processed exactly once
Different kinds of failures – executor and driver
Some failures and guarantee requirements need additional
configurations and setups
25
Executor
Receiver
Data Blocks!
What if an executor fails?
26
Executor
Failed Ex.
Receiver
Blocks!
Blocks!
Driver
If executor fails,
receiver is lost and
all blocks are lost
Executor
Receiver
Data Blocks!
What if an executor fails?
Tasks and receivers restarted by Spark
automatically, no config needed
27
Executor
Failed Ex.
Receiver
Blocks!
Blocks!
Driver
If executor fails,
receiver is lost and
all blocks are lost
Receiver
Receiver
restarted
Tasks restarted
on block replicas
What if the driver fails?
28
Executor
Blocks!How do we
recover?
When the driver
fails, all the
executors fail
All computation,
all received
blocks are lost
Executor
Receiver
Blocks!
Failed Ex.
Receiver
Blocks!
Failed
Executor
Blocks!
Driver
Failed
Driver
Recovering Driver w/ DStream Checkpointing
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
29
Executor
Blocks!
Executor
Receiver
Blocks!
Active
Driver
Checkpoint info
to HDFS / S3
Recovering Driver w/ DStream Checkpointing
30
Failed driver can be restarted
from checkpoint information
Failed
Driver
Restarted
Driver
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
Recovering Driver w/ DStream Checkpointing
31
Failed driver can be restarted
from checkpoint information
Failed
Driver
Restarted
Driver
New
Executor
New
Executor
Receiver
New executors
launched and
receivers
restarted
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
Recovering Driver w/ DStream Checkpointing
1.  Configure automatic driver restart
All cluster managers support this
2.  Set a checkpoint directory in a HDFS-compatible file
system
!streamingContext.checkpoint(hdfsDirectory)!
3.  Slightly restructure of the code to use checkpoints for
recovery
32
Configurating Automatic Driver Restart
Spark Standalone – Use spark-submit with “cluster” mode and “--
supervise”
See http://spark.apache.org/docs/latest/spark-standalone.html
YARN – Use spark-submit in “cluster” mode
See YARN config “yarn.resourcemanager.am.max-attempts”
Mesos – Marathon can restart applications or use the “--supervise” flag.
33
Restructuring code for Checkpointing
34
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
Restructuring code for Checkpointing
35
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
Put all setup code into a function that returns a new StreamingContext
Restructuring code for Checkpointing
36
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
Put all setup code into a function that returns a new StreamingContext
Get context setup from HDFS dir OR create a new one with the function
val!context!=!
StreamingContext.getOrCreate(!
!!hdfsDir,!creatingFunc)!
context.start()!
Restructuring code for Checkpointing
StreamingContext.getOrCreate():
If HDFS directory has checkpoint info
recover context from info
else
call creatingFunc() to create
and setup a new context
Restarted process can figure out whether
to recover using checkpoint info or not
37
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
val!context!=!
StreamingContext.getOrCreate(!
!!hdfsDir,!creatingFunc)!
context.start()!
Received blocks lost on Restart!
38
Failed
Driver
Restarted
Driver
New
Executor
New Ex.
Receiver
No Blocks!
In-memory blocks of
buffered data are
lost on driver restart
Recovering data with Write Ahead Logs
Write Ahead Log (WAL): Synchronously
save received data to fault-tolerant storage
39
Executor
Blocks saved
to HDFS
Executor
Receiver
Blocks!
Active
Driver
Data stream
Recovering data with Write Ahead Logs
40
Failed
Driver
Restarted
Driver
New
Executor
New Ex.
Receiver
Blocks!
Blocks recovered
from Write Ahead Log
Write Ahead Log (WAL): Synchronously
save received data to fault-tolerant storage
Recovering data with Write Ahead Logs
1.  Enable checkpointing, logs written in checkpoint directory
3.  Enabled WAL in SparkConf configuration
!!!!!sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",!"true")!
3.  Receiver should also be reliable
Acknowledge source only after data saved to WAL
Unacked data will be replayed from source by restarted receiver
5.  Disable in-memory replication (already replicated by HDFS)
!Use StorageLevel.MEMORY_AND_DISK_SER for input DStreams
41
RDD Checkpointing
•  Stateful stream processing can lead to long RDD lineages
•  Long lineage = bad for fault-tolerance, too much recomputation
•  RDD checkpointing saves RDD data to the fault-tolerant storage to
limit lineage and recomputation
•  More:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
42
Fault-tolerance Semantics
43
Zero data loss = every stage processes each
event at least once despite any failure
Sources
Transforming
Sinks
Outputting
Receiving
Fault-tolerance Semantics
44
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Receiving
Outputting
End-to-end semantics:
At-least once
Fault-tolerance Semantics
45
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Receiving
Outputting
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
46
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
At least once, w/ Checkpointing + WAL + Reliable receiversReceiving
Outputting
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
47
Exactly once receiving with new Kafka Direct approach
Treats Kafka like a replicated log, reads it like a file
Does not use receivers
No need to create multiple DStreams and union them
No need to enable Write Ahead Logs
!
@val!directKafkaStream!=!KafkaUtils.createDirectStream(...)!
!
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Sources
Transforming
Sinks
Outputting
Receiving
Fault-tolerance Semantics
48
Exactly once receiving with new Kafka Direct approach
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
Exactly once!
Design Tips for Successful
Streaming Applications
Areas for consideration
• Enhance resilience with additional
components.
• Mini-batch vs. per-message handling.
• Exploit Reactive Streams.
• Use Storm, Akka, Samza, etc. for handling
individual messages, especially with sub-
second latency requirements.
• Use Spark Streaming’s mini-batch model for
the Lambda architecture and highly-scalable
analytics.
Mini-batch vs. per-message handling
• Consider Kafka or Kinesis for resilient buffering
in front of Spark Streaming.
•  Buffer for traffic spikes.
•  Re-retrieval of data if an RDD partition is lost and must be
reconstructed from the source.
• Going to store the raw data anyway?
•  Do it first, then ingest to Spark from that storage.
Enhance Resiliency with Additional
Components.
• Spark Streaming v1.5 will have support for back pressure to more
easily build end-to-end reactive applications
Exploit Reactive Streams
• Spark Streaming v1.5 will have support for back
pressure to more easily build end-to-end reactive
applications
• Backpressure from consumer to producer:
•  Prevents buffer overflows.
•  Avoids unnecessary throttling.
Exploit Reactive Streams
• Spark Streaming v1.4? Buffer with Akka
Streams:
Exploit Reactive Streams
Event
Source
Akka
App
Spark
Streaming App
Event
Event
Event
Event
feedback
(back pressure)
• Spark Streaming v1.4 has a rate limit property:
•  spark.streaming.receiver.maxRate
•  Consider setting it for long-running streaming apps with a
variable input flow rate.
• Have a graph of Reactive Streams? Consider using an
Akka app to buffer the data fed to Spark Streaming over
a socket (until 1.5…).
Exploit Reactive Streams
Thank you!
Dean Wampler, Typesafe
Tathagata Das, Databricks
Databricks is
available as a hosted
platform on 
AWS with a monthly
subscription. 
What to do next?
Start with a free trial
Typesafe now offers
certified support for
Spark, Mesos &
DCOS, read more
about it
READ MORE
REACTIVE PLATFORM

Full Lifecycle Support for Play, Akka, Scala and Spark
Give your project a boost with Reactive Platform:
• Monitor Message-Driven Apps
• Resolve Network Partitions Decisively
• Integrate Easily with Legacy Systems
• Eliminate Incompatibility & Security Risks
• Protect Apps Against Abuse
• Expert Support from Dedicated Product Teams
Enjoy learning? See about the availability of
on-site training for Scala, Akka, Play and Spark!
Learn more about our offersCONTACT US

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark
SparkSpark
Spark
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 

Andere mochten auch

Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 

Andere mochten auch (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 

Ähnlich wie Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 

Ähnlich wie Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks (20)

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 

Mehr von Legacy Typesafe (now Lightbend)

Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Legacy Typesafe (now Lightbend)
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Legacy Typesafe (now Lightbend)
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
Legacy Typesafe (now Lightbend)
 

Mehr von Legacy Typesafe (now Lightbend) (15)

The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Reactive Design Patterns
Reactive Design PatternsReactive Design Patterns
Reactive Design Patterns
 
Revitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with MicroservicesRevitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with Microservices
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
 
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive Platform
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
 
Microservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with TechnologyMicroservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with Technology
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
 
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)
 
Going Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive PlatformGoing Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive Platform
 
Why Play Framework is fast
Why Play Framework is fastWhy Play Framework is fast
Why Play Framework is fast
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
 

Kürzlich hochgeladen

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

  • 1. Four Things to Know about Reliable Spark Streaming Dean Wampler, Typesafe Tathagata Das, Databricks
  • 2. Agenda for today • The Stream Processing Landscape • How Spark Streaming Works - A Quick Overview • Features in Spark Streaming that Help Prevent Data Loss • Design Tips for Successful Streaming Applications
  • 7. How Spark Streaming Works: A Quick Overview
  • 8. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrates with MLlib, SQL, DataFrames, GraphX
  • 9. Spark Streaming Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results 9 data streams receivers batches results
  • 10. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! 10 entry point of streaming functionality create DStream from Kafka data
  • 11. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! val!words!=!lines.flatMap(_.split("!"))! 11 split lines into words
  • 12. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! val!words!=!lines.flatMap(_.split("!"))! val!wordCounts!=!words.map(x!=>!(x,!1))! !!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! wordCounts.print()! context.start()! 12 print some counts on screen count the words start receiving and transforming the data
  • 13. Word Count with Kafka object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(new!SparkConf(),!Seconds(1))! !!!!val!lines!=!KafkaUtils.createStream(context,!...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1)).reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! 13
  • 14. Features in Spark Streaming that Help Prevent Data Loss
  • 15. 15 A Deeper View of Spark Streaming
  • 16. Any Spark Application 16 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster
  • 17. Any Spark Application 17 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Spark Executor Spark Executor Spark Executor Driver launches executors in cluster
  • 18. Any Spark Application 18 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Tasks sent to executors for processing data Spark Executor Spark Executor Spark Executor Driver launches executors in cluster
  • 19. Spark Streaming Application: Receive data 19 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 20. Spark Streaming Application: Receive data 20 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! Receiver divides stream into blocks and keeps in memory Data Blocks!
  • 21. Spark Streaming Application: Receive data 21 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! Receiver divides stream into blocks and keeps in memory Data Blocks! Blocks also replicated to another executor Data Blocks!
  • 22. Spark Streaming Application: Process data 22 Executor Executor Receiver Data Blocks! Data Blocks! Every batch interval, driver launches tasks to process the blocks Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 23. Spark Streaming Application: Process data 23 Executor Executor Receiver Data Blocks! Data Blocks! Data store Every batch interval, driver launches tasks to process the blocks Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 25. Failures? Why care? Many streaming applications need zero data loss guarantees despite any kind of failures in the system At least once guarantee – every record processed at least once Exactly once guarantee – every record processed exactly once Different kinds of failures – executor and driver Some failures and guarantee requirements need additional configurations and setups 25
  • 26. Executor Receiver Data Blocks! What if an executor fails? 26 Executor Failed Ex. Receiver Blocks! Blocks! Driver If executor fails, receiver is lost and all blocks are lost
  • 27. Executor Receiver Data Blocks! What if an executor fails? Tasks and receivers restarted by Spark automatically, no config needed 27 Executor Failed Ex. Receiver Blocks! Blocks! Driver If executor fails, receiver is lost and all blocks are lost Receiver Receiver restarted Tasks restarted on block replicas
  • 28. What if the driver fails? 28 Executor Blocks!How do we recover? When the driver fails, all the executors fail All computation, all received blocks are lost Executor Receiver Blocks! Failed Ex. Receiver Blocks! Failed Executor Blocks! Driver Failed Driver
  • 29. Recovering Driver w/ DStream Checkpointing DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage 29 Executor Blocks! Executor Receiver Blocks! Active Driver Checkpoint info to HDFS / S3
  • 30. Recovering Driver w/ DStream Checkpointing 30 Failed driver can be restarted from checkpoint information Failed Driver Restarted Driver DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage
  • 31. Recovering Driver w/ DStream Checkpointing 31 Failed driver can be restarted from checkpoint information Failed Driver Restarted Driver New Executor New Executor Receiver New executors launched and receivers restarted DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage
  • 32. Recovering Driver w/ DStream Checkpointing 1.  Configure automatic driver restart All cluster managers support this 2.  Set a checkpoint directory in a HDFS-compatible file system !streamingContext.checkpoint(hdfsDirectory)! 3.  Slightly restructure of the code to use checkpoints for recovery 32
  • 33. Configurating Automatic Driver Restart Spark Standalone – Use spark-submit with “cluster” mode and “-- supervise” See http://spark.apache.org/docs/latest/spark-standalone.html YARN – Use spark-submit in “cluster” mode See YARN config “yarn.resourcemanager.am.max-attempts” Mesos – Marathon can restart applications or use the “--supervise” flag. 33
  • 34. Restructuring code for Checkpointing 34 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start
  • 35. Restructuring code for Checkpointing 35 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! Put all setup code into a function that returns a new StreamingContext
  • 36. Restructuring code for Checkpointing 36 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! Put all setup code into a function that returns a new StreamingContext Get context setup from HDFS dir OR create a new one with the function val!context!=! StreamingContext.getOrCreate(! !!hdfsDir,!creatingFunc)! context.start()!
  • 37. Restructuring code for Checkpointing StreamingContext.getOrCreate(): If HDFS directory has checkpoint info recover context from info else call creatingFunc() to create and setup a new context Restarted process can figure out whether to recover using checkpoint info or not 37 def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! val!context!=! StreamingContext.getOrCreate(! !!hdfsDir,!creatingFunc)! context.start()!
  • 38. Received blocks lost on Restart! 38 Failed Driver Restarted Driver New Executor New Ex. Receiver No Blocks! In-memory blocks of buffered data are lost on driver restart
  • 39. Recovering data with Write Ahead Logs Write Ahead Log (WAL): Synchronously save received data to fault-tolerant storage 39 Executor Blocks saved to HDFS Executor Receiver Blocks! Active Driver Data stream
  • 40. Recovering data with Write Ahead Logs 40 Failed Driver Restarted Driver New Executor New Ex. Receiver Blocks! Blocks recovered from Write Ahead Log Write Ahead Log (WAL): Synchronously save received data to fault-tolerant storage
  • 41. Recovering data with Write Ahead Logs 1.  Enable checkpointing, logs written in checkpoint directory 3.  Enabled WAL in SparkConf configuration !!!!!sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",!"true")! 3.  Receiver should also be reliable Acknowledge source only after data saved to WAL Unacked data will be replayed from source by restarted receiver 5.  Disable in-memory replication (already replicated by HDFS) !Use StorageLevel.MEMORY_AND_DISK_SER for input DStreams 41
  • 42. RDD Checkpointing •  Stateful stream processing can lead to long RDD lineages •  Long lineage = bad for fault-tolerance, too much recomputation •  RDD checkpointing saves RDD data to the fault-tolerant storage to limit lineage and recomputation •  More: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing 42
  • 43. Fault-tolerance Semantics 43 Zero data loss = every stage processes each event at least once despite any failure Sources Transforming Sinks Outputting Receiving
  • 44. Fault-tolerance Semantics 44 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Receiving Outputting End-to-end semantics: At-least once
  • 45. Fault-tolerance Semantics 45 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Receiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 46. Fault-tolerance Semantics 46 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost At least once, w/ Checkpointing + WAL + Reliable receiversReceiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 47. Fault-tolerance Semantics 47 Exactly once receiving with new Kafka Direct approach Treats Kafka like a replicated log, reads it like a file Does not use receivers No need to create multiple DStreams and union them No need to enable Write Ahead Logs ! @val!directKafkaStream!=!KafkaUtils.createDirectStream(...)! ! https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html http://spark.apache.org/docs/latest/streaming-kafka-integration.html Sources Transforming Sinks Outputting Receiving
  • 48. Fault-tolerance Semantics 48 Exactly once receiving with new Kafka Direct approach Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Exactly once, if outputs are idempotent or transactional End-to-end semantics: Exactly once!
  • 49. Design Tips for Successful Streaming Applications
  • 50. Areas for consideration • Enhance resilience with additional components. • Mini-batch vs. per-message handling. • Exploit Reactive Streams.
  • 51. • Use Storm, Akka, Samza, etc. for handling individual messages, especially with sub- second latency requirements. • Use Spark Streaming’s mini-batch model for the Lambda architecture and highly-scalable analytics. Mini-batch vs. per-message handling
  • 52. • Consider Kafka or Kinesis for resilient buffering in front of Spark Streaming. •  Buffer for traffic spikes. •  Re-retrieval of data if an RDD partition is lost and must be reconstructed from the source. • Going to store the raw data anyway? •  Do it first, then ingest to Spark from that storage. Enhance Resiliency with Additional Components.
  • 53. • Spark Streaming v1.5 will have support for back pressure to more easily build end-to-end reactive applications Exploit Reactive Streams
  • 54. • Spark Streaming v1.5 will have support for back pressure to more easily build end-to-end reactive applications • Backpressure from consumer to producer: •  Prevents buffer overflows. •  Avoids unnecessary throttling. Exploit Reactive Streams
  • 55. • Spark Streaming v1.4? Buffer with Akka Streams: Exploit Reactive Streams Event Source Akka App Spark Streaming App Event Event Event Event feedback (back pressure)
  • 56. • Spark Streaming v1.4 has a rate limit property: •  spark.streaming.receiver.maxRate •  Consider setting it for long-running streaming apps with a variable input flow rate. • Have a graph of Reactive Streams? Consider using an Akka app to buffer the data fed to Spark Streaming over a socket (until 1.5…). Exploit Reactive Streams
  • 57. Thank you! Dean Wampler, Typesafe Tathagata Das, Databricks
  • 58. Databricks is available as a hosted platform on  AWS with a monthly subscription.  What to do next? Start with a free trial Typesafe now offers certified support for Spark, Mesos & DCOS, read more about it READ MORE
  • 59. REACTIVE PLATFORM
 Full Lifecycle Support for Play, Akka, Scala and Spark Give your project a boost with Reactive Platform: • Monitor Message-Driven Apps • Resolve Network Partitions Decisively • Integrate Easily with Legacy Systems • Eliminate Incompatibility & Security Risks • Protect Apps Against Abuse • Expert Support from Dedicated Product Teams Enjoy learning? See about the availability of on-site training for Scala, Akka, Play and Spark! Learn more about our offersCONTACT US

Hinweis der Redaktion

  1. You’ve heard from TD about Streaming internals and how to make your Spark Streaming apps resilient. Let’s expand the view a bit with additional tips.