SlideShare a Scribd company logo
1 of 79
Download to read offline
Top 5 mistakes when
writing a streaming app
Mark Grover (@mark_grover) - Software Engineer, Cloudera
Ted Malaska (@TedMalaska) - Group Technical Architect, Blizzard
tiny.cloudera.com/streaming-app
Book
•  Book signing, at 1pm today
–  Cloudera booth
•  hadooparchitecturebook.com
•  @hadooparchbook
•  github.com/hadooparchitecturebook
•  slideshare.com/hadooparchbook
Agenda
•  5 mistakes when writing streaming applications
#5 – Not monitoring and
managing jobs
Batch systems
Batch
processi
ng
HDFS/
S3
HDFS/
S3
Managing and monitoring batch
systems
Batch
processi
ng
HDFS/
S3
HDFS/
S3
●  Use Cron/Oozie/Azkaban/Luigi for orchestration
●  Validation logic in the job (e.g. if input dir is empty)
●  Logs aggregated at end of the job
○  Track times against previous runs, etc.
●  “Automatic” orchestration (micro-batching)
●  Long running driver process
Streaming systems systems
Stream
processi
ngContinuous
stream
Storage (e.g. HBase,
Cassandra, Solr, Kafka,
etc.)
•  YARN
– Doesn’t aggregate logs until job finishes
•  Spark
– Checkpoints can’t survive app or Spark upgrades
– Need to clear checkpoint directory during upgrade
Not originally built for streaming
1. Management
a. Where to run the driver?
b. How to restart the driver automatically if it fails?
c. How to pause the job?
2. Monitoring
a. How to prevent backlog i.e. make sure processing time
< batch interval?
b. How to keep tabs on health of the long running driver
process, etc.?
Big questions remain unanswered
Disclaimer
•  Most discussion that follows corresponds to
YARN but can be applied to other cluster
managers as well.
Recap: (YARN) Client mode
Recap: (YARN) Cluster mode
Recap: (YARN) Client mode
Client dies,
driver dies
i.e. job dies
Recap: (YARN) Cluster mode
Still OK if
the client
dies.
•  Run on YARN Cluster mode
– Driver will continue running when the client machine
goes down
1a. Where to run the driver?
•  Set up automatic restart
s
In spark configuration (e.g. spark-defaults.conf):
spark.yarn.maxAppAttempts=2
spark.yarn.am.attemptFailuresValidityInterval=1h
1b. How to restart driver?
•  If running on Mesos, use Marathon:
•  See “Graceful Shutdown” later for YARN, etc.
1c. How to pause a job?
Image source: https://raw.githubusercontent.com/mhausenblas/elsa/master/doc/elsa-marathon-deploy.png
Suspend
button!
2. Monitoring - Spark Streaming UI
Records
processed
Min rate
(records/
sec)
25th
percentile
50th
percentile
(Median)
75th
percentile
Max rate
(records/
sec)
Processing time,
scheduling delay,
total delay, etc.
Base image from: http://i.imgur.com/1ooDGhm.png
2. Monitoring - But I want more!
•  Spark has a configurable metrics system
–  Based on Dropwizard Metrics Library
•  Use Graphite/Grafana to dashboard metrics like:
–  Number of records processed
–  Memory/GC usage on driver and executors
–  Total delay
–  Utilization of the cluster
Summary - Managing and
monitoring
•  Afterthought
• But it’s possible
• In Structured Streaming, use
StreamingQueryListener (starting Apache
Spark 2.1)
•  Spark Streaming on YARN by Marcin Kuthan
http://mkuthan.github.io/blog/2016/09/30/spark-
streaming-on-yarn/
•  Marathon
https://mesosphere.github.io/marathon/
References
#4 – Not considering data
loss
Prevent data loss
•  Ok, my driver is automatically restarting
•  Can I lose data in between driver restarts?
No data loss!
As long as you do things the right way
How to do things the right way
Stream
processi
ng1. File
2. Receiver based source
(Flume, Kafka)
3. Kafka direct stream
Storage (e.g. HBase,
Cassandra, Solr, Kafka,
etc.)
1. File sources
Stream
processi
ng
Storage (e.g. HBase,
Cassandra, Solr, Kafka,
etc.)
HDFS/
S3
FileStre
am
1. File sources
Stream
processi
ng
Storage (e.g. HBase,
Cassandra, Solr, Kafka,
etc.)
HDFS/
S3
FileStre
am
●  Use checkpointing!
Checkpointing
// new context
val ssc = new StreamingContext(...)
...
// set checkpoint directory
ssc.checkpoint(checkpointDirectory)
What is checkpointing
• Metadata checkpointing (Configuration,
Incomplete batches, etc.)
– Recover from driver failures
• Data checkpointing
– Trim lineage when doing stateful transformations (e.g.
updateStateByKey)
•  Spark streaming checkpoints both data and
metadata
Checkpointing gotchas
• Checkpoints don’t work across app or Spark
upgrades
• Clear out (or change) checkpointing directory
across upgrades
Ted’s Rant
Spark
with
while
loop
Storage (e.g. HBase,
Cassandra, Solr, Kafka,
etc.)
HDFS/
S3
FileStre
am
Landin
g
In
Proces
s
Finishe
d
2. Receiver based sources
Image source: http://static.oschina.net/uploads/space/2015/0618/110032_9Fvp_1410765.png
Receiver based sources
• Enable checkpointing, AND
• Enable Write Ahead Log (WAL)
– Set the following property to true
spark.streaming.receiver.writeAheadLog.en
able
– Default is false!
Why do we need a WAL?
•  Data on Receiver is stored in executor memory
• Without WAL, a failure in the middle of operation
can lead to data loss
• With WAL, data is written to durable storage
(e.g. HDFS, S3) before being acknowledged
back to source.
WAL gotchas!
•  Makes a copy of all data on disk
•  Use StorageLevel.MEMORY_AND_DISK_SER storage
level for your DStream
– No need to replicate the data in memory across
Spark executors
•  For S3
spark.streaming.receiver.writeAheadLog.clo
seFileAfterWrite to true (similar for driver)
But what about Kafka?
•  Kafka already stores a copy of the data
•  Why store another copy in WAL?
●  Use “direct” connector
●  No need for a WAL with the direct connector!
3. Spark Streaming with Kafka
Stream
processi
ngContinuous
stream
Storage (e.g. HBase,
Cassandra, Solr, Kafka,
etc.)
Kafka with WAL
Image source:
https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.11.42-PM.png
Kafka direct connector
(without WAL)
Image source:
https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.14.11-PM.png
Why no WAL?
• No receiver process creating blocks
• Data is stored in Kafka, can be directly
recovered from there
Direct connector gotchas
•  Need to track offsets for driver recovery
•  Checkpoints?
–  No! Not recoverable across upgrades
•  Track them yourself
–  In ZooKeeper, HDFS, or a database
•  For accuracy
–  Processing needs to be idempotent, OR
–  Update offsets in the same transaction when updating results
Summary - prevent data loss
• Different for different sources
• Preventable if done right
• In Structured Streaming, state is stored in
memory (backed by HDFS/S3 WAL), starting
Spark 2.1
References
•  Improved fault-tolerance and zero data loss
https://databricks.com/blog/2015/01/15/improved-
driver-fault-tolerance-and-zero-data-loss-in-spark-
streaming.html
•  Tracking offsets when using direct connector
http://blog.cloudera.com/blog/2015/03/exactly-once-
spark-streaming-from-apache-kafka/
#3 – Use streaming for
everything
When do you use streaming?
When do you use streaming?
ALWAYS!
Do I need to use Spark Streaming
•  Think about your goals
Types of goals
•  Atomic Enrichment
•  Notifications
•  Joining
•  Partitioning
•  Ingestion
•  Counting
•  Incremental machine learning
•  Progressive Analysis
Types of goals
•  Atomic enrichment
–  No cache/context needed
Types of goals
•  Notifications
– NRT
– Atomic
– With Context
– With Global Summary
#2 - Partitioned cache data
Data is partitioned
based on field(s) and
then cached
#3 - External fetch
Data fetched from
external system
Types of goals
•  Joining
– Big Joins
– Broadcast Joins
– Streaming Joins
Types of goals
•  !!Ted to add a diagram!!
•  Partitioning
– Fan out
– Fan in
– Shuffling for key isolation
Types of goals
•  Ingestion
– File Systems or Block Stores
– No SQLs
– Lucene
– Kafka
Types of goals
•  Counting
– Streaming counting
– Count on query
– Aggregation on query
Types of goals
•  Machine Learning
Types of goals
•  Progressive Analysis
– Reading the stream
– SQL on the stream
Summary - Do I need to use
streaming?
•  Spark Streaming is great for
– Accurate counts
– Windowing aggregations
– Progressive Analysis
– Continuous Machine Learning
#2 – Assuming everything
adds up perfectly
(exactly-once)
Exactly once semantics
•  No duplicate records
•  Perfect counting
In the days of Lambda
Spark Streaming State
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Push inserts and
Updates to Kudu
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful RDD 1
Push inserts and
Updates to Kudu
Stateful RDD 2
Stateful RDD 1
Initial
Stateful RDD 1
Normal Spark
KuduInputFormat
Spark
Streaming
Normal
Spark
State within
Things can go wrong in many
places
Source Sink Dist. log Consumer
Spark
Streaming
OpenTSDB
KairosDB
Things can go wrong in many
places
Source Sink Dist. log Consumer
Spark
Streaming
OpenTSDB
KairosDB
Possible
Double
Possible
Resubmission
Need to manage
offsets
State needs to be
in sync with
batches
Always puts, no
increments
Always put,
must be resilient
Restarts needs to
get full state
No one else
writing to stream
aggregated fields
Seq Numbers
can help here
Consider large divergence
in event time retrieval
Summary - Exactly once semantics
•  Seq number tied to sources
•  Puts for external storage system
•  Consider large divergence in event time
retrieval
– Increase divergence window
– Retrieve state from external storage system *
– Ignore
– Process off line
– Source aggregation (pre-distributed log)
#1 – Not shutting down your
streaming app gracefully
•  Can we define graceful?
– Offsets known
– State stored externally
– Stopping at the right place (i.e. batch finished)
Graceful shutdown
How to be graceful?
•  Thread hooks
–  Check for an external flag every N seconds
/**
* Stop the execution of the streams, with option of ensuring all received data
* has been processed.
*
* @param stopSparkContext if true, stops the associated SparkContext. The underlying SparkContext
* will be stopped regardless of whether this StreamingContext has been
* started.
* @param stopGracefully if true, stops gracefully by waiting for the processing of all
* received data to be completed
*/
def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = {
Under the hood of grace
receiverTracker.stop(processAllReceivedData) //default is to wait 10 second, grace waits until done
jobGenerator.stop(processAllReceivedData) // Will use spark.streaming.gracefulStopTimeout
jobExecutor.shutdown()
val terminated = if (processAllReceivedData) {
jobExecutor.awaitTermination(1, TimeUnit.HOURS) // just a very large period of time
} else {
jobExecutor.awaitTermination(2, TimeUnit.SECONDS)
}
if (!terminated) {
jobExecutor.shutdownNow()
}
How to be graceful?
•  cmd line
–  $SPARK_HOME_DIR/bin/spark-submit --master $MASTER_REST_URL --kill $DRIVER_ID
–  spark.streaming.stopGracefullyOnShutdown=true
private def stopOnShutdown(): Unit = {
val stopGracefully = conf.getBoolean("spark.streaming.stopGracefullyOnShutdown", false)
logInfo(s"Invoking stop(stopGracefully=$stopGracefully) from shutdown hook")
// Do not stop SparkContext, let its own shutdown hook stop it
stop(stopSparkContext = false, stopGracefully = stopGracefully)
}
How to be graceful?
•  By marker file
– Touch a file when starting the app on HDFS
– Remove the file when you want to stop
– Separate thread in Spark app, calls
streamingContext.stop(stopSparkContext =
true, stopGracefully = true)
•  Externally in ZK, HDFS, HBase, Database
•  Recover on restart
Storing offsets
Summary - Graceful shutdown
•  Use provided command line, or
•  Background thread with a marker file
Conclusion
Conclusion
•  #5 – How to monitor and manage jobs?
•  #4 – How to prevent data loss?
•  #3 – Do I need to use Spark Streaming?
•  #2 – How to achieve exactly/effectively once
semantics?
•  #1 – How to gracefully shutdown your app?
Book signing
•  Cloudera booth, right after this talk
Thank You.
tiny.cloudera.com/streaming-app

More Related Content

What's hot

Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 

What's hot (20)

Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 

Similar to Top 5 mistakes when writing Streaming applications

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Yuval Itzchakov
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 

Similar to Top 5 mistakes when writing Streaming applications (20)

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 

More from hadooparchbook

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 

More from hadooparchbook (6)

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Top 5 mistakes when writing Streaming applications

  • 1. Top 5 mistakes when writing a streaming app Mark Grover (@mark_grover) - Software Engineer, Cloudera Ted Malaska (@TedMalaska) - Group Technical Architect, Blizzard tiny.cloudera.com/streaming-app
  • 2. Book •  Book signing, at 1pm today –  Cloudera booth •  hadooparchitecturebook.com •  @hadooparchbook •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook
  • 3. Agenda •  5 mistakes when writing streaming applications
  • 4. #5 – Not monitoring and managing jobs
  • 6. Managing and monitoring batch systems Batch processi ng HDFS/ S3 HDFS/ S3 ●  Use Cron/Oozie/Azkaban/Luigi for orchestration ●  Validation logic in the job (e.g. if input dir is empty) ●  Logs aggregated at end of the job ○  Track times against previous runs, etc.
  • 7. ●  “Automatic” orchestration (micro-batching) ●  Long running driver process Streaming systems systems Stream processi ngContinuous stream Storage (e.g. HBase, Cassandra, Solr, Kafka, etc.)
  • 8. •  YARN – Doesn’t aggregate logs until job finishes •  Spark – Checkpoints can’t survive app or Spark upgrades – Need to clear checkpoint directory during upgrade Not originally built for streaming
  • 9. 1. Management a. Where to run the driver? b. How to restart the driver automatically if it fails? c. How to pause the job? 2. Monitoring a. How to prevent backlog i.e. make sure processing time < batch interval? b. How to keep tabs on health of the long running driver process, etc.? Big questions remain unanswered
  • 10. Disclaimer •  Most discussion that follows corresponds to YARN but can be applied to other cluster managers as well.
  • 13. Recap: (YARN) Client mode Client dies, driver dies i.e. job dies
  • 14. Recap: (YARN) Cluster mode Still OK if the client dies.
  • 15. •  Run on YARN Cluster mode – Driver will continue running when the client machine goes down 1a. Where to run the driver?
  • 16. •  Set up automatic restart s In spark configuration (e.g. spark-defaults.conf): spark.yarn.maxAppAttempts=2 spark.yarn.am.attemptFailuresValidityInterval=1h 1b. How to restart driver?
  • 17. •  If running on Mesos, use Marathon: •  See “Graceful Shutdown” later for YARN, etc. 1c. How to pause a job? Image source: https://raw.githubusercontent.com/mhausenblas/elsa/master/doc/elsa-marathon-deploy.png Suspend button!
  • 18. 2. Monitoring - Spark Streaming UI Records processed Min rate (records/ sec) 25th percentile 50th percentile (Median) 75th percentile Max rate (records/ sec) Processing time, scheduling delay, total delay, etc. Base image from: http://i.imgur.com/1ooDGhm.png
  • 19. 2. Monitoring - But I want more! •  Spark has a configurable metrics system –  Based on Dropwizard Metrics Library •  Use Graphite/Grafana to dashboard metrics like: –  Number of records processed –  Memory/GC usage on driver and executors –  Total delay –  Utilization of the cluster
  • 20. Summary - Managing and monitoring •  Afterthought • But it’s possible • In Structured Streaming, use StreamingQueryListener (starting Apache Spark 2.1)
  • 21. •  Spark Streaming on YARN by Marcin Kuthan http://mkuthan.github.io/blog/2016/09/30/spark- streaming-on-yarn/ •  Marathon https://mesosphere.github.io/marathon/ References
  • 22. #4 – Not considering data loss
  • 23. Prevent data loss •  Ok, my driver is automatically restarting •  Can I lose data in between driver restarts?
  • 24. No data loss! As long as you do things the right way
  • 25. How to do things the right way Stream processi ng1. File 2. Receiver based source (Flume, Kafka) 3. Kafka direct stream Storage (e.g. HBase, Cassandra, Solr, Kafka, etc.)
  • 26. 1. File sources Stream processi ng Storage (e.g. HBase, Cassandra, Solr, Kafka, etc.) HDFS/ S3 FileStre am
  • 27. 1. File sources Stream processi ng Storage (e.g. HBase, Cassandra, Solr, Kafka, etc.) HDFS/ S3 FileStre am ●  Use checkpointing!
  • 28. Checkpointing // new context val ssc = new StreamingContext(...) ... // set checkpoint directory ssc.checkpoint(checkpointDirectory)
  • 29. What is checkpointing • Metadata checkpointing (Configuration, Incomplete batches, etc.) – Recover from driver failures • Data checkpointing – Trim lineage when doing stateful transformations (e.g. updateStateByKey) •  Spark streaming checkpoints both data and metadata
  • 30. Checkpointing gotchas • Checkpoints don’t work across app or Spark upgrades • Clear out (or change) checkpointing directory across upgrades
  • 31. Ted’s Rant Spark with while loop Storage (e.g. HBase, Cassandra, Solr, Kafka, etc.) HDFS/ S3 FileStre am Landin g In Proces s Finishe d
  • 32. 2. Receiver based sources Image source: http://static.oschina.net/uploads/space/2015/0618/110032_9Fvp_1410765.png
  • 33. Receiver based sources • Enable checkpointing, AND • Enable Write Ahead Log (WAL) – Set the following property to true spark.streaming.receiver.writeAheadLog.en able – Default is false!
  • 34. Why do we need a WAL? •  Data on Receiver is stored in executor memory • Without WAL, a failure in the middle of operation can lead to data loss • With WAL, data is written to durable storage (e.g. HDFS, S3) before being acknowledged back to source.
  • 35. WAL gotchas! •  Makes a copy of all data on disk •  Use StorageLevel.MEMORY_AND_DISK_SER storage level for your DStream – No need to replicate the data in memory across Spark executors •  For S3 spark.streaming.receiver.writeAheadLog.clo seFileAfterWrite to true (similar for driver)
  • 36. But what about Kafka? •  Kafka already stores a copy of the data •  Why store another copy in WAL?
  • 37. ●  Use “direct” connector ●  No need for a WAL with the direct connector! 3. Spark Streaming with Kafka Stream processi ngContinuous stream Storage (e.g. HBase, Cassandra, Solr, Kafka, etc.)
  • 38. Kafka with WAL Image source: https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.11.42-PM.png
  • 39. Kafka direct connector (without WAL) Image source: https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.14.11-PM.png
  • 40. Why no WAL? • No receiver process creating blocks • Data is stored in Kafka, can be directly recovered from there
  • 41. Direct connector gotchas •  Need to track offsets for driver recovery •  Checkpoints? –  No! Not recoverable across upgrades •  Track them yourself –  In ZooKeeper, HDFS, or a database •  For accuracy –  Processing needs to be idempotent, OR –  Update offsets in the same transaction when updating results
  • 42. Summary - prevent data loss • Different for different sources • Preventable if done right • In Structured Streaming, state is stored in memory (backed by HDFS/S3 WAL), starting Spark 2.1
  • 43. References •  Improved fault-tolerance and zero data loss https://databricks.com/blog/2015/01/15/improved- driver-fault-tolerance-and-zero-data-loss-in-spark- streaming.html •  Tracking offsets when using direct connector http://blog.cloudera.com/blog/2015/03/exactly-once- spark-streaming-from-apache-kafka/
  • 44. #3 – Use streaming for everything
  • 45. When do you use streaming?
  • 46. When do you use streaming? ALWAYS!
  • 47. Do I need to use Spark Streaming •  Think about your goals
  • 48. Types of goals •  Atomic Enrichment •  Notifications •  Joining •  Partitioning •  Ingestion •  Counting •  Incremental machine learning •  Progressive Analysis
  • 49. Types of goals •  Atomic enrichment –  No cache/context needed
  • 50. Types of goals •  Notifications – NRT – Atomic – With Context – With Global Summary
  • 51. #2 - Partitioned cache data Data is partitioned based on field(s) and then cached
  • 52. #3 - External fetch Data fetched from external system
  • 53. Types of goals •  Joining – Big Joins – Broadcast Joins – Streaming Joins
  • 54. Types of goals •  !!Ted to add a diagram!! •  Partitioning – Fan out – Fan in – Shuffling for key isolation
  • 55. Types of goals •  Ingestion – File Systems or Block Stores – No SQLs – Lucene – Kafka
  • 56. Types of goals •  Counting – Streaming counting – Count on query – Aggregation on query
  • 57. Types of goals •  Machine Learning
  • 58. Types of goals •  Progressive Analysis – Reading the stream – SQL on the stream
  • 59. Summary - Do I need to use streaming? •  Spark Streaming is great for – Accurate counts – Windowing aggregations – Progressive Analysis – Continuous Machine Learning
  • 60. #2 – Assuming everything adds up perfectly (exactly-once)
  • 61. Exactly once semantics •  No duplicate records •  Perfect counting
  • 62. In the days of Lambda
  • 63. Spark Streaming State DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Push inserts and Updates to Kudu Source Receiver RDD partitions RDD Parition RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Push inserts and Updates to Kudu Stateful RDD 2 Stateful RDD 1 Initial Stateful RDD 1 Normal Spark KuduInputFormat Spark Streaming Normal Spark
  • 65. Things can go wrong in many places Source Sink Dist. log Consumer Spark Streaming OpenTSDB KairosDB
  • 66. Things can go wrong in many places Source Sink Dist. log Consumer Spark Streaming OpenTSDB KairosDB Possible Double Possible Resubmission Need to manage offsets State needs to be in sync with batches Always puts, no increments Always put, must be resilient Restarts needs to get full state No one else writing to stream aggregated fields Seq Numbers can help here Consider large divergence in event time retrieval
  • 67. Summary - Exactly once semantics •  Seq number tied to sources •  Puts for external storage system •  Consider large divergence in event time retrieval – Increase divergence window – Retrieve state from external storage system * – Ignore – Process off line – Source aggregation (pre-distributed log)
  • 68. #1 – Not shutting down your streaming app gracefully
  • 69. •  Can we define graceful? – Offsets known – State stored externally – Stopping at the right place (i.e. batch finished) Graceful shutdown
  • 70. How to be graceful? •  Thread hooks –  Check for an external flag every N seconds /** * Stop the execution of the streams, with option of ensuring all received data * has been processed. * * @param stopSparkContext if true, stops the associated SparkContext. The underlying SparkContext * will be stopped regardless of whether this StreamingContext has been * started. * @param stopGracefully if true, stops gracefully by waiting for the processing of all * received data to be completed */ def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = {
  • 71. Under the hood of grace receiverTracker.stop(processAllReceivedData) //default is to wait 10 second, grace waits until done jobGenerator.stop(processAllReceivedData) // Will use spark.streaming.gracefulStopTimeout jobExecutor.shutdown() val terminated = if (processAllReceivedData) { jobExecutor.awaitTermination(1, TimeUnit.HOURS) // just a very large period of time } else { jobExecutor.awaitTermination(2, TimeUnit.SECONDS) } if (!terminated) { jobExecutor.shutdownNow() }
  • 72. How to be graceful? •  cmd line –  $SPARK_HOME_DIR/bin/spark-submit --master $MASTER_REST_URL --kill $DRIVER_ID –  spark.streaming.stopGracefullyOnShutdown=true private def stopOnShutdown(): Unit = { val stopGracefully = conf.getBoolean("spark.streaming.stopGracefullyOnShutdown", false) logInfo(s"Invoking stop(stopGracefully=$stopGracefully) from shutdown hook") // Do not stop SparkContext, let its own shutdown hook stop it stop(stopSparkContext = false, stopGracefully = stopGracefully) }
  • 73. How to be graceful? •  By marker file – Touch a file when starting the app on HDFS – Remove the file when you want to stop – Separate thread in Spark app, calls streamingContext.stop(stopSparkContext = true, stopGracefully = true)
  • 74. •  Externally in ZK, HDFS, HBase, Database •  Recover on restart Storing offsets
  • 75. Summary - Graceful shutdown •  Use provided command line, or •  Background thread with a marker file
  • 77. Conclusion •  #5 – How to monitor and manage jobs? •  #4 – How to prevent data loss? •  #3 – Do I need to use Spark Streaming? •  #2 – How to achieve exactly/effectively once semantics? •  #1 – How to gracefully shutdown your app?
  • 78. Book signing •  Cloudera booth, right after this talk