SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Downloaden Sie, um offline zu lesen
What’s new in
Spark Streaming
Tathagata “TD” Das
Strata NY 2015
@tathadas
Who am I?
Project Management Committee (PMC) member of Spark
Started Spark Streaming in AMPLab, UC Berkeley
Current technical lead of Spark Streaming
Software engineer at Databricks
2
Founded by creators of Spark and remains largest
contributor
Offers a hosted service
•  Spark on EC2
•  Notebooks
•  Plot visualizations
•  Cluster management
•  Scheduled jobs
What is Databricks?
3
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrates with MLlib, SQL,
DataFrames, GraphX
4
What can you use it for?
Real-time fraud detection in transactions
React to anomalies in sensors in real-time
Cat videos in tweets as soon as they go viral
5
Spark Streaming
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
data streams
receivers
batches results
6
Word Count with Kafka
val	
  context	
  =	
  new	
  StreamingContext(conf,	
  Seconds(1))	
  
val	
  lines	
  =	
  KafkaUtils.createStream(context,	
  ...)	
  
entry point of streaming
functionality
create DStream
from Kafka data
7
Word Count with Kafka
val	
  context	
  =	
  new	
  StreamingContext(conf,	
  Seconds(1))	
  
val	
  lines	
  =	
  KafkaUtils.createStream(context,	
  ...)	
  
val	
  words	
  =	
  lines.flatMap(_.split("	
  "))	
   split lines into words
8
Word Count with Kafka
val	
  context	
  =	
  new	
  StreamingContext(conf,	
  Seconds(1))	
  
val	
  lines	
  =	
  KafkaUtils.createStream(context,	
  ...)	
  
val	
  words	
  =	
  lines.flatMap(_.split("	
  "))	
  
val	
  wordCounts	
  =	
  words.map(x	
  =>	
  (x,	
  1))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  
wordCounts.print()	
  
context.start()	
  
print some counts on screen
count the words
start receiving and
transforming the data
9
Integrates with Spark Ecosystem
10
Spark Core
Spark
Streaming
Spark SQL
DataFrames
MLlib GraphX
Combine batch and streaming processing
Join data streams with static data sets
//	
  Create	
  data	
  set	
  from	
  Hadoop	
  file	
  
val	
  dataset	
  =	
  sparkContext.hadoopFile(“file”)	
  
	
  	
  	
  	
  	
  
//	
  Join	
  each	
  batch	
  in	
  stream	
  with	
  the	
  dataset	
  
kafkaStream.transform	
  {	
  batchRDD	
  =>	
  	
  
	
  	
  	
  	
  	
  	
  batchRDD.join(dataset)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .filter(	
  ...	
  )	
  
}	
  
Spark Core
Spark
Streaming
Spark SQL
DataFrames
MLlib GraphX
11
Combine machine learning with streaming
Learn models offline, apply them online
//	
  Learn	
  model	
  offline	
  
val	
  model	
  =	
  KMeans.train(dataset,	
  ...)	
  
	
  
//	
  Apply	
  model	
  online	
  on	
  stream	
  
kafkaStream.map	
  {	
  event	
  =>	
  	
  
	
  	
  	
  	
  model.predict(event.feature)	
  	
  
}	
  
	
  
Spark Core
Spark
Streaming
Spark SQL
DataFrames
MLlib GraphX
12
Combine SQL with streaming
Interactively query streaming data with SQL and DataFrames
//	
  Register	
  each	
  batch	
  in	
  stream	
  as	
  table	
  
kafkaStream.foreachRDD	
  {	
  batchRDD	
  =>	
  	
  
	
  	
  batchRDD.toDF.registerTempTable("events")	
  
}	
  
	
  
//	
  Interactively	
  query	
  table	
  
sqlContext.sql("select	
  *	
  from	
  events")	
  
Spark Core
Spark
Streaming
Spark SQL
DataFrames
MLlib GraphX
13
Spark Streaming Adoption
14
Spark Survey by Databricks
Survey over 1417
individuals from 842
organizations 
56% increase in Spark
Streaming users since 2014
Fastest rising component
in Spark
https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html
15
Feedback from community
We have learnt a lot from our rapidly growing user base
Most of the development in the last few releases have
driven by community demands
16
What have we
added recently?
17	
  
Ease of use
Infrastructure
Libraries
Streaming MLlib algorithms
val	
  model	
  =	
  new	
  StreamingKMeans()	
  
	
  	
  .setK(10)	
  
	
  	
  .setDecayFactor(1.0)	
  
	
  	
  .setRandomCenters(4,	
  0.0)	
  
	
  
//	
  Train	
  on	
  one	
  DStream	
  
model.trainOn(trainingDStream)	
  
	
  
//	
  Predict	
  on	
  another	
  DStream	
  
model.predictOnValues(	
  
	
  	
  testDStream.map	
  {	
  lp	
  =>	
  	
  
	
  	
  	
  	
  (lp.label,	
  lp.features)	
  	
  
	
  	
  }	
  
).print()	
  
	
  
19
Continuous learning and prediction on
streaming data
StreamingLinearRegression [Spark 1.1]
StreamingKMeans [Spark 1.2]
StreamingLogisticRegression [Spark 1.3]
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
Python API Improvements
Added Python API for Streaming ML algos [Spark 1.5]
Added Python API for various data sources
Kafka [Spark 1.3 - 1.5]
Flume, Kinesis, MQTT [Spark 1.5]
20
lines	
  =	
  KinesisUtils.createStream(streamingContext,	
  	
  
	
  	
  	
  	
  appName,	
  streamName,	
  endpointUrl,	
  regionName,	
  	
  	
  	
  
	
  InitialPositionInStream.LATEST,	
  2)	
  	
  
	
  
counts	
  =	
  lines.flatMap(lambda	
  line:	
  line.split("	
  "))	
  	
  
Ease of use
Infrastructure
Libraries
New Visualizations [Spark 1.4-15]
22
Stats over last 1000
batches
For stability
Scheduling delay should be approx 0
Processing Time approx < batch interval
New Visualizations [Spark 1.4-15]
23
Details of individual batches
Kafka offsets processed in each batch,
Can help in debugging bad data
List of Spark jobs in each batch
New Visualizations [Spark 1.4-15]
24
Full DAG of RDDs and stages
generated by Spark Streaming
New Visualizations [Spark 1.4-15]
Memory usage of received data
Can be used to understand memory
consumption across executors
Ease of use
Infrastructure
Libraries
Zero data loss
System stability
Non-replayable Sources
Sources that do not support replay
from any position (e.g. Flume, etc.)
Spark Streaming’s saves received
data to a Write Ahead Log (WAL) and
replays data from the WAL on failure
Zero data loss: Two cases
Replayable Sources
Sources that allow data to replayed
from any pos (e.g. Kafka, Kinesis, etc.)
Spark Streaming saves only the record
identifiers and replays the data back
directly from source
Cluster
Write Ahead Log (WAL) [Spark 1.3]
Save received data in a WAL in a fault-tolerant file system
29
Driver
Executor Data stream
Driver runs receivers
Driver runs
user code
ReceiverDriver runs tasks to
process received data
Receiver buffers
data in memory
and writes to WAL
WAL in HDFS
Executor
Receiver
Cluster
Write Ahead Log (WAL) [Spark 1.3]
Replay unprocessed data from WAL if driver fails and restarts
30
Restarted
Executor
Tasks read data
from the WAL
WAL in HDFS
Failed
Driver
Restarted
Driver Failed tasks rerun on
restarted executors
Write Ahead Log (WAL) [Spark 1.3]
WAL can be enabled by setting Spark configuration
spark.streaming.receiver.writeAheadLog.enable to true	
  
Should use reliable receiver, that ensures data written to
WAL for acknowledging sources
Reliable receiver + WAL gives at least once guarantee
31
Kinesis [Spark 1.5]
Save the Kinesis sequence numbers instead of raw data
Using
KCL
Sequence number ranges
sent to driver
Sequence number
ranges saved to HDFS
32
Driver Executor
Kinesis [Spark 1.5]
Recover unprocessed data directly from Kinesis using
recovered sequence numbers
Using
AWS SDK
33
Restarted
Driver
Restarted
ExecutorTasks rerun with
recovered ranges
Ranges recovered from HDFS
Kinesis [Spark 1.5]
After any failure, records are either recovered from saved
sequence numbers or replayed via KCL
No need to replicate received data in Spark Streaming
Provides end-to-end at least once guarantee
34
Kafka [1.3, graduated in 1.5]
A priori decide the offset ranges to consume in the next batch
35
Every batch interval, latest offset info
fetched for each Kafka partition
Offset ranges for next
batch decided and
saved to HDFS
Driver
Kafka [1.3, graduated in 1.5]
A priori decide the offset ranges to consume in the next batch
36
Executor
Executor
Executor
Broker
Broker
Broker
Tasks run to read each
range in parallel
Driver
Every batch interval, latest offset info
fetched for each Kafka partition
Direct Kafka API [Spark 1.5]
Does not use receivers, no need for Spark Streaming to replicate
Can provide up to 10x higher throughput than earlier receiver approach
https://spark-summit.org/2015/events/towards-benchmarking-modern-distributed-streaming-systems/
Can provide exactly once semantics
Output operation to external storage should be idempotent or transactional
Can run Spark batch jobs directly on Kafka
# RDD partitions = # Kafka partitions, easy to reason about
37
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
System stability
Streaming applications may have to deal with variations
in data rates and processing rates
For stability, any streaming application must receive
data only as fast as it can process
Since 1.1, Spark Streaming allowed setting static limits
ingestion rates on receivers to guard against spikes
38
Backpressure [Spark 1.5]
System automatically and dynamically adapts rate limits
to ensure stability under any processing conditions
If sinks slow down, then the system automatically
pushes back on the source to slow down receiving
39
receivers
Sources Sinks
Backpressure [Spark 1.5]
System uses batch processing times and scheduling
delays used to set rate limits
Well known PID controller theory (used in industrial
control systems) is used calculate appropriate rate limits
Contributed by Typesafe
40
Backpressure [Spark 1.5]
System uses batch processing times and scheduling
delays used to set rate limits
41
Dynamic rate limit prevents
receivers from receiving too fast
Scheduling delay kept in check
by the rate limits
Backpressure [Spark 1.5]
Experimental, so disabled by default in Spark 1.5
Enabled by setting Spark configuration
spark.streaming.backpressure.enabled to true	
  
Will be enabled by default in future releases
https://issues.apache.org/jira/browse/SPARK-7398
42
What’s next?
API and Libraries
Support for operations on event time and out of order data
Most demanded feature from the community
Tighter integration between Streaming and SQL + DataFrames
Helps leverage Project Tungsten
44
Infrastructure
Add native support for Dynamic Allocation for Streaming
Dynamically scale the cluster resources based on processing load
Will work in collaboration with backpressure to scale up/down while
maintaining stability
Note: As of 1.5, existing Dynamic Allocation not optimized for streaming
But users can build their own scaling logic using developer API
sparkContext.requestExecutors(),	
  sparkContext.killExecutors()	
  
45
Infrastructure
Higher throughput and lower latency by leveraging
Project Tungsten
Specifically, improved performance of stateful ops
46
Fastest growing component in the Spark ecosystem
Significant improvements in fault-tolerance, stability,
visualizations and Python API
More community requested features to come
@tathadas

Weitere ähnliche Inhalte

Was ist angesagt?

Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0 Databricks
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 

Was ist angesagt? (20)

Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 

Andere mochten auch

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comknowbigdata
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Scala overview
Scala overviewScala overview
Scala overviewSteve Min
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)Steve Min
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming OverviewStratio
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 

Andere mochten auch (20)

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.com
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Scala overview
Scala overviewScala overview
Scala overview
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 

Ähnlich wie Spark Streaming: What's New in Spark Streaming

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sigmoid
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingAbdelhamide EL ARIB
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream ProcessingEventador
 

Ähnlich wie Spark Streaming: What's New in Spark Streaming (20)

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream Processing
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 

Kürzlich hochgeladen (20)

Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 

Spark Streaming: What's New in Spark Streaming

  • 1. What’s new in Spark Streaming Tathagata “TD” Das Strata NY 2015 @tathadas
  • 2. Who am I? Project Management Committee (PMC) member of Spark Started Spark Streaming in AMPLab, UC Berkeley Current technical lead of Spark Streaming Software engineer at Databricks 2
  • 3. Founded by creators of Spark and remains largest contributor Offers a hosted service •  Spark on EC2 •  Notebooks •  Plot visualizations •  Cluster management •  Scheduled jobs What is Databricks? 3
  • 4. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrates with MLlib, SQL, DataFrames, GraphX 4
  • 5. What can you use it for? Real-time fraud detection in transactions React to anomalies in sensors in real-time Cat videos in tweets as soon as they go viral 5
  • 6. Spark Streaming Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results data streams receivers batches results 6
  • 7. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   entry point of streaming functionality create DStream from Kafka data 7
  • 8. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   split lines into words 8
  • 9. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   val  wordCounts  =  words.map(x  =>  (x,  1))                                              .reduceByKey(_  +  _)   wordCounts.print()   context.start()   print some counts on screen count the words start receiving and transforming the data 9
  • 10. Integrates with Spark Ecosystem 10 Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX
  • 11. Combine batch and streaming processing Join data streams with static data sets //  Create  data  set  from  Hadoop  file   val  dataset  =  sparkContext.hadoopFile(“file”)             //  Join  each  batch  in  stream  with  the  dataset   kafkaStream.transform  {  batchRDD  =>                batchRDD.join(dataset)                              .filter(  ...  )   }   Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 11
  • 12. Combine machine learning with streaming Learn models offline, apply them online //  Learn  model  offline   val  model  =  KMeans.train(dataset,  ...)     //  Apply  model  online  on  stream   kafkaStream.map  {  event  =>            model.predict(event.feature)     }     Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 12
  • 13. Combine SQL with streaming Interactively query streaming data with SQL and DataFrames //  Register  each  batch  in  stream  as  table   kafkaStream.foreachRDD  {  batchRDD  =>        batchRDD.toDF.registerTempTable("events")   }     //  Interactively  query  table   sqlContext.sql("select  *  from  events")   Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 13
  • 15. Spark Survey by Databricks Survey over 1417 individuals from 842 organizations  56% increase in Spark Streaming users since 2014 Fastest rising component in Spark https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html 15
  • 16. Feedback from community We have learnt a lot from our rapidly growing user base Most of the development in the last few releases have driven by community demands 16
  • 17. What have we added recently? 17  
  • 19. Streaming MLlib algorithms val  model  =  new  StreamingKMeans()      .setK(10)      .setDecayFactor(1.0)      .setRandomCenters(4,  0.0)     //  Train  on  one  DStream   model.trainOn(trainingDStream)     //  Predict  on  another  DStream   model.predictOnValues(      testDStream.map  {  lp  =>            (lp.label,  lp.features)        }   ).print()     19 Continuous learning and prediction on streaming data StreamingLinearRegression [Spark 1.1] StreamingKMeans [Spark 1.2] StreamingLogisticRegression [Spark 1.3] https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
  • 20. Python API Improvements Added Python API for Streaming ML algos [Spark 1.5] Added Python API for various data sources Kafka [Spark 1.3 - 1.5] Flume, Kinesis, MQTT [Spark 1.5] 20 lines  =  KinesisUtils.createStream(streamingContext,            appName,  streamName,  endpointUrl,  regionName,          InitialPositionInStream.LATEST,  2)       counts  =  lines.flatMap(lambda  line:  line.split("  "))    
  • 22. New Visualizations [Spark 1.4-15] 22 Stats over last 1000 batches For stability Scheduling delay should be approx 0 Processing Time approx < batch interval
  • 23. New Visualizations [Spark 1.4-15] 23 Details of individual batches Kafka offsets processed in each batch, Can help in debugging bad data List of Spark jobs in each batch
  • 24. New Visualizations [Spark 1.4-15] 24 Full DAG of RDDs and stages generated by Spark Streaming
  • 25. New Visualizations [Spark 1.4-15] Memory usage of received data Can be used to understand memory consumption across executors
  • 28. Non-replayable Sources Sources that do not support replay from any position (e.g. Flume, etc.) Spark Streaming’s saves received data to a Write Ahead Log (WAL) and replays data from the WAL on failure Zero data loss: Two cases Replayable Sources Sources that allow data to replayed from any pos (e.g. Kafka, Kinesis, etc.) Spark Streaming saves only the record identifiers and replays the data back directly from source
  • 29. Cluster Write Ahead Log (WAL) [Spark 1.3] Save received data in a WAL in a fault-tolerant file system 29 Driver Executor Data stream Driver runs receivers Driver runs user code ReceiverDriver runs tasks to process received data Receiver buffers data in memory and writes to WAL WAL in HDFS
  • 30. Executor Receiver Cluster Write Ahead Log (WAL) [Spark 1.3] Replay unprocessed data from WAL if driver fails and restarts 30 Restarted Executor Tasks read data from the WAL WAL in HDFS Failed Driver Restarted Driver Failed tasks rerun on restarted executors
  • 31. Write Ahead Log (WAL) [Spark 1.3] WAL can be enabled by setting Spark configuration spark.streaming.receiver.writeAheadLog.enable to true   Should use reliable receiver, that ensures data written to WAL for acknowledging sources Reliable receiver + WAL gives at least once guarantee 31
  • 32. Kinesis [Spark 1.5] Save the Kinesis sequence numbers instead of raw data Using KCL Sequence number ranges sent to driver Sequence number ranges saved to HDFS 32 Driver Executor
  • 33. Kinesis [Spark 1.5] Recover unprocessed data directly from Kinesis using recovered sequence numbers Using AWS SDK 33 Restarted Driver Restarted ExecutorTasks rerun with recovered ranges Ranges recovered from HDFS
  • 34. Kinesis [Spark 1.5] After any failure, records are either recovered from saved sequence numbers or replayed via KCL No need to replicate received data in Spark Streaming Provides end-to-end at least once guarantee 34
  • 35. Kafka [1.3, graduated in 1.5] A priori decide the offset ranges to consume in the next batch 35 Every batch interval, latest offset info fetched for each Kafka partition Offset ranges for next batch decided and saved to HDFS Driver
  • 36. Kafka [1.3, graduated in 1.5] A priori decide the offset ranges to consume in the next batch 36 Executor Executor Executor Broker Broker Broker Tasks run to read each range in parallel Driver Every batch interval, latest offset info fetched for each Kafka partition
  • 37. Direct Kafka API [Spark 1.5] Does not use receivers, no need for Spark Streaming to replicate Can provide up to 10x higher throughput than earlier receiver approach https://spark-summit.org/2015/events/towards-benchmarking-modern-distributed-streaming-systems/ Can provide exactly once semantics Output operation to external storage should be idempotent or transactional Can run Spark batch jobs directly on Kafka # RDD partitions = # Kafka partitions, easy to reason about 37 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
  • 38. System stability Streaming applications may have to deal with variations in data rates and processing rates For stability, any streaming application must receive data only as fast as it can process Since 1.1, Spark Streaming allowed setting static limits ingestion rates on receivers to guard against spikes 38
  • 39. Backpressure [Spark 1.5] System automatically and dynamically adapts rate limits to ensure stability under any processing conditions If sinks slow down, then the system automatically pushes back on the source to slow down receiving 39 receivers Sources Sinks
  • 40. Backpressure [Spark 1.5] System uses batch processing times and scheduling delays used to set rate limits Well known PID controller theory (used in industrial control systems) is used calculate appropriate rate limits Contributed by Typesafe 40
  • 41. Backpressure [Spark 1.5] System uses batch processing times and scheduling delays used to set rate limits 41 Dynamic rate limit prevents receivers from receiving too fast Scheduling delay kept in check by the rate limits
  • 42. Backpressure [Spark 1.5] Experimental, so disabled by default in Spark 1.5 Enabled by setting Spark configuration spark.streaming.backpressure.enabled to true   Will be enabled by default in future releases https://issues.apache.org/jira/browse/SPARK-7398 42
  • 44. API and Libraries Support for operations on event time and out of order data Most demanded feature from the community Tighter integration between Streaming and SQL + DataFrames Helps leverage Project Tungsten 44
  • 45. Infrastructure Add native support for Dynamic Allocation for Streaming Dynamically scale the cluster resources based on processing load Will work in collaboration with backpressure to scale up/down while maintaining stability Note: As of 1.5, existing Dynamic Allocation not optimized for streaming But users can build their own scaling logic using developer API sparkContext.requestExecutors(),  sparkContext.killExecutors()   45
  • 46. Infrastructure Higher throughput and lower latency by leveraging Project Tungsten Specifically, improved performance of stateful ops 46
  • 47.
  • 48. Fastest growing component in the Spark ecosystem Significant improvements in fault-tolerance, stability, visualizations and Python API More community requested features to come @tathadas