SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Downloaden Sie, um offline zu lesen


A Deep Dive into 



Structured Streaming

Burak Yavuz
Spark Meetup @ Milano
18 January 2017
Streaming in Apache Spark
Spark Streaming changed how people write streaming apps
2
SQL Streaming MLlib
Spark Core
GraphX
Functional, concise and expressive
Fault-tolerant state management
Unified stack with batch processing
More than 50% users consider most important part of Apache Spark
Streaming apps are
growing more complex
3
Streaming computations
don’t run in isolation
Need to interact with batch data,
interactive analysis, machine learning, etc.
Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Prevent data loss
- Prevent duplicatesStatus monitoring
- Handle late data
- Aggregate on windows
on event time
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn models offline
- Use online + continuous
learning
Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Prevent data loss
- Prevent duplicatesStatus monitoring
- Handle late data
- Aggregate on windows
on event time
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn models offline
- Use online + continuous
learningContinuous Applications
Not just streaming any more
1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. Interoperate streaming with batch AND interactive
- RDD/DStream has similar API, but still requires translation
3. Reasoning about end-to-end guarantees
- Requires carefully constructing sinks that handle failures correctly
- Data consistency in the storage while being updated
Pain points with DStreams
Structured Streaming
The simplest way to perform streaming analytics

is not having to reason about streaming at all
New Model Trigger: every 1 sec
1 2 3
Time
data up

to 1
Input data up

to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
New Model Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up

to 1
Input data up

to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Result: final operated table
updated every trigger interval
Output: what part of result to write
to data sink after every
trigger
Complete output: Write full result table every time
Output
[complete mode]
output all the rows in the result table
New Model Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up

to 1
Input data up

to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[append mode] output only new rows
since last trigger
Result: final operated table
updated every trigger
interval
Output: what part of result to write
to data sink after every
trigger
Complete output: Write full result table every time
Append output: Write only new rows that got added
to result table since previous batch
*Not all output modes are feasible with all queries
Static, bounded
data
Streaming, unbounded
data
Single API !
API - Dataset/DataFrame
14
Dataset
A distributed bounded or unbounded collection of data
with a known schema
A high-level abstraction for selecting, filtering, mapping,
reducing, and aggregating structured data
Common API across Scala, Java, Python, R
Standards based:
Supports most popular
constructs in HiveQL, ANSI
SQL
Compatible: Use JDBC to
connect with popular BI
tools
15
Datasets with SQL
SELECT type, avg(signal)
FROM devices
GROUP BY type
Concise: great for ad-hoc
interactive analysis
Interoperable: based on
R / pandas, easy to go
back and forth
16
Dynamic Datasets (DataFrames)
spark.table("device-data")
.groupBy("type")
.agg("type",
avg("signal"))
.map(lambda …)
.collect()
17
Statically-typed Datasets
No boilerplate:
automatically convert to/
from domain objects
Safe: compile time
checks for correctness
// Compute histogram of age by name.
val hist = ds.groupBy(_.type).mapGroups {
case (type, data: Iter[DeviceData]) =>
val buckets = new Array[Int](10)
data.map(_.signal).foreach { a =>
buckets(a / 10) += 1
}
(type, buckets)
}
case class DeviceData(type:String,
signal:Int)
// Convert data to Java objects
val ds: Dataset[DeviceData] =
spark.read.json("data.json").as[DeviceData]
Unified, Structured APIs in Spark 2.0
18
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Batch ETL with DataFrames
input = spark.read
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.save("dest-path")


Read from Json file
Select some devices
Write to parquet file
Streaming ETL with DataFrames
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")


Read from Json file stream
Replace read with readStream
Select some devices
Code does not change
Write to Parquet file stream
Replace save() with start()
Streaming ETL with DataFrames
readStream…load() creates a streaming
DataFrame, does not start any of the
computation
writeStream…start() defines where & how
to output the data and starts the
processing
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")


Streaming ETL with DataFrames
1 2 3
Result
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")


Demo
Continuous Aggregations
Continuously compute average
signal across all devices
Continuously compute average
signal of each type of device
24
input.avg("signal")
input.groupBy("device_type")
.avg("signal")
Continuous Windowed Aggregations
25
input.groupBy(
$"device_type",
window($"event_time", "10 min"))
.avg("signal")
Continuously compute
average signal of each type
of device in last 10 minutes
using event_time column
Simplifies event-time stream processing (not possible in DStreams)
Works on both, streaming and batch jobs
Watermarking for handling late data

[Spark 2.1]
26
input.withWatermark("event_time", "15 min")
.groupBy(
$"device_type",
window($"event_time", "10 min"))
.avg("signal")
Watermark = threshold using event-time for classifying data as
"too late"
Continuously compute 10
minute average while
considering that data can
be up to 15 minutes late
Joining streams with static data
kafkaDataset = spark.readStream
 .format("kafka")
 .option("subscribe", "iot-updates")
 .option("kafka.bootstrap.servers", ...)
.load()
staticDataset = spark.read
 .jdbc("jdbc://", "iot-device-info")
joinedDataset =
 kafkaDataset.join(staticDataset, "device_type")
27
Join streaming data from Kafka with
static data via JDBC to enrich the
streaming data …
… without having to think that you
are joining streaming data
Output Modes
Defines what is outputted every time there is a trigger
Different output modes make sense for different queries
28
input.select("device", "signal")
.writeStream
.outputMode("append")
.format("parquet")
.start("dest-path")
Append mode with
non-aggregation queries
input.agg(count("*"))
.writeStream
.outputMode("complete")
.format("memory")
.queryName("my_table")
.start()
Complete mode with
aggregation queries
Query Management
query = result.writeStream
.format("parquet")
.outputMode("append")
.start("dest-path")
query.stop()
query.awaitTermination()
query.exception()
query.status()
query.recentProgress()
29
query: a handle to the running streaming
computation for managing it
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Can ask for current status, and metrics
Logically:
Dataset operations on table
(i.e. as easy to understand as batch)
Physically:
Spark automatically runs the query in
streaming fashion
(i.e. incrementally and continuously)
DataFrame
Logical Plan
Continuous, 

incremental execution
Catalyst optimizer
Query Execution
Structured Streaming
High-level streaming API built on Datasets/DataFrames
Event time, windowing, sessions, sources & sinks
End-to-end exactly once semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove, change queries at runtime
Build and apply ML models
What can you do with this that’s hard with other engines?
True unification
Combines batch + streaming + interactive workloads
Same code + same super-optimized engine for everything
Flexible API tightly integrated with the engine
Choose your own tool - Dataset/DataFrame/SQL
Greater debuggability and performance
Benefits of Spark
in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …
Underneath the Hood
Batch Execution on Spark SQL
34
DataFrame/
Dataset
Logical
Plan
Abstract
representation of
query
Batch Execution on Spark SQL
35
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical 

Plans
Code
Generation
CatalogDataset
Helluvalot of magic!
Batch Execution on Spark SQL
36
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Run super-optimized Spark
jobs to compute results
Bytecode generation
JVM intrinsics, vectorization
Operations on serialized data
Code Optimizations Memory Optimizations
Compact and fast encoding
Off-heap memory
Project Tungsten - Phase 1 and 2
Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a continuous
series of incremental execution plans,
for each processing the next chunk of
streaming data
37
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4
Continuous Incremental Execution
38
Planner
Incremental
Execution 2
Offsets: [106-197] Count: 92
Planner polls for
new data from
sources
Incremental
Execution 1
Offsets: [19-105] Count: 87
Incrementally executes new
data and writes to sink
Continuous Aggregations
Maintain running aggregate as in-memory state
backed by WAL in file system for fault-tolerance
39
state data generated and used
across incremental executions
Incremental
Execution 1
state:
87
Offsets: [19-105] Running Count: 87
memory
Incremental
Execution 2
state:
179
Offsets: [106-179] Count: 87+92 = 179
Fault-tolerance
All data and metadata in the
system needs to be
recoverable / replayable
st
at
e
Planner
source sink
Incremental
Execution 1
Incremental
Execution 2
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the offset
range of each execution to a write
ahead log (WAL) in HDFS
st
at
e
Planner
source sink
Offsets written to fault-
tolerant WAL
before execution
Incremental
Execution 2
Incremental
Execution 1
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the offset
range of each execution to a write
ahead log (WAL) in HDFS
st
at
e
Planner
source sink
Failed planner fails
current execution
Incremental
Execution 2
Incremental
Execution 1
Failed Execution
Failed
Planner
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the offset
range of each execution to a write
ahead log (WAL) in HDFS
Reads log to recover from failures,
and re-execute exact range of
offsets
st
at
e
Restarted
Planner
source sink
Offsets read back from
WAL
Incremental
Execution 1
Same executions
regenerated from offsets
Failed Execution
Incremental
Execution 2
Fault-tolerance
Fault-tolerant Sources
Structured streaming sources are
by design replayable (e.g. Kafka,
Kinesis, files) and generate the
exactly same data given offsets
recovered by planner
st
at
e
Planner
sink
Incremental
Execution 1
Incremental
Execution 2
source
Replayable
source
Fault-tolerance
Fault-tolerant State
Intermediate "state data" is a
maintained in versioned, key-
value maps in Spark workers,
backed by HDFS
Planner makes sure "correct
version" of state used to re-
execute after failure
Planner
source sink
Incremental
Execution 1
Incremental
Execution 2
st
at
e
state is fault-tolerant with WAL
Fault-tolerance
Fault-tolerant Sink
Sink are by design idempotent,
and handles re-executions to
avoid double committing the
output
Planner
source
Incremental
Execution 1
Incremental
Execution 2
st
at
e
sink
Idempotent by
design
47
offset tracking in WAL
+
state management
+
fault-tolerant sources and sinks
=
end-to-end
exactly-once
guarantees
48
Fast, fault-tolerant, exactly-once
stateful stream processing
without having to reason about streaming
Blogs for Deeper Technical Dive
Blog 1:
https://databricks.com/blog/2016/07/28/
continuous-applications-evolving-
streaming-in-apache-spark-2-0.html
Blog 2:
https://databricks.com/blog/2016/07/28/
structured-streaming-in-apache-
spark.html
Comparison with Other Engines
50Read the blog to understand this table
Previous Status: Spark 2.0.2
Basic infrastructure and API
- Event time, windows, aggregations
- Append and Complete output modes
- Support for a subset of batch queries
Source and sink
- Sources: Files, Kafka
- Sinks: Files, in-memory table,
unmanaged sinks (foreach)
Experimental release to set the
future direction
Not ready for production but
good to experiment with and
provide feedback
Current Status: Spark 2.1.0
Event-time watermarks and old state evictions
Status monitoring and metrics
- Current status and metrics (processing rates, state data size, etc.)
- Codahale/Dropwizard metrics (reporting to Ganglia, Graphite, etc.)
Stable future-proof checkpoint format upgrading code across Spark versions
Many, many, many stability and performance improvements
52
Future Direction
Stability, stability, stability
Support for more queries
Sessionization
More output modes
Support for more sinks
ML integrations
Make Structured
Streaming ready for
production workloads by
Spark 2.2/2.3
https://spark-summit.org/east-2017/

Weitere ähnliche Inhalte

Was ist angesagt?

Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrDatabricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesRomi Kuntsman
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Patterns and Operational Insights from the First Users of Delta Lake
Patterns and Operational Insights from the First Users of Delta LakePatterns and Operational Insights from the First Users of Delta Lake
Patterns and Operational Insights from the First Users of Delta LakeDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 

Was ist angesagt? (20)

Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Structured streaming in Spark
Structured streaming in SparkStructured streaming in Spark
Structured streaming in Spark
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Spark etl
Spark etlSpark etl
Spark etl
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Patterns and Operational Insights from the First Users of Delta Lake
Patterns and Operational Insights from the First Users of Delta LakePatterns and Operational Insights from the First Users of Delta Lake
Patterns and Operational Insights from the First Users of Delta Lake
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 

Andere mochten auch

Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Sigmoid
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 

Andere mochten auch (8)

Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 

Ähnlich wie A Deep Dive into Structured Streaming in Apache Spark

Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...confluent
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Source Code Analysis with SAST
Source Code Analysis with SASTSource Code Analysis with SAST
Source Code Analysis with SASTBlueinfy Solutions
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamImre Nagi
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkmxmxm
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkReynold Xin
 

Ähnlich wie A Deep Dive into Structured Streaming in Apache Spark (20)

Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Source Code Analysis with SAST
Source Code Analysis with SASTSource Code Analysis with SAST
Source Code Analysis with SAST
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 

Mehr von Anyscale

ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfAnyscale
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019Anyscale
 
Putting AI to Work on Apache Spark
Putting AI to Work on Apache SparkPutting AI to Work on Apache Spark
Putting AI to Work on Apache SparkAnyscale
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Anyscale
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksAnyscale
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 

Mehr von Anyscale (9)

ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdf
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019
 
Putting AI to Work on Apache Spark
Putting AI to Work on Apache SparkPutting AI to Work on Apache Spark
Putting AI to Work on Apache Spark
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at Databricks
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

Kürzlich hochgeladen

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Kürzlich hochgeladen (20)

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

A Deep Dive into Structured Streaming in Apache Spark

  • 1. 
 A Deep Dive into 
 
 Structured Streaming
 Burak Yavuz Spark Meetup @ Milano 18 January 2017
  • 2. Streaming in Apache Spark Spark Streaming changed how people write streaming apps 2 SQL Streaming MLlib Spark Core GraphX Functional, concise and expressive Fault-tolerant state management Unified stack with batch processing More than 50% users consider most important part of Apache Spark
  • 3. Streaming apps are growing more complex 3
  • 4. Streaming computations don’t run in isolation Need to interact with batch data, interactive analysis, machine learning, etc.
  • 5. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Prevent data loss - Prevent duplicatesStatus monitoring - Handle late data - Aggregate on windows on event time Interactively debug issues - consistency event stream Anomaly detection - Learn models offline - Use online + continuous learning
  • 6. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Prevent data loss - Prevent duplicatesStatus monitoring - Handle late data - Aggregate on windows on event time Interactively debug issues - consistency event stream Anomaly detection - Learn models offline - Use online + continuous learningContinuous Applications Not just streaming any more
  • 7. 1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time 2. Interoperate streaming with batch AND interactive - RDD/DStream has similar API, but still requires translation 3. Reasoning about end-to-end guarantees - Requires carefully constructing sinks that handle failures correctly - Data consistency in the storage while being updated Pain points with DStreams
  • 9. The simplest way to perform streaming analytics
 is not having to reason about streaming at all
  • 10. New Model Trigger: every 1 sec 1 2 3 Time data up
 to 1 Input data up
 to 2 data up to 3 Query Input: data from source as an append-only table Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops
  • 11. New Model Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up
 to 1 Input data up
 to 2 result for data up to 2 data up to 3 result for data up to 3 Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output [complete mode] output all the rows in the result table
  • 12. New Model Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up
 to 1 Input data up
 to 2 result for data up to 2 data up to 3 result for data up to 3 Output [append mode] output only new rows since last trigger Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Append output: Write only new rows that got added to result table since previous batch *Not all output modes are feasible with all queries
  • 14. 14 Dataset A distributed bounded or unbounded collection of data with a known schema A high-level abstraction for selecting, filtering, mapping, reducing, and aggregating structured data Common API across Scala, Java, Python, R
  • 15. Standards based: Supports most popular constructs in HiveQL, ANSI SQL Compatible: Use JDBC to connect with popular BI tools 15 Datasets with SQL SELECT type, avg(signal) FROM devices GROUP BY type
  • 16. Concise: great for ad-hoc interactive analysis Interoperable: based on R / pandas, easy to go back and forth 16 Dynamic Datasets (DataFrames) spark.table("device-data") .groupBy("type") .agg("type", avg("signal")) .map(lambda …) .collect()
  • 17. 17 Statically-typed Datasets No boilerplate: automatically convert to/ from domain objects Safe: compile time checks for correctness // Compute histogram of age by name. val hist = ds.groupBy(_.type).mapGroups { case (type, data: Iter[DeviceData]) => val buckets = new Array[Int](10) data.map(_.signal).foreach { a => buckets(a / 10) += 1 } (type, buckets) } case class DeviceData(type:String, signal:Int) // Convert data to Java objects val ds: Dataset[DeviceData] = spark.read.json("data.json").as[DeviceData]
  • 18. Unified, Structured APIs in Spark 2.0 18 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime
  • 19. Batch ETL with DataFrames input = spark.read .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") 
 Read from Json file Select some devices Write to parquet file
  • 20. Streaming ETL with DataFrames input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") 
 Read from Json file stream Replace read with readStream Select some devices Code does not change Write to Parquet file stream Replace save() with start()
  • 21. Streaming ETL with DataFrames readStream…load() creates a streaming DataFrame, does not start any of the computation writeStream…start() defines where & how to output the data and starts the processing input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") 

  • 22. Streaming ETL with DataFrames 1 2 3 Result Input Output [append mode] new rows in result of 2 new rows in result of 3 input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") 

  • 23. Demo
  • 24. Continuous Aggregations Continuously compute average signal across all devices Continuously compute average signal of each type of device 24 input.avg("signal") input.groupBy("device_type") .avg("signal")
  • 25. Continuous Windowed Aggregations 25 input.groupBy( $"device_type", window($"event_time", "10 min")) .avg("signal") Continuously compute average signal of each type of device in last 10 minutes using event_time column Simplifies event-time stream processing (not possible in DStreams) Works on both, streaming and batch jobs
  • 26. Watermarking for handling late data
 [Spark 2.1] 26 input.withWatermark("event_time", "15 min") .groupBy( $"device_type", window($"event_time", "10 min")) .avg("signal") Watermark = threshold using event-time for classifying data as "too late" Continuously compute 10 minute average while considering that data can be up to 15 minutes late
  • 27. Joining streams with static data kafkaDataset = spark.readStream  .format("kafka")  .option("subscribe", "iot-updates")  .option("kafka.bootstrap.servers", ...) .load() staticDataset = spark.read  .jdbc("jdbc://", "iot-device-info") joinedDataset =  kafkaDataset.join(staticDataset, "device_type") 27 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … without having to think that you are joining streaming data
  • 28. Output Modes Defines what is outputted every time there is a trigger Different output modes make sense for different queries 28 input.select("device", "signal") .writeStream .outputMode("append") .format("parquet") .start("dest-path") Append mode with non-aggregation queries input.agg(count("*")) .writeStream .outputMode("complete") .format("memory") .queryName("my_table") .start() Complete mode with aggregation queries
  • 29. Query Management query = result.writeStream .format("parquet") .outputMode("append") .start("dest-path") query.stop() query.awaitTermination() query.exception() query.status() query.recentProgress() 29 query: a handle to the running streaming computation for managing it - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Can ask for current status, and metrics
  • 30. Logically: Dataset operations on table (i.e. as easy to understand as batch) Physically: Spark automatically runs the query in streaming fashion (i.e. incrementally and continuously) DataFrame Logical Plan Continuous, 
 incremental execution Catalyst optimizer Query Execution
  • 31. Structured Streaming High-level streaming API built on Datasets/DataFrames Event time, windowing, sessions, sources & sinks End-to-end exactly once semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove, change queries at runtime Build and apply ML models
  • 32. What can you do with this that’s hard with other engines? True unification Combines batch + streaming + interactive workloads Same code + same super-optimized engine for everything Flexible API tightly integrated with the engine Choose your own tool - Dataset/DataFrame/SQL Greater debuggability and performance Benefits of Spark in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …
  • 34. Batch Execution on Spark SQL 34 DataFrame/ Dataset Logical Plan Abstract representation of query
  • 35. Batch Execution on Spark SQL 35 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical 
 Plans Code Generation CatalogDataset Helluvalot of magic!
  • 36. Batch Execution on Spark SQL 36 DataFrame/ Dataset Logical Plan Execution PlanPlanner Run super-optimized Spark jobs to compute results Bytecode generation JVM intrinsics, vectorization Operations on serialized data Code Optimizations Memory Optimizations Compact and fast encoding Off-heap memory Project Tungsten - Phase 1 and 2
  • 37. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans, for each processing the next chunk of streaming data 37 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  • 38. Continuous Incremental Execution 38 Planner Incremental Execution 2 Offsets: [106-197] Count: 92 Planner polls for new data from sources Incremental Execution 1 Offsets: [19-105] Count: 87 Incrementally executes new data and writes to sink
  • 39. Continuous Aggregations Maintain running aggregate as in-memory state backed by WAL in file system for fault-tolerance 39 state data generated and used across incremental executions Incremental Execution 1 state: 87 Offsets: [19-105] Running Count: 87 memory Incremental Execution 2 state: 179 Offsets: [106-179] Count: 87+92 = 179
  • 40. Fault-tolerance All data and metadata in the system needs to be recoverable / replayable st at e Planner source sink Incremental Execution 1 Incremental Execution 2
  • 41. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS st at e Planner source sink Offsets written to fault- tolerant WAL before execution Incremental Execution 2 Incremental Execution 1
  • 42. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS st at e Planner source sink Failed planner fails current execution Incremental Execution 2 Incremental Execution 1 Failed Execution Failed Planner
  • 43. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS Reads log to recover from failures, and re-execute exact range of offsets st at e Restarted Planner source sink Offsets read back from WAL Incremental Execution 1 Same executions regenerated from offsets Failed Execution Incremental Execution 2
  • 44. Fault-tolerance Fault-tolerant Sources Structured streaming sources are by design replayable (e.g. Kafka, Kinesis, files) and generate the exactly same data given offsets recovered by planner st at e Planner sink Incremental Execution 1 Incremental Execution 2 source Replayable source
  • 45. Fault-tolerance Fault-tolerant State Intermediate "state data" is a maintained in versioned, key- value maps in Spark workers, backed by HDFS Planner makes sure "correct version" of state used to re- execute after failure Planner source sink Incremental Execution 1 Incremental Execution 2 st at e state is fault-tolerant with WAL
  • 46. Fault-tolerance Fault-tolerant Sink Sink are by design idempotent, and handles re-executions to avoid double committing the output Planner source Incremental Execution 1 Incremental Execution 2 st at e sink Idempotent by design
  • 47. 47 offset tracking in WAL + state management + fault-tolerant sources and sinks = end-to-end exactly-once guarantees
  • 48. 48 Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming
  • 49. Blogs for Deeper Technical Dive Blog 1: https://databricks.com/blog/2016/07/28/ continuous-applications-evolving- streaming-in-apache-spark-2-0.html Blog 2: https://databricks.com/blog/2016/07/28/ structured-streaming-in-apache- spark.html
  • 50. Comparison with Other Engines 50Read the blog to understand this table
  • 51. Previous Status: Spark 2.0.2 Basic infrastructure and API - Event time, windows, aggregations - Append and Complete output modes - Support for a subset of batch queries Source and sink - Sources: Files, Kafka - Sinks: Files, in-memory table, unmanaged sinks (foreach) Experimental release to set the future direction Not ready for production but good to experiment with and provide feedback
  • 52. Current Status: Spark 2.1.0 Event-time watermarks and old state evictions Status monitoring and metrics - Current status and metrics (processing rates, state data size, etc.) - Codahale/Dropwizard metrics (reporting to Ganglia, Graphite, etc.) Stable future-proof checkpoint format upgrading code across Spark versions Many, many, many stability and performance improvements 52
  • 53. Future Direction Stability, stability, stability Support for more queries Sessionization More output modes Support for more sinks ML integrations Make Structured Streaming ready for production workloads by Spark 2.2/2.3