When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation.
Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies.
In this talk, we will focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.
Six Myths about Ontologies: The Basics of Formal Ontology
Scylla Summit 2017: Stateful Streaming Applications with Apache Spark
1. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Arbitrary
Stateful Aggregations
using
Structured
Streaming
in
Apache
Spark™
Software
Engineer,
Databricks
Burak
Yavuz
2. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Burak
Yavuz
2
●Software
Engineer
– Databricks
-‐ “We
make
your
streams
come
true”
●Apache
Spark
Committer
as
of
Feb
2017
●MS
in
Management
Science
&
Engineering
-‐
Stanford
University
●BS
in
Mechanical
Engineering
-‐ Bogazici University,
Istanbul
3. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
TEAM
About
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Outline
oStructured
Streaming
Concepts
oStateful Processing
in
Structured
Streaming
oUse
Cases
and
How
NoSQL
Stores
Fit
In
oDemos
5. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
The simplest way to perform streaming analytics
is not having to reason about streaming at all
6. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
7. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
New
Model
Input:
data
from
source
as
an
append-‐only table
Trigger:
how
frequently
to
check
input
for
new
data
Query:
operations
on
input
usual
map/filter/reduce
new
window,
session
ops
Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
8. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode]
output all the rows in the result table
New
Model
Result:
final
operated
table
updated
every
trigger
interval
Output:
what
part
of
result
to
write
to
data
sink
after
every
trigger
Complete
output:
Write
full
result
table
every
time
9. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[append mode]
output only new rows since
last trigger
Result: final operated table updated
every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table
every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible with all queries
New Model
10. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
11. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Append
mode
(default) -‐ New
rows
added
to
the
Result
Table
since
the
last
trigger
will
be
outputted
to
the
sink.
Rows
will
be
output
only
once,
and
cannot
be
rescinded.
Example
use
cases:
ETL
12. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Complete
mode -‐ The
whole
Result
Table
will
be
outputted
to
the
sink
after
every
trigger.
This
is
supported
for
aggregation
queries.
Example
use
cases:
Monitoring
13. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Update
mode -‐ (Available
since
Spark
2.1.1)
Only
the
rows
in
the
Result
Table
that
were
updated
since
the
last
trigger
will
be
outputted
to
the
sink.
Example
use
cases:
Alerting,
Sessionization
14. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Outline
oStructured
Streaming
Concepts
oStateful Processing
in
Structured
Streaming
oUse
Cases
and
How
NoSQL
Stores
Fit
In
oDemos
15. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Event
time
Aggregations
Many
use
cases
require
aggregate
statistics
by
event
time
E.g.
what's
the
#errors
in
each
system
in
1
hour
windows?
Many
challenges
Extracting
event
time
from
data,
handling
late,
out-‐of-‐order
data
DStream APIs
were
insufficient
for
event
time
operations
16. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Event
time
Aggregations
Windowing
is
just
another
type
of
grouping
in
Struct.
Streaming
number
of
records
every
hour
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Use built-in functions to extract event-time
No need for separate extractors
17. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Advanced
Aggregations
Powerful
built-‐in
aggregations
Multiple
simultaneous
aggregations
Custom
aggs using
reduceGroups,
UDAFs
parsedData
.groupBy(window("timestamp","1 hour"))
.agg(avg("signal"), stddev("signal"), max("signal"))
variance, stddev, kurtosis, stddev_samp, collect_list,
collect_set, corr, approx_count_distinct, ...
// Compute histogram of age by name.
val hist = ds.groupBy(_.type).mapGroups {
case (type, data: Iter[DeviceData]) =>
val buckets = new Array[Int](10)
data.map(_.signal).foreach { a => buckets(a/10)+=1 }
(type, buckets)
}
18. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Stateful Processing
for
Aggregations
In-‐memory,
streaming
state
maintained
for
aggregations 12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00
Keeping state allows late data to
update counts of old windows
But size of the state increases
indefinitely if old windows not dropped
red = state updated
with late data
19. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
20. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data
Watermark [Spark
2.1]
-‐ a
moving
threshold
that
trails
behind
the
max
seen
event
time
Trailing
gap
defines
how
late
data
is
expected
to
be
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins
21. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data
Data
newer
than
watermark
may
be
late,
but
allowed
to
aggregate
Data
older
than
watermark
is
"too
late"
and
dropped
State
older
than
watermark
automatically
deleted
to
limit
the
amount
of
intermediate
state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
22. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data
Control
the
tradeoff
between
state
size
and
lateness
requirements
Handle
more
late
à keep
more
state
Reduce
state
à handle
less
lateness
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
23. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
to
Limit
State
[Spark
2.1]
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in blog post!
24. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
25. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
Separate processing details (output rate, late data tolerance)
from query semantics.
26. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
How to group
data by time
Same in streaming & batch
27. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
How late
data can be
28. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
How often
to emit updates
29. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Arbitrary
Stateful Operations
[Spark
2.2]
mapGroupsWithState
allows
any
user-‐defined
stateful ops
to
a
user-‐defined
state
Direct
support
for
per-‐key
timeouts
in
event-‐time
or
processing-‐time
supports
Scala
and
Java
ds.groupByKey(groupingFunc)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,
values: Iterator[V],
state: GroupState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}
30. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
flatMapGroupsWithState
▪ Applies
the
given
function
to
each
group
of
data,
while
maintaining
a
user-‐defined
per-‐group state
▪ Invoked
once
per
group
in
batch
▪ Invoked
each
trigger
(with
the
existence
of
data)
per
group
in
streaming
▪ Requires
user
to
provide
an
output
mode
for
the
function
31. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
flatMapGroupsWithState
▪ mapGroupsWithState is
a
special
case
with
oOutput
mode:
Update
oOutput
size:
1
row
per
group
▪ Supports
both
Processing
Time
and
Event
Time
timeouts
32. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Outline
oStructured
Streaming
Concepts
oStateful Processing
in
Structured
Streaming
oUse
Cases and
How
NoSQL
Stores
Fit
In
oDemos
33. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Alerting
val monitoring = stream
.as[Event]
.groupBy(_.id)
.flatMapGroupsWithState(Append, GST.ProcessingTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>
...
}
.writeStream
.queryName("alerts")
.foreach(new PagerdutySink(credentials))
Monitor a stream using custom stateful logic with timeouts.
34. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Alerting
▪ Save
your
state
to
Scylla
to
power
dashboards
▪ Have
the
stream
trigger
alerts
ASAP
35. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Sessionization
val monitoring = stream
.as[Event]
.groupBy(_.session_id)
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>
...
}
.writeStream
.scylla("trips")
Analyze sessions of user/system behavior
36. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Sessionization
▪ Update
sessions
in
your
stream
▪ Save
it
to
a
NoSQL
store
like
Scylla!
37. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Demo
38. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Try Spark 2.2 on Community Edition today!
https://databricks.com/try-databricks
39. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Apache Spark’s Structured Streaming at Scale Series
https://databricks.com/blog/category/engineering
Twitter: @databricks
40. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
We are hiring!
https://databricks.com/company/careers
41. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
THANK
YOU
burak@databricks.com
“Does anyone have any questions for my answers?”
- Henry Kissinger