Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Arbitrary
Stateful Aggregations
using
Structured
Streaming
in
Apache
Spark™
Software
Engineer,
Databricks
Burak
Yavuz

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Burak
Yavuz
2
●Software
Engineer
– Databricks
-‐ “We
make
your
streams
come
true”
●Apache
Spark
Committer
as
of
Feb
2017
●MS
in
Management
Science
&
Engineering
-‐
Stanford
University
●BS
in
Mechanical
Engineering
-‐ Bogazici University,

Istanbul

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
TEAM
About
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Outline
oStructured
Streaming
Concepts
oStateful Processing
in
Structured
Streaming
oUse
Cases
and
How
NoSQL
Stores
Fit
In
oDemos

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
The simplest way to perform streaming analytics
is not having to reason about streaming at all

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
New
Model
Input:
data
from
source
as
an

append-‐only table
Trigger:
how
frequently
to
check
input
for
new
data
Query:
operations
on
input
usual
map/filter/reduce

new
window,
session
ops
Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode]
output all the rows in the result table
New
Model
Result:
final
operated
table

updated
every
trigger
interval
Output:
what
part
of
result
to

write
to
data
sink
after
every

trigger
Complete
output:
Write
full
result
table

every
time

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[append mode]
output only new rows since
last trigger
Result: final operated table updated
every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table
every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible with all queries
New Model

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Append
mode
(default) -‐ New
rows
added
to
the
Result
Table

since
the
last
trigger
will
be
outputted
to
the
sink.
Rows
will
be

output
only
once,
and
cannot
be
rescinded.
Example
use
cases:
ETL

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Complete
mode -‐ The
whole
Result
Table
will
be
outputted
to
the

sink
after
every
trigger.
This
is
supported
for
aggregation
queries.
Example
use
cases:
Monitoring

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Update
mode -‐ (Available
since
Spark
2.1.1)
Only
the
rows
in
the

Result
Table
that
were
updated
since
the
last
trigger
will
be

outputted
to
the
sink.
Example
use
cases:
Alerting,
Sessionization

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Event
time
Aggregations
Many
use
cases
require
aggregate
statistics
by
event
time
E.g.
what's
the
#errors
in
each
system
in
1
hour
windows?
Many
challenges
Extracting
event
time
from
data,
handling
late,
out-‐of-‐order
data
DStream APIs
were
insufficient
for
event
time
operations

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Event
time
Aggregations
Windowing
is
just
another
type
of
grouping
in
Struct.
Streaming
number
of
records
every
hour
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Use built-in functions to extract event-time
No need for separate extractors

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Advanced
Aggregations
Powerful
built-‐in

aggregations
Multiple
simultaneous

aggregations
Custom
aggs using

reduceGroups,
UDAFs
parsedData
.groupBy(window("timestamp","1 hour"))
.agg(avg("signal"), stddev("signal"), max("signal"))
variance, stddev, kurtosis, stddev_samp, collect_list,
collect_set, corr, approx_count_distinct, ...
// Compute histogram of age by name.
val hist = ds.groupBy(_.type).mapGroups {
case (type, data: Iter[DeviceData]) =>
val buckets = new Array[Int](10)
data.map(_.signal).foreach { a => buckets(a/10)+=1 }
(type, buckets)
}

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Stateful Processing
for
Aggregations
In-‐memory,
streaming

state
maintained
for

aggregations 12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00
Keeping state allows late data to
update counts of old windows
But size of the state increases
indefinitely if old windows not dropped
red = state updated
with late data

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data

Watermark [Spark
2.1]
-‐ a
moving

threshold
that
trails
behind
the
max

seen
event
time
Trailing
gap
defines
how
late
data
is

expected
to
be
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data
Data
newer
than
watermark
may

be
late,
but
allowed
to
aggregate
Data
older
than
watermark
is
"too

late"
and
dropped
State
older
than
watermark

automatically
deleted
to
limit
the

amount
of
intermediate
state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data
Control
the
tradeoff
between
state

size
and
lateness
requirements
Handle
more
late
à keep
more
state
Reduce
state
à handle
less
lateness
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
to
Limit
State
[Spark
2.1]
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in blog post!

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
Separate processing details (output rate, late data tolerance)
from query semantics.

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
.count()
.writeStream
How to group
data by time
Same in streaming & batch

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
.count()
.writeStream
How late
data can be

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
.count()
.writeStream
How often
to emit updates

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Arbitrary
Stateful Operations
[Spark
2.2]
mapGroupsWithState
allows
any
user-‐defined
stateful ops
to
a

user-‐defined
state
Direct
support
for
per-‐key

timeouts
in
event-‐time
or

processing-‐time
supports
Scala
and
Java
ds.groupByKey(groupingFunc)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,
values: Iterator[V],
state: GroupState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
flatMapGroupsWithState
▪ Applies
the
given
function
to
each
group
of
data,
while
maintaining

a
user-‐defined
per-‐group state
▪ Invoked
once
per
group
in
batch
▪ Invoked
each
trigger
(with
the
existence
of
data)
per
group
in

streaming
▪ Requires
user
to
provide
an
output
mode
for
the
function

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
flatMapGroupsWithState
▪ mapGroupsWithState is
a
special
case
with
oOutput
mode:
Update
oOutput
size:
1
row
per
group
▪ Supports
both
Processing
Time
and
Event
Time
timeouts

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Outline
oStructured
Streaming
Concepts
oStateful Processing
in
Structured
Streaming
oUse
Cases and
How
NoSQL
Stores
Fit
In
oDemos

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Alerting
val monitoring = stream
.as[Event]
.groupBy(_.id)
.flatMapGroupsWithState(Append, GST.ProcessingTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>
...
}
.writeStream
.queryName("alerts")
.foreach(new PagerdutySink(credentials))
Monitor a stream using custom stateful logic with timeouts.

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Alerting
▪ Save
your
state
to
Scylla
to
power
dashboards
▪ Have
the
stream
trigger
alerts
ASAP

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Sessionization
val monitoring = stream
.as[Event]
.groupBy(_.session_id)
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>
...
}
.writeStream
.scylla("trips")
Analyze sessions of user/system behavior

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Sessionization
▪ Update
sessions
in
your
stream
▪ Save
it
to
a
NoSQL
store
like
Scylla!

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Demo

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Try Spark 2.2 on Community Edition today!
https://databricks.com/try-databricks

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
Apache Spark’s Structured Streaming at Scale Series
https://databricks.com/blog/category/engineering
Twitter: @databricks

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
We are hiring!
https://databricks.com/company/careers

PRESENTATION
TITLE
ON
ONE
LINE

AND
ON
TWO
LINES
First
and
last
name
Position,
company
THANK
YOU
burak@databricks.com
“Does anyone have any questions for my answers?”
- Henry Kissinger

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

Similar to Scylla Summit 2017: Stateful Streaming Applications with Apache Spark (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark