2. Agenda
❖ Overview
❖ Architecture
❖ Fault-‐tolerance
❖ Why
Spark
streaming?
We
have
Storm
❖ Demo
3. Overview
❖ Spark
Streaming
is
an
extension
of
core
Spark
API.
It
enables
scalable,
high-‐throughput,
fault-‐tolerant
stream
processing
of
live
data
streams.
❖ ConnecGons
for
most
of
common
data
sources
such
as
KaIa,
Flume,
TwiKer,
ZeroMQ,
Kinesis,
TCP,
etc.
❖ Spark
streaming
differ
from
most
online
processing
soluGon
by
espousing
mini
batch
approach,
instead
of
data
stream.
❖ Based
on
DiscreGzed
Stream
paper
❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing
Matei Zaharia,Tathagata Das, Haoyuan Li,
Timothy Hunter, Scott Shenker, Ion Stoica
Berkeley EECS (2012-12-14)
www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
4. Overview
Spark
streaming
runs
streaming
computaGon
as
a
series
of
very
small,
determinis1c
batch
jobs
Spark
streaming
Spark
Live
data
stream
Batches
of
X
milliseconds
Processed
results
❖ Chops
live
stream
into
batches
of
x
milliseconds
❖ Spark
treats
each
batch
of
data
as
RDDs
❖ Processed
results
of
the
RDD
operaGons
are
returned
in
batches
6. Example 1 - DStream to RDD
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
Twi8er
Streaming
API
!
!
tweets
DStream
batch
@
t batch
@
t
+
1 batch
@
t
+
3batch
@
t
+
2
stored
in
memory
as
an
RDD
(immutable,
distributed)
7. Example 1 - DStream to RDD relation
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))
tweets
DStream
batch
@
t batch
@
t
+
1 batch
@
t
+
3batch
@
t
+
2
hashTags
DStream
[#hobbitch,
#bilboleggins,
…]
flatMap flatMap flatMap flatMap
new
RDDs
for
each
batch
new
DStream
8. Example 1 - DStream to RDD
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
hashTags.saveToCassandra(“keyspace”, “tableName”)
tweets
DStream
hashTags
DStream
[#hobbitch,
#bilboleggins,
…]
flatMap flatMap flatMap flatMap
every
batch
saved
to
Cassandra
save save save save
9. Example 2 - DStream to RDD relation
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
val tagCounts = hashTags.countByValue()
tweets
DStream
hashTags
flatMap flatMap flatMap flatMap
map map map map
reduceByKey reduceByKey reduceByKey reduceByKey
hashTags
[(#hobbitch,
10),
(#bilboleggins,
34),
…]
10. Example 3 - Count the hash tags over last 10 minutes
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
Sliding
window
operaGon Window
length Sliding
interval
11. Example 3 - Count the hash tags over last 10 minutes
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
t-1 t t+1 t+2 t+3
sliding
window
hashTags
hashTags
Count
over
all
data
in
window
12. Example 4 - Count hash tags over last 10 minutes smartly
val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1))
t-1 t t+1 t+2 t+3
sliding
window
hashTags
hashTags
Add
count
of
new
batch
in
window
+-
Reduce
count
of
batch
out
of
window
generalizaGon
of
smart
window
reduce
exists:
reduceByKeyAndWindow(reduce,
inverseReduce,
window,
interval)
13. Architecture
❖ Receivers
divides
data
into
mini
batches
❖ Size
of
batches
can
be
defined
in
milliseconds
(best
pracGce
is
greater
than
500
milliseconds)
Spark
Streaming
Receivers
Spark
Engine
Batches
of
input
RDDs
Batches
of
output
RDDs
Input
streams
14. Fault-tolerance
❖ RDDs
are
not
generated
from
fault-‐tolerance
source
❖ Replicate
data
among
worker
nodes
(default
replicaGon
factor
of
2)
❖ In
state-‐full
jobs
checkpoints
should
be
used
❖ Journaling
such
as
in
DB
can
be
acGvated
flatMap
Tweets
RDD
hashTags
RDD
input
data
replicated
in
memory
lost
parGGons
recomputed
on
other
workers
15. Fault-tolerance
❖ Two
kinds
of
data
to
recover
in
the
event
of
failure:
• Data
received
and
replicated
-‐
This
data
survives
failure
of
a
single
worker
node,
since
a
copy
of
it
exists
on
one
of
the
other
nodes.
• Data
received
but
buffered
for
replicaGon
-‐
As
this
is
not
replicated,
the
only
way
to
recover
that
data
is
to
get
it
from
the
source
again.
16. Fault-tolerance
❖ Two
receiver
semanGcs:
• Reliable
receiver
-‐
Acknowledges
only
ager
received
data
is
replicated.
If
fails,
buffered
data
does
not
get
acknowledged
to
the
source.
If
the
receiver
is
restarted,
the
source
will
resend
the
data,
and
therefore
no
data
will
be
lost
due
to
the
failure.
• Unreliable
Receiver
-‐
Such
receivers
can
lose
data
when
they
fail
due
to
worker
or
driver
failures.
17. Fault-tolerance
Deployment
Scenario
Receiver
Failure Driver
failure
without
write
ahead
log
Buffered
data
lost
with
unreliable
receivers
Zero
data
lost
with
reliable
receivers
and
files
Buffered
data
lost
with
unreliable
receivers
Past
data
lost
with
all
receivers
Zero
data
lost
with
files
with
write
ahead
log
Zero
data
lost
with
receivers
and
files Zero
data
lost
with
receivers
and
files
19. One model to rule them all
❖ Same
model
for
offline
AND
online
processing
❖ Common
code
base
for
offline
AND
online
processing
❖ Less
bugs
due
to
duplicaGon
❖ Less
bugs
of
framework
difference
❖ Increase
developer
producGvity
20. One stack to rule them all
❖ Explore
data
interacGvely
using
Spark
shell
to
idenGfy
problem
❖ Use
same
code
in
Spark
standalone
to
idenGfy
problem
in
producGon
environment
❖ Use
similar
code
in
Spark
Streaming
to
monitor
problem
online
$
./spark-‐shell
scala>
val
file
=
sc.hadoopFile(“smallLogs”)
...
scala>
val
filtered
=
file.filter(_.contains(“ERROR”))
...
scala>
va
object
ProcessProductionData
{
def
main(args:
Array[String])
{
val
sc
=
new
SparkContext(...)
val
file
=
sc.hadoopFile(“productionLogs”)
val
filtered
=
file.filter(_.contains(“ERROR”))
val
mapped
=
filtered.map(...)
...
}
} object
ProcessLiveStream
{
def
main(args:
Array[String])
{
val
sc
=
new
StreamingContext(...)
val
stream
=
sc.kafkaStream(...)
val
filtered
=
stream.filter(_.contains(“ERROR”))
val
mapped
=
filtered.map(...)
...
}
}
21. Performance
❖ Higher
throughput
than
Storm
• Spark
Streaming:
670k
records/second/node
• Storm:
115k
records/seconds/node
Grep
Throughput
per
node
(MB/s)
0
17.5
35
52.5
70
Record
size
(bytes)
100 1000
Spark
Storm
WordCount
0
7.5
15
22.5
30
Record
size
(bytes)
100 1000
Tested
with
100
EC2
instances
with
4
core
each
Comparison
taken
from
Das
Thatagata
and
Reynold
Xin
Hadoop
summit
2013
presentaGon
27. Utilization
❖ Spark
1.2
introduces
dynamic
cluster
resource
allocaGon
❖ Jobs
can
request
more
resources
and
release
resource
❖ Available
only
on
YARN