The document discusses real-time big data analytics using Storm, Cassandra, and in-memory computing. It describes how Storm is a popular open-source platform for real-time stream processing. It also explains how an in-memory data grid can interface with Cassandra for a complete solution, providing real-time analytics with high availability and elasticity. Real-time analytics require processing data within seconds, while batch analytics can occur over longer timeframes.
Scanning the Internet for External Cloud Exposures via SSL Certs
Real Time Big Data Analytics with Storm, Cassandra and In-Memory Computing
1. Real
Time
Big
Data
With
Storm,
Cassandra,
and
In-‐Memory
Compu=ng
DeWayne
Filppi
@dfilppi
2. Big
Data
Predic=ons
“Over
the
next
few
years
we'll
see
the
adop=on
of
scalable
frameworks
and
pla1orms
for
handling
streaming,
or
near
real-‐=me,
analysis
and
processing.
In
the
same
way
that
Hadoop
has
been
borne
out
of
large-‐scale
web
applica=ons,
these
plaMorms
will
be
driven
by
the
needs
of
large-‐
scale
loca=on-‐aware
mobile,
social
and
sensor
use.”
Edd
Dumbill,
O’REILLY
2
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
3. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
3
The
Two
Vs
of
Big
Data
Velocity
Volume
4. We’re
Living
in
a
Real
Time
World…
Homeland Security
Real Time Search
Social
eCommerce
User
Tracking
&
Engagement
Financial Services
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
4
5. The
Flavors
of
Big
Data
Analy=cs
Coun:ng
Correla:ng
Research
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
5
6. Analy=cs
@
Twi`er
–
Coun=ng
§ How
many
signups,
tweets,
retweets
for
a
topic?
§ What’s
the
average
latency?
§ Demographics
§ Countries
and
ci=es
§ Gender
§ Age
groups
§ Device
types
§ …
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
6
7. Analy=cs
@
Twi`er
–
Correla=ng
§ What
devices
fail
at
the
same
=me?
§ What
features
get
user
hooked?
§ What
places
on
the
globe
are
“happening”?
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
7
8. Analy=cs
@
Twi`er
–
Research
§ Sen=ment
analysis
§ “Obama
is
popular”
§ Trends
§ “People
like
to
tweet
aeer
watching
American
Idol”
§ Spam
pa`erns
§ How
can
you
tell
when
a
user
spams?
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
8
9. It’s
All
about
Timing
“Real
:me”
(<
few
Seconds)
Reasonably
Quick
(seconds
-‐
minutes)
Batch
(hours/days)
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
9
10. It’s
All
about
Timing
• Event
driven
/
stream
processing
• High
resolu=on
–
every
tweet
gets
counted
• Ad-‐hoc
querying
• Medium
resolu=on
(aggrega=ons)
• Long
running
batch
jobs
(ETL,
map/reduce)
• Low
resolu=on
(trends
&
pa`erns)
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
10
This
is
what
we’re
here
to
discuss
J
12. § RAM
is
the
new
disk
§ Data
par==oned
across
a
cluster
§ Large
“virtual”
memory
space
§ Transac=onal
§ Highly
available
§ Code
collocated
with
data.
In
Memory
Data
Grid
Review
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
12
13. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
13
Data
Grid
+
Cassandra:
A
Complete
Solu=on
• Data
flows
through
the
in-‐memory
cluster
async
to
Cassandra
• Side
effects
calculated
• Filtering
an
op=on
• Enrichment
an
op=on
• Results
instantly
available
• Internal
and
external
event
listeners
no=fied
14. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
14
Simplified
Event
Flow
15. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
15
Grid
–
Cassandra
Interface
§ Hector
and
CQL
based
interface
§ In
memory
data
must
be
mapped
to
column
families.
§ Configurable
class
to
column
family
mapping
§ Must
serialize
individual
fields
§ Fixed
fields
can
use
defined
types
§ Variable
fields
(
for
schemaless
in-‐memory
mode)
need
serializers
§ Object
model
fla`ening
§ By
default,
nested
fields
are
fla`ened.
§ Can
be
overridden
by
custom
serializer.
16. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
16
Virtues
and
Limita=ons
§ Could
be
faster:
high
availability
has
a
cost
§ Complex
flows
not
easy
to
assemble
or
understand
with
simple
event
handlers
§ Complete
stack,
not
just
two
tools
of
many
§ Fast.
§ Microsecond
latencies
for
in
memory
opera=ons
§ Fast
enough
for
almost
anybody
§ Highly
available/self
healing
§ Elas=c
17. § Popular
open
source,
real
=me,
in-‐memory,
streaming
computa=on
plaMorm.
§ Includes
distributed
run=me
and
intui=ve
API
for
defining
distributed
processing
flows.
§ Scalable
and
fault
tolerant.
§ Developed
at
BackType,
and
open
sourced
by
Twi`er
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
17
Storm
Background
18. § Streams
§ Unbounded
sequence
of
tuples
§ Spouts
§ Source
of
streams
(Queues)
§ Bolts
§ Func=ons,
Filters,
Joins,
Aggrega=ons
§ Topologies
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
18
Storm
Abstrac=ons
Spout
Bolt
Topologies
19. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
19
Streaming
word
count
with
Storm
§ Storm
has
a
simple
builder
interface
to
crea=ng
stream
processing
topologies
§ Storm
delegates
persistence
to
external
providers
§ Cassandra,
because
of
its
write
performance,
is
commonly
used
20. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
20
Storm
:
Op=mis=c
Processing
§ Storm
(quite
ra=onally)
assumes
success
is
normal
§ Storm
uses
batching
and
pipelining
for
performance
§ Therefore
the
spout
must
be
able
to
replay
tuples
on
demand
in
case
of
error.
§ Any
kind
of
quasi-‐queue
like
data
source
can
be
fashioned
into
a
spout.
§ No
persistence
is
ever
required,
and
speed
a`ained
by
minimizing
network
hops
during
topology
processing.
21. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
21
Fast.
Want
to
go
faster?
§ Eliminate
non-‐memory
components
§ Subs=tute
disk
based
queue
for
reliable
in-‐memory
queue
§ Subs=tute
disk
based
state
persistence
to
in-‐memory
persistence
§ Asynchronously
update
disk
based
state
(C*)
22. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
22
Sample
Architecture
23. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
23
References
§ Try
the
Cloudify
recipe
§ Download
Cloudify
:
h`p://www.cloudifysource.org/
§ Download
the
Recipe
(apps/xapstream,
services/xapstream):
– h`ps://github.com/CloudifySource/cloudify-‐recipes
§ XAP
–
Cassandra
Interface
Details;
§ h`p://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency
§ Check
out
the
source
for
the
XAP
Spout
and
a
sample
state
implementa=on
backed
by
XAP,
and
a
Storm
friendly
streaming
implemen=on
on
github:
§ h`ps://github.com/Gigaspaces/storm-‐integra=on
§ For
more
background
on
the
effort,
check
out
my
recent
blog
posts
at
h`p://blog.gigaspaces.com/
§ h`p://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐1-‐storm-‐clouds/
§ h`p://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐2-‐xap-‐integra=on/
§ Part
3
coming
soon.
25. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
25
Twi`er
Storm
With
Cassandra
26. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
26
Storm
Overview
27. § Streams
§ Unbounded
sequence
of
tuples
§ Spouts
§ Source
of
streams
(Queues)
§ Bolts
§ Func=ons,
Filters,
Joins,
Aggrega=ons
§ Topologies
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
27
Storm
Concepts
Spouts
Bolt
Topologies
28. Challenge
–
Word
Count
Word:Count
Tweets
Count
?®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
28
• HoWest
topics
• URL
men:ons
• etc.
29. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
29
Streaming
word
count
with
Storm
30. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
30
Supercharging
Storm
§ Storm
doesn’t
supply
persistence,
but
provides
for
it
§ Storm
op=mizes
IO
to
slow
persistence
(e.g.
databases)
using
batching.
§ Storm
processes
streams.
The
stream
provider
itself
needs
to
support
persistency,
batching,
and
reliability.
Tweets,
events,whatever….
31. XAP
Real
Time
Analy=cs
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
31
32. ®
Copyright
2011
Gigaspaces
Ltd.
All
Rights
Reserved
Two
Layer
Approach
§ Advantage:
Minimal
“impedance
mismatch”
between
layers.
– Both
NoSQL
cluster
technologies,
with
similar
advantages
§ Grid
layer
serves
as
an
in
memory
cache
for
interac=ve
requests.
§ Grid
layer
serves
as
a
real
=me
computa=on
fabric
for
CEP,
and
limited
(
to
allocated
memory)
real
=me
distributed
query
capability.
In
Memory
Compute
Cluster
NoSQL
Cluster
...
Raw
Event
Stream
Raw
Event
Stream
Raw
Event
Stream
Real
Time
Events
Raw
And
Derived
Events
Real
Time
Events
Reporting
Engine
SCALE
SCALE
33. ®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
33
Simplified
Architecture
34. § Flowing
event
streams
through
memory
for
side
effects
§ Event
driven
architecture
execu=ng
in-‐memory
§ Raw
events
flushed,
aggrega=ons/deriva=ons
retained
§ All
layers
horizontally
scalable
§ All
layers
highly
available
§ Real-‐=me
analy=cs
&
cached
batch
analy=cs
on
same
scalable
layer
§ Data
grid
provides
a
transac=onal/consistent
façade
on
NoSQL
store
(in
this
case
elimina=ng
SQL
database
en=rely)
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
34
Key
Concepts
35. Keep
Things
In
Memory
Facebook
keeps
80%
of
its
data
in
Memory
(Stanford
research)
RAM
is
100-‐1000x
faster
than
Disk
(Random
seek)
• Disk:
5
-‐10ms
• RAM:
~0.001msec
36. Take
Aways
§ A
data
grid
can
serve
different
needs
for
big
data
analy=cs:
§ Supercharge
a
dedicated
stream
processing
cluster
like
Storm.
– Provide
fast,
reliable,
transac=onal
tuple
streams
and
state
§ Provide
a
general
purpose
analy=cs
plaMorm
– Roll
your
own
§ Simplify
overall
architecture
while
enhancing
scalability
– Ultra
high
performance/low
latency
– Dynamically
scalable
processing
and
in-‐memory
storage
– Eliminate
messaging
=er
– Eliminate
or
minimize
need
for
RDBMS
37. § Real:me
Analy:cs
with
Storm
and
Hadoop
§ hWp://www.slideshare.net/Hadoop_Summit/real:me-‐
analy:cs-‐with-‐storm
§ Learn
and
fork
the
code
on
github:
hWps://github.com/Gigaspaces/storm-‐integra:on
§ Twi`er
Storm:
hWp://storm-‐project.net
§ XAP
+
Storm
Detailed
Blog
Post
hWp://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐2-‐xap-‐
integra:on/
®
Copyright
2013
Gigaspaces
Ltd.
All
Rights
Reserved
37
References