3. G
Emerging break out
trends in Twitter (in the
form #hashtags)
Ü
Real time sports
conversations related
with a topic (recent goal
or touchdown)
!
Real time product
recommendations based
on your behavior &
profile
real time searchreal time trends real time conversations
WHY REAL TIME?
real time recommendations
Real time search of
tweets with a budget <
200 ms
s
3
5. ! E
CUBE ANALYTICS
Business Intelligence
PREDICTIVE ANALYTICS
Statistics and Machine
learning
TYPES OF ANALYTICS
varieties
5
6. Ü
Ability to provide
insights after several
hours/days when a
query is posed
REAL TIME BATCH
DIMENSIONS OF ANALYTICS
variants
Ability to analyze the
data instantly
s
6
7. streaming
Analyze data as it is
being produced
interactive
Store data and provide
results instantly when
a query is posed
H
C
REAL TIME ANALYTICS
dichotomy
7
8. STREAMING VS. INTERACTIVE
dichotomy
Static Batch
Results/Reports
Database
Server
Data$
Storage$
Queries
Bulkload
Data
INTERACTIVE ANALYTICS STREAMING ANALYTICS
8
Real time alerts, Real time analytics
Continuous visibility
Data$
Storage$
Results
Queries
Data Stream
Processing
9. REAL TIME
visibility
WHAT IS REAL TIME?
milli secs or secs or mins?
approximate
few secs
BATCH
adhoc queries
high throughput
few hours/days
OLTP
deterministic workflows
latency sensitive
< 500 ms
9
10. STREAMING SYSTEMS
First generation - SQL
NiagaraCQ Query Engine [Chen et al., SIGMOD 2000]
STREAM: The Stanford Stream Data Manager [Arasu et al., SIGMOD 2003]
Aurora: A Data Stream Management Engine [Abadi et al., SIGMOD 2003]
The Design of the Borealis Stream Processing Engine [Abadi et al., CIDR 2005]
Cayuga: A general purpose event monitoring system [Demers et al., CIDR 2007]
10
14. STORM DATA MODEL
SPOUTS
Sources of data for the topology (e.g) Postgres/My SQL/Kafka/Kestrel
BOLTS
Units of computation on data (e.g) filtering/aggregation/join/transformations#
TOPOLOGY
Directed acyclic graph - vertices = computation, edges = streams of data
,
,
14
15. WORD COUNT TOPOLOGY
% %
TWEET SPOUT PARSE TWEET BOLT WORD COUNT BOLT
Live stream of Tweets
#worldcup : 1M
soccer: 400K
….
LOGICAL PLAN
15
16. WORD COUNT TOPOLOGY
% %
TWEET SPOUT
TASKS
PARSE TWEET BOLT
TASKS
WORD COUNT BOLT
TASKS
%%%% %%%%
When a parse tweet bolt task emits a tuple
which word count bolt task should it send to?
16
17. Replicates tuples to next
stage bolt instances
Sends all the tuples to a
single next stage bolt
instance
ALL GROUPING GLOBAL GROUPING
STREAM GROUPINGS
combining data
Groups tuples by a
single column value or
multiple column values
FIELDS GROUPING
Randomly distributes
tuples to next stage bolt
instances
SHUFFLE GROUPING
/ . - ,
17
21. DATA FLOW IN STORM WORKERS
In QueueIn QueueIn QueueIn QueueIn Queue
TCP Receive Buffer
In QueueIn QueueIn QueueIn QueueOut Queue
Outgoing
Message Buffer
User Logic
Thread
User Logic
Thread
User Logic
Thread
User Logic
Thread
User Logic
Thread
User Logic
Thread
User Logic
Thread
User Logic
Thread
User Logic
ThreadSend Thread
Global Send
Thread
TCP Send Buffer
Global Receive
Thread
Kernel
Disruptor Queues
0mq Queues
Queue Contention
Multiple Languages
21
23. STORM @TWITTER
Large amount of data
produced every day
Largest storm cluster Several topologies
deployed
Several billion
messages every day
>thousands
l
>50tb
h
> HUNDREDS
P
>3b
b
1 stage 8 stages
23
35. FINDING “ANOMALOUS” NODES
KEY FEATURES
Filter/Expected values/Long term
WIDELY USED OUTSIDE TWITTER
R PACKAGE[1]
: SEASONALITY AND TREND AWARE
Employs time series decomposition and robust statistics
,
|
35
[1]$h&ps://blog.twi&er.com/2015/introducing=prac?cal=and=robust=anomaly=detec?on=in=a=?me=series$
&
á
36. FINDING “ANOMALOUS” NODES
LEVERAGE MULTIPLE METRICS
Minimize false positives
EXPLOIT CORRELATION/TOPOLOGY
Observed variables[1]
and latent variables
R PACKAGE
Applicable to univariate time series
,
I
36
[1]$"Automa,c$Failure$Diagnosis$in$Distributed$Large:Scale$So<ware$Systems$based$on$Timing$Behavior$Anomaly$Correla,on",$by$Marwede,$N.$S.,$Rohr,$M.,$van$Hoorn,$A.$and$Hasselbring,$W.$In$European$CSMR,$March$24::27,$2009.$
E
#
37. FINDING “ANOMALOUS” NODES
SERVICE COMPONENT HEALTH
Determine the intersection of the set of anomalies of each instance
HOST HEALTH
Determine the intersection of the set of anomalies of each process
,
'
intersection analysis
37
'
38. FINDING “ANOMALOUS” NODES
ANOMALY TYPE - INPUT SPIKE
All metrics had sudden spikes
ANOMALY TYPE - CONTAINER DEATH
All metrics of instances on that container had drops
,
v
intersection analysis - validation
38
Q