SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
SERC – CADL
Indian Institute of Science
Bangalore, India
TWITTER STORM
Real Time, Fault Tolerant Distributed Framework
Created : 25th May, 2013
SONAL RAJ
National Institute of Technology,
Jamshedpur, India
Background
• Created by Nathan Marz @ BackType/Twitter
• Analyze tweets, links, users on Twitter
• Opensourced at Sep 2011
• Eclipse Public License 1.0
• Storm 0.5.2
• 16k java and 7k Clojure LOC
• Current stable release 0.8.2
• 0.9.0 major core improvement
Background
• Active user group
• https://groups.google.com/group/storm-user
• https://github.com/nathanmarz/storm
• Most watched java repo at GitHub (>4k watcher)
• Used by over 30 companies
• Twitter, Groupon, Alibaba, GumGum, ..
What led to storm . .
Problems . . .
•Scale is painful
•Poor fault-tolerance
• Hadoop is stateful
•Coding is tedious
•Batch processing
• Long latency
• no realtime
Storm . . .Problems Solved !!
•Scalable and robust
• No persistent layer
•Guarantees no data loss
•Fault-tolerant
•Programming language agnostic
•Use case
• Stream processing
• Distributed RPC
• Continues computation
STORM FEATURES
Storm
Guaranteed data processing
...,Horizontal scalability
Fault-tolerance
..., No intermediate message brokers!
...,Higher level abstraction than message passing
...,"Just works"
Storm’s edge over hadoop
HADOOP STORM
• Batch processing
• Jobs runs to completion
• JobTracker is SPOF*
• Stateful nodes
• Scalable
• Guarantees no data loss
• Open source
Real-time processing
Topologies run forever
No single point of failure
Stateless nodes
Scalable
Guarantees no data loss
Open source
* Hadoop 0.21 added some checkpointing
SPOF: Single Point Of Failure
Streaming
Computation
Paradigm of stream computation
Queues /Workers
General method
Messages Queue
general method
Message routing can be complex
Messages Queue
storm use cases
COMPONENTS
• Nimbus daemon is comparable to Hadoop JobTracker. It is
the master
• Supervisor daemon spawns workers, it is comparable to
Hadoop TaskTracker
• Worker is spawned by supervisor, one per port defined in
storm.yaml configuration
• Task is run as a thread in workers
• Zookeeper is a distributed system, used to store metadata.
Nimbus and Supervisor daemons are fail-fast and stateless.
All states is kept in Zookeeper.
Notice all communication between Nimbus and
Supervisors are done through Zookeeper
On a cluster with 2k+1 zookeeper nodes, the
system can recover when maximally k nodes fails.
STORM ARCHITECTLlRE
,_ , 'I
Storm architecture
Master Node ( Similar to Hadoop Job-Tracker )
STORM ARCHITECTLlRE
Used for Cluster Co-ordination
STORM ARCHITECTLlRE
Runs Worker Nodes I Processes
CONCEPTS
• Streams
• Topology
• A spout
• A bolt
• An edge represents a grouping
streams
spouts
• Example
• Read from logs, API calls,
event data, queues, …
SPOUTS
•Interface ISpout
l·lethod Summanr"
void ack(java.lang.Object msg_d)
Storm has detennined that thetnpl1
e emitted by this spout th the msgld identifierhas been fuUy processed.
void acti-.:rate 0
Called when a spout has been actPtated out ,of a deactivated mode.
void close()
Called when an ISpout is going to be shutdovn.
void deactivate()
Called vhen a spout has been deacty.,ated.
void fail(java.lang.Object msgidl
The tnple emitted by this spout vith the msgld identifier hasfailed to befulrlprocessed.
void nextTu12le()
Vhen thls method is calle<l Stonn is requesting iliat the Spout emit tnples to theoutput colleotor.
void open(java.· ti .Map con.f, Tog.ologyContext context, SQoutOutQutCollector co ector)
Called when a task for this component is initialized within a worker on the d1rrster.
Bolts
•Bolts
• Processes input streams and produces new streams
• Example
• Stream Joins, DBs, APIs, Filters, Aggregation, …
BOLTS
• Interface Ibolt
TOPOLOGY
•Topology
• is a graph where each node is a spout or bolt, and the edges
indicate which bolts are subscribing to which streams.
TASKS
• Parallelism is implemented using multiples instances of each spout
and bolt for simultaneous similar tasks. Spouts and bolts execute as
many tasks across the cluster.
• Managed by the supervisor daemon
Stream groupings
When a tuple is emitted, which task
does it go to?
Stream grouping
Shuffle grouping: pick a random task
Fields grouping: consistent hashing on a
subset of tuple fields
All grouping: send to all tasks
Global grouping: pick task with lowest id
example : streaming word count
• TopologyBuilder is used to construct topologies in Java.
• Define a Spout in the Topology with parallelism of 5 tasks.
abstraction : DRPC
Consumer decides what data it receives and how it gets
grouped
• Split Sentences into words with parallelism of 8 tasks.
• Create a word count stream
ABSTRACTION : DRPC
)
public static class SplttSentence extends ShellBolt implements IRtchBolt {
public SplttSentence()
super("python", "splltsentence.pyH);
}
public votd declareOutputF1elds(OutputF1eldsDeclarer declare!){
declarer.declaren(ew Fields ''word''));
}
}
'import storm
class SplttSentenceBolts(torm.BastcBolt):
def process(self, tup):
words = tup.values[0].spl1t"( 11
for word tn words:
storm.emit([word])
INSIDE A BOLT ..
public static class WordCount implements IBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
public void prepare(Map conf, TopologyContext conte ) {
}
public void execute(Tuple tuple, BastcOutputCollector
collector){ String vorc..J = tuple.getStr1ng(0);
Integer count = counts.get(word);
if(count==null)count = 0;
count++;
counts.put(word, count);
collector.emitn(ew Values(word, count));
}
public votd cleanup(){
}
public vo1d declareOutputFields(OutputFieldsDeclarer declarEr){
declarer.declaren(ew flelds("word", "count"));
}
}
abstraction : DRPC
• Submitting Topologies to the cluster
abstraction : DRPC
• Running the Topology in Local Mode
Fault-Tolerance
• Zookeeper stores metadata in a very robust way
• Nimbus and Supervisor are stateless and only need metadata from ZK to
work/restart
• When a node dies
• The tasks will time out and be reassigned to other workers by Nimbus.
• When a worker dies
• The supervisor will restart the worker.
• Nimbus will reassign worker to another supervisor, if no heartbeats are
sent.
• If not possible (no free ports), then tasks will be run on other workers in
topology. If more capacity is added to the cluster later, STORM will
automatically initialize a new worker and spread out the tasks.
• When nimbus or supervisor dies
• Workers will continue to run
• Workers cannot be reassigned without Nimbus
• Nimbus and Supervisor should be run using a process monitoring tool, to
restarts them automatically if they fail.
AT LEAST ONCE Processing
• STORM guarantees at-least-once processing of tuples.
• Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits
long
• Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple.
• Ack is called on spout, when tree of tuples for spout tuple is fully processed.
• Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of
tuples is not fully processed within a specified timeout (default is 30 seconds).
• It is possible to specify the message id, when emitting a tuple. This might be
useful for replaying tuples from a queue.
Ack/fail method called when tree of
tuples have been fully processed or
failed / timed-out
AT Least once processing
• Anchoring is used to copy the spout tuple message id(s) to the new
tuples generated. In this way, every tuple knows the message id(s) of all
spout tuples.
• Multi-anchoring is when multiple tuples are anchored. If the tuple tree
fails, then multiple spout tuples will be replayed. Useful for doing
streaming joins and more.
• Ack called from a bolt, indicates the tuple has been processed as
intented
• Fail called from a bolt, replays the spout tuple(s)
• Every tuple must be acked/failed or the task will run out of memory at
some point.
_collector.emit(tuple,new Values(word)); Uses anchoring
_collector.emit(new Values(word)); Does NOT use anchoring
exactly once processing
• Transactional topologies (TT) is an abstraction built on STORM primitives.
• TT guarantees exactly-once-processing of tuples.
• Acking is optimized in TT, no need to do anchoring or acking manually.
• Bolts execute as new instances per attempt of processing a batch
• Example
All grouping
Spout
Task: 1
Bolt
Task: 2
Bolt
Task: 3
1. A spout tuple is emitted to task 2 and 3
2. Worker responsible for task 3 fails
3. Supervisor restarts worker
4. Spout tuple is replayed and emitted to task
2 and 3
5. Task 2 and 3 initiate new bolts because of new
attempt
Now there is no problem
ABSTRACTION : DRPC
f
/
l["request-id"',..result"]
,-----
+''result.. - DRPC
-"args.. Server
::.,
Topology
[..request-id"1· "args' "return-info..]
Ill
Ill
Distributed RPC Architecture
WHY DRPC ?
Before Distributed RPC, time-sensitive queries relied
on a pre-computed index
Storm Does away with the indexing !!
abstraction : DRPC example
• Calculating the “Reach” of URL on the fly (in real time ! )
• Written by Nathan Marz to implement storm !
• Real World Application of Storm , open source, available
at http://github.com/nathanmarz/storm
• Reach is the number of unique people exposed to a URL
(tweet) on twitter at any given time.
abstraction : DRPC >> computing reach
ABSTRACTION : DRPC >> REACH TOPOLOGY
Spout - shuffle
["follower-id"]
+
global
t
abstraction : DRPC >> Reach topology
Create the Topology for the DRPC
Implementation of Reach Computation
ABSTRACTION : DRPC
_collector.emitn(ew Values(id, count));
}
public static class PartialUniquer implements IRichBolt, FinishedCallback {
OutputCollector _collecto";
Map<Object, Set<String>> _sets - new HashMap<Object, Set<String>>();
public void execute(Tuple tuple){
Object id = tuple.getValue(0);
Set<String> curr = _sets.get(id);
if(curr==null){
curr = new HashSet<String>();
_sets.put(id, curr);
}
curr.add(tuple.getString(l));
_collector.ack(tuple);
}
@Override
public void finishedidO(bject 1d){
Set<String> curr = _sets.remove(id);
int count = 0;
if(curr!=null)count = curr.size();
ABSTRACTION : DRPC
_collector.emitn(ew Values(id, count));
}
public static class Part1a1Un1 uer 1m lements IR1chBolt, F1n1shedCa1lback {
Ou _co ector;
ap<Object, Set<String>> _sets = new HashMap<Object, Set<String>>
public void execu e u
Object 1d = tuple.getVa1ue(0);
Set<String> curr = _sets.get(1d);
1f(curr==nu11){
curr = new HashSet<Str1ng>();
_sets.put(id, curr);
}
curr.add(tup1e.getStr1ng(l));
_collector.ack(tuple);
Keep set of followers for
each request id in n1en1ory
}
@Override
public void f1n1shedidO(bject id){
Set<String> curr = _sets.remove(id);
i.nt count = 0;
1f(curr!=nu11)count = curr.size();
ABSTRACTION : DRPC
_collector.emitn(ew Values(id, count));
}
public static class PartialUniquer implements IRichBolt, FinishedCallback {
OutputCollector _collector;
Map<Object, Set<String>> _sets - new HashMap<Object, Set<String>>();
pub · oid
execute(Tuple
Object id = tuple.getValue(0 ,
Set<String> curr = _sets.get(id
if(curr==null){
curr = new HashSet<String>();
_sets.put(id, curr);
}
curr.add(tuple.getString(l));
_collector.ack(tuple);
@Override
public void finishedidO(bject id){
Set<String> curr = _sets.remove(id);
int count = 0;
ABSTRACTION : DRPC
_collector.emitn(ew Values(id, count));
}
if(curr!=null)count = curr.size();
ABSTRACTION : DRPC
public static class PartialUniquer implements IRichBolt, FinishedCallback {
OutputCollector _collector;
Map<Object, Set<String>> _sets = new HashMap<Object, Set<String>>();
public void execute(Tuple tuple){
Object id = tuple.getValue(0);
Set<String> curr = _sets.get(id);
if(curr==null){
curr = new HashSet<String>();
_sets.put(id, curr);
}
curr.add(tuple.getString(l));
_collector.ack(tuple);
}
lie void finishedidO(bject id){
Set<String> curr = _sets.remove(id);
int count = 0;
if(curr!=null)count = curr.size();
_collector.emitn(ew Values(id, count
guaranteeing message processing
Tuple Tree
Guaranteeing message processing
• A spout tuple is not fully processed until all tuples in
the tree have been completed.
• If the tuple tree is not completed within a specified
timeout, the spout tuple is replayed
• Use of an inherent tool called the Reliability API
Guaranteeing message processing
Marks a single node in
the tree as complete
“ Anchoring “ creates a
new edge in the tuple
tree
Storm tracks tuple trees for you in an extremely efficient way
Running a storm application
•Local Mode
• Runs on a single JVM
• Used for development testing and debugging
•Remote Mode
• Submit our processes to Storm Cluster which has many processes
running on different machines.
• Doesn’t show debugging info, hence it is considered Production Mode.
STORM UI
l Pilm•
231HmOI
Hos1
p 11-32 181-'B.ta.llltf<!11
l>orl
6700
l:meted lnondwTecS ,,_ .....ey (ntsJ
OSII 'UJ21'l!J 0
2 23n' n 57s p11).98 200- 01 «:2 '*'nil (i100 54!S.."'60 033-1 2742"..&0 0
a 2'31 17 tp.IG-t
"""
&roo 64l!.S320 &oee'.l320 0. 274.."'«>0 0
5 231117m!l!l p 10.1'V-Il7·116.tc2.1nterno! fl700 03:!6 274274() D
,_
Storm Ul
Component summary
2
Bolt stats
Proc.n cYIMII
031!1
O.alll
0.3:<'0
0320
Input stats (AJItime)
• 'Stt.., Process bl.tone)' IM•I
032CI
Fa'lood
0
Acted Uosl "'""
• 17n• tOll IP 10.:»-73·2311.«,11111! 6100 0 742740 0
DOCUlVIENTATION
nathanman: DastOoard lnbox
nathanmarz I storm 2.,051 I. 109
Pull • 23 Wild 2.4 SlAts e.Graphs
Home Pages WtklHistory GitAocess
Home wPage fGitP&ge
Storm is a distributed realtime computation system.Similar to how Hadoop provides a set of generalprimJtives for doing batch processing,
Storm prov1desa set or generalprimitivesror doang realtJmecomputation.Storm iss1mp1e,canbe usedwath anyprogramm1ng Jaoguage,and
Is a lot of fun to use!
Read these first
• Ra:Jonale
• Sottmg up devolopment environment
• Creatmg a new Stormproject
• Tutor al
Getting help
Feeltree to askquestionson Storm's mailing list·ttp:lkjro p :. ooo oom/qrn 1p torm-user
You can also come to tho Istorm-user room on " cnodo You can usually find a Storm dovolopor thoro to help you out
fated projects
STORM LIBRARIES . .
STORM uses a lot of libraries. The most prominent are
• Clojure a new lisp programming language. Crash-course follows
• Jetty an embedded webserver. Used to host the UI of Nimbus.
• Kryo a fast serializer, used when sending tuples
• Thrift a framework to build services. Nimbus is a thrift daemon
• ZeroMQ a very fast transportation layer
• Zookeeper a distributed system for storing metadata
References
•Twitter Storm
• Mathan Marz
• http://www.storm-project.org
•Storm
• nathanmarz@github
• http://www.github.com/nathanmarz/storm
•Realtime Analytics with Storm and Hadoop
• Hadoop_Summit

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleDung Ngua
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs stormTrong Ton
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Stormthe100rabh
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 

Was ist angesagt? (20)

Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
STORM
STORMSTORM
STORM
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 

Ähnlich wie SERC – CADL Real Time Distributed Framework

Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Stormjustinjleet
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/MultitaskingSasha Kravchuk
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Design for Test [DFT]-1 (1).pdf DESIGN DFT
Design for Test [DFT]-1 (1).pdf DESIGN DFTDesign for Test [DFT]-1 (1).pdf DESIGN DFT
Design for Test [DFT]-1 (1).pdf DESIGN DFTjayasreenimmakuri777
 
Profiler Guided Java Performance Tuning
Profiler Guided Java Performance TuningProfiler Guided Java Performance Tuning
Profiler Guided Java Performance Tuningosa_ora
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxIbrahimBenhadhria
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachAlexandre Rafalovitch
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Lucidworks
 
Multithreading Presentation
Multithreading PresentationMultithreading Presentation
Multithreading PresentationNeeraj Kaushik
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsDaniel Blezek
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemAndrii Gakhov
 
Let's Talk Locks!
Let's Talk Locks!Let's Talk Locks!
Let's Talk Locks!C4Media
 
Multi threading
Multi threadingMulti threading
Multi threadinggndu
 
Prologue O/S - Improving the Odds of Job Success
Prologue O/S - Improving the Odds of Job SuccessPrologue O/S - Improving the Odds of Job Success
Prologue O/S - Improving the Odds of Job Successinside-BigData.com
 
JDD 2017: Brace yourself! Storm is coming! (Łukasz Gebel, Michał Koziorowski)
JDD 2017: Brace yourself! Storm is coming! (Łukasz Gebel, Michał Koziorowski)JDD 2017: Brace yourself! Storm is coming! (Łukasz Gebel, Michał Koziorowski)
JDD 2017: Brace yourself! Storm is coming! (Łukasz Gebel, Michał Koziorowski)PROIDEA
 

Ähnlich wie SERC – CADL Real Time Distributed Framework (20)

Storm
StormStorm
Storm
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Storm 0.8.2
Storm 0.8.2Storm 0.8.2
Storm 0.8.2
 
Design for Test [DFT]-1 (1).pdf DESIGN DFT
Design for Test [DFT]-1 (1).pdf DESIGN DFTDesign for Test [DFT]-1 (1).pdf DESIGN DFT
Design for Test [DFT]-1 (1).pdf DESIGN DFT
 
Profiler Guided Java Performance Tuning
Profiler Guided Java Performance TuningProfiler Guided Java Performance Tuning
Profiler Guided Java Performance Tuning
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Storm
StormStorm
Storm
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptx
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
 
Multithreading Presentation
Multithreading PresentationMultithreading Presentation
Multithreading Presentation
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 
Let's Talk Locks!
Let's Talk Locks!Let's Talk Locks!
Let's Talk Locks!
 
Multi threading
Multi threadingMulti threading
Multi threading
 
Prologue O/S - Improving the Odds of Job Success
Prologue O/S - Improving the Odds of Job SuccessPrologue O/S - Improving the Odds of Job Success
Prologue O/S - Improving the Odds of Job Success
 
JDD 2017: Brace yourself! Storm is coming! (Łukasz Gebel, Michał Koziorowski)
JDD 2017: Brace yourself! Storm is coming! (Łukasz Gebel, Michał Koziorowski)JDD 2017: Brace yourself! Storm is coming! (Łukasz Gebel, Michał Koziorowski)
JDD 2017: Brace yourself! Storm is coming! (Łukasz Gebel, Michał Koziorowski)
 

Mehr von Sonal Raj

Internet of Things with Python & Serverless - PyCon MY 2019 - Kuala Lumpur, M...
Internet of Things with Python & Serverless - PyCon MY 2019 - Kuala Lumpur, M...Internet of Things with Python & Serverless - PyCon MY 2019 - Kuala Lumpur, M...
Internet of Things with Python & Serverless - PyCon MY 2019 - Kuala Lumpur, M...Sonal Raj
 
IOT and Home Automation with Serverless Computing | Serverless Days 2019 | So...
IOT and Home Automation with Serverless Computing | Serverless Days 2019 | So...IOT and Home Automation with Serverless Computing | Serverless Days 2019 | So...
IOT and Home Automation with Serverless Computing | Serverless Days 2019 | So...Sonal Raj
 
Internet of Python - IOT with Python and Serverless | Sonal Raj | HydPy Feb 2019
Internet of Python - IOT with Python and Serverless | Sonal Raj | HydPy Feb 2019Internet of Python - IOT with Python and Serverless | Sonal Raj | HydPy Feb 2019
Internet of Python - IOT with Python and Serverless | Sonal Raj | HydPy Feb 2019Sonal Raj
 
Progressive Javascript: Why React when you can Vue?
Progressive Javascript: Why React when you can Vue?Progressive Javascript: Why React when you can Vue?
Progressive Javascript: Why React when you can Vue?Sonal Raj
 
Alexa enabled smart home programming in Python - PyCon India 2018
Alexa enabled smart home programming in Python - PyCon India 2018Alexa enabled smart home programming in Python - PyCon India 2018
Alexa enabled smart home programming in Python - PyCon India 2018Sonal Raj
 
Startup Diagnostics: Reasons why startups can fail.
Startup Diagnostics: Reasons why startups can fail.Startup Diagnostics: Reasons why startups can fail.
Startup Diagnostics: Reasons why startups can fail.Sonal Raj
 
IT Quiz Mains
IT Quiz MainsIT Quiz Mains
IT Quiz MainsSonal Raj
 
IT Quiz Prelims
IT Quiz PrelimsIT Quiz Prelims
IT Quiz PrelimsSonal Raj
 
Spock the human computer interaction system - synopsis
Spock   the human computer interaction system - synopsisSpock   the human computer interaction system - synopsis
Spock the human computer interaction system - synopsisSonal Raj
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 

Mehr von Sonal Raj (10)

Internet of Things with Python & Serverless - PyCon MY 2019 - Kuala Lumpur, M...
Internet of Things with Python & Serverless - PyCon MY 2019 - Kuala Lumpur, M...Internet of Things with Python & Serverless - PyCon MY 2019 - Kuala Lumpur, M...
Internet of Things with Python & Serverless - PyCon MY 2019 - Kuala Lumpur, M...
 
IOT and Home Automation with Serverless Computing | Serverless Days 2019 | So...
IOT and Home Automation with Serverless Computing | Serverless Days 2019 | So...IOT and Home Automation with Serverless Computing | Serverless Days 2019 | So...
IOT and Home Automation with Serverless Computing | Serverless Days 2019 | So...
 
Internet of Python - IOT with Python and Serverless | Sonal Raj | HydPy Feb 2019
Internet of Python - IOT with Python and Serverless | Sonal Raj | HydPy Feb 2019Internet of Python - IOT with Python and Serverless | Sonal Raj | HydPy Feb 2019
Internet of Python - IOT with Python and Serverless | Sonal Raj | HydPy Feb 2019
 
Progressive Javascript: Why React when you can Vue?
Progressive Javascript: Why React when you can Vue?Progressive Javascript: Why React when you can Vue?
Progressive Javascript: Why React when you can Vue?
 
Alexa enabled smart home programming in Python - PyCon India 2018
Alexa enabled smart home programming in Python - PyCon India 2018Alexa enabled smart home programming in Python - PyCon India 2018
Alexa enabled smart home programming in Python - PyCon India 2018
 
Startup Diagnostics: Reasons why startups can fail.
Startup Diagnostics: Reasons why startups can fail.Startup Diagnostics: Reasons why startups can fail.
Startup Diagnostics: Reasons why startups can fail.
 
IT Quiz Mains
IT Quiz MainsIT Quiz Mains
IT Quiz Mains
 
IT Quiz Prelims
IT Quiz PrelimsIT Quiz Prelims
IT Quiz Prelims
 
Spock the human computer interaction system - synopsis
Spock   the human computer interaction system - synopsisSpock   the human computer interaction system - synopsis
Spock the human computer interaction system - synopsis
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 

Kürzlich hochgeladen

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

SERC – CADL Real Time Distributed Framework

  • 1. SERC – CADL Indian Institute of Science Bangalore, India TWITTER STORM Real Time, Fault Tolerant Distributed Framework Created : 25th May, 2013 SONAL RAJ National Institute of Technology, Jamshedpur, India
  • 2. Background • Created by Nathan Marz @ BackType/Twitter • Analyze tweets, links, users on Twitter • Opensourced at Sep 2011 • Eclipse Public License 1.0 • Storm 0.5.2 • 16k java and 7k Clojure LOC • Current stable release 0.8.2 • 0.9.0 major core improvement
  • 3. Background • Active user group • https://groups.google.com/group/storm-user • https://github.com/nathanmarz/storm • Most watched java repo at GitHub (>4k watcher) • Used by over 30 companies • Twitter, Groupon, Alibaba, GumGum, ..
  • 4. What led to storm . .
  • 5. Problems . . . •Scale is painful •Poor fault-tolerance • Hadoop is stateful •Coding is tedious •Batch processing • Long latency • no realtime
  • 6. Storm . . .Problems Solved !! •Scalable and robust • No persistent layer •Guarantees no data loss •Fault-tolerant •Programming language agnostic •Use case • Stream processing • Distributed RPC • Continues computation
  • 7. STORM FEATURES Storm Guaranteed data processing ...,Horizontal scalability Fault-tolerance ..., No intermediate message brokers! ...,Higher level abstraction than message passing ...,"Just works"
  • 8. Storm’s edge over hadoop HADOOP STORM • Batch processing • Jobs runs to completion • JobTracker is SPOF* • Stateful nodes • Scalable • Guarantees no data loss • Open source Real-time processing Topologies run forever No single point of failure Stateless nodes Scalable Guarantees no data loss Open source * Hadoop 0.21 added some checkpointing SPOF: Single Point Of Failure
  • 10. Paradigm of stream computation Queues /Workers
  • 12. general method Message routing can be complex Messages Queue
  • 14. COMPONENTS • Nimbus daemon is comparable to Hadoop JobTracker. It is the master • Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker • Worker is spawned by supervisor, one per port defined in storm.yaml configuration • Task is run as a thread in workers • Zookeeper is a distributed system, used to store metadata. Nimbus and Supervisor daemons are fail-fast and stateless. All states is kept in Zookeeper. Notice all communication between Nimbus and Supervisors are done through Zookeeper On a cluster with 2k+1 zookeeper nodes, the system can recover when maximally k nodes fails.
  • 16. Storm architecture Master Node ( Similar to Hadoop Job-Tracker )
  • 17. STORM ARCHITECTLlRE Used for Cluster Co-ordination
  • 18. STORM ARCHITECTLlRE Runs Worker Nodes I Processes
  • 19. CONCEPTS • Streams • Topology • A spout • A bolt • An edge represents a grouping
  • 21. spouts • Example • Read from logs, API calls, event data, queues, …
  • 22. SPOUTS •Interface ISpout l·lethod Summanr" void ack(java.lang.Object msg_d) Storm has detennined that thetnpl1 e emitted by this spout th the msgld identifierhas been fuUy processed. void acti-.:rate 0 Called when a spout has been actPtated out ,of a deactivated mode. void close() Called when an ISpout is going to be shutdovn. void deactivate() Called vhen a spout has been deacty.,ated. void fail(java.lang.Object msgidl The tnple emitted by this spout vith the msgld identifier hasfailed to befulrlprocessed. void nextTu12le() Vhen thls method is calle<l Stonn is requesting iliat the Spout emit tnples to theoutput colleotor. void open(java.· ti .Map con.f, Tog.ologyContext context, SQoutOutQutCollector co ector) Called when a task for this component is initialized within a worker on the d1rrster.
  • 23. Bolts •Bolts • Processes input streams and produces new streams • Example • Stream Joins, DBs, APIs, Filters, Aggregation, …
  • 25. TOPOLOGY •Topology • is a graph where each node is a spout or bolt, and the edges indicate which bolts are subscribing to which streams.
  • 26. TASKS • Parallelism is implemented using multiples instances of each spout and bolt for simultaneous similar tasks. Spouts and bolts execute as many tasks across the cluster. • Managed by the supervisor daemon
  • 27. Stream groupings When a tuple is emitted, which task does it go to?
  • 28. Stream grouping Shuffle grouping: pick a random task Fields grouping: consistent hashing on a subset of tuple fields All grouping: send to all tasks Global grouping: pick task with lowest id
  • 29. example : streaming word count • TopologyBuilder is used to construct topologies in Java. • Define a Spout in the Topology with parallelism of 5 tasks.
  • 30. abstraction : DRPC Consumer decides what data it receives and how it gets grouped • Split Sentences into words with parallelism of 8 tasks. • Create a word count stream
  • 31. ABSTRACTION : DRPC ) public static class SplttSentence extends ShellBolt implements IRtchBolt { public SplttSentence() super("python", "splltsentence.pyH); } public votd declareOutputF1elds(OutputF1eldsDeclarer declare!){ declarer.declaren(ew Fields ''word'')); } } 'import storm class SplttSentenceBolts(torm.BastcBolt): def process(self, tup): words = tup.values[0].spl1t"( 11 for word tn words: storm.emit([word])
  • 32. INSIDE A BOLT .. public static class WordCount implements IBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void prepare(Map conf, TopologyContext conte ) { } public void execute(Tuple tuple, BastcOutputCollector collector){ String vorc..J = tuple.getStr1ng(0); Integer count = counts.get(word); if(count==null)count = 0; count++; counts.put(word, count); collector.emitn(ew Values(word, count)); } public votd cleanup(){ } public vo1d declareOutputFields(OutputFieldsDeclarer declarEr){ declarer.declaren(ew flelds("word", "count")); } }
  • 33. abstraction : DRPC • Submitting Topologies to the cluster
  • 34. abstraction : DRPC • Running the Topology in Local Mode
  • 35. Fault-Tolerance • Zookeeper stores metadata in a very robust way • Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart • When a node dies • The tasks will time out and be reassigned to other workers by Nimbus. • When a worker dies • The supervisor will restart the worker. • Nimbus will reassign worker to another supervisor, if no heartbeats are sent. • If not possible (no free ports), then tasks will be run on other workers in topology. If more capacity is added to the cluster later, STORM will automatically initialize a new worker and spread out the tasks. • When nimbus or supervisor dies • Workers will continue to run • Workers cannot be reassigned without Nimbus • Nimbus and Supervisor should be run using a process monitoring tool, to restarts them automatically if they fail.
  • 36. AT LEAST ONCE Processing • STORM guarantees at-least-once processing of tuples. • Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long • Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple. • Ack is called on spout, when tree of tuples for spout tuple is fully processed. • Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of tuples is not fully processed within a specified timeout (default is 30 seconds). • It is possible to specify the message id, when emitting a tuple. This might be useful for replaying tuples from a queue. Ack/fail method called when tree of tuples have been fully processed or failed / timed-out
  • 37. AT Least once processing • Anchoring is used to copy the spout tuple message id(s) to the new tuples generated. In this way, every tuple knows the message id(s) of all spout tuples. • Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then multiple spout tuples will be replayed. Useful for doing streaming joins and more. • Ack called from a bolt, indicates the tuple has been processed as intented • Fail called from a bolt, replays the spout tuple(s) • Every tuple must be acked/failed or the task will run out of memory at some point. _collector.emit(tuple,new Values(word)); Uses anchoring _collector.emit(new Values(word)); Does NOT use anchoring
  • 38. exactly once processing • Transactional topologies (TT) is an abstraction built on STORM primitives. • TT guarantees exactly-once-processing of tuples. • Acking is optimized in TT, no need to do anchoring or acking manually. • Bolts execute as new instances per attempt of processing a batch • Example All grouping Spout Task: 1 Bolt Task: 2 Bolt Task: 3 1. A spout tuple is emitted to task 2 and 3 2. Worker responsible for task 3 fails 3. Supervisor restarts worker 4. Spout tuple is replayed and emitted to task 2 and 3 5. Task 2 and 3 initiate new bolts because of new attempt Now there is no problem
  • 39. ABSTRACTION : DRPC f / l["request-id"',..result"] ,----- +''result.. - DRPC -"args.. Server ::., Topology [..request-id"1· "args' "return-info..] Ill Ill Distributed RPC Architecture
  • 40. WHY DRPC ? Before Distributed RPC, time-sensitive queries relied on a pre-computed index Storm Does away with the indexing !!
  • 41. abstraction : DRPC example • Calculating the “Reach” of URL on the fly (in real time ! ) • Written by Nathan Marz to implement storm ! • Real World Application of Storm , open source, available at http://github.com/nathanmarz/storm • Reach is the number of unique people exposed to a URL (tweet) on twitter at any given time.
  • 42. abstraction : DRPC >> computing reach
  • 43. ABSTRACTION : DRPC >> REACH TOPOLOGY Spout - shuffle ["follower-id"] + global t
  • 44. abstraction : DRPC >> Reach topology Create the Topology for the DRPC Implementation of Reach Computation
  • 45. ABSTRACTION : DRPC _collector.emitn(ew Values(id, count)); } public static class PartialUniquer implements IRichBolt, FinishedCallback { OutputCollector _collecto"; Map<Object, Set<String>> _sets - new HashMap<Object, Set<String>>(); public void execute(Tuple tuple){ Object id = tuple.getValue(0); Set<String> curr = _sets.get(id); if(curr==null){ curr = new HashSet<String>(); _sets.put(id, curr); } curr.add(tuple.getString(l)); _collector.ack(tuple); } @Override public void finishedidO(bject 1d){ Set<String> curr = _sets.remove(id); int count = 0; if(curr!=null)count = curr.size();
  • 46. ABSTRACTION : DRPC _collector.emitn(ew Values(id, count)); } public static class Part1a1Un1 uer 1m lements IR1chBolt, F1n1shedCa1lback { Ou _co ector; ap<Object, Set<String>> _sets = new HashMap<Object, Set<String>> public void execu e u Object 1d = tuple.getVa1ue(0); Set<String> curr = _sets.get(1d); 1f(curr==nu11){ curr = new HashSet<Str1ng>(); _sets.put(id, curr); } curr.add(tup1e.getStr1ng(l)); _collector.ack(tuple); Keep set of followers for each request id in n1en1ory } @Override public void f1n1shedidO(bject id){ Set<String> curr = _sets.remove(id); i.nt count = 0; 1f(curr!=nu11)count = curr.size();
  • 47. ABSTRACTION : DRPC _collector.emitn(ew Values(id, count)); } public static class PartialUniquer implements IRichBolt, FinishedCallback { OutputCollector _collector; Map<Object, Set<String>> _sets - new HashMap<Object, Set<String>>(); pub · oid execute(Tuple Object id = tuple.getValue(0 , Set<String> curr = _sets.get(id if(curr==null){ curr = new HashSet<String>(); _sets.put(id, curr); } curr.add(tuple.getString(l)); _collector.ack(tuple); @Override public void finishedidO(bject id){ Set<String> curr = _sets.remove(id); int count = 0;
  • 48. ABSTRACTION : DRPC _collector.emitn(ew Values(id, count)); } if(curr!=null)count = curr.size();
  • 49. ABSTRACTION : DRPC public static class PartialUniquer implements IRichBolt, FinishedCallback { OutputCollector _collector; Map<Object, Set<String>> _sets = new HashMap<Object, Set<String>>(); public void execute(Tuple tuple){ Object id = tuple.getValue(0); Set<String> curr = _sets.get(id); if(curr==null){ curr = new HashSet<String>(); _sets.put(id, curr); } curr.add(tuple.getString(l)); _collector.ack(tuple); } lie void finishedidO(bject id){ Set<String> curr = _sets.remove(id); int count = 0; if(curr!=null)count = curr.size(); _collector.emitn(ew Values(id, count
  • 51. Guaranteeing message processing • A spout tuple is not fully processed until all tuples in the tree have been completed. • If the tuple tree is not completed within a specified timeout, the spout tuple is replayed • Use of an inherent tool called the Reliability API
  • 52. Guaranteeing message processing Marks a single node in the tree as complete “ Anchoring “ creates a new edge in the tuple tree Storm tracks tuple trees for you in an extremely efficient way
  • 53. Running a storm application •Local Mode • Runs on a single JVM • Used for development testing and debugging •Remote Mode • Submit our processes to Storm Cluster which has many processes running on different machines. • Doesn’t show debugging info, hence it is considered Production Mode.
  • 54. STORM UI l Pilm• 231HmOI Hos1 p 11-32 181-'B.ta.llltf<!11 l>orl 6700 l:meted lnondwTecS ,,_ .....ey (ntsJ OSII 'UJ21'l!J 0 2 23n' n 57s p11).98 200- 01 «:2 '*'nil (i100 54!S.."'60 033-1 2742"..&0 0 a 2'31 17 tp.IG-t """ &roo 64l!.S320 &oee'.l320 0. 274.."'«>0 0 5 231117m!l!l p 10.1'V-Il7·116.tc2.1nterno! fl700 03:!6 274274() D ,_ Storm Ul Component summary 2 Bolt stats Proc.n cYIMII 031!1 O.alll 0.3:<'0 0320 Input stats (AJItime) • 'Stt.., Process bl.tone)' IM•I 032CI Fa'lood 0 Acted Uosl "'"" • 17n• tOll IP 10.:»-73·2311.«,11111! 6100 0 742740 0
  • 55. DOCUlVIENTATION nathanman: DastOoard lnbox nathanmarz I storm 2.,051 I. 109 Pull • 23 Wild 2.4 SlAts e.Graphs Home Pages WtklHistory GitAocess Home wPage fGitP&ge Storm is a distributed realtime computation system.Similar to how Hadoop provides a set of generalprimJtives for doing batch processing, Storm prov1desa set or generalprimitivesror doang realtJmecomputation.Storm iss1mp1e,canbe usedwath anyprogramm1ng Jaoguage,and Is a lot of fun to use! Read these first • Ra:Jonale • Sottmg up devolopment environment • Creatmg a new Stormproject • Tutor al Getting help Feeltree to askquestionson Storm's mailing list·ttp:lkjro p :. ooo oom/qrn 1p torm-user You can also come to tho Istorm-user room on " cnodo You can usually find a Storm dovolopor thoro to help you out fated projects
  • 56. STORM LIBRARIES . . STORM uses a lot of libraries. The most prominent are • Clojure a new lisp programming language. Crash-course follows • Jetty an embedded webserver. Used to host the UI of Nimbus. • Kryo a fast serializer, used when sending tuples • Thrift a framework to build services. Nimbus is a thrift daemon • ZeroMQ a very fast transportation layer • Zookeeper a distributed system for storing metadata
  • 57. References •Twitter Storm • Mathan Marz • http://www.storm-project.org •Storm • nathanmarz@github • http://www.github.com/nathanmarz/storm •Realtime Analytics with Storm and Hadoop • Hadoop_Summit