3. Overview
• Apache Storm is a free and open source distributed real-time
computation system.
• Storm makes it easy to reliably process unbounded streams of
data, doing for real-time processing what Hadoop did for batch
processing.
• Storm is fast (million tuples processed/second/node)
• Can be used with any programming language
4. Overview (cont)
• Use cases:
• Real-time analytics,
• Online machine learning,
• Continuous computation
• …
• Integration: with any queueing and any database system
such as:
• Kafka
• Kestrel
• RabbitMG/ AMQP
• JMS
• Amazon Kinesis
6. Core Storm Concepts: Topology (cont)
• Topology: is a graph of computation, consits of NODEs
and EDGEs.
• Nodes: represent some individual computations.
• Edges: represent the data being passed between nodes.
7. Core Storm Concepts: Tuple (cont)
• Nodes in topology send data in form of tuples
• Tuple: is ordered list of values, where each value is
assigned a name
• Processing of sending a tuple is called emitting tuple
8. Core Storm Concepts: Stream (cont)
• Stream: is an unbounded sequence of tuples between two
nodes in topology.
• A topology can contain any number of streams
9. Core Storm Concepts: Spout (cont)
• Spout: is the source of stream in topology
• Read data from external data source and emits tuples into
topology.
10. Core Storm Concepts: Bolt (cont)
• Bolt: accepts a tuple from its input stream, performs some
computation or transformation – filtering, aggregation, join
– on tuple, and optional emits a new tuple(s)
11. Core Storm Concepts: Stream Grouping
• Defines how tuples are sent between instance of spouts
and bolts.
• Two most common groupings: shuffle grouping and fields
grouping
• SHUFFLE GROUPING: type of stream grouping where
tuples are emitted to bolts at random.
• FIELDS GROUPING: ensures that tuples with the same
value for a particular field name are always emitted to the
same bolt.
12. Components of Storm Cluster
• Two kinds of nodes: Master and Worker
• Master node runs daemon called Nimbus
• Worker node runs daemon called Supervisor
• All coordination between Nimbus and Supervisor is done
through Zookeeper.
14. Example: GitHub Commit Feed (cont)
• Each commit comes into feed as single string containing
COMMIT_ID, followed by a SPACE, followed by EMAIL.
15. Breaking Down the Problem
• Component: reads from live feed of
commits and produces single
commit message
• Component: accepts single commit
message, extracts the developer’s
email from that commit, produces
email
• Component: accepts developer’s
email and updates in-memory map
where key is email and value is
number of commits for that email.