Storm is a free and open source distributed real-time computation system. It is fault-tolerant, scalable, and guarantees data processing. Storm topologies can integrate data streams from multiple sources and languages, and run computations across computer clusters in a distributed manner. It is used by companies for applications like stream processing, distributed RPCs, and continuous computations.
4. Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-tolerant
• supports multiple languages
5. Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology can be scaled individually by tweaking their
parallelism.
The "rebalance" command of the "storm" command line client can adjust the
parallelism of running topologies on the fly.
6. Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on another node.
The Storm daemons, Nimbus and the Supervisors, are designed to be stateless
and fail-fast.
7. Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the ability to track the lineage of a tuple as it makes its way
through the topology in an extremely efficient way.
Messages are only replayed when there are failures. Storm's basic abstractions
provide an at-least-once processing guarantee, the same guarantee you get
when using a queueing system.
8. Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spouts and bolts can be defined in any language. Non-JVM spouts
and bolts communicate to Storm over a JSON-based protocol over
stdin/stdout.
Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl,
and PHP.
10. How Storm works? Basic concepts
Topology
Topology is a graph of computation. A topology runs forever, or until you kill it.
Stream
Stream is an unbounded sequence of tuples.
Spout
Spout is a source of streams.
Bolt
Bolt is the place where calculations are done. Bolts can do anything from run
functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
11. How Storm works? Basic concepts
Worker process
A worker process executes a subset of a topology. A worker process belongs to
a specific topology and may run one or more executors for one or more
components (spouts or bolts) of this topology.
Executor (thread)
Executor is a thread that is spawned by a worker process. It may run 1+ tasks
for the same component. It always has 1 thread that it uses for all of its tasks.
Task
Task performs the actual data processing – each spout or bolt that you implement in
your code executes as many tasks across the cluster. The number of tasks for a
component is always the same throughout the lifetime of a topology.
17. Read More about Storm
• Storm
http://storm-project.net/
• Example Storm Topologies
https://github.com/nathanmarz/storm-starter
• Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-
topics-in-storm/
• Understanding the Internal Message Buffers of Storm
http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-
message-buffers/
• Understanding the Parallelism of a Storm Topology
http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-
a-storm-topology/
20. Our data flow (simplified)
Twitter
Facebook
Google+
Blogs
Comments
Online media
Offline media
Reviews
ElasticSearch
ElasticSearch
ElasticSearch
processing classification analyzing
21. Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end of calculations we save item into
storage(s) – ElasticSearch, PostgreSQL etc.
• if processing fails because of some environment
issues, we want to re-queue item easily
• some of our calculations can be done in parallel
Google+
Twitter
Facebook
22. Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm cluster with 2 nodes:
32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04
• ~ 20 items per sec (could be increased)
• 3 slots per worker, 198 tasks, 68 executors
27. Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-tolerant
• supports multiple languages
28. Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology can be scaled individually by tweaking their
parallelism.
The "rebalance" command of the "storm" command line client can adjust the
parallelism of running topologies on the fly.
29. Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on another node.
The Storm daemons, Nimbus and the Supervisors, are designed to be stateless
and fail-fast.
30. Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the ability to track the lineage of a tuple as it makes its way
through the topology in an extremely efficient way.
Messages are only replayed when there are failures. Storm's basic abstractions
provide an at-least-once processing guarantee, the same guarantee you get
when using a queueing system.
31. Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spouts and bolts can be defined in any language. Non-JVM spouts
and bolts communicate to Storm over a JSON-based protocol over
stdin/stdout.
Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl,
and PHP.
33. How Storm works? Basic concepts
Topology
Topology is a graph of computation. A topology runs forever, or until you kill it.
Stream
Stream is an unbounded sequence of tuples.
Spout
Spout is a source of streams.
Bolt
Bolt is the place where calculations are done. Bolts can do anything from run
functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
34. How Storm works? Basic concepts
Worker process
A worker process executes a subset of a topology. A worker process belongs to
a specific topology and may run one or more executors for one or more
components (spouts or bolts) of this topology.
Executor (thread)
Executor is a thread that is spawned by a worker process. It may run 1+ tasks
for the same component. It always has 1 thread that it uses for all of its tasks.
Task
Task performs the actual data processing – each spout or bolt that you implement in
your code executes as many tasks across the cluster. The number of tasks for a
component is always the same throughout the lifetime of a topology.
40. Read More about Storm
• Storm
http://storm-project.net/
• Example Storm Topologies
https://github.com/nathanmarz/storm-starter
• Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-
topics-in-storm/
• Understanding the Internal Message Buffers of Storm
http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-
message-buffers/
• Understanding the Parallelism of a Storm Topology
http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-
a-storm-topology/
43. Our data flow (simplified)
Twitter
Facebook
Google+
Blogs
Comments
Online media
Offline media
Reviews
ElasticSearch
ElasticSearch
ElasticSearch
processing classification analyzing
44. Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end of calculations we save item into
storage(s) – ElasticSearch, PostgreSQL etc.
• if processing fails because of some environment
issues, we want to re-queue item easily
• some of our calculations can be done in parallel
Google+
Twitter
Facebook
45. Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm cluster with 2 nodes:
32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04
• ~ 20 items per sec (could be increased)
• 3 slots per worker, 198 tasks, 68 executors
50. Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-tolerant
• supports multiple languages
51. Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology can be scaled individually by tweaking their
parallelism.
The "rebalance" command of the "storm" command line client can adjust the
parallelism of running topologies on the fly.
52. Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on another node.
The Storm daemons, Nimbus and the Supervisors, are designed to be stateless
and fail-fast.
53. Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the ability to track the lineage of a tuple as it makes its way
through the topology in an extremely efficient way.
Messages are only replayed when there are failures. Storm's basic abstractions
provide an at-least-once processing guarantee, the same guarantee you get
when using a queueing system.
54. Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spouts and bolts can be defined in any language. Non-JVM spouts
and bolts communicate to Storm over a JSON-based protocol over
stdin/stdout.
Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl,
and PHP.
56. How Storm works? Basic concepts
Topology
Topology is a graph of computation. A topology runs forever, or until you kill it.
Stream
Stream is an unbounded sequence of tuples.
Spout
Spout is a source of streams.
Bolt
Bolt is the place where calculations are done. Bolts can do anything from run
functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
57. How Storm works? Basic concepts
Worker process
A worker process executes a subset of a topology. A worker process belongs to
a specific topology and may run one or more executors for one or more
components (spouts or bolts) of this topology.
Executor (thread)
Executor is a thread that is spawned by a worker process. It may run 1+ tasks
for the same component. It always has 1 thread that it uses for all of its tasks.
Task
Task performs the actual data processing – each spout or bolt that you implement in
your code executes as many tasks across the cluster. The number of tasks for a
component is always the same throughout the lifetime of a topology.
63. Read More about Storm
• Storm
http://storm-project.net/
• Example Storm Topologies
https://github.com/nathanmarz/storm-starter
• Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-
topics-in-storm/
• Understanding the Internal Message Buffers of Storm
http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-
message-buffers/
• Understanding the Parallelism of a Storm Topology
http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-
a-storm-topology/
66. Our data flow (simplified)
Twitter
Facebook
Google+
Blogs
Comments
Online media
Offline media
Reviews
ElasticSearch
ElasticSearch
ElasticSearch
processing classification analyzing
67. Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end of calculations we save item into
storage(s) – ElasticSearch, PostgreSQL etc.
• if processing fails because of some environment
issues, we want to re-queue item easily
• some of our calculations can be done in parallel
Google+
Twitter
Facebook
68. Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm cluster with 2 nodes:
32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04
• ~ 20 items per sec (could be increased)
• 3 slots per worker, 198 tasks, 68 executors