BWB Meetup: Storm - distributed realtime computation system

1. Storm: overview distributed and fault-tolerant realtime computation. Backend Web Berlin

2. Storm www.storm-project.net Storm is a free and open source distributed realtime computation system. September BWB Meetup

3. Use cases distributed RPC continuous computationsstream processing

4. Overview • free and open source • integrates with any queuing and database system • distributed and scalable • fault-tolerant • supports multiple languages

5. Scalable Storm topologies are inherently parallel and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly.

6. Fault tolerant When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast.

7. Guarantees data processing Storm guarantees every tuple will be fully processed. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way. Messages are only replayed when there are failures. Storm's basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system.

8. Use with many languages Storm was designed from the ground up to be usable with any programming language. Similarly, spouts and bolts can be defined in any language. Non-JVM spouts and bolts communicate to Storm over a JSON-based protocol over stdin/stdout. Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl, and PHP.

9. How Storm works? Storm cluster Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Nimbus

10. How Storm works? Basic concepts Topology Topology is a graph of computation. A topology runs forever, or until you kill it. Stream Stream is an unbounded sequence of tuples. Spout Spout is a source of streams. Bolt Bolt is the place where calculations are done. Bolts can do anything from run functions, filter tuples, do streaming aggregations, joins, talk to databases etc.

11. How Storm works? Basic concepts Worker process A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. Executor (thread) Executor is a thread that is spawned by a worker process. It may run 1+ tasks for the same component. It always has 1 thread that it uses for all of its tasks. Task Task performs the actual data processing – each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology.

12. How Storm works? Basic concepts Spout Task1 Task2 BoltA Task1 Task2 Task3 BoltB Task1 Task2 BoltC Task1 Task2 Task3 Task4 Task5 Task6 BoltD Task1 Task2 Task3 BoltE Task1 Task2 BoltF Task1

13. How Storm works? Topology Example class DemoTopology { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2) .declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item"); builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”); builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout") .declareDefaultStream("uid", “fromB"); builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA") .declareDefaultStream("uid", “fromC"); builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC") .fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")) .declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne"); builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”); builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”); StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology()); }

14. How Storm works? Spout Example public class DemoSpout extends BaseRichSpout { …. @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _queue = new MyFavoritQueue<string>(); } @Override public void nextTuple() { String nextItem = queue.poll(); _collector.emit(new Values(nextItem)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }

15. How Storm works? Bolt Example public class BoltA extends BaseRichBolt { private OutputCollector _collector; @Override public void execute(Tuple tuple) { Object obj = tuple.getValue(0); String capitalizedItem = capitalize((String)obj); _collector.emit(tuple, new Value(capitalizedItem)); _collector.ack(tuple); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }

16. Storm UI

17. Read More about Storm • Storm http://storm-project.net/ • Example Storm Topologies https://github.com/nathanmarz/storm-starter • Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending- topics-in-storm/ • Understanding the Internal Message Buffers of Storm http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal- message-buffers/ • Understanding the Parallelism of a Storm Topology http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of- a-storm-topology/

18. Storm in our company ferret-go.com

19. Ferret go GmbH Trend & Media Analytics ferret-go.com

20. Our data flow (simplified) Twitter Facebook Google+ Blogs Comments Online media Offline media Reviews ElasticSearch ElasticSearch ElasticSearch processing classification analyzing

21. Problem overview • we have a number of streams that spout items • for every item we do different calculations • at the end of calculations we save item into storage(s) – ElasticSearch, PostgreSQL etc. • if processing fails because of some environment issues, we want to re-queue item easily • some of our calculations can be done in parallel Google+ Twitter Facebook

22. Solution • Redis-based queues for spouting • 1-2 spouts per topology • 1 bulk bolt for storage writing per worker • Storm cluster with 2 nodes: 32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04 • ~ 20 items per sec (could be increased) • 3 slots per worker, 198 tasks, 68 executors

23. Thank you! 30.10.2013 September BWB Meetup Andrii Gakhov

39. Storm UI

62. Storm UI

BWB Meetup: Storm - distributed realtime computation system

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (18)

Similar to BWB Meetup: Storm - distributed realtime computation system

Similar to BWB Meetup: Storm - distributed realtime computation system (20)

More from Andrii Gakhov

More from Andrii Gakhov (20)

Recently uploaded

Recently uploaded (20)

BWB Meetup: Storm - distributed realtime computation system

Editor's Notes