Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Storm Real Time Computation
Storm Real Time Computation
Wird geladen in …3
×

Hier ansehen

1 von 84 Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)

Anzeige

Ähnlich wie Storm (20)

Anzeige

Aktuellste (20)

Storm

  1. 1. Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter
  2. 2. Basic info • Open sourced September 19th • Implementation is 12,000 lines of code • Used by over 25 companies • >2280 watchers on Github (most watched JVM project) • Very active mailing list • >1700 messages • >520 members
  3. 3. Hadoop Batch computation Distributed Fault-tolerant
  4. 4. Storm Realtime computation Distributed Fault-tolerant
  5. 5. Hadoop • Large, finite jobs • Process a lot of data at once • High latency
  6. 6. Storm • Infinite computations called topologies • Process infinite streams of data • Tuple-at-a-time computational model • Low latency
  7. 7. Before Storm Queues Workers
  8. 8. Example (simplified)
  9. 9. Example Workers schemify tweets and append to Hadoop
  10. 10. Example Workers update statistics on URLs by incrementing counters in Cassandra
  11. 11. Problems • Scaling is painful • Poor fault-tolerance • Coding is tedious
  12. 12. What we want • Guaranteed data processing • Horizontal scalability • Fault-tolerance • No intermediate message brokers! • Higher level abstraction than message passing • “Just works”
  13. 13. Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works”
  14. 14. Use cases Stream Distributed Continuous processing RPC computation
  15. 15. Storm Cluster
  16. 16. Storm Cluster Master node (similar to Hadoop JobTracker)
  17. 17. Storm Cluster Used for cluster coordination
  18. 18. Storm Cluster Run worker processes
  19. 19. Starting a topology
  20. 20. Killing a topology
  21. 21. Concepts • Streams • Spouts • Bolts • Topologies
  22. 22. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  23. 23. Spouts Source of streams
  24. 24. Spout examples • Read from Kestrel queue • Read from Twitter streaming API
  25. 25. Bolts Processes input streams and produces new streams
  26. 26. Bolts • Functions • Filters • Aggregation • Joins • Talk to databases
  27. 27. Topology Network of spouts and bolts
  28. 28. Tasks Spouts and bolts execute as many tasks across the cluster
  29. 29. Task execution Tasks are spread across the cluster
  30. 30. Task execution Tasks are spread across the cluster
  31. 31. Stream grouping When a tuple is emitted, which task does it go to?
  32. 32. Stream grouping • Shuffle grouping: pick a random task • Fields grouping: mod hashing on a subset of tuple fields • All grouping: send to all tasks • Global grouping: pick task with lowest id
  33. 33. Topology shuffle [“id1”, “id2”] shuffle [“url”] shuffle all
  34. 34. Streaming word count TopologyBuilder is used to construct topologies in Java
  35. 35. Streaming word count Define a spout in the topology with parallelism of 5 tasks
  36. 36. Streaming word count Split sentences into words with parallelism of 8 tasks
  37. 37. Streaming word count Consumer decides what data it receives and how it gets grouped Split sentences into words with parallelism of 8 tasks
  38. 38. Streaming word count Create a word count stream
  39. 39. Streaming word count splitsentence.py
  40. 40. Streaming word count
  41. 41. Streaming word count Submitting topology to a cluster
  42. 42. Streaming word count Running topology in local mode
  43. 43. Demo
  44. 44. Distributed RPC Data flow for Distributed RPC
  45. 45. DRPC Example Computing “reach” of a URL on the fly
  46. 46. Reach Reach is the number of unique people exposed to a URL on Twitter
  47. 47. Computing reach Follower Distinct Tweeter Follower follower Follower Distinct URL Tweeter follower Count Reach Follower Follower Distinct Tweeter follower Follower
  48. 48. Reach topology
  49. 49. Reach topology
  50. 50. Reach topology
  51. 51. Reach topology Keep set of followers for each request id in memory
  52. 52. Reach topology Update followers set when receive a new follower
  53. 53. Reach topology Emit partial count after receiving all followers for a request id
  54. 54. Demo
  55. 55. Guaranteeing message processing “Tuple tree”
  56. 56. Guaranteeing message processing • A spout tuple is not fully processed until all tuples in the tree have been completed
  57. 57. Guaranteeing message processing • If the tuple tree is not completed within a specified timeout, the spout tuple is replayed
  58. 58. Guaranteeing message processing Reliability API
  59. 59. Guaranteeing message processing “Anchoring” creates a new edge in the tuple tree
  60. 60. Guaranteeing message processing Marks a single node in the tree as complete
  61. 61. Guaranteeing message processing • Storm tracks tuple trees for you in an extremely efficient way
  62. 62. Transactional topologies How do you do idempotent counting with an at least once delivery guarantee?
  63. 63. Transactional topologies Won’t you overcount?
  64. 64. Transactional topologies Transactional topologies solve this problem
  65. 65. Transactional topologies Built completely on top of Storm’s primitives of streams, spouts, and bolts
  66. 66. Transactional topologies Batch 1 Batch 2 Batch 3 Process small batches of tuples
  67. 67. Transactional topologies Batch 1 Batch 2 Batch 3 If a batch fails, replay the whole batch
  68. 68. Transactional topologies Batch 1 Batch 2 Batch 3 Once a batch is completed, commit the batch
  69. 69. Transactional topologies Batch 1 Batch 2 Batch 3 Bolts can optionally implement “commit” method
  70. 70. Transactional topologies Commit 1 Commit 1 Commit 2 Commit 3 Commit 4 Commit 4 Commits are ordered. If there’s a failure during commit, the whole batch + commit is retried
  71. 71. Example
  72. 72. Example New instance of this object for every transaction attempt
  73. 73. Example Aggregate the count for this batch
  74. 74. Example Only update database if transaction ids differ
  75. 75. Example This enables idempotency since commits are ordered
  76. 76. Example (Credit goes to Kafka guys for figuring out this trick)
  77. 77. Transactional topologies Multiple batches can be processed in parallel, but commits are guaranteed to be ordered
  78. 78. Transactional topologies • Will be available in next version of Storm (0.7.0) • Requires a source queue that can replay identical batches of messages • Aiming for first TransactionalSpout implementation to use Kafka
  79. 79. Storm UI
  80. 80. Storm on EC2 https://github.com/nathanmarz/storm-deploy One-click deploy tool
  81. 81. Starter code https://github.com/nathanmarz/storm-starter Example topologies
  82. 82. Documentation
  83. 83. Ecosystem • Scala, JRuby, and Clojure DSL’s • Kestrel, AMQP, JMS, and other spout adapters • Serializers • Multilang adapters • Cassandra, MongoDB integration
  84. 84. Questions? http://github.com/nathanmarz/storm

×