Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Storm: distributed and fault-tolerant realtime computation

Wird geladen in …3

Hier ansehen

1 von 75 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Storm: distributed and fault-tolerant realtime computation (20)


Aktuellste (20)


Storm: distributed and fault-tolerant realtime computation

  1. Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter
  2. Storm at Twitter Twitter Web Analytics
  3. Before Storm Queues Workers
  4. Example (simplified)
  5. Example Workers schemify tweets and append to Hadoop
  6. Example Workers update statistics on URLs by incrementing counters in Cassandra
  7. Example Distribute tweets randomly on multiple queues
  8. Example Workers share the load of schemifying tweets
  9. Example Desire all updates for same URL go to same worker
  10. Message locality • Because: • No transactions in Cassandra (and no atomic increments at the time) • More effective batching of updates
  11. Implementing message locality • Have a queue for each consuming worker • Choose queue for a URL using consistent hashing
  12. Example Workers choose queue to enqueue to using hash/mod of URL
  13. Example All updates for same URL guaranteed to go to same worker
  14. Adding a worker
  15. Adding a worker Deploy Reconfigure/redeploy
  16. Problems • Scaling is painful • Poor fault-tolerance • Coding is tedious
  17. What we want • Guaranteed data processing • Horizontal scalability • Fault-tolerance • No intermediate message brokers! • Higher level abstraction than message passing • “Just works”
  18. Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works”
  19. Use cases Stream Distributed Continuous processing RPC computation
  20. Storm Cluster
  21. Storm Cluster Master node (similar to Hadoop JobTracker)
  22. Storm Cluster Used for cluster coordination
  23. Storm Cluster Run worker processes
  24. Starting a topology
  25. Killing a topology
  26. Concepts • Streams • Spouts • Bolts • Topologies
  27. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  28. Spouts Source of streams
  29. Spout examples • Read from Kestrel queue • Read from Twitter streaming API
  30. Bolts Processes input streams and produces new streams
  31. Bolts • Functions • Filters • Aggregation • Joins • Talk to databases
  32. Topology Network of spouts and bolts
  33. Tasks Spouts and bolts execute as many tasks across the cluster
  34. Stream grouping When a tuple is emitted, which task does it go to?
  35. Stream grouping • Shuffle grouping: pick a random task • Fields grouping: consistent hashing on a subset of tuple fields • All grouping: send to all tasks • Global grouping: pick task with lowest id
  36. Topology shuffle [“id1”, “id2”] shuffle [“url”] shuffle all
  37. Streaming word count TopologyBuilder is used to construct topologies in Java
  38. Streaming word count Define a spout in the topology with parallelism of 5 tasks
  39. Streaming word count Split sentences into words with parallelism of 8 tasks
  40. Streaming word count Consumer decides what data it receives and how it gets grouped Split sentences into words with parallelism of 8 tasks
  41. Streaming word count Create a word count stream
  42. Streaming word count splitsentence.py
  43. Streaming word count
  44. Streaming word count Submitting topology to a cluster
  45. Streaming word count Running topology in local mode
  46. Demo
  47. Traditional data processing
  48. Traditional data processing Intense processing (Hadoop, databases, etc.)
  49. Traditional data processing Light processing on a single machine to resolve queries
  50. Distributed RPC Distributed RPC lets you do intense processing at query-time
  51. Game changer
  52. Distributed RPC Data flow for Distributed RPC
  53. DRPC Example Computing “reach” of a URL on the fly
  54. Reach Reach is the number of unique people exposed to a URL on Twitter
  55. Computing reach Follower Distinct Tweeter Follower follower Follower Distinct URL Tweeter follower Count Reach Follower Follower Distinct Tweeter follower Follower
  56. Reach topology
  57. Guaranteeing message processing “Tuple tree”
  58. Guaranteeing message processing • A spout tuple is not fully processed until all tuples in the tree have been completed
  59. Guaranteeing message processing • If the tuple tree is not completed within a specified timeout, the spout tuple is replayed
  60. Guaranteeing message processing Reliability API
  61. Guaranteeing message processing “Anchoring” creates a new edge in the tuple tree
  62. Guaranteeing message processing Marks a single node in the tree as complete
  63. Guaranteeing message processing • Storm tracks tuple trees for you in an extremely efficient way
  64. Storm UI
  65. Storm UI
  66. Storm UI
  67. Storm on EC2 https://github.com/nathanmarz/storm-deploy One-click deploy tool
  68. Documentation
  69. State spout (almost done) Synchronize a large amount of frequently changing state into a topology
  70. State spout (almost done) Optimizing reach topology by eliminating the database calls
  71. State spout (almost done) Each GetFollowers task keeps a synchronous cache of a subset of the social graph
  72. State spout (almost done) This works because GetFollowers repartitions the social graph the same way it partitions GetTweeter’s stream
  73. Future work • Storm on Mesos • “Swapping” • Auto-scaling • Higher level abstractions
  74. Questions? http://github.com/nathanmarz/storm
  75. What Storm does • Distributes code and configurations • Robust process management • Monitors topologies and reassigns failed tasks • Provides reliability by tracking tuple trees • Routing and partitioning of streams • Serialization • Fine-grained performance stats of topologies