Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
StormDistributed and fault-tolerant realtime computation                                          Nathan Marz             ...
Storm at Twitter  Twitter Web Analytics
Before StormQueues        Workers
Example (simplified)
ExampleWorkers schemify tweets and append to Hadoop
ExampleWorkers update statistics on URLs byincrementing counters in Cassandra
ExampleDistribute tweets randomly    on multiple queues
ExampleWorkers share the load of  schemifying tweets
ExampleDesire all updates for same URL go to same worker
Message locality• Because: • No transactions in Cassandra (and no    atomic increments at the time) • More effective batch...
Implementing message        locality• Have a queue for each consuming worker• Choose queue for a URL using consistent hash...
ExampleWorkers choose queue to enqueue   to using hash/mod of URL
Example    All updates for same URLguaranteed to go to same worker
Adding a worker
Adding a worker                      DeployReconfigure/redeploy
Problems• Scaling is painful• Poor fault-tolerance• Coding is tedious
What we want• Guaranteed data processing• Horizontal scalability• Fault-tolerance• No intermediate message brokers!• Highe...
StormGuaranteed data processingHorizontal scalabilityFault-toleranceNo intermediate message brokers!Higher level abstracti...
Use cases  Stream      Distributed   Continuousprocessing       RPC        computation
Storm Cluster
Storm ClusterMaster node (similar to Hadoop JobTracker)
Storm ClusterUsed for cluster coordination
Storm Cluster Run worker processes
Starting a topology
Killing a topology
Concepts• Streams• Spouts• Bolts• Topologies
StreamsTuple   Tuple   Tuple   Tuple   Tuple   Tuple   Tuple          Unbounded sequence of tuples
SpoutsSource of streams
Spout examples• Read from Kestrel queue• Read from Twitter streaming API
BoltsProcesses input streams and produces new streams
Bolts• Functions• Filters• Aggregation• Joins• Talk to databases
TopologyNetwork of spouts and bolts
TasksSpouts and bolts execute asmany tasks across the cluster
Stream groupingWhen a tuple is emitted, which task does it go to?
Stream grouping• Shuffle grouping: pick a random task• Fields grouping: consistent hashing on a  subset of tuple fields• All...
Topologyshuffle      [“id1”, “id2”]           shuffle[“url”]  shuffle              all
Streaming word countTopologyBuilder is used to construct topologies in Java
Streaming word countDefine a spout in the topology with parallelism of 5 tasks
Streaming word countSplit sentences into words with parallelism of 8 tasks
Streaming word countConsumer decides what data it receives and how it gets groupedSplit sentences into words with parallel...
Streaming word count   Create a word count stream
Streaming word count      splitsentence.py
Streaming word count
Streaming word count  Submitting topology to a cluster
Streaming word count  Running topology in local mode
Demo
Traditional data processing
Traditional data processing   Intense processing (Hadoop, databases, etc.)
Traditional data processingLight processing on a single machine to resolve queries
Distributed RPCDistributed RPC lets you do intense processing at query-time
Game changer
Distributed RPCData flow for Distributed RPC
DRPC ExampleComputing “reach” of a URL on the fly
ReachReach is the number of unique people    exposed to a URL on Twitter
Computing reach                Follower                           Distinct      Tweeter   Follower   follower             ...
Reach topology
Guaranteeing message     processing       “Tuple tree”
Guaranteeing message     processing• A spout tuple is not fully processed until all  tuples in the tree have been completed
Guaranteeing message     processing• If the tuple tree is not completed within a  specified timeout, the spout tuple is rep...
Guaranteeing message     processing      Reliability API
Guaranteeing message     processing“Anchoring” creates a new edge in the tuple tree
Guaranteeing message     processing Marks a single node in the tree as complete
Guaranteeing message     processing• Storm tracks tuple trees for you in an  extremely efficient way
Storm UI
Storm UI
Storm UI
Storm on EC2https://github.com/nathanmarz/storm-deploy          One-click deploy tool
Documentation
State spout (almost done)       Synchronize a large amount of  frequently changing state into a topology
State spout (almost done)Optimizing reach topology by eliminating the database calls
State spout (almost done)  Each GetFollowers task keeps a synchronous     cache of a subset of the social graph
State spout (almost done)This works because GetFollowers repartitions the social graph the same way it partitions GetTweet...
Future work• Storm on Mesos• “Swapping”• Auto-scaling• Higher level abstractions
Questions?http://github.com/nathanmarz/storm
What Storm does•   Distributes code and configurations•   Robust process management•   Monitors topologies and reassigns fa...
Nächste SlideShare
Wird geladen in …5
×
Nächste SlideShare
Scaling Apache Storm - Strata + Hadoop World 2014
Weiter
Herunterladen, um offline zu lesen und im Vollbildmodus anzuzeigen.

227

Teilen

Herunterladen, um offline zu lesen

Storm: distributed and fault-tolerant realtime computation

Herunterladen, um offline zu lesen

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Storm: distributed and fault-tolerant realtime computation

  1. StormDistributed and fault-tolerant realtime computation Nathan Marz Twitter
  2. Storm at Twitter Twitter Web Analytics
  3. Before StormQueues Workers
  4. Example (simplified)
  5. ExampleWorkers schemify tweets and append to Hadoop
  6. ExampleWorkers update statistics on URLs byincrementing counters in Cassandra
  7. ExampleDistribute tweets randomly on multiple queues
  8. ExampleWorkers share the load of schemifying tweets
  9. ExampleDesire all updates for same URL go to same worker
  10. Message locality• Because: • No transactions in Cassandra (and no atomic increments at the time) • More effective batching of updates
  11. Implementing message locality• Have a queue for each consuming worker• Choose queue for a URL using consistent hashing
  12. ExampleWorkers choose queue to enqueue to using hash/mod of URL
  13. Example All updates for same URLguaranteed to go to same worker
  14. Adding a worker
  15. Adding a worker DeployReconfigure/redeploy
  16. Problems• Scaling is painful• Poor fault-tolerance• Coding is tedious
  17. What we want• Guaranteed data processing• Horizontal scalability• Fault-tolerance• No intermediate message brokers!• Higher level abstraction than message passing• “Just works”
  18. StormGuaranteed data processingHorizontal scalabilityFault-toleranceNo intermediate message brokers!Higher level abstraction than message passing“Just works”
  19. Use cases Stream Distributed Continuousprocessing RPC computation
  20. Storm Cluster
  21. Storm ClusterMaster node (similar to Hadoop JobTracker)
  22. Storm ClusterUsed for cluster coordination
  23. Storm Cluster Run worker processes
  24. Starting a topology
  25. Killing a topology
  26. Concepts• Streams• Spouts• Bolts• Topologies
  27. StreamsTuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  28. SpoutsSource of streams
  29. Spout examples• Read from Kestrel queue• Read from Twitter streaming API
  30. BoltsProcesses input streams and produces new streams
  31. Bolts• Functions• Filters• Aggregation• Joins• Talk to databases
  32. TopologyNetwork of spouts and bolts
  33. TasksSpouts and bolts execute asmany tasks across the cluster
  34. Stream groupingWhen a tuple is emitted, which task does it go to?
  35. Stream grouping• Shuffle grouping: pick a random task• Fields grouping: consistent hashing on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest id
  36. Topologyshuffle [“id1”, “id2”] shuffle[“url”] shuffle all
  37. Streaming word countTopologyBuilder is used to construct topologies in Java
  38. Streaming word countDefine a spout in the topology with parallelism of 5 tasks
  39. Streaming word countSplit sentences into words with parallelism of 8 tasks
  40. Streaming word countConsumer decides what data it receives and how it gets groupedSplit sentences into words with parallelism of 8 tasks
  41. Streaming word count Create a word count stream
  42. Streaming word count splitsentence.py
  43. Streaming word count
  44. Streaming word count Submitting topology to a cluster
  45. Streaming word count Running topology in local mode
  46. Demo
  47. Traditional data processing
  48. Traditional data processing Intense processing (Hadoop, databases, etc.)
  49. Traditional data processingLight processing on a single machine to resolve queries
  50. Distributed RPCDistributed RPC lets you do intense processing at query-time
  51. Game changer
  52. Distributed RPCData flow for Distributed RPC
  53. DRPC ExampleComputing “reach” of a URL on the fly
  54. ReachReach is the number of unique people exposed to a URL on Twitter
  55. Computing reach Follower Distinct Tweeter Follower follower Follower DistinctURL Tweeter follower Count Reach Follower Follower Distinct Tweeter follower Follower
  56. Reach topology
  57. Guaranteeing message processing “Tuple tree”
  58. Guaranteeing message processing• A spout tuple is not fully processed until all tuples in the tree have been completed
  59. Guaranteeing message processing• If the tuple tree is not completed within a specified timeout, the spout tuple is replayed
  60. Guaranteeing message processing Reliability API
  61. Guaranteeing message processing“Anchoring” creates a new edge in the tuple tree
  62. Guaranteeing message processing Marks a single node in the tree as complete
  63. Guaranteeing message processing• Storm tracks tuple trees for you in an extremely efficient way
  64. Storm UI
  65. Storm UI
  66. Storm UI
  67. Storm on EC2https://github.com/nathanmarz/storm-deploy One-click deploy tool
  68. Documentation
  69. State spout (almost done) Synchronize a large amount of frequently changing state into a topology
  70. State spout (almost done)Optimizing reach topology by eliminating the database calls
  71. State spout (almost done) Each GetFollowers task keeps a synchronous cache of a subset of the social graph
  72. State spout (almost done)This works because GetFollowers repartitions the social graph the same way it partitions GetTweeter’s stream
  73. Future work• Storm on Mesos• “Swapping”• Auto-scaling• Higher level abstractions
  74. Questions?http://github.com/nathanmarz/storm
  75. What Storm does• Distributes code and configurations• Robust process management• Monitors topologies and reassigns failed tasks• Provides reliability by tracking tuple trees• Routing and partitioning of streams• Serialization• Fine-grained performance stats of topologies
  • jaewanchoi14

    Mar. 8, 2018
  • GeorgianCocis

    Nov. 30, 2017
  • wcmj1023

    Sep. 17, 2017
  • zhangcheng777

    May. 15, 2017
  • marceloancelmo

    May. 4, 2017
  • mgupt

    Apr. 18, 2017
  • unread

    Apr. 2, 2017
  • vimalanandans

    Dec. 16, 2016
  • yangyao6

    Nov. 28, 2016
  • PyaeSone27

    Sep. 18, 2016
  • Junwei_wang

    Aug. 7, 2016
  • SusanneBraun2

    Jul. 4, 2016
  • jxauwxj

    Jul. 1, 2016
  • jfyhung

    Jun. 22, 2016
  • MaurizioCarioli

    Jun. 21, 2016
  • pramodModi

    Mar. 6, 2016
  • rnjailamba12

    Feb. 25, 2016
  • thonnicolas

    Jan. 12, 2016
  • JustinJones1011

    Dec. 29, 2015
  • AnbuAnandGurusamy

    Dec. 15, 2015

Aufrufe

Aufrufe insgesamt

294.751

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

135.823

Befehle

Downloads

3.347

Geteilt

0

Kommentare

0

Likes

227

×