Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Introduction to Storm

Wird geladen in …3

Hier ansehen

1 von 82 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Introduction to Storm (20)


Aktuellste (20)

Introduction to Storm

  1. 1. 1 Distributed, real-time, fault-tolerant framework Introduction to Storm Eugene Dvorkin Coding Architect, WebMD edvorkin@gmail.com #edvorkin eugenedvorkin.com
  2. 2. 2 Big Data “Big Data is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction”
  3. 3. 3 Big Data Velocity VolumeVariety
  4. 4. 4 Enablers of Big Data Map/Reduce frameworks – Hadoop Scalable storage – HDFS, NoSQL databases Cheap computing power – Cloud computing
  5. 5. 5 Why Real Time? Better end-user experience - Ex: View an ad, see the counter move. Operational intelligence - Low latency analysis - Real time Dashboards ŸEvent response - Rule Engine, Personalization, Predictions - Scalable analysis Example: Trend analysis to recommend „hot‟ articles.
  6. 6. 6 Requirements Fast Scalable by process parallelization and distribution Fault-tolerant Guarantees data processing Easy to learn, code and operate Robust Doing scalable real time processing require framework that:
  7. 7. 7 Storm • Storm – open source distributed Real- time computation system. • Developed by Nathan Marz – acquired by Twitter
  8. 8. 8 Storm Fast Scalable by process parallelization and distribution Fault-tolerant Guarantees data processing Runs on JVM Easy to learn, code and operate Supports development in multiple languages
  9. 9. 9 Hadoop Storm Storm for Real-Time processing Storm is to real-time computation what Hadoop is to batch computation.
  10. 10. 10 Storm Use cases
  11. 11. 11 Storm Use Cases “Storm powers a wide variety of Twitter systems, ranging in applications from discovery, real-time analytics, personalization, search, revenue optimization, and many more.” “Storm empowers stream/micro-batch processing of user events, content feeds, and application logs” - Yahoo “ETL – move data from MongoDB to BI”
  12. 12. 12 Storm Abstractions
  13. 13. 13 Storm cluster
  14. 14. 14 Storm Abstractions Tuples, Streams, Spouts, Bolts and Topologies
  15. 15. 15 Tuples [“Colonoscopy”, 14106] • Storm Data structure • List of elements
  16. 16. 16 Stream Unbounded sequence of tuples [“Colonoscopy”, 14091][“Cancer”,42651] [“Oncology”, 14417]
  17. 17. 17 Spout Read from stream of data – queues, web logs, API calls, databases Emit streams of tuples
  18. 18. 18 Bolts Process tuples and create new streams
  19. 19. 19 Bolts Apply functions /transforms Calculate and aggregate data (word count!) Access DB, API , etc. Filter data Process tuples and create new streams
  20. 20. 20 Topology
  21. 21. 21 Storm is Easy to Code How to write storm components? Storm is easy to use
  22. 22. 22 Topology Example
  23. 23. 23 How to create a spout
  24. 24. 24 How to create a spout
  25. 25. 25 Spouts Available on GitHub Integration with Redis, Kafka, MongoDB, Amazon SQS, JMS and some others are readily available
  26. 26. 26 How to Create a Bolt
  27. 27. 27 HashTagFilterBolt
  28. 28. 28 HashTagCountBolt
  29. 29. 29 Creating Topology
  30. 30. 30 Problem What about parallel processing?
  31. 31. 31 Topology Example
  32. 32. 32 Topology Example
  33. 33. 33 Topology Example
  34. 34. 34 Parallelism Storm Scalability - Parallelism
  35. 35. 35 Storm cluster
  36. 36. 36 Storm Parallelism
  37. 37. 37 Storm rebalance > storm rebalance demo -n 3 -e myspout=5 -e mybolt=1
  38. 38. 38 Creating Cluster Topology >storm jar HashTagTopology.jar org.javameetup.topology. HashTagCountTopology
  39. 39. 39 Stream groupings Shuffle grouping: Tuples are randomly distributed across the bolt's tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Custom grouping
  40. 40. 40 Stream groupings
  41. 41. 41 Demo
  42. 42. 42 Deployment Storm Deployment
  43. 43. 43 Storm deployment
  44. 44. 44 Storm deployment Out of box configuration are suitable for production One-click deploy with storm-deploy project to EC2 Once deployed, easy to operate – designed to be robust Storm daemons, Nimbus and Supervisors are stateless and fail-fast Useful UI
  45. 45. 45 Storm UI
  46. 46. 46 Storm UI
  47. 47. 47 Storm is Fault - Tolerant
  48. 48. 48 Normal operations
  49. 49. 49 Nimbus down • Processing will continue. But topology lifecycle operations and reassignment facility are lost. • Run under system supervision.
  50. 50. 50 Worker node down • Nimbus will reassign tasks to other machines
  51. 51. 51 Supervisor goes down Processing will still continue. But assignment is never synchronized
  52. 52. 52 Worker process down • Supervisor will restart the worker process and the processing will continue
  53. 53. 53 Guaranteeing message processing
  54. 54. 54 Guaranteed Message Processing “Tuple tree”
  55. 55. 55 Reliability API When emitting a tuple, the Spout provides a "message id" that will be used to identify the tuple later.
  56. 56. 56 Reliability API- Anchoring
  57. 57. 57 Reliability API – finishing processing
  58. 58. 58 Spout - Reliability API
  59. 59. 59 Reliability API
  60. 60. 60 Reliability API
  61. 61. 61 Advanced Topics - Trident Trident is a high-level abstraction for doing real-time computing on top of Storm.
  62. 62. 62 Trident- Higher level constructs Joins Aggregations Grouping Functions Filters Consistent, exactly one semantics
  63. 63. 63 Example [Physicians, 79] [Oncology:78] [Cancer:237] …….
  64. 64. 64 Example
  65. 65. 65 Example
  66. 66. 66 Example
  67. 67. 67 Example
  68. 68. 68 Example
  69. 69. 69 Example
  70. 70. 70 Example
  71. 71. 71 Example
  72. 72. 72 Example [Physicians, 79] [Oncology:78] [Cancer:237] …….
  73. 73. 73 Demo
  74. 74. 74 DRPC Server DRPC Server
  75. 75. 75 DRPC Server We want to know the aggregate count of tweets with hashtags #cancer and #Physician at this moment
  76. 76. 76 DRPC Server
  77. 77. 77 DRPC Server
  78. 78. 78
  79. 79. 79 Conclusion Storm allows us to solve a wide range of business problems in real time Thriving open-source community
  80. 80. 80 Resources Storm Project wiki Storm starter project Storm contributions project Running a Multi-Node Storm cluster tutorial Implementing real-time trending topic A Hadoop Alternative: Building a real-time data pipeline with Storm Storm Use cases
  81. 81. 81 Resources (cont’d) Understanding the Parallelism of a Storm Topology Trident – high level Storm abstraction A practical Storm‟s Trident API Storm online forum Project source code New York City Storm Meetup Image credits: US NASA
  82. 82. 82 Questions Eugene Dvorkin, Architect WebMD edvorkin@gmail.com Twitter: #edvorkin Introduction to Storm

Hinweis der Redaktion

  • Average enterprises now can process and make sense of big data
  • Variety – the various types of dataVelocity – how fast this data is processedVolume – how much data
  • Running if component die and self healing
  • Running if component die and self healing
  • Stream – read tuples, do some processing and update database and drop tuples. Move data from operational db into BI or process log file, ETL processingYou ask storm for really expensive computation query online – for example, how many events I got since last week.Trending topics or most popular articles
  • Graph of spouts and bolts with streams connection
  • Number of worker processes per clusterFinally, you can change the number of workers and/or number of executors for components using the "storm rebalance" command. The following command changes the number of workers for the "demo" topology to 3, the number of executors for the "myspout" component to 5, and the number of executors for the "mybolt" component to 1: storm rebalance demo -n 3 -e myspout=5 -e mybolt=1 The number of executor threads can be changed after the topology has been started (see storm rebalance command).The number of tasks of a topology is static.So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is of course slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalance the topology to make full use of all 25 boxes without any downtime.Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.
  • Question
  • Submitter - Uploads topology JAR to Nimbus inbox with dependencies Nimbus - Makes assignment, Starts topology
  • Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
  • For example, mongoDB _id
  • There's two things you have to do as a user to benefit from Storm's reliability capabilities. First, you need to tell Storm whenever you're creating a new link in the tree of tuples. Second, you need to tell Storm when you have finished processing an individual tuple. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately. Storm's API provides a concise way of doing both of these tasks.Specifying a link in the tuple tree is called anchoring.
  • Second, you need to tell Storm when you have finished processing an individual tuple.