Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Stream Processing Overview

13 Aufrufe

Veröffentlicht am

A complete overview of stream processing systems, characteristics and use cases.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Stream Processing Overview

  1. 1. Stream Processing Overview Maycon Viana Bordin Instituto de Informática Universidade Federal do Rio Grande do Sul
  2. 2. HUGE amounts of data are being generated in real-time
  3. 3. 500M tweets are sent per day
  4. 4. 4.75B shares 4.5B likes 420M status updates 300M photos EVERY DAY.
  5. 5. SOME APPLICATIONS
  6. 6. Traffic Monitoring and Route Planning
  7. 7. • Architecture for Stream and CEP processing • Input from buses and SCATS sensors • Use of crowdsourcing to resolve data source unreliability • Dataset of 13GB from Dublin city
  8. 8. System Architecture
  9. 9. Traffic flow estimated using Gaussian Process Regression
  10. 10. Network Monitoring
  11. 11. • 6 Billion records per day • 160 Million customers • Detect duplicates in a 15 day window • Records can’t be lost • Solution: InfoSphere Streams
  12. 12. Application Graph
  13. 13. Number of terminated calls by category in the last hour Call termination reason for enterprise customers in the last hour Dashboards
  14. 14. Smart Grids
  15. 15. • 1.4 Million consumers • Demand Response Optimization 1. Peak demand forecasting 2. Effective response selection • Data source: AMIs (Advanced Metering Infrastructure) • 3TB of data per day
  16. 16. System Architecture
  17. 17. Sensor Networks
  18. 18. • Detection of events: earthquakes, typhoons, etc. • Twitter users as sensors • Location estimation: Kalman and particle filtering • Detects 96% of earthquakes repoted by the Japan Meteorological Agency
  19. 19. Results
  20. 20. Other Applications • Fraud detection • Process control in manufacturing • Surveillance systems • CDR processing • Healthcare monitoring
  21. 21. They need to process…
  22. 22. They need to process… large volumes of data
  23. 23. They need to process… large volumes of data in real-time
  24. 24. They need to process… large volumes of data in real-time continuously
  25. 25. They need to process… large volumes of data in real-time continuously producing actionable information
  26. 26. And are categorized as Information Flow Processing technologies
  27. 27. Information Flow Processing Active Databases Continuous Queries Publish- subscribe systems Complex Event Processing Stream Processing Systems
  28. 28. Information Flow Processing Active Databases Continuous Queries Publish- subscribe systems Complex Event Processing Stream Processing Systems RCA rules Triggers
  29. 29. Information Flow Processing Active Databases Continuous Queries Publish- subscribe systems Complex Event Processing Stream Processing Systems Standing queries query – trigger – stop conditions
  30. 30. Information Flow Processing Active Databases Continuous Queries Publish- subscribe systems Complex Event Processing Stream Processing Systems Decoupled components Topic and content based
  31. 31. Information Flow Processing Active Databases Continuous Queries Publish- subscribe systems Complex Event Processing Stream Processing Systems Event detection based on rules and patterns
  32. 32. Stream Processing Concepts
  33. 33. B
  34. 34. B
  35. 35. Data Stream
  36. 36. B
  37. 37. B
  38. 38. B 1234567
  39. 39. Data from the stream source may or may not be structured
  40. 40. The amount of data is usually unbounded in size
  41. 41. The input rate is variable and typically unpredictable
  42. 42. Operators
  43. 43. OP
  44. 44. OP
  45. 45. OP
  46. 46. Operators Classification
  47. 47. OPERATORS
  48. 48. OPERATORS Stateless (map, filter)
  49. 49. OPERATORS Stateless (map, filter) Stateful
  50. 50. OPERATORS Stateless (map, filter) Stateful Non-Blocking (count, sum)
  51. 51. OPERATORS Stateless (map, filter) Stateful Blocking (join, freq. itemset) Non-Blocking (count, sum)
  52. 52. Blocking operators need all input in order to generate a result
  53. 53. but that’s not possible since data streams are unbounded
  54. 54. To solve this issue, tuples are grouped in windows
  55. 55. window start (ws) window end (we) Range in time units or number of tuples
  56. 56. old ws old we new ws new we advance
  57. 57. Operators Examples
  58. 58. Parsing/Filtering/ETL Aggregation: collection and summarization of tuples Merging: combining of streams with different schemas Splitting: partitioning of stream into multiple ones for data/task parallelism or some logical reason Data mining/Machine Learning/NLP: spam filtering, fraud detection, recommendation systems, data stream clustering, sentiment analysis … Others: relational algebra, artificial intelligence and other custom operations
  59. 59. Traditional vs Data Stream Processing
  60. 60. Traditional Data Stream Distributed No Yes Type of Result Accurate Approximate Memory Usage Unlimited Restricted Processing Time Unlimited Restricted No. of Passes Multiple Single
  61. 61. These differences gave way to a number of synopsis structures
  62. 62. Sampling: classification, query estimation, order statistics estimation, distinct value queries Wavelets: hierarchical decomposition and summarization Clustering: knowledge discovery Sketches: distinct count, heavy hitters, quantiles, change detection Histograms: range queries, selectivity estimation
  63. 63. Programming Model
  64. 64. Applications are composed as data flow graphs
  65. 65. To illustrate, let’s look at the graph of a Trending Topics application
  66. 66. extract hashtags hashtag counter Sink parse
  67. 67. The graph above is the logical view of the application
  68. 68. The physical view displays the component instances and their location in the cluster
  69. 69. extract File Sink stream extractextract extractextract countmincountmincountmincountmin countmin node-0 node-1 node-2
  70. 70. Parallelism
  71. 71. a data stream among the instances of an operator.
  72. 72. one or more data streams among different operators.
  73. 73. Stream Processing Scheduling
  74. 74. Provides and ensure the (latency and throughput)
  75. 75. Consists of two stages and of operators
  76. 76. Architecture Independent Distributed Hybrid Algorithm structure Centralized Descentralized Metric Load Latency Bandwidth Hybrid Machine resources Operator importance Operator-level operations Operator reuse Replication Reconfiguration Types of changes •Network •Data •Flow graph Response strategy •Dynamic •Static [Lakshmanan, 2008]
  77. 77. Stream Processing Fault Tolerance
  78. 78. Stream processing systems can suffer from and
  79. 79. These faults can be dealt with by and of components
  80. 80. Events are usually tracked in the following way…
  81. 81. Processed messages are by downstream operators
  82. 82. If an ack is not received for an amount of time, the event is lost
  83. 83. Lost events are replayed from upstream operators
  84. 84. Techniques Upstream Backup
  85. 85. The upstream component keeps the output tuples in a queue until they have been processed
  86. 86. If a downstream component fails…
  87. 87. the tuples are replayed to another component.
  88. 88. Techniques Active Replication
  89. 89. Replicas of a component process the same data
  90. 90. The state is thus implicitly synchronized
  91. 91. Once the primary component fails…
  92. 92. a backup component takes over.
  93. 93. Techniques Passive Replication
  94. 94. Primary component saves its state periodically to a permanent shared storage
  95. 95. Secondary components synchronize their state through the shared storage
  96. 96. If the component fails the secondary takes over…
  97. 97. sends the messages in the output queue and asks the upstream nodes for the messages its has not seen.
  98. 98. Techniques Checkpointing
  99. 99. An operator periodically saves its state in a storage
  100. 100. Upon a failure the component is restored to the previous consistent state
  101. 101. The periodicity is determined by the type of recovery protocol
  102. 102. In protocols each component decides when to do a checkpoint
  103. 103. It is simple to implement, but hard to guarantee the consistency of the whole system
  104. 104. protocols, on the other hand, organize the checkpoint moments between components
  105. 105. It ensures the consistency of the whole system at the cost of a complex and more costly protocol
  106. 106. Techniques Recovery
  107. 107. failures are not visible, except for the increase in latency
  108. 108. may affect the system beyond latency, e.g. duplicated tuples
  109. 109. as the components don’t save their state, tuples can be lost during recovery
  110. 110. Platforms History
  111. 111. 2000 2001 2003 2004 20062002 2005 2008 20102009 2011 2012 20142007 2013 • Cougar • Stream Mill NiagaraCQ Cougar TelegraphCQ STREAM Aurora/Medusa StreamBase Borealis BusinessEvents Oracle CEP InfoSphere Streams Stream Mill Granules S4 Storm Samza Spark Streaming MillWheel TimeStream Flink Streaming
  112. 112. Platforms Spark Streaming
  113. 113. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  114. 114. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 122  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system
  115. 115. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data batch @ t+1 batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API
  116. 116. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one Dstream to create another DStream new DStream new RDDs created for every batch batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]
  117. 117. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMa p flatMa p flatMa p save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  118. 118. Java Example Scala val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { }) hashTags.saveAsHadoopFiles("hdfs://...") Function object to define the transformation
  119. 119. Fault-tolerance  RDDs remember the sequence of operations that created it from the original fault-tolerant input data  Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant  Data lost due to worker failure, can be recomputed from input data
  120. 120. Key concepts  DStream – sequence of RDDs representing a stream of data - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets  Transformations – modify data from on DStream to another - Standard RDD operations – map, countByValue, reduce, join, … - Stateful operations – window, countByValueAndWindow, …  Output Operations – send data to external entity - saveAsHadoopFiles – saves to HDFS - foreach – do anything with each batch of results
  121. 121. Example 2 – Count the hashtags val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue()
  122. 122. Example 3 – Count the hashtags over last 10 mins val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() sliding window operation window length sliding interval
  123. 123. Example 3 – Counting the hashtags over last 10 mins val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
  124. 124. ? Smart window-based countByValue val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1)) + + –
  125. 125. Platforms Storm
  126. 126. Applications
  127. 127. BoltSpout Spout Spout Bolt Bolt Bolt Bolt Bolt Bolt Bolt Topology
  128. 128. BoltSpout Spout Spout Bolt Bolt Bolt Bolt Bolt Bolt Bolt Parallelism hint 2 5 2 1
  129. 129. Architecture
  130. 130. Scheduler Supervisor Master node Worker node 1 Supervisor Worker node n Composed of one Nimbus and a set of supervisors Storm clusterExecutor Worker (process) SlotsNimbus (process)
  131. 131. Scheduler Supervisor Master node Worker node 1 Supervisor Worker node n The Nimbus assigns work to supervisors, manage failures and monitors resource usage. Storm clusterExecutor Worker (process) SlotsNimbus (process)
  132. 132. Scheduler Supervisor Master node Worker node 1 Supervisor Worker node n The number of slots of a supervisor is the maximum number of workers it can execute Storm clusterExecutor Worker (process) SlotsNimbus (process)
  133. 133. Parallelism
  134. 134. Worker process Worker Process Task Task Task Task Task Task Task Task Task Task Task Task Blue Bolt Green Bolt Yellow Bolt 2 2 6 # executors = 10 5 executors per worker Green bolt was configured with 2 executors and 4 tasks
  135. 135. Platforms Comparison
  136. 136. Platform Storm Storm Trident Spark Streaming Samza S4 Processing Model Record-at-a- time Micro-batches Micro-batches Record-at-a- time Record-at-a- time Programming Model DAG DAG Monad DAG Actors Stream Partitioning Yes Yes Yes Yes Yes Rebalancing Yes Yes No No Yes Dynamic Cluster Yes Yes Yes Yes No Resource Management Standalone, YARN, Mesos Standalone, YARN, Mesos Standalone, YARN, Mesos YARN, Mesos Standalone Coordination Zookeeper Zookeeper Built-in Built-in Zookeeper Programming Language Java, any (via Thrift) Java, any (via Thrift) Java, Scala, Python JVM- languages Java
  137. 137. Platform Storm Storm Trident Spark Streaming Samza S4 Implementati on Language Java, Clojure Java Scala, Java Scala, Java Java, Groovy Built-in Operators No Yes Yes No No Deterministic - - Yes - - Message System Netty Netty Netty, Akka Kafka Netty Data Mobility Pull Pull - Pull Push Devlivery Guarantees At-most-once At-least-once Exactly-once At-most-once At-least-once Exactly-once Exactly-once At-most-once Fault Tolerance Rollback recovery using upstram backup - Coordinated periodic checkpoint, replication, parallel recovery Rollback recovery Uncoordinated periodic checkpoint Dynamic Graph No No No Yes Yes Persistent State No Yes Yes Yes Yes
  138. 138. Maycon Viana Bordin Advisor: Claudio Geyer
  139. 139. Datasets Number of Nodes Application 1 2 4 8 word-count 4GB 8GB 16GB 26GB log-processing 15GB 30GB 60GB 120GB traffic-monitoring 4GB 8GB* 16GB* 32GB* machine-outlier 4GB 9GB 18GB 36GB spam-filter 4GB* 8GB* 16GB* 32GB* sentiment-analysis 7GB 15GB 30GB 60GB trending-topics 7GB 15GB 30GB 60GB click-analytics 15GB 30GB 60GB 120GB fraud-detection 4GB† 8GB† 16GB† 32GB† spike-detection 4GB* 8GB* 16GB* 32GB* *replicated †generated
  140. 140. Parallelism 1:1 Best Best (only source) Best (max mem) Application Operator base multipliers base multipliers base multipliers base multipliers word-count source 1 1...6 1 1...3 1 2, 4, 8 3 1 splitter 1 1...6 5 1...3 5 1 5 1 counter 1 1...6 6 1...3 6 1 6 1, 2 sink 1 1...6 3 1...3 3 1 3 1 log-processing source 1 1...6 4 1...3 1 1, 2, 8 4 1 status-counter 1 1...6 1 1...3 1 1 1 1 volume-counter 1 1...6 2 1...3 2 1 2 1 geo-locator 1 1...6 4 1...3 4 1 4 1, 2 geo-summarizer 1 1...6 2 1...3 2 1 2 1 sink 1 1...6 4 1...3 4 1 4 1 traffic-monitoring source 1 1...6 1 1...3 1 2, 4, 8 1 1 map-matcher 1 1...6 2 1...3 2 1 2 1, 2 speed-calculator 1 1...6 2 1...3 2 1 2 1, 2 sink 1 1...6 1 1...3 1 1 1 1 machine-outlier source 1 1...6 6 1...3 1 1, 2, 4, 8 - - scorer 1 1...6 1 1...3 1 1 - - anomaly-scorer 1 1...6 1 1...3 1 1 - - alert-trigger 1 1...6 4 1...3 4 1 - - sink 1 1...6 1 1...3 1 1 - - spam-filter source 1 1...6 1 1...3 1 2, 4, 8 1 1 tokenizer 1 1...6 10 1...3 10 1 10 1, 2 word-probability 1 1...6 1 1...3 1 1 1 1 bayes-rule 1 1...6 1 1...3 1 1 1 1 sink 1 1...6 1 1...3 1 1 1 1
  141. 141. 1:1 Best Best (only source) Best (max mem) Application Operator base multipliers base multipliers base multipliers base multipliers sentiment-analysis source 1 1...6 1 2, 4, 8 tweet-filter 1 1...6 1 1 text-filter 1 1...6 1 1 stemmer 1 1...6 1 1 positive-scorer 1 1...6 1 1 negative-scorer 1 1...6 1 1 joiner 1 1...6 1 1 scorer 1 1...6 1 1 sink 1 1...6 1 1 trending-topics source 1 1...6 9 1...3 1 1, 2, 4, 8 9 1 topic-extractor 1 1...6 2 1...3 2 1 2 1 counter 1 1...6 1 1...3 1 1 1 2, 4 intermediate-ranker 1 1...6 1 1...3 1 1 1 1 total-ranker 1 1...6 1 1...3 1 1 1 1 sink 1 1...6 1 1...3 1 1 1 1 click-analytics source 1 1...6 2 1...3 2 2, 4, 8 2 1 repeat-visits 1 1...6 2 1...3 2 1 2 1 total-visits 1 1...6 2 1...3 2 1 2 1 geo-locator 1 1...6 5 1...3 5 1 5 2, 4 geo-summarizer 1 1...6 1 1...3 1 1 1 1 sink-visits 1 1...6 1 1...3 1 1 1 1 sink-locations 1 1...6 1 1...3 1 1 1 1 fraud-detection source 1 1...6 8 1...3 1 1, 2, 4 8 1 predictor 1 1...6 3 1...3 3 1 3 2, 4 sink 1 1...6 2 1...3 2 1 2 1 spike-detection source 1 1...6 7 1...3 1 1, 2, 4, 8 7 1 moving-average 1 1...6 3 1...3 3 1 3 2, 4 spike-detector 1 1...6 2 1...3 2 1 2 1 sink 1 1...6 1 1...3 1 1 1 1
  142. 142. Architecture Azure broker broker broker Kafka Platform master slave slave slave slave slave slave slave slave metrics
  143. 143. Tests: wordcount
  144. 144. 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 38000 40000 42000 44000 46000 48000 50000 52000 0 531 1062 1593 2124 2655 3186 3717 4248 4779 5310 5841 6372 6903 7434 7965 8496 9027 9558 10089 10620 11156 11687 12218 12749 13280 13816 14347 14878 15409 15945 16481 17017 17548 18079 18610 19146 19682 20213 20744 21285 21836 22367 22898 23429 Throughput(tuples/sec) Time (seconds) Throughput: nodes=1, parallelism=3 source splitSentence wordCount
  145. 145. CPU usage 0 20 40 60 80 100 120 1 218 435 652 869 1086 1303 1520 1737 1954 2171 2388 2605 2822 3039 3256 3473 3690 3907 4124 4341 4558 4775 4992 5209 5426 5643 5860 6077 6294 6511 6728 6945 7162 7379 7596 7813 8030 8247 8464 8681 8898 9115 9332 9549 9766 9983 10200 10417 10634 10851 11068 11285 11502 11719
  146. 146. Memory usage 0 0.5 1 1.5 2 2.5 3 3.5 1 218 435 652 869 1086 1303 1520 1737 1954 2171 2388 2605 2822 3039 3256 3473 3690 3907 4124 4341 4558 4775 4992 5209 5426 5643 5860 6077 6294 6511 6728 6945 7162 7379 7596 7813 8030 8247 8464 8681 8898 9115 9332 9549 9766 9983 10200 10417 10634 10851 11068 11285 11502 11719
  147. 147. Network usage 0 2 4 6 8 10 12 1 260 519 778 1037 1296 1555 1814 2073 2332 2591 2850 3109 3368 3627 3886 4145 4404 4663 4922 5181 5440 5699 5958 6217 6476 6735 6994 7253 7512 7771 8030 8289 8548 8807 9066 9325 9584 9843 10102 10361 10620 10879 11138 11397 11656 MB/sec net recv (MB/s) net sent (MB/s)
  148. 148. HDD Read/Write – Kafka Broker 0 10 20 30 40 50 60 70 80 90 MBytes/sec SDD_READ SDD_WRITE SDB_READ SDB_WRITE
  149. 149. Nodes=1, parallelism=1
  150. 150. 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 38000 40000 42000 44000 46000 48000 50000 52000 54000 56000 58000 60000 62000 64000 66000 68000 70000 72000 74000 0 430 861 1292 1722 2153 2584 3014 3445 3876 4306 4737 5168 5598 6029 6460 6890 7321 7752 8182 8613 9044 9479 9920 10351 10782 11212 11643 12074 12504 12935 13366 13801 14237 14668 15098 15534 15970 16405 16841 17272 17702 18133 18564 19004 19455 19896 source splitSentence wordCount
  151. 151. CPU usage 0 20 40 60 80 100 120 1 186 371 556 741 926 1111 1296 1481 1666 1851 2036 2221 2406 2591 2776 2961 3146 3331 3516 3701 3886 4071 4256 4441 4626 4811 4996 5181 5366 5551 5736 5921 6106 6291 6476 6661 6846 7031 7216 7401 7586 7771 7956 8141 8326 8511 8696 8881 9066 9251 9436 9621 9806 9991
  152. 152. Memory usage 0 500 1000 1500 2000 2500 3000 3500 1 186 371 556 741 926 1111 1296 1481 1666 1851 2036 2221 2406 2591 2776 2961 3146 3331 3516 3701 3886 4071 4256 4441 4626 4811 4996 5181 5366 5551 5736 5921 6106 6291 6476 6661 6846 7031 7216 7401 7586 7771 7956 8141 8326 8511 8696 8881 9066 9251 9436 9621 9806 9991
  153. 153. Nodes=2, parallelism=1
  154. 154. Latency
  155. 155. Throughput
  156. 156. Heinze, Thomas, et al. "Tutorial: Cloud-based Data Stream Processing." (2014). Artikis, Alexander, Matthias Weidlich, Francois Schnitzler, Ioannis Boutsis, Thomas Liebig, Nico Piatkowski, Christian Bockermann et al. "Heterogeneous Stream Processing and Crowdsourcing for Urban Traffic Management." In EDBT, pp. 712-723. 2014. Bouillet, Eric, et al. "Processing 6 billion CDRs/day: from research to production (experience report)." Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems. ACM, 2012. Lakshmanan, G. T., LI, Y., and Strom, R. Placement strategies for internet-scale data stream systems. Internet Computing, IEEE 12, 6 (2008), 50–60. Simmhan, Yogesh, et al. "An informatics approach to demand response optimization in smart grids." NATURAL GAS 31 (2011): 60. Sakaki, Takeshi, Makoto Okazaki, and Yutaka Matsuo. "Earthquake shakes Twitter users: real-time event detection by social sensors." Proceedings of the 19th international conference on World wide web. ACM, 2010.

×