Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Streaming architecture patterns

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 77 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Streaming architecture patterns (20)

Anzeige

Weitere von hadooparchbook (20)

Aktuellste (20)

Anzeige

Streaming architecture patterns

  1. 1. Best practices for streaming applications O’Reilly Webcast June 21st /22nd , 2016 Mark Grover | @mark_grover | Software Engineer Ted Malaska | @TedMalaska | Principal Solutions Architect
  2. 2. 2 About the presenters • Principal Solutions Architect at Cloudera • Done Hadoop for 6 years – Worked with > 70 companies in 8 countries • Previously, lead architect at FINRA • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Marvel fan boy, runner • Software Engineer at Cloudera, working on Spark • Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) • Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume Ted Malaska Mark Grover
  3. 3. 3 About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook
  4. 4. 4 Goal
  5. 5. 5 Understand common use- cases for streaming and their architectures
  6. 6. 6 What is streaming?
  7. 7. 7 When to stream, and when not to Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch
  8. 8. 8 When to stream, and when not to Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch
  9. 9. 9 No free lunch Constant low milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds or more, re-run in case of failures Real-time Near real-time Batch “Difficult” architectures, lower latency “Easier” architectures, higher latency
  10. 10. 10 Use-cases for streaming
  11. 11. 11 Use-case categories • Ingestion • Simple transformations – Decision (e.g. Anomaly detection) • Simple counts – Lambda, etc. • Advanced usage – Machine Learning – Windowing
  12. 12. 12 Ingestion & Transformations
  13. 13. 13 What is ingestion? Source Systems Destination system Streaming engine
  14. 14. 14 But there multiple sources Ingest Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Streaming engine Ingest
  15. 15. 15 But.. • Sources, sinks, ingestion channels may go down • Sources, sinks producing/consuming at different rates (buffering) • Regular maintenance windows may need to be scheduled • You need a resilient message broker (pub/sub)
  16. 16. 16 Need for a message broker Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker
  17. 17. 17 Kafka Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker
  18. 18. 18 Destination systems Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker Most common “destination” is a storage system
  19. 19. 19 Architecture diagram with a broker Source System 1 Storage systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker
  20. 20. 20 Streaming engines Source System 1 Storage systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Kafka Connect Apache Flume Message broker Apache Beam (incubating)
  21. 21. 21 Storage options Source System 1 Storage systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Kafka Connect Apache Flume Message broker Apache Beam (incubating)
  22. 22. 22 Semantics At most once, Exactly once, At least once
  23. 23. 23 Semantic types • At most once – Not good for many cases – Only where performance/SLA is more important than accuracy • Exactly once – Expensive to achieve but desirable • At least once – Easiest to achieve
  24. 24. 24 Review Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker
  25. 25. 25 Semantics of our architecture Source System 1 Destination systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming engine Push Message broker At least once At least once Ordered Partitioned It depends It depends
  26. 26. 26 Transforming data in flight
  27. 27. 27 Streaming architecture for ingestion Source System 1 Storage systemSource System 2 Source System 3 Ingest Ingest Ingest Extract Streaming ingestion process Push Kafka connect Apache Flume Message broker Can be used to do simple transformations
  28. 28. 28 Ingestion and/or Transformation 1. Zero Transformation – No transformation, plain ingest, no schema validation – Keep the original format - SequenceFiles, Text, etc. – Allows to store data that may have errors in the schema 2. Format Transformation – Simply change the format of field, for example – Structured Format e.g. Avro – Which does schema validation 3. Enrichment Transformation – Atomic – Contextual
  29. 29. 29 #3 - Enrichment transformations Atomic • Need to work with one event at a time • Mask a credit card number • Add processing time or offset to the record Contextual • Need to refer to external context • Example - convert zip code to state, by looking up a cache
  30. 30. 30 Atomic transformations • Require no context • All streaming engines support it
  31. 31. 31 Contextual transformations • Well supported by many streaming engines • Need to store the context somewhere.
  32. 32. 32 Where to store the context 1. Locally Broadcast Cached Dim Data – Local to Process (On Heap, Off Heap) – Local to Node (Off Process) 2. Partitioned Cache – Shuffle to move new data to partitioned cache 3. External Fetch Data (e.g. HBase, Memcached)
  33. 33. 33 #1a - Locally broadcast cached data Could be On heap or Off heap
  34. 34. 34 #1b - Off process cached data Data is cached on the node, outside of process. Potentially in an external system like Rocks DB
  35. 35. 35 #2 - Partitioned cache data Data is partitioned based on field(s) and then cached
  36. 36. 36 #3 - External fetch Data fetched from external system
  37. 37. 37 A combination (partitioned cache + external)
  38. 38. 38 Anomaly detection using contextual transformations
  39. 39. 39 Storage systems When to use which one?
  40. 40. 40 Storage Considerations • Throughput • Access Patterns – Scanning – Indexed – Reversed Indexed • Transaction Level – Record/Document – File
  41. 41. 41 File Level • HDFS • S3
  42. 42. 42 NoSql • HBase • Cassandra • MongoDB
  43. 43. 43 Search • SolR • Elastic Search
  44. 44. 44 NoSql-Sql • Kudu
  45. 45. 45 Streaming engines Comparison
  46. 46. 46© Cloudera, Inc. All rights reserved. Tricks With Producers •Send Source ID (requires Partitioning In Kafka) •Seq •UUID •UUID plus time •Partition on SourceID •Watch out for repartitions and partition fail overs
  47. 47. 47© Cloudera, Inc. All rights reserved. Streaming Engines •Consumer •Flume, KafkaConnect, Streaming Engine •Storm •Spark Streaming •Flink •Kafka Streams
  48. 48. 48© Cloudera, Inc. All rights reserved. Consumer: Flume, KafkaConnect •Simple and Works •Low latency •High throughput •Interceptors •Transformations •Alerting •Ingestions
  49. 49. 49© Cloudera, Inc. All rights reserved. Consumer: Streaming Engines •Not so great at HDFS Ingestion •But great for record storage systems •HBase •Cassandra •Kudu •SolR •Elastic Search
  50. 50. 50© Cloudera, Inc. All rights reserved. Storm •Old Gen •Low latency •Low throughput •At least once •Around for ever •Topology Based
  51. 51. 51© Cloudera, Inc. All rights reserved. Spark Streaming •The Juggernaut •Higher Latency •High Through Put • Exactly Once •SQL •MlLib •Highly used •Easy to Debug/Unit Test •Easy to transition from Batch •Flow Language •600 commits in a month and about 100 meetups
  52. 52. 52© Cloudera, Inc. All rights reserved. Spark Streaming DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print First Batch Second Batch
  53. 53. 53© Cloudera, Inc. All rights reserved. DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD partitions RDD Parition RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1 Spark Streaming
  54. 54. 54© Cloudera, Inc. All rights reserved. Flink •I’m Better Than Spark Why Doesn’t Anyone use me •Very much like Spark but not as feature rich •Lower Latency •Micro Batch -> ABS •Asynchronous Barrier Snapshotting •Flow Language •~1/6th the comments and meetups •But Slim loves it ☺
  55. 55. 55© Cloudera, Inc. All rights reserved. Flink - ABS Operator Buffer
  56. 56. 56© Cloudera, Inc. All rights reserved. Operator Buffer Operator Buffer Flink - ABS Barrier 1A Hit Barrier 1B Still Behind
  57. 57. 57© Cloudera, Inc. All rights reserved. Operator Buffer Flink - ABS Both Barriers Hit Operator Buffer Barrier 1A Hit Barrier 1B Still Behind Check Point
  58. 58. 58© Cloudera, Inc. All rights reserved. Operator Buffer Flink - ABS Both Barriers Hit Check Point Operator Buffer Barrier is combined and can move on Buffer can be flushed out
  59. 59. 59© Cloudera, Inc. All rights reserved. Kafka Streams • The new Kid on the Block • When you only have Kafka • Low Latency • High Throughput • Not exactly once • Very Young • Flow Language • Very different hardware profile then others • Not widely supported • Not widely used • Worries about separation of concern
  60. 60. 60© Cloudera, Inc. All rights reserved. Summary about Engines • Ingestion • Flume and KafkaConnect • Super Real Time and Special • Consumer • Counting, MlLib, SQL • Spark • Maybe future and cool • Flink and KafkaStreams • Odd man out • Storm
  61. 61. 61© Cloudera, Inc. All rights reserved. Abstractions Code Abstractions Beam SQL Abstraction SQL UI Abstraction StreamSets Streaming Engines
  62. 62. 62 Counting
  63. 63. 63 Streaming and Counting • Counting is easy right? • Back to Only once
  64. 64. 64 We started with Lambda Pipe Speed Layer Batch Layer Persist Results Speed Results Batch Results Serving Layer
  65. 65. 65 Why did Streaming Suck • Increments with Cassandra • Double increment • No strong consistency • Storm without Kafka • Not only once • Not at least once • Batch would have to re-process EVERY record to remove dups
  66. 66. 66 We have come a long way • We don’t have to use Increments any more and we can have consistency • HBase • We can have state in our streaming platform • Spark Streaming • We don’t lose data • Spark Streaming • Kafka • Other options • Full universe of Deduping • Again HBase with versions
  67. 67. 67 Increments
  68. 68. 68 Puts with State
  69. 69. 69 Advanced streaming When to use which one?
  70. 70. 70 Advanced Streaming • Ad-hoc will produce Identify Value • Ad-hoc will become batch • The value will demand less latency on batch • Batch will become Streaming
  71. 71. 71 Advanced Streaming • Requirements for Ideal Batch to Streaming frameworks • Something that can snap both paradigms • Something that can use the tools of Ad-hoc • SQL • MlLib • R • Scala • Java • Development through a common IDE • Debugging • Unit Testing • Common deployment model
  72. 72. 72 Advanced Streaming • In Spark Streaming • A DStream is a collection of RDD with respect to micro batch intervals • If we can access RDDs in Spark Streaming • We can convert to Vectors • KMeans • Principal component analysis • We can convert to LabeledPoint • NaiveBayes • Random Forest • Linear Support Vector Machines • We can convert to a DataFrames • SQL • R
  73. 73. 73 Wrap-up
  74. 74. 74 Understand common use-cases for streaming and their architecturesOur original goal
  75. 75. 75 Common streaming use-cases • Ingestion – Transformation • Counting – Lambda, etc. • Advanced streaming
  76. 76. 76 Thank you!Mark Grover | @mark_grover Ted Malaska | @TedMalaska @hadooparchbook hadooparchitecturebook.com
  77. 77. 77 Transformations with context

×