Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Designing Structured Streaming Pipelines—How to Architect Things Right

1.826 Aufrufe

Veröffentlicht am

"Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved.

What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data?
What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output?
When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency?
How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions.

These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem."

Veröffentlicht in: Daten & Analysen

Designing Structured Streaming Pipelines—How to Architect Things Right

  1. 1. Tathagata “TD” Das Designing Structured Streaming Pipelines How to Architect Things Right #UnifiedAnalytics #SparkAISummit @tathadas
  2. 2. Structured Streaming Distributed stream processing built on SQL engine High throughput, second-scale latencies Fault-tolerant, exactly-once Great set of connectors Philosophy: Data streams are unbounded tables Users write batch-like code on a table Spark will automatically run code incrementally on streams 2
  3. 3. Structured Streaming @ 1000s of customer streaming apps in production on Databricks 1000+ trillions of rows processed in production
  4. 4. Structured Streaming 4 #UnifiedAnalytics #SparkAISummit Example Read JSON data from Kafka Parse nested JSON Store in structured Parquet table Get end-to-end failure guarantees ETL
  5. 5. Anatomy of a Streaming Query 5 spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Source Specify where to read data from Built-in support for Files / Kafka / Kinesis* Can include multiple sources of different types using join() / union() *Available only on Databricks Runtime creates a DataFrame key value topic partition offset timestamp [binary] [binary] topicA 0 345 1486087873 [binary] [binary] topicB 3 2890 1486086721
  6. 6. spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) Anatomy of a Streaming Query 6 #UnifiedAnalytics #SparkAISummit Transformations Cast bytes from Kafka to a string, parse it as a json, and generate nested columns 100s of built-in, optimized SQL functions like from_json user-defined functions, lambdas, function literals with map, flatMap…
  7. 7. spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") Anatomy of a Streaming Query 7 Sink Write transformed output to external storage systems Built-in support for Files / Kafka Use foreach to execute arbitrary code with the output data Some sinks are transactional and exactly once (e.g. files)
  8. 8. spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start() Anatomy of a Streaming Query 8 #UnifiedAnalytics #SparkAISummit Processing Details Trigger: when to process data - Fixed interval micro-batches - As fast as possible micro-batches - Continuously (new in Spark 2.3) Checkpoint location: for tracking the progress of the query
  9. 9. DataFrames, Datasets, SQL Logical Plan Read from Kafka Project device, signal Filter signal > 15 Write to Parquet Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Plan Series of Incremental Execution Plans process newdata t = 1 t = 2 t = 3 process newdata process newdata spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start()
  10. 10. Anatomy of a Streaming Query Raw data from Kafka available as structured data in seconds, ready for querying spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start() STRUCTURED STREAMING Complex ETL
  11. 11. My past talks: Deep dives 11 #UnifiedAnalytics #SparkAISummit A Deep Dive into Stateful Stream Processing in Structured Streaming Spark + AI Summit 2018 A Deep Dive Into Structured Streaming Spark Summit 2016 STRUCTURED STREAMING Complex ETL
  12. 12. This talk: Streaming Design Patterns Zoomed out view of your data pipeline 12 STRUCTURED STREAMING Complex ETL DATA INSIGHTS
  13. 13. Another streaming design pattern talk? Most talks Focus on a pure streaming engine Explain one way of achieving the end goal 13 This talk Spark is more than a streaming engine Spark has multiple ways of achieving the end goal with tunable perf, cost and quality
  14. 14. This talk How to think about design Common design patterns How we are making this easier 14
  15. 15. Streaming Pipeline Design 15 Data streams Insights ????
  16. 16. 16 ???? What? Why? How?
  17. 17. What? 17 What is your input? What is your data? What format and system is your data in? What is your output? What results do you need? What throughput and latency do you need?
  18. 18. Why? 18 Why do you want this output in this way? Who is going to take actions based on it? When and how are they going to consume it? humans? computers?
  19. 19. Why? Common mistakes! #1 "I want my dashboard with counts to be updated every second" #2 "I want to generate automatic alerts with up-to-the-last second count" (but my input data is often delayed) 19 No point of updating every second if humans are going to take actions in minutes or hours No point taking fast actions on low quality data and results
  20. 20. Why? Common mistakes! #3 "I want to train machine learning models on the results" (but my results are in a key-value store) 20 Key-value stores are not great for large, repeated data scans which machine learning workloads perform
  21. 21. How? 21 ???? How to process the data? ???? How to store the results?
  22. 22. Streaming Design Patterns 22 What? Why? How?Complex ETL
  23. 23. Pattern 1: ETL 23 Input: unstructured input stream from files, Kafka, etc. What? Why? Query latest structured data interactively or with periodic jobs 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … Output: structured tabular data
  24. 24. P1: ETL 24 Convert unstructured input to structured tabular data Latency: few minutes What? How? Why? Query latest structured data interactively or with periodic jobs Process: Use Structured Streaming query to transform unstructured, dirty data Run 24/7 on a cluster with default trigger Store: Save to structured scalable storage that supports data skipping, etc. E.g.: Parquet, ORC, or even better, Delta Lake STRUCTURED STREAMING ETL QUERY 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  25. 25. P1: ETL 25 How? Store: Save to STRUCTURED STREAMING Read with snapshot guarantees while writes are in progress Concurrently reprocess data with full ACID guarantees Coalesce small files into larger files Fix mistakes in existing data REPROCESS ETL QUERY 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  26. 26. P1: Cheaper ETL 26 Convert unstructured input to structured tabular data Latency: few minutes hours Not have clusters up 24/7 What? How? Why? Query latest data interactively or with periodic jobs Cheaper solution Process: Still use Structured Streaming query! Run streaming query with "trigger.once" for processing all available data since last batch Set up external schedule (every few hours?) to periodically start a cluster and run one batch STRUCTURED STREAMING RESTART ON SCHEDULE 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  27. 27. P1: Query faster than ETL! 27 Latency: hours seconds What? How? Why? Query latest up-to-the last second data interactively Query data in Kafka directly using Spark SQL Can process up to the last records received by Kafka when the query was started SQL
  28. 28. Pattern 2: Key-value output 28 Input: new data for each key What? Why? Lookup latest value for key (dashboards, websites, etc.) OR Summary tables for querying interactively or with periodic jobs KEY LATEST VALUE key1 value2 key2 value3 { "key1": "value1" } { "key1": "value2" } { "key2": "value3" } Output: updated values for each key Aggregations (sum, count, …) Sessionizations
  29. 29. P2.1: Key-value output for lookup 29 Generate updated values for keys Latency: seconds/minutes What? How? Why? Lookup latest value for key STRUCTURED STREAMING Process: Use Structured Streaming with stateful operations for aggregation Store: Save in key-values stores optimized for single key lookups STATEFUL AGGREGATION LOOKUP { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  30. 30. P2.2: Key-value output for analytics 30 Generate updated values for keys Latency: seconds/minutes What? How? Why? Lookup latest value for key Summary tables for analytics Process: Use Structured Streaming with Store: Save in Delta Lake! STRUCTURED STREAMING STATEFUL AGGREGATION SUMMARY stateful operations for aggregation Managed Delta Lake supports upserts using SQL MERGE { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  31. 31. P2.2: Key-value output for analytics 31 How? STRUCTURED STREAMING STATEFUL AGGREGATION streamingDataFrame.foreachBatch { batchOutput => spark.sql(""" MERGE INTO deltaTable USING batchOutput WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ... """) }.start() SUMMARY stateful operations for aggregation Managed Delta Lake supports upserts using SQL MERGE Coming to OSS Delta Lake soon! { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  32. 32. P2.2: Key-value output for analytics 32 How? Stateful aggregation requires setting watermark to drop very late data Dropping some data leads some inaccuracies Stateful operations for aggregation Managed Delta Lake supports upserts using SQL MERGE STRUCTURED STREAMING STATEFUL AGGREGATION SUMMARY { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  33. 33. P2.3: Key-value aggregations for analytics 33 Generate aggregated values for keys Latency: hours/days Do not drop any late data What? How? Why? Summary tables for analytics Correct Process: ETL to structured table (no stateful aggregation) Store: Save in Delta Lake Post-process: Aggregate after all delayed data received STRUCTURED STREAMING STATEFUL AGGREGATION AGGREGATIONETL SUMMARY { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  34. 34. Pattern 3: Joining multiple inputs 34 #UnifiedAnalytics #SparkAISummit { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value Input: Multiple data streams based on common key Output: Combined information What?
  35. 35. P3.1: Joining fast and slow data 35 Input: One fast stream of facts and one slow stream of dimension changes Output: Fast stream enriched by data from slow stream Example: product sales info enriched by more production info What + Why? { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value How? ETL slow stream to a dimension table Join fast stream with snapshots of the dimension table FAST ETL DIMENSION TABLE COMBINED TABLE JOIN SLOW ETL
  36. 36. P3.1: Joining fast and slow data 36 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value SLOW ETL How? - Caveats FAST ETL JOIN COMBINED TABLE DIMENSION TABLE Store dimension table in Delta Lake Delta Lake's versioning allows changes to be detected and the snapshot automatically reloaded without restart** Better Solution ** available only in Managed Delta Lake in Databricks Runtime Structured Streaming by default does reload dimension table snapshot Changes by slow ETL wont be seen until restart
  37. 37. P3.1: Joining fast and slow data 37 How? - Caveats Treat it as a "joining fast and fast data" Better Solution Delays in updates to dimension table can cause joining with stale dimension data E.g. sale of a product received even before product table has any info on the product
  38. 38. P3.2: Joining fast and fast data 38 Input: Two fast streams where either stream maybe delayed Output: Combined info even if one is delayed over another Example: ad impressions and ad clicks Use stream-stream joins in Structured Streaming Data will be buffered as state Watermarks define how long to buffer before giving up on matching What + Why? i d upd ate val ue STREAM STREAM JOIN { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … See my previous deep dive talk for more info
  39. 39. Pattern 4: Change data capture 39 #UnifiedAnalytics #SparkAISummit INSERT a, 1 INSERT b, 2 UPDATE a, 3 DELETE b INSERT b, 4 key value a 3 b 4 Input: Change data based on a primary key Output: Final table after changes End-to-end replication of transactional tables into analytical tables What? Why?
  40. 40. P4: Change data capture 40 How? Use Structured Streaming and Delta Lake! In each batch, apply changes to the Delta table using MERGE MERGE in Managed Data Lake supports UPDATE, INSERT and DELETE Coming soon to OSS Delta Lake! INSERT a, 1 INSERT b, 2 UPDATE a, 3 DELETE b INSERT b, 4 STRUCTURED STREAMING streamingDataFrame.foreachBatch { batchOutput => spark.sql(""" MERGE INTO deltaTable USING batchOutput WHEN MATCHED ... THEN UPDATE ... WHEN MATCHED ... THEN INSERT ... WHEN WHEN NOT MATCHED THEN INSERT ... """) }.start() See Databricks docs for more info
  41. 41. Pattern 5: Writing to multiple outputs 41 id val1 val2 id sum count RAW LOGS + SUMMARIES TABLE 2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … SUMMARY FOR ANALYTICS + SUMMARY FOR LOOKUP TABLE CHANGE LOG + UPDATED TABLE What?
  42. 42. P5: Serial or Parallel? 42 id val1 val2 id sum count TABLE 2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … Parallel How? id val1 val2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 TABLE 2 Writes table 1 and reads it again Cheap or expensive depends on the size and format of table 1 Higher latency Reads input twice, may have to parse the data twice Cheap or expensive depends on size of raw input + parsing cost Serial
  43. 43. P5: Combination! 43 Combo 1: Multiple streaming queries id val1 val2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 TABLE 2 Do expensive parsing once, write to table 1 Do cheaper follow up processing from table 1 Good for ETL + multiple levels of summaries Change log + updated table Still writing + reading table1 Compute once, write multiple times Cheaper, but looses exactly-once guarantee id val1 val2 TABLE 3 Combo 2: Single query + foreachBatch EXPENSIVE CHEAP { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 LOCATION 1 id val1 val2 LOCATION 2 streamingDataFrame.foreachBatch { batchSummaryData => // write summary to Delta Lake // write summary to key value store }.start() How?
  44. 44. Production Pipelines @ REPORTING STREAMING ANALYTICS Bronze Tables Silver Tables Gold Tables STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING
  45. 45. Production Pipelines @ REPORTING STREAMING ANALYTICS Bronze Tables Silver Tables Gold Tables STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING Productionizing a single pipeline requires the following Unit testing Integration testing Deploying Monitoring Upgrading See Burak's talk Productizing Structured Streaming Jobs
  46. 46. Production Pipelines @ Bronze Tables Silver Tables Gold TablesBut we really need to productionize the entire DAG of pipelines! REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING
  47. 47. Future direction Declarative APIs to specify and manage entire DAGs of pipelines 47
  48. 48. Delta Pipelines REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING 48 source( name = "raw", df = spark.readStream…) sink( name = "warehouse", (df) => df.writeStream… flow( name = "ingest", df = input("raw") .withColumn(…), dest = "warehouse", latency = "5 minutes") A small library on top of Apache Spark that lets users express their whole pipeline as a dataflow graph
  49. 49. Delta Pipelines 49 STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING REPORTING STREAMING ANALYTICS A small library on top of Apache Spark that lets users express their whole pipeline as a dataflow graph Specify expectations with Delta tables Unit test DAG with sample input data Integration test DAG with production data Deploy/upgrade new code of the entire DAG Rollback code and/or data when needed Monitor entire DAG in one place
  50. 50. STAY TUNED!
  51. 51. Thanks for listening! AMA at 4:00 PM today #UnifiedAnalytics #SparkAISummit

×