Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Taking Spark Streaming to the Next Level with Datasets and DataFrames

7.981 Aufrufe

Veröffentlicht am

Strata San Jose 2016 talks about the future of Spark streaming API

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Taking Spark Streaming to the Next Level with Datasets and DataFrames

  1. 1. Taking Spark Streaming to the next level with Datasets and DataFrames Tathagata “TD” Das @tathadas Strata San Jose 2016
  2. 2. Streaming in Spark Spark Streaming changedhow peoplewrite streaming apps 2 SQL Streaming MLlib Spark Core GraphX Functional, conciseand expressive Fault-tolerant statemanagement Unified stack with batch processing More than 50%users consider most important partof Spark
  3. 3. Streaming apps are growing more complex 3
  4. 4. Streaming computations don’t run in isolation Need to interact with batch data, interactive analysis, machine learning, etc.
  5. 5. Use case: IoT Device Monitoring IoT Devices ETL into long term storage - Preventdata loss - PreventduplicatesStatus monitoring - Handlelate data - Process based on eventtime Interactively debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online+continuous learning
  6. 6. 1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time 2. Interoperatestreaming with batch AND interactive - RDD/DStream hassimilar API, butstill requirestranslation - Hard to interoperate with DataFrames and Datasets 3. Reasoning about end-to-end guarantees - Requirescarefully constructing sinks that handle failurescorrectly - Data consistency in the storage while being updated Pain points with DStreams
  7. 7. Structured Streaming
  8. 8. The simplest way to perform streaming analytics is not having to reason about streaming at all
  9. 9. Model Trigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query Input: data from source as an append-only table Trigger: howfrequently to check input for newdata Query: operations on input usual map/filter/reduce newwindow, session ops
  10. 10. Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complete output
  11. 11. Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Output delta output Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Delta output: Write only the rows that changed in result from previous batch Append output: Write only new rows *Not all output modes are feasible withall queries
  12. 12. Static, bounded table Dataset, DataFrame Streaming, unbounded table Single API !
  13. 13. Batch ETL with DataFrame input = ctxt.read .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") Read from Json file Select some devices Write to parquet file
  14. 14. Streaming ETL with DataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") Read from Json file stream Select some devices Write to parquet file stream
  15. 15. Streaming ETL with DataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") read…stream() creates a streaming DataFrame, doesnot start any of the computation write…startStream() defineswhere & how to outputthe data and starts the processing
  16. 16. Streaming ETL with DataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") 1 2 3 Result [append-only table] Input Output [append mode] new rows in result of 2 new rows in result of 3
  17. 17. Continuous Aggregations Continuously compute average signal of each type of device 17 input.groupBy("device-type") .avg("signal") input.groupBy( window("event-time", "10min"), "device type") .avg("age") Continuously compute average signal of each type of device in last10 minutesof eventtime - Windowing is just a type of aggregation - Simple API for event time based windowing
  18. 18. Query Management query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 18 query: a handle to the running streaming computation for managingit - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack
  19. 19. Logically: DataFrame operations on table (i.e. as easyto understand as batch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally and continuously) DataFrame LogicalPlan Continuous, incrementalexecution Catalyst optimizer Execution
  20. 20. Structured Streaming High-level streaming API built on SparkSQL engine Runs the same computation as batch queriesin Datasets/DataFrames Eventtime, windowing,sessions,sources& sinks End-to-end exactly once semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove,change queriesat runtime Build and apply ML models
  21. 21. Advantages over DStreams 1. Processingwith event-time,dealingwith late data 2. Exactly same API for batch,streaming,and interactive 3. End-to-endexactly-once guaranteesfromthe system 4. Performance through SQL optimizations - Logical plan optimizations, Tungsten, Codegen, etc. - Faster state management for stateful stream processing 21
  22. 22. Underneath the Hood
  23. 23. Batch Execution on Spark SQL 23 DataFrame/ Dataset Logical Plan Execution PlanPlanner Logical plan optimization Execution plan generation RDD job Tungsten Code generation Abstract representation of query
  24. 24. Continuous Incremental Execution Planner extendedto be aware of the streaminglogical plans Planner generatesa continuousseriesof incrementalexecutionplans, each processingthe next chunk ofstreaming data 24 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  25. 25. Streaming Source A streaming data source where - Records can be uniquely identified by a offset - Arbitrary segmentof the stream data can be read based on an offset range - Files,Kafka, Kinesissupported More restricted than DStream Receivers Can ensure end-to-end exactly-once guarantees 25 Logical Plan Sources Sink Operations
  26. 26. Streaming Sink Allows output to pushed to externalstorage Each batch of output should be written to storage atomically and idempotently Both neededto ensureend-to-end exactly- once guarantees,AND data consistency 26 Logical Plan Sources Sink Operations
  27. 27. Incremental Execution Every trigger interval - Plannerask source for next chunkof inputdata - Generatesexecution plan with the input - Generate output data - Handsto sink for pushing itout 27 Logical Plan Sources Sink Ops Incremental Execution Plan Input get new data as input Output push output generate outputPlanner
  28. 28. Offset Tracking with WAL Plannersaves the next offset range in a write- ahead log (WAL) on HDFS/S3 before incremental execution starts 28 Logical Plan Sources Incremental Execution Plan Planner processed offsets Offset WAL in-progress offsets save offset range to WAL before processing starts Input
  29. 29. Recovery from WAL 29 Logical Plan Sources Incremental Execution PlanRestarted Planner processed offsets in-progress offsets Offset WAL recover last offset range and restart computation After failure, restarted plannerrecoverslast offset range from WAL and restarts failed execution Output exactly same as it would been without failure Input
  30. 30. 30 Streaming source + Streaming sink + Offset tracking = End-to-end exactly-once guarantees
  31. 31. Stateful Stream Processing Streaming aggregationsrequire maintaining intermediate "state" data acrossbatches 31 compute aggregates 3 Dog Cat 2 Dog 1 Cat 3 Dog 1 Cow 1 state data compute aggregates 2 Cat, Cow compute aggregates 1 Cat, Dog, Cat aggregates
  32. 32. State Management State needsto be fault-tolerant Cat, Dog, Cat Cat, Cow Dog Cat 2 Dog 1 Cat 3 Dog 1 Cow 1aggregates both, input + previous state needed to recover from failure Cat 3 Dog 1 Cow 1 Dog compute aggregates 1 compute aggregates 2 compute aggregates 3
  33. 33. Old State Management with DStreams DStreams (i.e. updateStateByKey,mapWithState)represented state data as RDDs Leveraged RDD lineage for fault-tolerance and RDD checkpointing to prevent unbounded RDD checkpointing savesall state data to HDFS/S3 Inefficientwhen update ratesare low No "incremental"checkpointing 33
  34. 34. New State Management: State Store State Store API: any versionedkey value store that - Allowsa set of key-value updatesto be transactionally committed with a version number(i.e. incremental checkpointing) - Allowsa specific version of the key-value data to be retrieved 34 Incremental Execution 1 state store, v1 Incremental Execution 2 Incremental Execution 3 state store, v2
  35. 35. Spark cluster HDFS-backed State Store Implementation of State Store API in Spark 2.0: In-memory hashmap backed by files in HDFS 35 Driver Executor Executor hash map hash map - Each executorhas hashmap(s) containing versioned data - Updatescommitted as delta filesin HDFS/S3 - Delta filesperiodically collapsed into snapshotsto improve recovery deltafiles in HDFS k1 v1 k2 v2 k1 v1 k2 v2
  36. 36. Fault recovery of HDFS State Store State Store version also stored in the WAL 36 On fault recovery, - Recoverinputdata from source using lastoffsets - Recoverstate data using last store version Incremental Execution In progress WAL sources store files recovered state recovered input
  37. 37. 37 Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming
  38. 38. Plan Spark 2.0 Basic infrastructureand API - Eventtime, windows,aggregations Files as sourcesand sinks - Kafka as source, SQL as sink(?)
  39. 39. Plan Spark 2.1+ More supportfor late data Dynamic scaling Sourceand sinks public API - Extends DataSource API More sourcesand sinks ML integrations Spark XXX "Truestreaming" engine - API already agnostic to micro-batch
  40. 40. 40 Simple and fast real-time analytics • Develop Productively • Execute Efficiently • Update Automatically Questions? Learn More Today, 1:50-2:30 AMA Followme @tathadas

×