Suche senden
Hochladen
Introduction to Structured Streaming
•
3 gefällt mir
•
1,641 views
datamantra
Folgen
Spark streaming 2.0 - Structured streaming
Weniger lesen
Mehr lesen
Technologie
Melden
Teilen
Melden
Teilen
1 von 25
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
State management in Structured Streaming
State management in Structured Streaming
datamantra
Structured Streaming with Kafka
Structured Streaming with Kafka
datamantra
Introduction to Flink Streaming
Introduction to Flink Streaming
datamantra
Introduction to Structured streaming
Introduction to Structured streaming
datamantra
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
Migrating to spark 2.0
Migrating to spark 2.0
datamantra
Evolution of apache spark
Evolution of apache spark
datamantra
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
Empfohlen
State management in Structured Streaming
State management in Structured Streaming
datamantra
Structured Streaming with Kafka
Structured Streaming with Kafka
datamantra
Introduction to Flink Streaming
Introduction to Flink Streaming
datamantra
Introduction to Structured streaming
Introduction to Structured streaming
datamantra
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
Migrating to spark 2.0
Migrating to spark 2.0
datamantra
Evolution of apache spark
Evolution of apache spark
datamantra
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
Core Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
Productionalizing a spark application
Productionalizing a spark application
datamantra
Introduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
Exploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
Productionalizing Spark ML
Productionalizing Spark ML
datamantra
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
Spark on Kubernetes
Spark on Kubernetes
datamantra
Understanding time in structured streaming
Understanding time in structured streaming
datamantra
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
Introduction to Spark Streaming
Introduction to Spark Streaming
datamantra
Building end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
Introduction to dataset
Introduction to dataset
datamantra
Interactive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
Shashank L
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
confluent
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
Apache Airflow in Production
Apache Airflow in Production
Robert Sanders
Comparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
Apache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang
Weitere ähnliche Inhalte
Was ist angesagt?
Core Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
Productionalizing a spark application
Productionalizing a spark application
datamantra
Introduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
Exploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
Productionalizing Spark ML
Productionalizing Spark ML
datamantra
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
Spark on Kubernetes
Spark on Kubernetes
datamantra
Understanding time in structured streaming
Understanding time in structured streaming
datamantra
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
Introduction to Spark Streaming
Introduction to Spark Streaming
datamantra
Building end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
Introduction to dataset
Introduction to dataset
datamantra
Interactive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
Shashank L
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
confluent
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
Apache Airflow in Production
Apache Airflow in Production
Robert Sanders
Was ist angesagt?
(20)
Core Services behind Spark Job Execution
Core Services behind Spark Job Execution
Productionalizing a spark application
Productionalizing a spark application
Introduction to Datasource V2 API
Introduction to Datasource V2 API
Exploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
Productionalizing Spark ML
Productionalizing Spark ML
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
Spark on Kubernetes
Spark on Kubernetes
Understanding time in structured streaming
Understanding time in structured streaming
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Introduction to Spark Streaming
Introduction to Spark Streaming
Building end to end streaming application on Spark
Building end to end streaming application on Spark
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
Introduction to dataset
Introduction to dataset
Interactive workflow management using Azkaban
Interactive workflow management using Azkaban
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Apache Airflow in Production
Apache Airflow in Production
Ähnlich wie Introduction to Structured Streaming
Comparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
Apache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang
Apache Tez – Present and Future
Apache Tez – Present and Future
Rajesh Balamohan
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
SD Times - Docker v2
SD Times - Docker v2
Alvin Richards
Using Databases and Containers From Development to Deployment
Using Databases and Containers From Development to Deployment
Aerospike, Inc.
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
QCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application Development
Chris Bailey
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
Couchbase and Apache Spark
Couchbase and Apache Spark
Matt Ingenthron
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Benoit Hudzia
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
vasuballa
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Filipe Miranda
NTTs Journey with Openstack-final
NTTs Journey with Openstack-final
shintaro mizuno
Tech trends 2018 2019
Tech trends 2018 2019
Johan Norm
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
Ähnlich wie Introduction to Structured Streaming
(20)
Comparison of various streaming technologies
Comparison of various streaming technologies
Apache Tez – Present and Future
Apache Tez – Present and Future
Apache Tez – Present and Future
Apache Tez – Present and Future
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Apache Big Data Europe 2016
Apache Big Data Europe 2016
SD Times - Docker v2
SD Times - Docker v2
Using Databases and Containers From Development to Deployment
Using Databases and Containers From Development to Deployment
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
QCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application Development
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
Couchbase and Apache Spark
Couchbase and Apache Spark
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
NTTs Journey with Openstack-final
NTTs Journey with Openstack-final
Tech trends 2018 2019
Tech trends 2018 2019
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Mehr von datamantra
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
Understanding transactional writes in datasource v2
Understanding transactional writes in datasource v2
datamantra
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
Spark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
Testing Spark and Scala
Testing Spark and Scala
datamantra
Understanding Implicits in Scala
Understanding Implicits in Scala
datamantra
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
Functional programming in Scala
Functional programming in Scala
datamantra
Telco analytics at scale
Telco analytics at scale
datamantra
Platform for Data Scientists
Platform for Data Scientists
datamantra
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
datamantra
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
Anatomy of spark catalyst
Anatomy of spark catalyst
datamantra
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
Mehr von datamantra
(15)
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
Understanding transactional writes in datasource v2
Understanding transactional writes in datasource v2
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
Spark stack for Model life-cycle management
Spark stack for Model life-cycle management
Testing Spark and Scala
Testing Spark and Scala
Understanding Implicits in Scala
Understanding Implicits in Scala
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
Functional programming in Scala
Functional programming in Scala
Telco analytics at scale
Telco analytics at scale
Platform for Data Scientists
Platform for Data Scientists
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of spark catalyst
Anatomy of spark catalyst
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
Kürzlich hochgeladen
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Michael W. Hawkins
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Scott Keck-Warren
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
The Digital Insurer
How to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
naman860154
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Sinan KOZAK
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Igalia
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Drew Madelung
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Gabriella Davis
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
OnBoard
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
2toLead Limited
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
Slack Application Development 101 Slides
Slack Application Development 101 Slides
praypatel2
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
BookNet Canada
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
hans926745
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Safe Software
Kürzlich hochgeladen
(20)
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
How to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Slack Application Development 101 Slides
Slack Application Development 101 Slides
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Introduction to Structured Streaming
1.
© 2015 IBM
Corporation1 ! Agenda - Spark Streaming 1.X • Features • Areas for Improvement - Spark Streaming 2.0 – Structured Streaming • Addressing the Improvement Areas • API • Fault Tolerance • Event Time • Managing Streaming queries - Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming - Summary thoughts
2.
© 2015 IBM
Corporation2 Spark Streaming 1. X ! Features of Spark Streaming - High Level API (stateful, joins, aggregates, windows etc.) • Overlap with RDD API (batch) - Fault – Tolerant (exactly once semantics achievable) - Back Pressure - Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) ! Apache Hadoop Day 2015
3.
© 2015 IBM
Corporation3 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink Can that be handled in a very simple way for the end-user ? Apache Hadoop Day 2015
4.
© 2015 IBM
Corporation4 Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transac6onal Exactly Once, as long as received data is not lost Exactly Once needs re-‐playable sources (e.g. Ka?a Direct) Source Receiver Transforming Outputting Sink
5.
© 2015 IBM
Corporation5 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance - For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink ! API - Request for more seamless API between Batch & Stream - Reduce complexities of streaming app * ! No Event Time support - Hard to support when processing time/batch time exposed in externals ! Streaming Query Management ! Micro-batch ! Apache Hadoop Day 2015
6.
© 2015 IBM
Corporation6 Spark Streaming 2.0 API ! Built on top of Spark SQL Engine ! Implicit Benefits - Extend the primary Batch API even to Streaming - Gain an Optimizer and all other enhancements done in SparkSQL. ! Challenge - Remove/Keep streaming complexities to minimum !
7.
© 2015 IBM
Corporation7 Lets Dive in
8.
© 2015 IBM
Corporation8 SQL Batch vs SQL Streaming- Conceptually
9.
© 2015 IBM
Corporation9 Batch vs Streaming - Programmatically
10.
© 2015 IBM
Corporation10 Output Modes - Sink ! Defined as what gets written from the Result table to external storage (Sink) ! Output modes - Complete – Entire updated Result table is written to external storage. - Append – Only new rows added in the Result table since last incremental query execution is written to external storage. - Update - Only the rows updated in the Result table since last incremental query execution is written to external storage. Upto implementation of Storage connector to decide how to write. * Aggregate queries only support complete mode and non-aggregate queries append mode
11.
© 2015 IBM
Corporation11 Supported Sinks & Modes in 2.0 *DEBUG ONLY *DEBUG ONLY
12.
© 2015 IBM
Corporation12 Windowing in Structured Streaming
13.
© 2015 IBM
Corporation13 Window operations ! Continuous time based aggregations are most common in Streaming applications. - Sliding window & Tumbling window E.g. Top x hashtags on Twitter in last half hour, every 5 minutes ! New function that treats windowing as a regular aggregation ! Used in a Group By clause Can be used in Batch as well
14.
© 2015 IBM
Corporation14 Event Time Windows ! Event-Time is time embedded within the data itself It is not the time Spark received the data ! What about processing time windows if you want them
15.
© 2015 IBM
Corporation15 Handling Late Arrival in Event-Time ! Since the ‘Result’ table is updated by Spark, the late data is put in its correct window group ! Use a normal filter in the SQL ? ! Watermarks
16.
© 2015 IBM
Corporation16 Fault Tolerance ! Why Care? ! Different guarantees for Data Loss ! Atleast Once ! Exactly Once ! What all can fail? ! Driver ! Executor
17.
© 2015 IBM
Corporation17 Spark 1.x Best Fault tolerance - Kafka Direct API • Simplified Parallelism • Less Storage Need • Exactly Once Semantics. source & processing Benefits of this approach
18.
© 2015 IBM
Corporation18 Fault Tolerance in Structured Streaming Active Driver Checkpoint to HDFS ! Structured Streaming Checkpointing Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any processing is started for that trigger Nth record in log indicates data that is currently being processed N-1 entry in log indicates offsets idempotent written to Sink Log entries are monotonically increasing integers ! On Recovery Restart processing of nth entry in WAL
19.
© 2015 IBM
Corporation19 Fault Tolerance in Structured Streaming ! End-to-End Exactly Once guarantees with - idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC) - Built-in Sources will *mostly* be only ones that support replay https://issues.apache.org/jira/browse/SPARK-15842
20.
© 2015 IBM
Corporation20 Managing Streaming Queries ! Streaming in 1.x was definetly lacking in - Starting / Stopping individual Streaming Queries - Changing the computation done in a Query. - When a Streaming Query abnormally terminates handle more gracefully than app crash.
21.
© 2015 IBM
Corporation21 Managing Streaming Queries
22.
© 2015 IBM
Corporation22 Managing Streaming Queries
23.
© 2015 IBM
Corporation23 Summary ! Overall has a good set of features - Easier code share between Batch and Streaming (No different type hierarchies) - Window not tied to Batch interval - No Streaming context - Optimizer now available for your queries. ! Getting started - Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around * And not much control over those. - Only get Runtime exceptions when you mess with above ! How does it compare to Apache Beam ?
24.
© 2015 IBM
Corporation24 For Each Sink
25.
© 2015 IBM
Corporation25 Thank YOU
Jetzt herunterladen