SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
© 2015 IBM Corporation1
! Agenda
- Spark Streaming 1.X
•  Features
•  Areas for Improvement
- Spark Streaming 2.0 – Structured Streaming
•  Addressing the Improvement Areas
•  API
•  Fault Tolerance
•  Event Time
•  Managing Streaming queries
- Structured Streaming Examples
https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming
- Summary thoughts
© 2015 IBM Corporation2
Spark Streaming 1. X
! Features of Spark Streaming
-  High Level API (stateful, joins, aggregates, windows etc.)
•  Overlap with RDD API (batch)
-  Fault – Tolerant (exactly once semantics achievable)
-  Back Pressure
-  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)
!
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation3
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
For end-2-end exactly once guarantees, user needs to do all the heavy lifting in
the Sink
Can that be handled in a very simple way for the end-user ?
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation4
Fault-Tolerant Semantics
Exactly	
  Once,	
  If	
  Outputs	
  are	
  Idempotent	
  or	
  transac6onal	
  
Exactly	
  Once,	
  as	
  long	
  as	
  received	
  data	
  is	
  not	
  lost	
  
Exactly	
  Once	
  needs	
  re-­‐playable	
  sources	
  (e.g.	
  Ka?a	
  Direct)	
  
Source
Receiver
Transforming
Outputting
Sink
© 2015 IBM Corporation5
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
-  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink
! API
-  Request for more seamless API between Batch & Stream
-  Reduce complexities of streaming app *
! No Event Time support
-  Hard to support when processing time/batch time exposed in externals
! Streaming Query Management
! Micro-batch
!
Apache	
  Hadoop	
  Day	
  2015	
  
© 2015 IBM Corporation6
Spark Streaming 2.0 API
! Built on top of Spark SQL Engine
! Implicit Benefits
- Extend the primary Batch API even to Streaming
- Gain an Optimizer and all other enhancements done in SparkSQL.
! Challenge
- Remove/Keep streaming complexities to minimum
!
© 2015 IBM Corporation7
Lets Dive in
© 2015 IBM Corporation8
SQL Batch vs SQL Streaming- Conceptually
© 2015 IBM Corporation9
Batch vs Streaming - Programmatically
© 2015 IBM Corporation10
Output Modes - Sink
! Defined as what gets written from the Result table to external storage (Sink)
! Output modes
-  Complete – Entire updated Result table is written to external storage.
-  Append – Only new rows added in the Result table since last incremental query execution is
written to external storage.
-  Update - Only the rows updated in the Result table since last incremental query execution is
written to external storage.
Upto implementation of Storage connector to decide how to write.
* Aggregate queries only support complete mode and non-aggregate queries append mode
© 2015 IBM Corporation11
Supported Sinks & Modes in 2.0
*DEBUG	
  ONLY	
  
*DEBUG	
  ONLY	
  
© 2015 IBM Corporation12
Windowing in Structured Streaming
© 2015 IBM Corporation13
Window operations
!  Continuous time based aggregations are most common in Streaming applications.
-  Sliding window & Tumbling window
E.g. Top x hashtags on Twitter in last half hour, every 5 minutes
! New function that treats windowing as a regular aggregation
!  Used in a Group By clause
Can be used in Batch as well
© 2015 IBM Corporation14
Event Time Windows
! Event-Time is time embedded within the data itself
It is not the time Spark received the data
! What about processing time windows if you want them
© 2015 IBM Corporation15
Handling Late Arrival in Event-Time
! Since the ‘Result’ table is updated by Spark, the late data is put in its correct
window group
! Use a normal filter in the SQL ?
! Watermarks
© 2015 IBM Corporation16
Fault Tolerance
! Why Care?
! Different guarantees for Data Loss
! Atleast Once
! Exactly Once
! What all can fail?
! Driver
! Executor
© 2015 IBM Corporation17
Spark 1.x Best Fault tolerance - Kafka Direct API
•  Simplified Parallelism
•  Less Storage Need
•  Exactly Once Semantics.
source & processing
Benefits	
  of	
  this	
  approach	
  
© 2015 IBM Corporation18
Fault Tolerance in Structured Streaming
Active
Driver
Checkpoint	
  to	
  HDFS	
  
! Structured Streaming Checkpointing
Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any
processing is started for that trigger
Nth record in log indicates data that is currently being processed
N-1 entry in log indicates offsets idempotent written to Sink
Log entries are monotonically increasing integers
! On Recovery
Restart processing of nth entry in WAL
© 2015 IBM Corporation19
Fault Tolerance in Structured Streaming
! End-to-End Exactly Once guarantees with
-  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)
-  Built-in Sources will *mostly* be only ones that support replay
https://issues.apache.org/jira/browse/SPARK-15842
© 2015 IBM Corporation20
Managing Streaming Queries
!  Streaming in 1.x was definetly lacking in
-  Starting / Stopping individual Streaming Queries
-  Changing the computation done in a Query.
-  When a Streaming Query abnormally terminates handle more gracefully than app crash.
© 2015 IBM Corporation21
Managing Streaming Queries
© 2015 IBM Corporation22
Managing Streaming Queries
© 2015 IBM Corporation23
Summary
!  Overall has a good set of features
-  Easier code share between Batch and Streaming (No different type hierarchies)
-  Window not tied to Batch interval
-  No Streaming context
-  Optimizer now available for your queries.
!  Getting started
-  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *
And not much control over those.
-  Only get Runtime exceptions when you mess with above
!  How does it compare to Apache Beam ?
© 2015 IBM Corporation24
For Each Sink
© 2015 IBM Corporation25
Thank YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsShashank L
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beamconfluent
 
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...Flink Forward
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 

Was ist angesagt? (20)

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 

Ähnlich wie Introduction to Structured Streaming

Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologiesSachin Aggarwal
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016Tim Ellison
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to DeploymentAerospike, Inc.
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveYifeng Jiang
 
QCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application DevelopmentQCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application DevelopmentChris Bailey
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache SparkMatt Ingenthron
 
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...Benoit Hudzia
 
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]vasuballa
 
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Filipe Miranda
 
NTTs Journey with Openstack-final
NTTs Journey with Openstack-finalNTTs Journey with Openstack-final
NTTs Journey with Openstack-finalshintaro mizuno
 
Tech trends 2018 2019
Tech trends 2018 2019Tech trends 2018 2019
Tech trends 2018 2019Johan Norm
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterTim Ellison
 

Ähnlich wie Introduction to Structured Streaming (20)

Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
SD Times - Docker v2
SD Times - Docker v2SD Times - Docker v2
SD Times - Docker v2
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to Deployment
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
QCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application DevelopmentQCon Shanghai: Trends in Application Development
QCon Shanghai: Trends in Application Development
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
 
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...
 
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
OOW16 - Getting Optimal Performance from Oracle E-Business Suite [CON6711]
 
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas Red Hat for IBM System z IBM Enterprise2014 Las Vegas
Red Hat for IBM System z IBM Enterprise2014 Las Vegas
 
NTTs Journey with Openstack-final
NTTs Journey with Openstack-finalNTTs Journey with Openstack-final
NTTs Journey with Openstack-final
 
Tech trends 2018 2019
Tech trends 2018 2019Tech trends 2018 2019
Tech trends 2018 2019
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 

Mehr von datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 

Mehr von datamantra (15)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 

Kürzlich hochgeladen

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Kürzlich hochgeladen (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Introduction to Structured Streaming

  • 1. © 2015 IBM Corporation1 ! Agenda - Spark Streaming 1.X •  Features •  Areas for Improvement - Spark Streaming 2.0 – Structured Streaming •  Addressing the Improvement Areas •  API •  Fault Tolerance •  Event Time •  Managing Streaming queries - Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming - Summary thoughts
  • 2. © 2015 IBM Corporation2 Spark Streaming 1. X ! Features of Spark Streaming -  High Level API (stateful, joins, aggregates, windows etc.) •  Overlap with RDD API (batch) -  Fault – Tolerant (exactly once semantics achievable) -  Back Pressure -  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) ! Apache  Hadoop  Day  2015  
  • 3. © 2015 IBM Corporation3 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink Can that be handled in a very simple way for the end-user ? Apache  Hadoop  Day  2015  
  • 4. © 2015 IBM Corporation4 Fault-Tolerant Semantics Exactly  Once,  If  Outputs  are  Idempotent  or  transac6onal   Exactly  Once,  as  long  as  received  data  is  not  lost   Exactly  Once  needs  re-­‐playable  sources  (e.g.  Ka?a  Direct)   Source Receiver Transforming Outputting Sink
  • 5. © 2015 IBM Corporation5 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance -  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink ! API -  Request for more seamless API between Batch & Stream -  Reduce complexities of streaming app * ! No Event Time support -  Hard to support when processing time/batch time exposed in externals ! Streaming Query Management ! Micro-batch ! Apache  Hadoop  Day  2015  
  • 6. © 2015 IBM Corporation6 Spark Streaming 2.0 API ! Built on top of Spark SQL Engine ! Implicit Benefits - Extend the primary Batch API even to Streaming - Gain an Optimizer and all other enhancements done in SparkSQL. ! Challenge - Remove/Keep streaming complexities to minimum !
  • 7. © 2015 IBM Corporation7 Lets Dive in
  • 8. © 2015 IBM Corporation8 SQL Batch vs SQL Streaming- Conceptually
  • 9. © 2015 IBM Corporation9 Batch vs Streaming - Programmatically
  • 10. © 2015 IBM Corporation10 Output Modes - Sink ! Defined as what gets written from the Result table to external storage (Sink) ! Output modes -  Complete – Entire updated Result table is written to external storage. -  Append – Only new rows added in the Result table since last incremental query execution is written to external storage. -  Update - Only the rows updated in the Result table since last incremental query execution is written to external storage. Upto implementation of Storage connector to decide how to write. * Aggregate queries only support complete mode and non-aggregate queries append mode
  • 11. © 2015 IBM Corporation11 Supported Sinks & Modes in 2.0 *DEBUG  ONLY   *DEBUG  ONLY  
  • 12. © 2015 IBM Corporation12 Windowing in Structured Streaming
  • 13. © 2015 IBM Corporation13 Window operations !  Continuous time based aggregations are most common in Streaming applications. -  Sliding window & Tumbling window E.g. Top x hashtags on Twitter in last half hour, every 5 minutes ! New function that treats windowing as a regular aggregation !  Used in a Group By clause Can be used in Batch as well
  • 14. © 2015 IBM Corporation14 Event Time Windows ! Event-Time is time embedded within the data itself It is not the time Spark received the data ! What about processing time windows if you want them
  • 15. © 2015 IBM Corporation15 Handling Late Arrival in Event-Time ! Since the ‘Result’ table is updated by Spark, the late data is put in its correct window group ! Use a normal filter in the SQL ? ! Watermarks
  • 16. © 2015 IBM Corporation16 Fault Tolerance ! Why Care? ! Different guarantees for Data Loss ! Atleast Once ! Exactly Once ! What all can fail? ! Driver ! Executor
  • 17. © 2015 IBM Corporation17 Spark 1.x Best Fault tolerance - Kafka Direct API •  Simplified Parallelism •  Less Storage Need •  Exactly Once Semantics. source & processing Benefits  of  this  approach  
  • 18. © 2015 IBM Corporation18 Fault Tolerance in Structured Streaming Active Driver Checkpoint  to  HDFS   ! Structured Streaming Checkpointing Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any processing is started for that trigger Nth record in log indicates data that is currently being processed N-1 entry in log indicates offsets idempotent written to Sink Log entries are monotonically increasing integers ! On Recovery Restart processing of nth entry in WAL
  • 19. © 2015 IBM Corporation19 Fault Tolerance in Structured Streaming ! End-to-End Exactly Once guarantees with -  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC) -  Built-in Sources will *mostly* be only ones that support replay https://issues.apache.org/jira/browse/SPARK-15842
  • 20. © 2015 IBM Corporation20 Managing Streaming Queries !  Streaming in 1.x was definetly lacking in -  Starting / Stopping individual Streaming Queries -  Changing the computation done in a Query. -  When a Streaming Query abnormally terminates handle more gracefully than app crash.
  • 21. © 2015 IBM Corporation21 Managing Streaming Queries
  • 22. © 2015 IBM Corporation22 Managing Streaming Queries
  • 23. © 2015 IBM Corporation23 Summary !  Overall has a good set of features -  Easier code share between Batch and Streaming (No different type hierarchies) -  Window not tied to Batch interval -  No Streaming context -  Optimizer now available for your queries. !  Getting started -  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around * And not much control over those. -  Only get Runtime exceptions when you mess with above !  How does it compare to Apache Beam ?
  • 24. © 2015 IBM Corporation24 For Each Sink
  • 25. © 2015 IBM Corporation25 Thank YOU