SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Interactive Data Analysis in
Spark Streaming
Ways to interact with Spark Streaming applications
https://github.com/Shasidhar/dynamic-streaming
● Shashidhar E S
● Big data consultant and trainer at
datamantra.io
● www.shashidhare.com
Agenda
● Big data applications
● Streaming applications
● Categories in streaming applications
● Interactive streaming applications
● Different strategies
● Zookeeper
● Apache curator
● Curator cache types
● Spark streaming context dynamic switch
Big data applications
● Typically applications in big data are divided depending on their work
loads
● Major divisions are
○ Batch applications
○ Streaming applications
● Most of the existing systems support both of these applications
● But there is a new category of applications are in raise, they are
known as interactive applications
Big data interactive applications
● Ability to manipulate data in interactive way
● Exploratory in nature
● Combines batch and streaming data
● For development
○ Zeppelin, Jupiter Notebook
● For production
○ Batch - Datameer, Tellius, Zoomdata
○ Streaming - Stratio Engine, WSO2
Streaming Engines
● Ability to process data in real time
● Streaming process includes
○ Collecting data
○ Processing data
● Types of streaming engines
○ Real time
○ Near real time - (Micro Batch)
● Spark allows near real time streaming processing
Streaming application types/categories
● Streaming ETL processes
● Decision engines
○ Rule based
○ Machine Learning based
■ Online learning
● Real time Dashboards
● Root cause analysis engines
○ Multiple Streams
○ Handling event times
Streaming applications in real world
● Static
○ Data scientist defines rules
○ Admin sets up dashboard
○ Not able to modify the behaviour of streaming application
● Dynamic
○ User can add/delete/modify rules , controls the decision
○ User can see some charts/ design charts
○ Ability to modify the behaviour of streaming application
Generic Interactive Application Architecture
Spark
Streaming
Streaming
data source
Streaming
Config source
Downstream
applications
How do we make the
configuration dynamic?
Spark streaming introduction
Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Micro batch
● Spark streaming is a fast batch processing system
● Spark streaming collects stream data into small batch
and runs batch processing on it
● Batch can be as small as 1s to as big as multiple hours
● Spark job creation and execution overhead is so low it
can do all that under a sec
● These batches are called as DStreams
Spark Streaming application
Define Input streams
Define data
Processing
Define data sync
Micro
Batch
Start Streaming Context
● Options to change behaviour
○ Restart context
○ Without restarting context
■ Control configuration
data
Create Streaming Context
Interactive Streaming Application
Strategies
Using Kafka as configuration source
Spark
Streaming
Streaming
data source
Config source
(Kafka)
Downstream
applications
Using Kafka as Configuration Source
● Easy to adapt as Kafka is the defacto streaming store
● Streaming configuration Source
○ New stream to track the configuration changes
● Spark Streaming
○ Maintain configuration as state in memory and apply
○ State needs to be checkpointed
○ Failure recovery strategies need to be taken care of
● Drawbacks
○ Hard to handle deletes/updates in state
○ Tricky to handle state if configurations are complex
Using Database as configuration source
Spark
Streaming
Streaming
data source
Distributed
Database
Downstream
applications
Workers
Interactive streaming application Strategies contd.
● Easy to start with databases, as people are familiar with it
● Configuration Source
○ Distributed data source
● Spark Streaming
○ Read configuration from database and apply - Polling
○ Database need to be consistent and fast
○ Configurations can be kept in cache to avoid latencies
● Drawbacks
○ Achieving distributed cache consistency is tricky
○ May be an extra component if you have it only for this purpose
Using Zookeeper as configuration source
Spark
Streaming
Streaming
data source
Zookeeper
Downstream
applications
Interactive streaming application Strategies contd.
● It is readily available if Kafka is used in a system, no extra burden
● Configuration Source - Zookeeper
● Spark streaming
○ Ability to track the configuration change and take action - Async
Callbacks
○ Suitable to store any type of configuration
○ Allows to adapt listeners for configuration changes
○ Ensures cache consistency by default
● Drawbacks
○ Streaming context restart is not suitable for all systems
Apache Zookeeper
“Zookeeper allows distributed processes to coordinate with each other
through a shared hierarchical namespace of data registers”
● Distributed Coordination service
● Hierarchical file system
● Data is stored in ZNode
● Can be thought as a “distributed in-memory file system” with some
limitations like size of data, optimized of high reads and low writes
Zookeeper Architecture
Zookeeper data model
● Follows hierarchical namespace
● Each node is called as ZNode
○ Data saved as bytes
○ Can have children
○ Only accessible through absolute paths
○ Data size limited to 1MB
● Follows global ordering
ZNode
● Types
○ Persistent Nodes
■ Exists till explicitly deleted
■ Can have children
○ Ephemeral nodes
■ Exists as long as session is active
■ Cannot have children
● Data can be secured at ZNode level with ACL
Data consistency
● Reads are served from local servers
● Writes are synchronised through leader
● Ensures Sequential Consistency
● Data is either read completely or fails
● All clients gets the same result irrespective of the server it is
connected
● Updates are persisted, unless overridden by any client
Zookeeper Watches
● Available watches in Zookeeper
○ Node Children Changed
○ Node Changed
○ Node Data Changed
○ Node Deleted
● Watchers are one time triggers
● Event is always received first rather than data
● Client can re register for watch if needed again
Zookeeper Client example
ZK API issues
● Making client code thread safe is tricky
● Hard for programmers
● Exception handling is bad
● Similar to MapReduce API
Solution is “Apache Curator”
Apache Curator
● A Zookeeper Keeper
● Main components
○ Client - Wrapper for ZK class, manages Zookeeper connection
○ Framework - High level API that encloses all ZK related
operations, handles all types of retries.
○ Recipes - Implementation of common Zookeeper “recipes” built of
top of curator framework
● User friendly API
Curator Hands on - Basic Operations
Git branch : zookeeperexamples
Apache Curator caches
Three types of caches
● Node Cache
○ Monitors single node
● Path Cache
○ Monitors a ZNode and children
● Tree Cache
○ Monitors a ZK Path by caching data locally
Curator Hands on - Node Cache
Git branch : zookeeperexamples
Path Cache
● Monitor a ZNode
● Using archaius - a dynamic property library
● Use ConfigurationSource from archaius to track changes
● Pair Configuration source with UpdateListener
● See in action
Watched
DataSource
Update Listener
Curator Hands on - Path Cache
Git branch : zookeeperlistener
Spark streaming dynamic restart
● Use the same WatchedSource to track any changes in configuration
● Track changes on zookeeper with patch cache
● Control Streaming context restart on ZK data change
Watched
DataSource
Update Listener
(Restart context)
Hands on - Streaming Restart
Git branch : streaming-listener
Ways to control data loss
● Enable checkpointing
● Track Kafka topic offsets manually
● Better to use Direct kafka input streams
● Use Kafka monitoring tools to see the status of data processing
● Always create spark streaming context from checkpoint directory
Next Steps
● Try to add some meaningful configurations
● Implement the same idea with Akka actors
References
● http://sysgears.com/articles/managing-configuration-of-distributed-system-wit
h-apache-zookeeper/
● https://github.com/Netflix/archaius/wiki/ZooKeeper-Dynamic-Configuration

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 

Andere mochten auch

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

Andere mochten auch (20)

Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
Actian Vector Whitepaper
 Actian Vector Whitepaper Actian Vector Whitepaper
Actian Vector Whitepaper
 
Actian Analytics Platform - Hadoop SQL Edition
Actian Analytics Platform - Hadoop SQL EditionActian Analytics Platform - Hadoop SQL Edition
Actian Analytics Platform - Hadoop SQL Edition
 
Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi
 

Ähnlich wie Interactive Data Analysis in Spark Streaming

Ähnlich wie Interactive Data Analysis in Spark Streaming (20)

Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
Change data capture
Change data captureChange data capture
Change data capture
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays Singapore
 
Streamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User GroupStreamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User Group
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar Functions
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 

Mehr von datamantra (11)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Kürzlich hochgeladen

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Kürzlich hochgeladen (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Interactive Data Analysis in Spark Streaming

  • 1. Interactive Data Analysis in Spark Streaming Ways to interact with Spark Streaming applications https://github.com/Shasidhar/dynamic-streaming
  • 2. ● Shashidhar E S ● Big data consultant and trainer at datamantra.io ● www.shashidhare.com
  • 3. Agenda ● Big data applications ● Streaming applications ● Categories in streaming applications ● Interactive streaming applications ● Different strategies ● Zookeeper ● Apache curator ● Curator cache types ● Spark streaming context dynamic switch
  • 4. Big data applications ● Typically applications in big data are divided depending on their work loads ● Major divisions are ○ Batch applications ○ Streaming applications ● Most of the existing systems support both of these applications ● But there is a new category of applications are in raise, they are known as interactive applications
  • 5. Big data interactive applications ● Ability to manipulate data in interactive way ● Exploratory in nature ● Combines batch and streaming data ● For development ○ Zeppelin, Jupiter Notebook ● For production ○ Batch - Datameer, Tellius, Zoomdata ○ Streaming - Stratio Engine, WSO2
  • 6. Streaming Engines ● Ability to process data in real time ● Streaming process includes ○ Collecting data ○ Processing data ● Types of streaming engines ○ Real time ○ Near real time - (Micro Batch) ● Spark allows near real time streaming processing
  • 7. Streaming application types/categories ● Streaming ETL processes ● Decision engines ○ Rule based ○ Machine Learning based ■ Online learning ● Real time Dashboards ● Root cause analysis engines ○ Multiple Streams ○ Handling event times
  • 8. Streaming applications in real world ● Static ○ Data scientist defines rules ○ Admin sets up dashboard ○ Not able to modify the behaviour of streaming application ● Dynamic ○ User can add/delete/modify rules , controls the decision ○ User can see some charts/ design charts ○ Ability to modify the behaviour of streaming application
  • 9. Generic Interactive Application Architecture Spark Streaming Streaming data source Streaming Config source Downstream applications How do we make the configuration dynamic?
  • 10. Spark streaming introduction Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
  • 11. Micro batch ● Spark streaming is a fast batch processing system ● Spark streaming collects stream data into small batch and runs batch processing on it ● Batch can be as small as 1s to as big as multiple hours ● Spark job creation and execution overhead is so low it can do all that under a sec ● These batches are called as DStreams
  • 12. Spark Streaming application Define Input streams Define data Processing Define data sync Micro Batch Start Streaming Context ● Options to change behaviour ○ Restart context ○ Without restarting context ■ Control configuration data Create Streaming Context
  • 14. Using Kafka as configuration source Spark Streaming Streaming data source Config source (Kafka) Downstream applications
  • 15. Using Kafka as Configuration Source ● Easy to adapt as Kafka is the defacto streaming store ● Streaming configuration Source ○ New stream to track the configuration changes ● Spark Streaming ○ Maintain configuration as state in memory and apply ○ State needs to be checkpointed ○ Failure recovery strategies need to be taken care of ● Drawbacks ○ Hard to handle deletes/updates in state ○ Tricky to handle state if configurations are complex
  • 16. Using Database as configuration source Spark Streaming Streaming data source Distributed Database Downstream applications Workers
  • 17. Interactive streaming application Strategies contd. ● Easy to start with databases, as people are familiar with it ● Configuration Source ○ Distributed data source ● Spark Streaming ○ Read configuration from database and apply - Polling ○ Database need to be consistent and fast ○ Configurations can be kept in cache to avoid latencies ● Drawbacks ○ Achieving distributed cache consistency is tricky ○ May be an extra component if you have it only for this purpose
  • 18. Using Zookeeper as configuration source Spark Streaming Streaming data source Zookeeper Downstream applications
  • 19. Interactive streaming application Strategies contd. ● It is readily available if Kafka is used in a system, no extra burden ● Configuration Source - Zookeeper ● Spark streaming ○ Ability to track the configuration change and take action - Async Callbacks ○ Suitable to store any type of configuration ○ Allows to adapt listeners for configuration changes ○ Ensures cache consistency by default ● Drawbacks ○ Streaming context restart is not suitable for all systems
  • 20. Apache Zookeeper “Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace of data registers” ● Distributed Coordination service ● Hierarchical file system ● Data is stored in ZNode ● Can be thought as a “distributed in-memory file system” with some limitations like size of data, optimized of high reads and low writes
  • 22. Zookeeper data model ● Follows hierarchical namespace ● Each node is called as ZNode ○ Data saved as bytes ○ Can have children ○ Only accessible through absolute paths ○ Data size limited to 1MB ● Follows global ordering
  • 23. ZNode ● Types ○ Persistent Nodes ■ Exists till explicitly deleted ■ Can have children ○ Ephemeral nodes ■ Exists as long as session is active ■ Cannot have children ● Data can be secured at ZNode level with ACL
  • 24. Data consistency ● Reads are served from local servers ● Writes are synchronised through leader ● Ensures Sequential Consistency ● Data is either read completely or fails ● All clients gets the same result irrespective of the server it is connected ● Updates are persisted, unless overridden by any client
  • 25. Zookeeper Watches ● Available watches in Zookeeper ○ Node Children Changed ○ Node Changed ○ Node Data Changed ○ Node Deleted ● Watchers are one time triggers ● Event is always received first rather than data ● Client can re register for watch if needed again
  • 27. ZK API issues ● Making client code thread safe is tricky ● Hard for programmers ● Exception handling is bad ● Similar to MapReduce API Solution is “Apache Curator”
  • 28. Apache Curator ● A Zookeeper Keeper ● Main components ○ Client - Wrapper for ZK class, manages Zookeeper connection ○ Framework - High level API that encloses all ZK related operations, handles all types of retries. ○ Recipes - Implementation of common Zookeeper “recipes” built of top of curator framework ● User friendly API
  • 29. Curator Hands on - Basic Operations Git branch : zookeeperexamples
  • 30. Apache Curator caches Three types of caches ● Node Cache ○ Monitors single node ● Path Cache ○ Monitors a ZNode and children ● Tree Cache ○ Monitors a ZK Path by caching data locally
  • 31. Curator Hands on - Node Cache Git branch : zookeeperexamples
  • 32. Path Cache ● Monitor a ZNode ● Using archaius - a dynamic property library ● Use ConfigurationSource from archaius to track changes ● Pair Configuration source with UpdateListener ● See in action Watched DataSource Update Listener
  • 33. Curator Hands on - Path Cache Git branch : zookeeperlistener
  • 34. Spark streaming dynamic restart ● Use the same WatchedSource to track any changes in configuration ● Track changes on zookeeper with patch cache ● Control Streaming context restart on ZK data change Watched DataSource Update Listener (Restart context)
  • 35. Hands on - Streaming Restart Git branch : streaming-listener
  • 36. Ways to control data loss ● Enable checkpointing ● Track Kafka topic offsets manually ● Better to use Direct kafka input streams ● Use Kafka monitoring tools to see the status of data processing ● Always create spark streaming context from checkpoint directory Next Steps ● Try to add some meaningful configurations ● Implement the same idea with Akka actors