SlideShare ist ein Scribd-Unternehmen logo
1 von 21
State Management in
Structured Streaming
Chandan Prakash
00Copyright 2018 © Qubole
Agenda
● Structured Streaming : Brief Intro
● Types of Stream Processing : Stateless vs Stateful
● State in Stream Processing
● State Store in Stream Processing
● State Management in Old Spark Streaming
● State Management in Structured Streaming
● Demo with Code Example
● Quiz , Food For Thought
00Copyright 2018 © Qubole
What does this picture represent ?
Image Source: google
00Copyright 2018 © Qubole
Batch Processing Stream Processing
Image Source: google
00Copyright 2018 © Qubole
Structured Streaming : Brief Intro
Image Source: google
● Built on Spark SQL engine
● Illusion : Stream of incoming data as unbounded Input Table, Processing
logic as Sql Query, output of processing as Results Table
● Internally query gets converted into incremental Micro-batch processing
00Copyright 2018 © Qubole
Structured Streaming Query Example
00Copyright 2018 © Qubole
Types of Stream Processing
● Stateless Streaming
○ Processing of every record is independent
○ Operations like map, filter
● Stateful Streaming
○ Processing of record is dependent on
previous records
○ Operations like aggregating count of records
per distinct key, deduplicating records
00Copyright 2018 © Qubole
State in Stream Processing
● State of Streaming Progress
○ Metadata of stream processing : offsets
○ Keeping track how much data processed so far
○ Needed for fault tolerance
○ Present in both stateless and stateful processing
● State of Data
○ Intermediate data information between records
○ Operations like aggregation, deduplication
○ Present in Stateful Processing
Note: When we say “State”, in general it means the State of data for processing. The
other one is called metadata/offsets
00Copyright 2018 © Qubole
State Store in Streaming
● Reliable place providing read and write of
intermediate data (state)
● Can sustain streaming failures and restore
processing from the same point
● Options :
In-memory, File Systems, Storage Systems
In-Memory HashMap
00Copyright 2018 © Qubole
State Management in old/Dstream Spark Streaming
● RDD based Streaming
● Inefficient Flawed design
○ State persisted with offset metadata
○ Complete snapshot persistence every microbatch
○ Tightly coupled, synchronous with Spark RDD tasks
○ No provision for incremental state persistence
○ Processing overhead, bottleneck as state grows
00Copyright 2018 © Qubole
State Management in Structured Streaming
Fundamental shift from Old Spark Streaming
● Decoupled from offsets/metadata checkpointing
● Asynchronous to Spark Tasks/Jobs
● Incremental State persistence
00Copyright 2018 © Qubole
HDFS backed State Management
1. In-Memory Hashmap + HDFS
2. Versioned key-value store per
partition
3. Versioned Delta file per partition
4. Partition Task scheduled on same
executor where previous state is
5. Synchronous write to HashMap and
Delta file outputstream
6. Asynchronous daemon thread per
executor for snapshotting, file
purging/deletion in HDFS
7. Only one thread in Executor can write
to a delta file. But threads from
multiple executors can try to write to
same delta file.
00Copyright 2018 © Qubole
Code Entities in HDFS backed State Management
● StatefulOperators
○ defines computation logic to be executed against the state store with set of rows in a partition
● StateStoreOps
○ prepares a StateStoreRDD for doing computations against state store with the computation logic
passed by the stateful operator.
● storeUpdateFunction
○ contains the computation logic defining what to do against the state store with data generated in a
partition task.
● HDFSBackedStateStore
○ concrete implementation of State Store using concurrent hashmap, backed by HDFS file system
for persistence.
● HDFSBackedStateStoreProvider
○ contains methods to get given store and execute maintenance task (snapshotting , purging,
deleting files, cleaning old states).
● StateStoreCoordinator
○ ensures task for a partition gets scheduled on an executor where its last versioned state is
maintained in hashmap.
00Copyright 2018 © Qubole
Code Flow of Stateful Structured Streaming
00Copyright 2018 © Qubole
Quiz Time
Possible Issues with
the HDFS backed
implementation in
production ?
00Copyright 2018 © Qubole
Quiz Time
Possible Issues with
the HDFS backed
implementation in
production ?
● State is constrained by executor
memory
● Same executor memory to be shared
with RDD computation
● Single Daemon thread responsible
snapshotting entire state hashmaps,
file cleanings, etc
00Copyright 2018 © Qubole
In-Memory HashMap
Possible Solution ?
Food for Thought
00Copyright 2018 © Qubole
Embedded/Local Store :
● Key-Value embedded data store
● Improvised LevelDB open sourced by
Facebook
● Bring Database close to Processing
● Pros :
○ No Memory Issues (HashMap)
○ No Network Latency (Cassandra)
○ Fast writes : Buffer + Sequential Transaction Log
○ Isolation
● Cons
○ Not Distributed
○ Not Replicated
○ Overhead of maintenance, non-JVM memory
● Architecture
○ Memtable : in-memory buffer
○ Change Log
○ SST Table on disk
Image Source: google
00Copyright 2018 © Qubole
in Streaming Systems
● Apache Flink
https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html
● Apache Samza
https://samza.apache.org/learn/documentation/0.7.0/container/state-management.html
● Kafka Streams
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Mana
gement
00Copyright 2018 © Qubole
Summary
● What is Stateful Processing and State in Streaming
● Architecture of State Management in Stateful processing of Structured
Streaming
● Code Example
● Why Embedded Store like RocksDB is so important in Stream Processing
Thank You. Questions?
Qubole Blog : https://www.qubole.com/blog/

Weitere ähnliche Inhalte

Was ist angesagt?

What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep diveDataWorks Summit
 
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...HostedbyConfluent
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyHostedbyConfluent
 
Standard Chartered- Threat Intelligence using Knowledge Graphs.pdf
Standard Chartered- Threat Intelligence using Knowledge Graphs.pdfStandard Chartered- Threat Intelligence using Knowledge Graphs.pdf
Standard Chartered- Threat Intelligence using Knowledge Graphs.pdfNeo4j
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...HostedbyConfluent
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateFlink Forward
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformVMware Tanzu
 

Was ist angesagt? (20)

What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Standard Chartered- Threat Intelligence using Knowledge Graphs.pdf
Standard Chartered- Threat Intelligence using Knowledge Graphs.pdfStandard Chartered- Threat Intelligence using Knowledge Graphs.pdf
Standard Chartered- Threat Intelligence using Knowledge Graphs.pdf
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large State
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise Platform
 

Ähnlich wie State management in Structured Streaming

Key considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKey considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKafkaZone
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesDoKC
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouseunicast
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeAngad Singh
 
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Severalnines
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLYugabyte
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...Boško Devetak
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia DatabasesJaime Crespo
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 

Ähnlich wie State management in Structured Streaming (20)

Key considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKey considerations in productionizing streaming applications
Key considerations in productionizing streaming applications
 
RubiX
RubiXRubiX
RubiX
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouse
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays Singapore
 
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 

Mehr von datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 

Mehr von datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Kürzlich hochgeladen

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Kürzlich hochgeladen (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

State management in Structured Streaming

  • 1. State Management in Structured Streaming Chandan Prakash
  • 2. 00Copyright 2018 © Qubole Agenda ● Structured Streaming : Brief Intro ● Types of Stream Processing : Stateless vs Stateful ● State in Stream Processing ● State Store in Stream Processing ● State Management in Old Spark Streaming ● State Management in Structured Streaming ● Demo with Code Example ● Quiz , Food For Thought
  • 3. 00Copyright 2018 © Qubole What does this picture represent ? Image Source: google
  • 4. 00Copyright 2018 © Qubole Batch Processing Stream Processing Image Source: google
  • 5. 00Copyright 2018 © Qubole Structured Streaming : Brief Intro Image Source: google ● Built on Spark SQL engine ● Illusion : Stream of incoming data as unbounded Input Table, Processing logic as Sql Query, output of processing as Results Table ● Internally query gets converted into incremental Micro-batch processing
  • 6. 00Copyright 2018 © Qubole Structured Streaming Query Example
  • 7. 00Copyright 2018 © Qubole Types of Stream Processing ● Stateless Streaming ○ Processing of every record is independent ○ Operations like map, filter ● Stateful Streaming ○ Processing of record is dependent on previous records ○ Operations like aggregating count of records per distinct key, deduplicating records
  • 8. 00Copyright 2018 © Qubole State in Stream Processing ● State of Streaming Progress ○ Metadata of stream processing : offsets ○ Keeping track how much data processed so far ○ Needed for fault tolerance ○ Present in both stateless and stateful processing ● State of Data ○ Intermediate data information between records ○ Operations like aggregation, deduplication ○ Present in Stateful Processing Note: When we say “State”, in general it means the State of data for processing. The other one is called metadata/offsets
  • 9. 00Copyright 2018 © Qubole State Store in Streaming ● Reliable place providing read and write of intermediate data (state) ● Can sustain streaming failures and restore processing from the same point ● Options : In-memory, File Systems, Storage Systems In-Memory HashMap
  • 10. 00Copyright 2018 © Qubole State Management in old/Dstream Spark Streaming ● RDD based Streaming ● Inefficient Flawed design ○ State persisted with offset metadata ○ Complete snapshot persistence every microbatch ○ Tightly coupled, synchronous with Spark RDD tasks ○ No provision for incremental state persistence ○ Processing overhead, bottleneck as state grows
  • 11. 00Copyright 2018 © Qubole State Management in Structured Streaming Fundamental shift from Old Spark Streaming ● Decoupled from offsets/metadata checkpointing ● Asynchronous to Spark Tasks/Jobs ● Incremental State persistence
  • 12. 00Copyright 2018 © Qubole HDFS backed State Management 1. In-Memory Hashmap + HDFS 2. Versioned key-value store per partition 3. Versioned Delta file per partition 4. Partition Task scheduled on same executor where previous state is 5. Synchronous write to HashMap and Delta file outputstream 6. Asynchronous daemon thread per executor for snapshotting, file purging/deletion in HDFS 7. Only one thread in Executor can write to a delta file. But threads from multiple executors can try to write to same delta file.
  • 13. 00Copyright 2018 © Qubole Code Entities in HDFS backed State Management ● StatefulOperators ○ defines computation logic to be executed against the state store with set of rows in a partition ● StateStoreOps ○ prepares a StateStoreRDD for doing computations against state store with the computation logic passed by the stateful operator. ● storeUpdateFunction ○ contains the computation logic defining what to do against the state store with data generated in a partition task. ● HDFSBackedStateStore ○ concrete implementation of State Store using concurrent hashmap, backed by HDFS file system for persistence. ● HDFSBackedStateStoreProvider ○ contains methods to get given store and execute maintenance task (snapshotting , purging, deleting files, cleaning old states). ● StateStoreCoordinator ○ ensures task for a partition gets scheduled on an executor where its last versioned state is maintained in hashmap.
  • 14. 00Copyright 2018 © Qubole Code Flow of Stateful Structured Streaming
  • 15. 00Copyright 2018 © Qubole Quiz Time Possible Issues with the HDFS backed implementation in production ?
  • 16. 00Copyright 2018 © Qubole Quiz Time Possible Issues with the HDFS backed implementation in production ? ● State is constrained by executor memory ● Same executor memory to be shared with RDD computation ● Single Daemon thread responsible snapshotting entire state hashmaps, file cleanings, etc
  • 17. 00Copyright 2018 © Qubole In-Memory HashMap Possible Solution ? Food for Thought
  • 18. 00Copyright 2018 © Qubole Embedded/Local Store : ● Key-Value embedded data store ● Improvised LevelDB open sourced by Facebook ● Bring Database close to Processing ● Pros : ○ No Memory Issues (HashMap) ○ No Network Latency (Cassandra) ○ Fast writes : Buffer + Sequential Transaction Log ○ Isolation ● Cons ○ Not Distributed ○ Not Replicated ○ Overhead of maintenance, non-JVM memory ● Architecture ○ Memtable : in-memory buffer ○ Change Log ○ SST Table on disk Image Source: google
  • 19. 00Copyright 2018 © Qubole in Streaming Systems ● Apache Flink https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html ● Apache Samza https://samza.apache.org/learn/documentation/0.7.0/container/state-management.html ● Kafka Streams https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Mana gement
  • 20. 00Copyright 2018 © Qubole Summary ● What is Stateful Processing and State in Streaming ● Architecture of State Management in Stateful processing of Structured Streaming ● Code Example ● Why Embedded Store like RocksDB is so important in Stream Processing
  • 21. Thank You. Questions? Qubole Blog : https://www.qubole.com/blog/

Hinweis der Redaktion

  1. How many of you have idea about streaming, Worked on any streaming, understand the word “state management” ? …...should be useful for everyone of you. information about past input and can be used to influence the processing of future input, will see in detail Feel free to ask questions at any point of time during presentation
  2. Why you would like to listen this ? Although the talk is specific to Spark Structured Streaming, but the design, architecture, concepts and thought process behind why its there what its there will give you good understanding of any Streaming technology. All are like distant cousins of same family and you will see many overlaps between different streaming systems. Understanding one helps you to understand others. Many of them copy or say are inspired from each other. Will give you persepective of streaming engine developer
  3. *Quick question: What do you infer from this picture ?
  4. *pretty much sums up difference between batch and stream processing Batch is data at rest, you take chunk of data each time you process. In streaming you keep getting data and you need to process it as and when the data comes
  5. We will see running version of this example on Qubole Notebook after understanding State Management START THE CLUSTER Objective of showing this code example is to give you idea of stateful processing, so when we talk about state management , you can actually relate and understand easily
  6. Having given some rough idea about structured streaming, Lets start with the actual topic that we want to discuss today By analogy to SQL, the select and where clauses of a query are usually stateless, but join, group by and aggregation functions like sum and count require state.
  7. Intermediate information in stream processing State of progress: offsets/commits
  8. Often easy to understand when compared with predecessor, evolution is constant process, something new comes because of limitations of old Story about experience with Stateless stream processing, maintaining offsets in zookeeper
  9. This is the main meat of this talk that I want to go into detail
  10. Prepared diagram on my understanding of the internal code, how it works in upcoming Spark 2.4 It is very important to note here is that all these concepts like incremental checkpointing, asynchronous state management are not specific to Spark Streaming. Will find in other streaming systems like Flink,etc also with different names.
  11. Slide for guys interested in checking out code theirselves classes/interfaces/method involved in doing the State management Wont go in detail, instead will show the code flow of the state management in next slide
  12. Stateful operator is the place where logic to interact with state store resides. Show code
  13. Before I go forward, do you have any questions here Because now I have a question for you
  14. Do u see any possible issues with this architecture Honestly I have not encountered any issues but lets discuss what can be possible issues with this approach
  15. Go back to architecture diagram
  16. Had intentionally not talked about RocksDB at the starting, now is the time Really wanted to talk about this embedded storage or local persistent store
  17. Why Embedded Storage? Became famous because of Flash Memory era/ SSDs , writing to local disks became much faster compared to client-server model over network to storage systems. Sequential read/write : analogy of airport conveyor belt for spinning disks, latency involved in doing the rotation and seek time going to right sector of the data Hadoop was about moving processing closer to data, RocksDb is about moving database closer to processing. Improvised LevelDB : multithreaded write and compaction, support for bloom scans while reading data, improved compaction logic similar to HBase
  18. rocksDB is present in almost every latest streaming systems with need of keeping unlimited state without penalty of network call Storm : currently does not use local storage like rocksDb. It still relies on remote storages like redis,HBase,cassandra. Samza : features in LinkedIn like personalized feed to be sent to your wall is decided after joining lot of information with the available feed using Samza Kafka and Samza were written by same people in LinkedIn who later went on to found company called Confluent where they wrote kafka Streams. So you will find many similarities.
  19. Like said in the beginning, understanding one system will help us understand others. RocksDB understanding is one of them . Incremental checkpointing, snapshotting, Asynchronous state management are other concepts Technologies might be different, implementations might be different but after all they are trying to similar problem of distributed world which have same challenges, limitations and expectations like fault tolerance,exactly once processing,etc will be there everywhere
  20. Please have a close watch on Qubole Engineering. We write lot of interesting stuffs on Big data on cloud, Spark , open sourced SparkLens, Tuning, Hive , Presto, AWS,