SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
REX Real-time Data Pipeline:
Kafka Streams / Kafka Connect versus Spark Streaming
EL ARIB Abdelhamide , Data Engineer at
OUTLINE
• The Challenge
• Brief presentation of Spark & Kafka
• Kafka vs Spark :
1. New File Detection
2. Processing
3. Handling Failure
4. Deployment
5. Scaling
6. Monitoring
• 10 points not to forget
THE CHALLENGE
HDFS is our source of truth
Need to stream very big data (parquet
or Avro) into Kafka
Different schemas for each source
Need to scale up/down the speed of
streaming
Exactly-Once Copy (Hdfs & Kafka)
And the monitoring is, as always, a must
HDFS
/client1/crm
/client1/tracking
/client1/transactions
/client2/mainCrm
/client2/payments
….
BRIEF PRESENTATION OF SPARK & KAFKA
APACHE SPARK
Distributed, data processing, in-memory engine for batch & streaming mode.
Apache Spark Core
Standalone
Scheduler
MesosYarn k8s …Cluster Manager
Spark SQL
Spark
Streaming
Spark MLlibGraphXLibraries
SPARK STREAMING
Receiver
Records
Micro Batches
(RDDs)
Apache Spark Core
Standalone
Scheduler
MesosYarn k8s …
End-to-end latency ~ 100 ms
Exactly once guarantee
SPARK STRUCTURED STREAMING
Whenever you can, use Dataframe instead of Spark’s RDD primitive
DataFrames get the benefit of the Catalyst query optimizer.
Multiple types of Triggers
Receiver
Records
Micro Batches
(DataFrames)
Apache Spark Core
Standalone
Scheduler
MesosYarn k8s …
(Experimental in Spark 2.3) End to end latency ~ 1 ms with at least once guarantee
Spark SQL
(Catalyst)
KAFKA/ KAFKA STREAMS/ KAFKA CONNECT
Data
Source
Kafka
Connect
Kafka
Topic
Kafka
Streams
KAFKA CONNECT
Kafka Connect is a framework to stream data into & out of Kafka using connectors.
Two types of connectors: Sinks (Export) & Sources (Import)
poll.interval.ms (default 5ms)
Kafka Connect Cluster
Worker 1 Worker 2 Worker 3
Connector Elastic Sink
Instance
Conn ES, Task 1
Partitions: 1,2
Conn Elastic Sink, Task 2
Partitions: 3,4
Conn Elastic Sink, Task 3
Partitions: 5,6
1. NEW FILE DETECTION
1. NEW FILE DETECTION - SPARK
Can detect new files within a dir
Should have the same schema
Can’t create one stream for all source of the client X
N clients with M Source => N*M streams.
1 executor at least for each stream
HDFS
/client1/crm
/client1/tracking
/client1/transactions
…
1. NEW FILE DETECTION – KAFKA
1. NEW FILE DETECTION - HDFS-WATCHER
(INOTIFY)
/client1/crm
/client1/tracking
/client1/transactions
/client2/mainCrm
/client2/payments
….
HDFS-WATCHER
HDFS
Kafka
Topic
(Avro)
(source, client) -> path
Scanning for new files
(exclude dirs with _temporary)
2. PROCESSING
2. PROCESSING - SPARK
Read file as stream
Read partition by partition
At least once when writing to Kafka
2. PROCESSING - KAFKA
HDFS
Task 1
Kafka Connect
Kafka
Producer
Client_source
Topic
hdfs-source-
offset
Topic
GlobalKtable
(path, lastLineCommitted)
HDFS Watcher
Schema
Registry
poll()commitRecords()
(path, newLastLineCommitted)
FileDetection
Kafka topic
((source, client) -> filePath)
Get n (=batch.size) lines
Consuming new files to process & get keyPartitionKey from the metadata in their schemas
Kafka Connect Framework
Tracking the offset so we can handle the failure
1
2
3 5
6
7
8
4
3. HANDLING FAILURE
3. HANDLING FAILURE - SPARK
The HDFS receiver is not reliable receiver
.option("checkpointLocation", ”hdfs://checkpointPath") (Expensive)
spark.streaming.receiver.writeAheadLog.enable
spark.task.maxFailures = 0
trait StreamingListener {
…
/** Called when a receiver has reported an error */
def onReceiverError(receiverError: StreamingListenerReceiverError) { }
…
}
3. HANDLING FAILURE - KAFKA
GlobalKtable that track the last line commited.
KIP-298: Error Handling in Connect
• errors.retry.timeout
• errors.retry.delay.max.ms
• errors.tolerance (in a task) = none
Kafka producer config:
• acks = 1 (ou all)
• Retries > 0
• max.in.flight.requests.per.connection > 0( no need to preserve order)
• Batch.size (producer config) < number of lines * size of a single line
• linger.ms = 0 (Send the record as they arrive)
4. DEPLOYMENT
4. DEPLOYMENT - SPARK
Standalone
Cluster mode: Using spark-submit, the driver is in the cluster
Client mode : Start the app. The driver is in the Client machine.
4. DEPLOYMENT - KAFKA
Packager the connector (Classpath issues warning)
Make it in /plugins
Start Kafka connect cluster
Run via the The rest API
5. SCALING UP/DOWN
5. SCALING UP/DOWN- SPARK
Dynamic Allocation
Min*, max & init resources
Can’t scale down manually the resources for executor without stopping the driver.
Can’t scale up/down the number of executors manually.
5. SCALING UP/DOWN- KAFKA
Scale up/down easily the number of task.
Only Static allocation for the cluster resources.
Can’t update the cluster/workers resources without stopping it.
6. MONITORING
6. MONITORING - SPARK
Spark ui
Spark ui rest API : http://localhost:4040/api/v1
JMX with DropWizard
6. MONITORING - KAFKA
Kafka connect REST API : to monitor Tasks
Jmx metrics
Connector Metrics
Task Metrics Worker
Metrics Worker
Rebalance Metrics
See KIP-196
10 POINTS TO NOT FORGET (FROM OUR EXPERIENCE)
1. It depends on what kind of datasource & datasink you’re using.
2. Always check the Receiver (Spark Streaming) & The connector (Kafka Connect)
3. The monitoring is pretty good for both
4. Integration Test is challenging with Kafka Connect
5. You want to add more tasks, go for Kafka Connect
6. You want dynamic allocation, go for Spark Streaming
7. Deploying Spark Streaming app is simple
8. Naming conventions in Kafka Connect Framework may be confusing with the naming in Kafka core
9. You prefer conf over code, go for Kafka Connect
10. You want a distributed, copy from or to Kafka, without using a cluster Manager : Go for Kafka Connect
AS ALWAYS IT REALLY DEPENDS ON YOUR NEEDS ☺
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsSamir Bessalah
 
How to scale your web app
How to scale your web appHow to scale your web app
How to scale your web appGeorgio_1999
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0Sigmoid
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...DataStax
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesDatabricks
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSCassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSDataStax Academy
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasFlintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasSpark Summit
 
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Spark Summit
 
Scalding - Big Data Programming with Scala
Scalding - Big Data Programming with ScalaScalding - Big Data Programming with Scala
Scalding - Big Data Programming with ScalaTaewook Eom
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 

Was ist angesagt? (20)

Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
How to scale your web app
How to scale your web appHow to scale your web app
How to scale your web app
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage Devices
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSCassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasFlintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
 
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
 
Scalding - Big Data Programming with Scala
Scalding - Big Data Programming with ScalaScalding - Big Data Programming with Scala
Scalding - Big Data Programming with Scala
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 

Ähnlich wie Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 
Spark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN Jim Dowling
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sigmoid
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedEdureka!
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 

Ähnlich wie Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming (20)

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
Spark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim Dowling
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Spark 101
Spark 101Spark 101
Spark 101
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 

Kürzlich hochgeladen

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 

Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming

  • 1. REX Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming EL ARIB Abdelhamide , Data Engineer at
  • 2. OUTLINE • The Challenge • Brief presentation of Spark & Kafka • Kafka vs Spark : 1. New File Detection 2. Processing 3. Handling Failure 4. Deployment 5. Scaling 6. Monitoring • 10 points not to forget
  • 3. THE CHALLENGE HDFS is our source of truth Need to stream very big data (parquet or Avro) into Kafka Different schemas for each source Need to scale up/down the speed of streaming Exactly-Once Copy (Hdfs & Kafka) And the monitoring is, as always, a must HDFS /client1/crm /client1/tracking /client1/transactions /client2/mainCrm /client2/payments ….
  • 4. BRIEF PRESENTATION OF SPARK & KAFKA
  • 5. APACHE SPARK Distributed, data processing, in-memory engine for batch & streaming mode. Apache Spark Core Standalone Scheduler MesosYarn k8s …Cluster Manager Spark SQL Spark Streaming Spark MLlibGraphXLibraries
  • 6. SPARK STREAMING Receiver Records Micro Batches (RDDs) Apache Spark Core Standalone Scheduler MesosYarn k8s … End-to-end latency ~ 100 ms Exactly once guarantee
  • 7. SPARK STRUCTURED STREAMING Whenever you can, use Dataframe instead of Spark’s RDD primitive DataFrames get the benefit of the Catalyst query optimizer. Multiple types of Triggers Receiver Records Micro Batches (DataFrames) Apache Spark Core Standalone Scheduler MesosYarn k8s … (Experimental in Spark 2.3) End to end latency ~ 1 ms with at least once guarantee Spark SQL (Catalyst)
  • 8. KAFKA/ KAFKA STREAMS/ KAFKA CONNECT Data Source Kafka Connect Kafka Topic Kafka Streams
  • 9. KAFKA CONNECT Kafka Connect is a framework to stream data into & out of Kafka using connectors. Two types of connectors: Sinks (Export) & Sources (Import) poll.interval.ms (default 5ms)
  • 10. Kafka Connect Cluster Worker 1 Worker 2 Worker 3 Connector Elastic Sink Instance Conn ES, Task 1 Partitions: 1,2 Conn Elastic Sink, Task 2 Partitions: 3,4 Conn Elastic Sink, Task 3 Partitions: 5,6
  • 11. 1. NEW FILE DETECTION
  • 12. 1. NEW FILE DETECTION - SPARK Can detect new files within a dir Should have the same schema Can’t create one stream for all source of the client X N clients with M Source => N*M streams. 1 executor at least for each stream HDFS /client1/crm /client1/tracking /client1/transactions …
  • 13. 1. NEW FILE DETECTION – KAFKA
  • 14. 1. NEW FILE DETECTION - HDFS-WATCHER (INOTIFY) /client1/crm /client1/tracking /client1/transactions /client2/mainCrm /client2/payments …. HDFS-WATCHER HDFS Kafka Topic (Avro) (source, client) -> path Scanning for new files (exclude dirs with _temporary)
  • 16. 2. PROCESSING - SPARK Read file as stream Read partition by partition At least once when writing to Kafka
  • 17. 2. PROCESSING - KAFKA HDFS Task 1 Kafka Connect Kafka Producer Client_source Topic hdfs-source- offset Topic GlobalKtable (path, lastLineCommitted) HDFS Watcher Schema Registry poll()commitRecords() (path, newLastLineCommitted) FileDetection Kafka topic ((source, client) -> filePath) Get n (=batch.size) lines Consuming new files to process & get keyPartitionKey from the metadata in their schemas Kafka Connect Framework Tracking the offset so we can handle the failure 1 2 3 5 6 7 8 4
  • 19. 3. HANDLING FAILURE - SPARK The HDFS receiver is not reliable receiver .option("checkpointLocation", ”hdfs://checkpointPath") (Expensive) spark.streaming.receiver.writeAheadLog.enable spark.task.maxFailures = 0 trait StreamingListener { … /** Called when a receiver has reported an error */ def onReceiverError(receiverError: StreamingListenerReceiverError) { } … }
  • 20. 3. HANDLING FAILURE - KAFKA GlobalKtable that track the last line commited. KIP-298: Error Handling in Connect • errors.retry.timeout • errors.retry.delay.max.ms • errors.tolerance (in a task) = none Kafka producer config: • acks = 1 (ou all) • Retries > 0 • max.in.flight.requests.per.connection > 0( no need to preserve order) • Batch.size (producer config) < number of lines * size of a single line • linger.ms = 0 (Send the record as they arrive)
  • 22. 4. DEPLOYMENT - SPARK Standalone Cluster mode: Using spark-submit, the driver is in the cluster Client mode : Start the app. The driver is in the Client machine.
  • 23. 4. DEPLOYMENT - KAFKA Packager the connector (Classpath issues warning) Make it in /plugins Start Kafka connect cluster Run via the The rest API
  • 25. 5. SCALING UP/DOWN- SPARK Dynamic Allocation Min*, max & init resources Can’t scale down manually the resources for executor without stopping the driver. Can’t scale up/down the number of executors manually.
  • 26. 5. SCALING UP/DOWN- KAFKA Scale up/down easily the number of task. Only Static allocation for the cluster resources. Can’t update the cluster/workers resources without stopping it.
  • 28. 6. MONITORING - SPARK Spark ui Spark ui rest API : http://localhost:4040/api/v1 JMX with DropWizard
  • 29. 6. MONITORING - KAFKA Kafka connect REST API : to monitor Tasks Jmx metrics Connector Metrics Task Metrics Worker Metrics Worker Rebalance Metrics See KIP-196
  • 30. 10 POINTS TO NOT FORGET (FROM OUR EXPERIENCE) 1. It depends on what kind of datasource & datasink you’re using. 2. Always check the Receiver (Spark Streaming) & The connector (Kafka Connect) 3. The monitoring is pretty good for both 4. Integration Test is challenging with Kafka Connect 5. You want to add more tasks, go for Kafka Connect 6. You want dynamic allocation, go for Spark Streaming 7. Deploying Spark Streaming app is simple 8. Naming conventions in Kafka Connect Framework may be confusing with the naming in Kafka core 9. You prefer conf over code, go for Kafka Connect 10. You want a distributed, copy from or to Kafka, without using a cluster Manager : Go for Kafka Connect AS ALWAYS IT REALLY DEPENDS ON YOUR NEEDS ☺