Should you read Kafka as a stream or in batch? Batch processing vs streaming

•

0 gefällt mir•390 views

This document discusses whether it is better to process data using a stream or batch approach. It describes how one company evolved their data pipeline from a micro-batch streaming process to a batch approach. The streaming process was very expensive, costing $400,000 per year to run. It also had issues with wasted resources during idle times, slow processing during bursts of data, and long recovery times from outages. The company rearchitected the process to use discrete time windows run in isolated batch jobs. This new batch approach reduced costs by 60% to $160,000 per year and improved processing efficiency and outage recovery times.

Technologie

Should you read Kafka
as a stream or in
batch?

Stream
● Micro deliveries
● Many trips
● AND - Low waits

Batch
Photo by Charles J. Sharp
● Large deliveries
● Few trips
● BUT - Long waits

This is a story about a stream …..
gone batch

o Project evolution
o Metrics and Results
o Insights for your projects
Agenda

Who Are We ?
Opher Dubrovsky
Big Data Dev Lead
Ido Nadler
Big Data Engineer
Data Pipelines, Serverless, Spark, Analytics

Nielsen Marketing Cloud
Cloud native
~6,000
nodes/day
~60 TB/day
5 PetaByte
~6 million/day
Create data for:
› Running campaigns
› Business decisions

Our Old Stream Project
Data Lake
AWS S3
Total: 60 TB/day
Biggest topic: 21 Tb/day
Flow: Consume
micro-batch Transform
Write
parquet

Total Cost
~$400,000 per Year
Daily Costs ($)

300
400
500
600
700
800
900
1,000
1,100
1,200
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
GB / Hour
Problems
Cluster is
Idle
Wasted $$$

Data Bursts (& reprocessing)
Cluster is
underpowered
Cluster
Capacity
Long Queues !

Data Expiration
Processing has to be Quick

Slow Recovery == Ticking Timebomb
Risk to Your DATA !

Rearchitecting
Rearchitecting the stream to batch……

Using Discreet Time Slots
time
4:00 5:00 6:00 7:00
3:00 8:00 9:00
Task
Task
Task

Task (& Cluster) Independence
Processing
Hour
4:00
5:00
6:00
8:00
3:00
7:00
9:00
1 hour 2 hours 3 hours
Some tasks are short
No more paying for
idle time !
Some tasks >1 hour

Mechanics
Airflow AWS S3
{ period: 05:00 - 06:00 }
{ period: 06:00 - 07:00 }
{ period: 07:00 - 08:00 }

Comparing to Previous System
Cluster
Capacity
Wasted
Processing
Efficiency was 30% ONLY !!
New System is 60% Cheaper !

From Micro-Batch to Batch
1 hour
1 hour
1 hour 1 hour 1 hour
Missed run
Rerun
2 hours

Pay as You Go
Bytes per Hour
$ per Hour
$400k /year à $160k /year

Outage Handling
Outage start
6 hours Recovery
completed
Old System
New System
Outage start Recovery
completed
~85% Faster Recovery
40 min
5 Concurrent
isolated workers

Conclusions and Insights
o Streaming is expensive
o Fixed clusters are never perfect
o Off hours - $$$
o High loads - Slow
o Recovery is critical
o Isolation and parallelism allows you to
o Easily scale
o Save on costs
o Deal with loads

You Can Reach Us At:
Opher Dubrovsky /opher-dubrovsky
Ido Nadler /ido-nadler

Empfohlen

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

The Netflix Way to deal with Big Data ProblemsMonal Daxini

Flink forward-2017-netflix keystones-paasMonal Daxini

Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? confluent

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent

Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...HostedbyConfluent

Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architectureconfluent

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Empfohlen

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

The Netflix Way to deal with Big Data ProblemsMonal Daxini

Flink forward-2017-netflix keystones-paasMonal Daxini

Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? confluent

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent

Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...HostedbyConfluent

Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architectureconfluent

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini

Apache kafka-a distributed streaming platformconfluent

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent

Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent

Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini

Data pipeline with kafkaMole Wong

Easily Build a Smart Pulsar Stream Processor_Simon CrosbyStreamNative

Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregationconfluent

Deploying Confluent Platform for Productionconfluent

Ingesting Healthcare Data, Micah Whitacreconfluent

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative

Kafka - Linkedin's messaging backboneAyyappadas Ravindran (Appu)

Shattering The Monolith(s) (Martin Kess, Namely) Kafka Summit SF 2019 confluent

Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...confluent

DataEngConf SF16 - High cardinality time series searchHakka Labs

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent

Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...HostedbyConfluent

Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...HostedbyConfluent

Weitere ähnliche Inhalte

Was ist angesagt?

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini

Apache kafka-a distributed streaming platformconfluent

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent

Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent

Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini

Data pipeline with kafkaMole Wong

Easily Build a Smart Pulsar Stream Processor_Simon CrosbyStreamNative

Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregationconfluent

Deploying Confluent Platform for Productionconfluent

Ingesting Healthcare Data, Micah Whitacreconfluent

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative

Kafka - Linkedin's messaging backboneAyyappadas Ravindran (Appu)

Shattering The Monolith(s) (Martin Kess, Namely) Kafka Summit SF 2019 confluent

Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...confluent

DataEngConf SF16 - High cardinality time series searchHakka Labs

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent

Was ist angesagt? (20)

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

Apache kafka-a distributed streaming platform

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka

Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Netflix Keystone Pipeline at Samza Meetup 10-13-2015

Data pipeline with kafka

Easily Build a Smart Pulsar Stream Processor_Simon Crosby

Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation

Deploying Confluent Platform for Production

Ingesting Healthcare Data, Micah Whitacre

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021

Kafka - Linkedin's messaging backbone

Shattering The Monolith(s) (Martin Kess, Namely) Kafka Summit SF 2019

Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...

DataEngConf SF16 - High cardinality time series search

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Ähnlich wie Should you read Kafka as a stream or in batch? Batch processing vs streaming

Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...HostedbyConfluent

Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...HostedbyConfluent

Scylla Summit 2019 Keynote - Dor Laor - Beyond CassandraScyllaDB

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion

Getting started with amazon aurora - TorontoAmazon Web Services

Getting Started with Amazon AuroraAmazon Web Services

Leveraging Databricks for Spark PipelinesRose Toomey

Leveraging Databricks for Spark pipelinesRose Toomey

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv

Getting Started with Amazon AuroraAmazon Web Services

Scylla Summit 2018: OLAP or OLTP? Why Not Both?ScyllaDB

Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari

Optimizing Total Cost of Ownership for the AWS CloudAmazon Web Services

S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari

(BDT204) Rendering a Seamless Satellite Map of the World with AWS and NASA Da...Amazon Web Services

AWS Presentation at JasperWorld APACAmazon Web Services

AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)Amazon Web Services

Couchbase live 2016Pierre Mavro

Amazon Aurora: Amazon’s New Relational Database EngineAmazon Web Services

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...Amazon Web Services

Ähnlich wie Should you read Kafka as a stream or in batch? Batch processing vs streaming (20)

Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...

Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...

Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...

Getting started with amazon aurora - Toronto

Getting Started with Amazon Aurora

Leveraging Databricks for Spark Pipelines

Leveraging Databricks for Spark pipelines

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

Getting Started with Amazon Aurora

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

Scala like distributed collections - dumping time-series data with apache spark

Optimizing Total Cost of Ownership for the AWS Cloud

S3 cassandra or outer space? dumping time series data using spark

(BDT204) Rendering a Seamless Satellite Map of the World with AWS and NASA Da...

AWS Presentation at JasperWorld APAC

AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Couchbase live 2016

Amazon Aurora: Amazon’s New Relational Database Engine

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...

Mehr von HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent

Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent

Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent

Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent

Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent

Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent

Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent

Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent

Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent

Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent

TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent

A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent

Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent

Data Contracts Management: Schema Registry and BeyondHostedbyConfluent

Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent

Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent

Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent

Mehr von HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Renaming a Kafka Topic | Kafka Summit London

Evolution of NRT Data Ingestion Pipeline at Trendyol

Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques

Exactly-once Stream Processing with Arroyo and Kafka

Fish Plays Pokemon | Kafka Summit London

Tiered Storage 101 | Kafla Summit London

Building a Self-Service Stream Processing Portal: How And Why

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...

Navigating Private Network Connectivity Options for Kafka Clusters

Apache Flink: Building a Company-wide Self-service Streaming Data Platform

Explaining How Real-Time GenAI Works in a Noisy Pub

TL;DR Kafka Metrics | Kafka Summit London

A Window Into Your Kafka Streams Tasks | KSL

Mastering Kafka Producer Configs: A Guide to Optimizing Performance

Data Contracts Management: Schema Registry and Beyond

Code-First Approach: Crafting Efficient Flink Apps

Debezium vs. the World: An Overview of the CDC Ecosystem

Beyond Tiered Storage: Serverless Kafka with No Local Disks

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

"ML in Production",Oleksandr BaganFwdays

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

CloudStudio User manual (basic edition):comworks

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Take control of your SAP testing with UiPath Test SuiteDianaGray10

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Search Engine Optimization SEO PDF for 2024.pdfRankYa

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

"Debugging python applications inside k8s environment", Andrii Soldatenko

"ML in Production",Oleksandr Bagan

Anypoint Exchange: It’s Not Just a Repo!

CloudStudio User manual (basic edition):

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

DevoxxFR 2024 Reproducible Builds with Apache Maven

Take control of your SAP testing with UiPath Test Suite

What's New in Teams Calling, Meetings and Devices March 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

SIP trunking in Janus @ Kamailio World 2024

Human Factors of XR: Using Human Factors to Design XR Systems

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Nell’iperspazio con Rocket: il Framework Web di Rust!

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Search Engine Optimization SEO PDF for 2024.pdf

How AI, OpenAI, and ChatGPT impact business and software.

Connect Wave/ connectwave Pitch Deck Presentation

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Should you read Kafka as a stream or in batch? Batch processing vs streaming

1. Should you read Kafka as a stream or in batch?

2. Stream ● Micro deliveries ● Many trips ● AND - Low waits

3. Batch Photo by Charles J. Sharp ● Large deliveries ● Few trips ● BUT - Long waits

4. This is a story about a stream ….. gone batch

5. o Project evolution o Metrics and Results o Insights for your projects Agenda

6. Who Are We ? Opher Dubrovsky Big Data Dev Lead Ido Nadler Big Data Engineer Data Pipelines, Serverless, Spark, Analytics

7. Nielsen Marketing Cloud Cloud native ~6,000 nodes/day ~60 TB/day 5 PetaByte ~6 million/day Create data for: › Running campaigns › Business decisions

8. Our Old Stream Project Data Lake AWS S3 Total: 60 TB/day Biggest topic: 21 Tb/day Flow: Consume micro-batch Transform Write parquet

9. Reality Check Photo by Olia Nayda on

10. Total Cost ~$400,000 per Year Daily Costs ($)

11. 300 400 500 600 700 800 900 1,000 1,100 1,200 0 : 0 0 2 : 0 0 4 : 0 0 6 : 0 0 8 : 0 0 1 0 : 0 0 1 2 : 0 0 1 4 : 0 0 1 6 : 0 0 1 8 : 0 0 2 0 : 0 0 2 2 : 0 0 0 : 0 0 2 : 0 0 4 : 0 0 6 : 0 0 8 : 0 0 1 0 : 0 0 1 2 : 0 0 1 4 : 0 0 1 6 : 0 0 1 8 : 0 0 2 0 : 0 0 2 2 : 0 0 0 : 0 0 2 : 0 0 4 : 0 0 6 : 0 0 8 : 0 0 1 0 : 0 0 1 2 : 0 0 1 4 : 0 0 1 6 : 0 0 1 8 : 0 0 2 0 : 0 0 2 2 : 0 0 GB / Hour Problems Cluster is Idle Wasted $$$

12. Data Bursts (& reprocessing) Cluster is underpowered Cluster Capacity Long Queues !

13. Data Expiration Processing has to be Quick

14. Slow Recovery == Ticking Timebomb Risk to Your DATA !

15. Rearchitecting Rearchitecting the stream to batch……

16. Spread the work across isolated workers

17. Using Discreet Time Slots time 4:00 5:00 6:00 7:00 3:00 8:00 9:00 Task Task Task

18. Task (& Cluster) Independence Processing Hour 4:00 5:00 6:00 8:00 3:00 7:00 9:00 1 hour 2 hours 3 hours Some tasks are short No more paying for idle time ! Some tasks >1 hour

19. Mechanics Airflow AWS S3 { period: 05:00 - 06:00 } { period: 06:00 - 07:00 } { period: 07:00 - 08:00 }

20. Comparing to Previous System Cluster Capacity Wasted Processing Efficiency was 30% ONLY !! New System is 60% Cheaper !

21. Results

22. From Micro-Batch to Batch 1 hour 1 hour 1 hour 1 hour 1 hour Missed run Rerun 2 hours

23. Pay as You Go Bytes per Hour $ per Hour $400k /year à $160k /year

24. Outage Handling Outage start 6 hours Recovery completed Old System New System Outage start Recovery completed ~85% Faster Recovery 40 min 5 Concurrent isolated workers

25. Summary

26. Conclusions and Insights o Streaming is expensive o Fixed clusters are never perfect o Off hours - $$$ o High loads - Slow o Recovery is critical o Isolation and parallelism allows you to o Easily scale o Save on costs o Deal with loads

27. You Can Reach Us At: Opher Dubrovsky /opher-dubrovsky Ido Nadler /ido-nadler

28. Thank You