SlideShare ist ein Scribd-Unternehmen logo
1 von 16
©2022, Imply
©2022, imply
Building an Event Analytics Pipeline
with Kafka, ksqlDB, and Druid
Hellmar Becker, Senior Sales Engineer
1
©2022, Imply
About Me
2
hellmar.becker@imply.io
https://www.linkedin.com/in/hellmarbec
ker/
https://blog.hellmar-becker.de/
Hellmar Becker
Sr. Sales Engineer at Imply
Lives near Munich
©2022, Imply
Agenda
● The Case for Streaming Analytics
● How to Prepare Your Data: Streaming ETL
● How to Analyze Your Data: Streaming Analytics
● Apache Druid - A Streaming Analytics Database
● K2D - A Streaming Analytics Architecture
● Live Demo
● Q&A
©2022, Imply
The Case for Streaming Analytics
● Analytics - "the process of discovering, interpreting, and communicating significant
patterns in data."
● OLAP = Online Analytical Processing
● Classical:
But that's not enough anymore!
Transaction
al Database
(OLTP)
Analytical
Database
(OLAP)
Source
Data
Batch
ETL
Client
©2022, Imply
The Case for Streaming Analytics (contd.)
● The Big Data Hype gave us the Lambda Architecture
● Separate paths for batch and realtime
● One common serving layer
● Complex, hard to reconcile
Image source: https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa
©2022, Imply
The Case for Streaming Analytics (contd.)
● 2014 Jay Krepps: Kappa Architecture
● Avoids having separate code paths for batch and streaming
Image source: https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa
©2022, Imply
How to prepare your Data: Streaming ETL
ETL = Extract, Transform, Load
Let's focus on the Transform part
Simple Event Processing = 1 event at a time
● Filter
● Transform
● Cleanse
Complex Event Processing = Relate events to each other
● Windowing
● Aggregations
● Joins
● Enrichment
ksqlDB is a tool by Confluent that does streaming ETL using streaming SQL
©2022, Imply
How to analyze your data: Streaming Analytics
with Druid
Sub-second queries at any scale
Interactive analytics on TB-PBs of data
High concurrency at the lowest cost
100s to 1000s QPS via a highly efficient engine
Real-time and historical insights
True stream ingestion for Kafka and Kinesis
Plus, non-stop reliability with automated fault
tolerance and continuous backup
1
2
3
For analytics applications that require:
©2022, Imply
Why do you need a Streaming Analytics
Database?
Files
App data
Data sources
µService
Database
Event Streaming Infrastructure
Databases Data Lake
Stream ETL
Stream
Processor
Messaging
CDC
Streams
Realtime
Analytics
Event Analytics Infrastructure
Custom
visualizations
Dashboards
& reports
Root-cause
analysis
Data/Event
driven Apps
BI tools
K2D Architecture - Kafka to Druid
©2022, Imply
Preprocessing - What we are going to do today
* * * * *
* * * * *
* * * * *
Filter out data by
type
Filter out data by
field values
©2022, Imply
Use Case: Publisher Clickstream Data
12
©2022, Imply
13
Visualisation
Query
Storage
Processing
Delivery
Data
Production
Analytics
Pipeline
Kafka as an event
streaming platform
Preprocessing with
ksqlDB:
- Filter
- Enrich
- Transform
Apache Druid -
Elastic storage
model (in
production backed
by cloud storage)
Highly scalable,
built in DR
Apache Druid
Analytical queries
against realtime,
detail data
Imply Pivot -
a data exploration
and adhoc analytics
GUI for Druid
Tracking,
Transactions
Delivery
Demo Architecture
©2022, Imply
Live Demo
14
©2022, Imply
Learnings
15
● Kafka and Druid complement each other
● Use ksqlDB for
● Preprocessing
● Enrichment
● Materialized views
● Use Druid for
● Scalable analytical applications
● Adhoc data exploration
● OLAP style analysis
● Integration is easy with native integration APIs
©2022, Imply
©2022, imply
Questions
16
hellmar.becker@imply.io
https://www.linkedin.com/in/hellmarbec
ker/
https://blog.hellmar-becker.de/

Weitere ähnliche Inhalte

Ähnlich wie Building a Streaming Analytics Pipeline with Kafka and Druid - Big Data Europe, 2023

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...HostedbyConfluent
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...HostedbyConfluent
 
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022HostedbyConfluent
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeDatabricks
 
Determining the Deployment Model that Fits Your Organization's Needs
Determining the Deployment Model that Fits Your Organization's NeedsDetermining the Deployment Model that Fits Your Organization's Needs
Determining the Deployment Model that Fits Your Organization's NeedsCelonis
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
 
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...Kai Wähner
 
Data Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM ExellysData Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM ExellysWout Scheepers
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraAttunity
 
IoT and Event Streaming at Scale with Apache Kafka
IoT and Event Streaming at Scale with Apache KafkaIoT and Event Streaming at Scale with Apache Kafka
IoT and Event Streaming at Scale with Apache Kafkaconfluent
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
 
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...Lviv Startup Club
 
Building a Streaming Data Pipeline for Trains Delays Processing
Building a Streaming Data Pipeline for Trains Delays ProcessingBuilding a Streaming Data Pipeline for Trains Delays Processing
Building a Streaming Data Pipeline for Trains Delays ProcessingDatabricks
 

Ähnlich wie Building a Streaming Analytics Pipeline with Kafka and Druid - Big Data Europe, 2023 (20)

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
 
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
Determining the Deployment Model that Fits Your Organization's Needs
Determining the Deployment Model that Fits Your Organization's NeedsDetermining the Deployment Model that Fits Your Organization's Needs
Determining the Deployment Model that Fits Your Organization's Needs
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
 
Data Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM ExellysData Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM Exellys
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
IoT and Event Streaming at Scale with Apache Kafka
IoT and Event Streaming at Scale with Apache KafkaIoT and Event Streaming at Scale with Apache Kafka
IoT and Event Streaming at Scale with Apache Kafka
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
Vitalii Bondarenko and Eugene Berko "Cloud AI Platform as an accelerator of e...
 
Building a Streaming Data Pipeline for Trains Delays Processing
Building a Streaming Data Pipeline for Trains Delays ProcessingBuilding a Streaming Data Pipeline for Trains Delays Processing
Building a Streaming Data Pipeline for Trains Delays Processing
 

Kürzlich hochgeladen

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 

Kürzlich hochgeladen (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 

Building a Streaming Analytics Pipeline with Kafka and Druid - Big Data Europe, 2023

  • 1. ©2022, Imply ©2022, imply Building an Event Analytics Pipeline with Kafka, ksqlDB, and Druid Hellmar Becker, Senior Sales Engineer 1
  • 3. ©2022, Imply Agenda ● The Case for Streaming Analytics ● How to Prepare Your Data: Streaming ETL ● How to Analyze Your Data: Streaming Analytics ● Apache Druid - A Streaming Analytics Database ● K2D - A Streaming Analytics Architecture ● Live Demo ● Q&A
  • 4. ©2022, Imply The Case for Streaming Analytics ● Analytics - "the process of discovering, interpreting, and communicating significant patterns in data." ● OLAP = Online Analytical Processing ● Classical: But that's not enough anymore! Transaction al Database (OLTP) Analytical Database (OLAP) Source Data Batch ETL Client
  • 5. ©2022, Imply The Case for Streaming Analytics (contd.) ● The Big Data Hype gave us the Lambda Architecture ● Separate paths for batch and realtime ● One common serving layer ● Complex, hard to reconcile Image source: https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa
  • 6. ©2022, Imply The Case for Streaming Analytics (contd.) ● 2014 Jay Krepps: Kappa Architecture ● Avoids having separate code paths for batch and streaming Image source: https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa
  • 7. ©2022, Imply How to prepare your Data: Streaming ETL ETL = Extract, Transform, Load Let's focus on the Transform part Simple Event Processing = 1 event at a time ● Filter ● Transform ● Cleanse Complex Event Processing = Relate events to each other ● Windowing ● Aggregations ● Joins ● Enrichment ksqlDB is a tool by Confluent that does streaming ETL using streaming SQL
  • 8. ©2022, Imply How to analyze your data: Streaming Analytics with Druid Sub-second queries at any scale Interactive analytics on TB-PBs of data High concurrency at the lowest cost 100s to 1000s QPS via a highly efficient engine Real-time and historical insights True stream ingestion for Kafka and Kinesis Plus, non-stop reliability with automated fault tolerance and continuous backup 1 2 3 For analytics applications that require:
  • 9. ©2022, Imply Why do you need a Streaming Analytics Database?
  • 10. Files App data Data sources µService Database Event Streaming Infrastructure Databases Data Lake Stream ETL Stream Processor Messaging CDC Streams Realtime Analytics Event Analytics Infrastructure Custom visualizations Dashboards & reports Root-cause analysis Data/Event driven Apps BI tools K2D Architecture - Kafka to Druid
  • 11. ©2022, Imply Preprocessing - What we are going to do today * * * * * * * * * * * * * * * Filter out data by type Filter out data by field values
  • 12. ©2022, Imply Use Case: Publisher Clickstream Data 12
  • 13. ©2022, Imply 13 Visualisation Query Storage Processing Delivery Data Production Analytics Pipeline Kafka as an event streaming platform Preprocessing with ksqlDB: - Filter - Enrich - Transform Apache Druid - Elastic storage model (in production backed by cloud storage) Highly scalable, built in DR Apache Druid Analytical queries against realtime, detail data Imply Pivot - a data exploration and adhoc analytics GUI for Druid Tracking, Transactions Delivery Demo Architecture
  • 15. ©2022, Imply Learnings 15 ● Kafka and Druid complement each other ● Use ksqlDB for ● Preprocessing ● Enrichment ● Materialized views ● Use Druid for ● Scalable analytical applications ● Adhoc data exploration ● OLAP style analysis ● Integration is easy with native integration APIs

Hinweis der Redaktion

  1. Today we're going to talk about how to set up a streaming analytics pipeline with Apache Kafka and Apache Druid. My name is Hellmar, I am a Developer Advocate at Imply, the company that offers enterprise services and products based on Apache Druid. I live in Munich in Germany, and I blog about Druid on a regular basis. The QR code on this slide will take you directly to my blog.
  2. Here's what I am going to cover in today's session. First of all, I am going to compare streaming analytics to more traditional ways of doing analytics. I will show how streaming analytics evolved over the last 10 or so years. I will then proceed to the components of a streaming analytics pipeline, namely the ETL and the analytics proper. I assume that many people in this room have an idea what Kafka is about [or do a hands-up survey here, who knows Kafka?] But Druid is a bit less known so I am going to give you a very quick description of what Druid is and how it works. I am then going to describe to you a streaming analytics architecture pipeline template built out of Apache Kafka, Apache Druid, and Confluent's ksqlDB which is a streaming SQL engine that is available under a community license. Finally, I will demonstrate this approach in practice by building a clickstream analytics pipeline right in front of you. All the components I am using are open source or community licensed, so you can try this at home for free!
  3. This is how we used to do analytics, roughly 20 years ago. You would have your operational systems that collected data into transactional, or OLTP, databases. OLTP databases are built to process single updates or inserts very quickly. In traditional relational modeling this means you have to normalize your data model, ideally to a point where every item exists only once in a database. The downside is when you want to run an analytical query that aggregates data from different parts of your database, these queries require complex joins and can become very expensive, hurting query times and interfering with the transactional performance of your database. Hence another type of databases was conceived which is optimized for these analytical queries: OLAP databases. These come in different shapes and flavours, but generally a certain amount of denormalization and possibly preaggregation is applied to the data. The process that ships data from the transactional system to the OLAP database is called ETL - Extract, Transform, Load. It is a batch process that would run on a regular basis, for instance once a night or once every week. The frequency of the batch process determines how "fresh" your analytical data is. In the old world, where analytical users would be data analysts inside the enterprise itself, that was often good enough. But nowadays, in the age of connectivity, everyone is an analytics user. If you check your bank account balance and the list of transactions in your banking app on your smartphone, you are an analytics user. And if someone transfers funds to your account, you expect to see the result now and not two days later. A better way of processing data for analytics was needed. And we'll look at that now.
  4. About ten years ago, the big data craze came up around the Hadoop ecosystem. Hadoop brought with it the ability to handle historical data to a previously unknown scale, but it also already had real time [note: this is not hard real time like in embedded systems! If necessary elaborate] capability, with tooling like Kafka, Flume, and HBase. The first approach to getting analytics more up to date was the so called lambda architecture, where incoming data would be sent across two parallel paths: A realtime layer with low latency and limited analytical capabilities A highly scalable but slower batch layer. This way, you would be able to retrieve at least some of the analytics results immediately, and get the full results the next day. A common serving layer would be the single entry point for clients. This architectural pattern did the job for a while but it has an intrinsic complexity that is somewhat hard to master. Also, when you have two different sources of results, you need to go through an extra effort to make sure that the results always match up.
  5. A better way needed to be found. It was created in the form of the kappa architecture. In the kappa architecture, there is only one data path and only one result for a given query. The same processing path gives (near) real time results and also fills up the storage for historical data. The kappa architecture handles incoming streaming data and historical data in a common, uniform way and is more robust than a lambda architecture. Ideally you still want to encapsulate the details of such an architecture and not concern the user with it. We will come to that in a moment.
  6. But first a few words about the part that I didn't cover in the last two slides: How to get the data out of the transactional systems into whatever analytics architecture you have. Instead of processing batches of data, streaming ETL has to be event driven. There are two ways of processing event data in a streaming ETL pipeline: Simple event processing looks at one event at a time. Simple event processing is stateless which makes it easy to implement but limite the things you can do with it. This is used for format transformations, filtering, or data cleansing, for instance. An example for simple event processing is Apache NiFi. Complex event processing looks at a collection of events over time, hence it is stateful and has to maintain a state store in the background. With that you can do things like windows aggregations, such as sliding averages or session aggregations. You can also join various event streams (think orders and shipments), or enrich data with lookup data that is itself event based. Complex event processing is possible using frameworks like Spark Streaming, Flink, or Kafka Streams. In this demo, I will use ksqlDB. ksqlDB is a community licensed SQL framework on top of Kafka Streams, by Confluent. With ksqlDB, you can write a complex event streaming application as simple SQL statements. ksqlDB queries are typically persistent: unlike database queries, they continue running until they are explicitly stopped, and they continue to emit new events as they process new input events in real time. ksqlDB abstracts away for the most part the detail of event and state handling.
  7. Apache Druid is a high-performance, real-time analytics database purpose-built for powering analytics apps at massive scale and concurrency on streaming and batch data. Devs build with Druid because of at least 1 of these 3 differentiators: Sub-second query response at any scale: Druid’s unique distributed architecture and storage engine delivers consistent very fast response times for simple to complex (ie. aggregation, groupby) queries on trillions of rows. High concurrency at the best value. Druid is designed with a highly-efficient engine that minimizes CPU cycles for queries to support high concurrency at 100s to 1000s of queries per second - at a fraction of the cost of other databases Real-time and historical insights. Druid is built natively for true stream ingestion with historical context. It’s native integration with Kafka and Kinesis enables event-by-event ingestion, query-on-arrival and guaranteed consistency at massive scale north of millions of events ingested per second. Big Bonus: non-stop reliability. Druid is built for mission-critical use cases where HA and durability are paramount. It supports automated fault tolerance with auto-rebalancing nodes and continuous backup for zero data loss.
  8. Let's take a look at the internal architecture of the Druid database. Druid is heavily distributed and exceptionally scalable, and here is how that works. In Druid, there are three type of servers: master, query, and data servers. Also there is deep storage (typically object storage, such as S3), and a relational database for metadata. Master servers handle data coordination, metadata processing, and service discovery. They know which bit of data lives where in the Druid cluster, and which processes and compute resources are available. Query servers serve as the entry point for clients. They receive a query, chop it up into partial queries that can be handled by a single machine independently, and assign each partial query to a process on a data server. When the partial results come back, the query server assembles them, applies final processing such as sorting and aggregations, and returns the result to the caller. The heavy lifting is mostly done by machines called data servers. A data server handles both data ingestion and partial queries. Let's look at streaming ingestion. An indexer process consumes data directly from a Kafka stream. These data are stored in memory as a realtime segment. They are already queryable. When a configurable time interval has been passed, the segment is closed off and a new segment is started. The finished segment is transformed into a columnar format. Within the segment, data is ordered by time. All alphanumeric data are dictionary compressed and bitmap indexed. The final result is binary compressed again, and written to deep storage. Deep storage serves as an archive and the source of truth. From deep storage, segments are then loaded to the local storage of the data servers, typically twice replicated for resiliency and performance. Then they are available for querying by the historical processes. A query's result is collected from the realtime segments (via the indexers) and the historical segments. This encapsulates the kappa architecture and hides most of its detail from the database user.
  9. With that, we have everything we need. Kafka provides event delivery, and the framework for event processing. Event processing itself is done with ksqlDB applications. The curated result lands in Druid, where we can query the data and get fresh results at all times.
  10. Let's take a look at what I am going to show in the demo. I have a stream of clickstream data in Kafka. This stream contains events that have a different format - click and session events. The first step will be to filter out one specific type of event. This shows how to handle semi-structured data in ksqlDB. After that, we have a defined schema to work with, so the next step is to reinterpret the result stream as structured data. And finally I will apply a data based filter to that stream, to create the curated stream that goes into Druid.
  11. Red - open to changing Green - its working, no need or ability to replace