SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Upleveling Analytics with Kafka
Amy Chen
2
Upleveling Analytics with Kafka
Amy Chen, Kafka Summit 2023
Amy Chen
dbt Labs
they/them
@notamyfromdbt
amy@dbtlabs.com
3
“Always a data practitioner,
never a stakeholder”
The framework
of the
conversation
4
Yes, it is that hard.
Who is Apache Kafka For?
● To my audience:
○ What is your experience level?
○ How long did it take for you to get to this level?
6
Folks with `Analyst` in their
Linkedin Profile
7
22,100,000
Evidence - Merit
● Analytics Engineers own :
○ the data contacts that
dictates how a Kafka
message relates to a dbt
model
○ Debugging upstream dbt
issues and upstream
○ Updating Kafka topics as
needed
8
Hypothesis: Apache Kafka is for Analysts
Why do you want an analyst to upskill?
● Career progression
● Upleveling the data team/more
hands on deck
My first data stack (without even knowing it)
10
*Joke/Diagram courtesy of Xebia Data
My first data stack (without even knowing it)
11
The next data stack
12
So why am I telling you my
life story?
● Because it doesnʼt start with Apache Kafka.
● Skills ahead of time:
○ Command Line
○ AWS architecture: IAM Roles, VPNs, EC2
○ Data Warehousing and modeling
○ Resource Monitoring
13
Experimentation & Testing the Hypothesis
How to learn Kafka
What didnʼt work
● Kafka: The Definitive
Guide: Real-Time Data
and Stream Processing at
Scale book
● Community slacks
15
What worked
● Friends (the text
messages were weird)
● Confluent & Snowflakeʼs
developer workshops
● Stack Overflow
● Medium posts
Analytics Engineering & Kafka : The experiment metrics
● Build a working streaming pipeline
● Apply analytics engineer best practices to make it production ready
○ Testing
○ Documentation
○ Version Control
○ Scalability
● Variable
○ Managed Kafka Service: AWS MSK and Confluent Cloud
16
My first end to end streaming pipeline
17
Tools:
● Cloudformation script to set up AWS Managed Service Kafka architecture
● Snowflake Kafka Connector
Actions:
● Cloudformation to set up the Infrastructure including EC2 instance, Kafka
Cluster, Linux jumphost, and IAM roles
● Installed Kafka Connector for Snowpipe Streaming
● Set up producer and topic in MSK Cluster to ingest from Rest API
Loading the Data: AWS MSK
18
Insert in Kafka Picture
Look, data! ✨
Tools:
● Confluent Cloud
● Snowflake Snowpipe
Actions:
● Creating the topic
● Created the Snowflake Sink as a connector in UI
● Provided Credentials
● Data in Snowflake
Loading the Data: Confluent Cloud
20
● Tested: ✅ ❌
○ Used a console consumer for adhoc
check
○ Up next: Schema Registry, no unit
testing
● Documented: ✅
○ in dbt project
● Version Controlled: ❌
○ Terraform Overkill for pet projects
● Scalability: ✅
○ AWS MSK - easy to switch out
Loading the Data: the Metrics
21
Lessons Learned:
● No version control logic - how do I save configurations?
● Security Access is a large determinant of success
● AWS MSK - know your bash commands well to debug
● Confluent - UI error vs third party errors
● Personal issue: cross-regional dependencies
● Overall - know the bigger picture of the connections
Loading the Data
22
Tools:
● dbt Cloud
● Snowflake
Actions:
● Using dbt to version control my logic in dbt models, created dynamic tables
inside of Snowflake
● Applied tests and documentation to my dbt models
Transforming the data
23
● ✅ Tested:
○ dbt tests
● ✅ Documented:
○ in dbt project
● ✅ Version Controlled:
○ dbt Github repository
● ✅ Scalability:
○ performance levers in
place
Transforming the Data: the Metrics
25
Tools:
● Hex
Actions:
● Created a notebook that selected
from the dynamic table
Visualizing the Data
26
Drawing Conclusions
✅ Data from Kafka Topic to Notebook
❌✅ Testing implemented
✅ Documented entire pipeline
❌ ✅ Version Controlled
⏲ Scalability
The Metrics
28
Conclusion: Apache Kafka is for Analysts
● More hands on deck with
business knowledge
● Career advancement
● Learn some software
development best practices
● Dependency management
○ Where is your source coming from? What happens if thereʼs a change
upstream?
○ Data contracts
● How to debug upstream
○ Your report broke - how do you work backwards?
● SQL/Git/CLI
○ Have to flatten that json blob somehow
○ Version control & Speedy development
○ Debugging
● Cost/Performance Optimization
○ When do you need streaming?
What does an analyst actually need to know?
30
● Security
○ How much access to the infrastructure do you have?
● Data Governance
○ How do you maintain PII data?
● Cross team reliance
○ How do other teams work?
○ Data contracts
What can be blockers?
31
Thank you!
Amy Chen
@notamyfromdbt
amy@dbtlabs.com

Weitere ähnliche Inhalte

Ähnlich wie Upleveling Analytics with Kafka with Amy Chen

Who needs containers in a serverless world
Who needs containers in a serverless worldWho needs containers in a serverless world
Who needs containers in a serverless world
Matthias Luebken
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
confluent
 

Ähnlich wie Upleveling Analytics with Kafka with Amy Chen (20)

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Data Engineer's Lunch #57: StreamSets for Data Engineering
Data Engineer's Lunch #57: StreamSets for Data EngineeringData Engineer's Lunch #57: StreamSets for Data Engineering
Data Engineer's Lunch #57: StreamSets for Data Engineering
 
Who needs containers in a serverless world
Who needs containers in a serverless worldWho needs containers in a serverless world
Who needs containers in a serverless world
 
Mule soft meetup_chandigarh_#7_25_sept_2021
Mule soft meetup_chandigarh_#7_25_sept_2021Mule soft meetup_chandigarh_#7_25_sept_2021
Mule soft meetup_chandigarh_#7_25_sept_2021
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
 
API workshop by AWS and 3scale
API workshop by AWS and 3scaleAPI workshop by AWS and 3scale
API workshop by AWS and 3scale
 
MySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the WireMySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the Wire
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
 
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online BootcampBuilding Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
GraphQL Bangkok meetup 5.0
GraphQL Bangkok meetup 5.0GraphQL Bangkok meetup 5.0
GraphQL Bangkok meetup 5.0
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
 

Mehr von HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

Mehr von HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Upleveling Analytics with Kafka with Amy Chen

  • 1. Upleveling Analytics with Kafka Amy Chen
  • 2. 2 Upleveling Analytics with Kafka Amy Chen, Kafka Summit 2023
  • 5. Yes, it is that hard.
  • 6. Who is Apache Kafka For? ● To my audience: ○ What is your experience level? ○ How long did it take for you to get to this level? 6
  • 7. Folks with `Analyst` in their Linkedin Profile 7 22,100,000
  • 8. Evidence - Merit ● Analytics Engineers own : ○ the data contacts that dictates how a Kafka message relates to a dbt model ○ Debugging upstream dbt issues and upstream ○ Updating Kafka topics as needed 8
  • 9. Hypothesis: Apache Kafka is for Analysts Why do you want an analyst to upskill? ● Career progression ● Upleveling the data team/more hands on deck
  • 10. My first data stack (without even knowing it) 10 *Joke/Diagram courtesy of Xebia Data
  • 11. My first data stack (without even knowing it) 11
  • 12. The next data stack 12
  • 13. So why am I telling you my life story? ● Because it doesnʼt start with Apache Kafka. ● Skills ahead of time: ○ Command Line ○ AWS architecture: IAM Roles, VPNs, EC2 ○ Data Warehousing and modeling ○ Resource Monitoring 13
  • 14. Experimentation & Testing the Hypothesis
  • 15. How to learn Kafka What didnʼt work ● Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale book ● Community slacks 15 What worked ● Friends (the text messages were weird) ● Confluent & Snowflakeʼs developer workshops ● Stack Overflow ● Medium posts
  • 16. Analytics Engineering & Kafka : The experiment metrics ● Build a working streaming pipeline ● Apply analytics engineer best practices to make it production ready ○ Testing ○ Documentation ○ Version Control ○ Scalability ● Variable ○ Managed Kafka Service: AWS MSK and Confluent Cloud 16
  • 17. My first end to end streaming pipeline 17
  • 18. Tools: ● Cloudformation script to set up AWS Managed Service Kafka architecture ● Snowflake Kafka Connector Actions: ● Cloudformation to set up the Infrastructure including EC2 instance, Kafka Cluster, Linux jumphost, and IAM roles ● Installed Kafka Connector for Snowpipe Streaming ● Set up producer and topic in MSK Cluster to ingest from Rest API Loading the Data: AWS MSK 18
  • 19. Insert in Kafka Picture Look, data! ✨
  • 20. Tools: ● Confluent Cloud ● Snowflake Snowpipe Actions: ● Creating the topic ● Created the Snowflake Sink as a connector in UI ● Provided Credentials ● Data in Snowflake Loading the Data: Confluent Cloud 20
  • 21. ● Tested: ✅ ❌ ○ Used a console consumer for adhoc check ○ Up next: Schema Registry, no unit testing ● Documented: ✅ ○ in dbt project ● Version Controlled: ❌ ○ Terraform Overkill for pet projects ● Scalability: ✅ ○ AWS MSK - easy to switch out Loading the Data: the Metrics 21
  • 22. Lessons Learned: ● No version control logic - how do I save configurations? ● Security Access is a large determinant of success ● AWS MSK - know your bash commands well to debug ● Confluent - UI error vs third party errors ● Personal issue: cross-regional dependencies ● Overall - know the bigger picture of the connections Loading the Data 22
  • 23. Tools: ● dbt Cloud ● Snowflake Actions: ● Using dbt to version control my logic in dbt models, created dynamic tables inside of Snowflake ● Applied tests and documentation to my dbt models Transforming the data 23
  • 24.
  • 25. ● ✅ Tested: ○ dbt tests ● ✅ Documented: ○ in dbt project ● ✅ Version Controlled: ○ dbt Github repository ● ✅ Scalability: ○ performance levers in place Transforming the Data: the Metrics 25
  • 26. Tools: ● Hex Actions: ● Created a notebook that selected from the dynamic table Visualizing the Data 26
  • 28. ✅ Data from Kafka Topic to Notebook ❌✅ Testing implemented ✅ Documented entire pipeline ❌ ✅ Version Controlled ⏲ Scalability The Metrics 28
  • 29. Conclusion: Apache Kafka is for Analysts ● More hands on deck with business knowledge ● Career advancement ● Learn some software development best practices
  • 30. ● Dependency management ○ Where is your source coming from? What happens if thereʼs a change upstream? ○ Data contracts ● How to debug upstream ○ Your report broke - how do you work backwards? ● SQL/Git/CLI ○ Have to flatten that json blob somehow ○ Version control & Speedy development ○ Debugging ● Cost/Performance Optimization ○ When do you need streaming? What does an analyst actually need to know? 30
  • 31. ● Security ○ How much access to the infrastructure do you have? ● Data Governance ○ How do you maintain PII data? ● Cross team reliance ○ How do other teams work? ○ Data contracts What can be blockers? 31
  • 32.