SlideShare ist ein Scribd-Unternehmen logo
1 von 27
KAFKA@GS
When Streams Fail
Kafka Off the Shore
What If?
© 2017 Goldman Sachs.  This presentation should not be relied upon or considered investment advice. Goldman Sachs does not warrant
or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and recipients should not rely on it except at their
own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact.
  
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we
consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on
as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed
may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you,
you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman
Sachs.
© Copyright 2017 The Goldman Sachs Group, Inc. All rights reserved.
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
streaming-kafka-spark
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
KAFKA@GS
ABOUT US
• Investment Management Division
• $1.4 Trillion Assets Under Management
• Core Front-Office Platform Team
• Interface with 15+ teams
• Kafka, Event Topology, Data Fabric, Akka, etc…
• Sounds fun? Talk to me!
• What if…
KAFKA@GS
What If? This presentation is boring
• Topology Strategy, Context & Deployment Goals
• What If? Hosts / Network / DC Failures
• Deployment & Monitoring Strategy
• What If? Framework & Software Failure
KAFKA@GS
KAFKA DEPLOYMENT USAGE PATTERNS
• Most used clusters serve ~1.5Tb a week to
consumers
• However message count relatively low – order of
millions per week; avg. several hundred a second
• At peak periods
• ~1,500 messages produced/second
• ~2.5Mb produced/second
• ~12.5Mb consumed/second
KAFKA@GS
DEPLOYMENT GOALS
• No data-loss even in case of DC outage
• No Primary/Back-Up notion
• No “failover” scenarios
• Minimize Outage time
KAFKA@GS
DATALOSS 101
• Tape Backup
• Nightly Batch Replication
• Async Replication
• Sync Replication (ex: SRDF)
KAFKA@GS
Speed of Light Network Roundtrips
NYC to SF ~60ms
Virginia to Ohio ~12ms
NYC to NJ ~4ms
KAFKA@GS
What If? Datacenters were near…
Treat multi-DC environment as
a single redundant data-center
where half could go down
KAFKA@GS
Broker ZK
Virtual Machine
Datacenter A
Broker ZK
Virtual Machine
Broker ZK
Virtual Machine
Broker ZK
Virtual Machine
Datacenter B
Broker ZK
Virtual Machine
Broker ZK
Virtual Machine
Conceptual cluster
Physical cluster
Broker
ZKZK ZK
BrokerBroker
Broker Broker
Broker
ZKZK ZK
DEPLOYMENT STRATEGY
KAFKA@GS
H1
Datacenter A
Partition assignment
EXAMPLE Single Topic Setup
• 1 topic
• 3 partitions (p1-p3)
• Replication factor of 4
• Ensure even replicas between DCs
• Min.Insync.Replicas 3
• Cross-DC latency is low!
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
KAFKA@GS
H1
Datacenter A
What If? A host fails
• 1-5 times / year
• No impact to producers/consumers
(still able to satisfy 3 ISR)
• No manual recovery beyond
replacing host
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
KAFKA@GS
H1
Datacenter A
What If? Two hosts fail
• 1/year depending where hosts are
(e.g. bad hypervisor)
• Processing for some topics will be
halted
• Short-term: Add replicas for
affected partitions on remaining
hosts
• ASAP: Replace bad hosts
• GS Dynamic Compute allows
seamless VM replace with no need
to re-point dns aliases/change
kafka config
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
KAFKA@GS
H1
Datacenter A
What If? Two hosts fail
• 1/year depending where hosts are
(e.g. bad hypervisor)
• Processing for some topics will be
halted
• Short-term: Add replicas for
affected partitions on remaining
hosts
• ASAP: Replace bad hosts
• GS Dynamic Compute allows
seamless VM replace with no need
to re-point dns aliases/change
kafka config
H3
p1r1
H5
p2 p3
Datacenter B
H2
p1
H4
p1
p1r1
H6
p3
p2
p2 p3
p1
pr2 p3
p1
p2 p3
p1
p1
KAFKA@GS
H1
Datacenter A
What If? Three hosts fail
• 1 / few years
• Cluster processing halted as cannot
satisfy in-sync replica requirements
• Proceed with immediate host
replacement
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
KAFKA@GS
H1
Datacenter A
What If? Datacenter Failed / Network Partition
• Once a 20-year event
• Short-term strategy: add additional
machines in Datacenter A
• Largest impact on recovery time is
how long to get new hosts
provisioned
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
KAFKA@GS
H1
Datacenter A
After adding additional hosts
What If? Datacenter Failed / Network Partition
p1
H3
p1
p1r1
H5
p2
p3
p3
Datacenter B
p2
H2
p1
H4
p1
p1r1
H6
p2
p3
p3
p2
H7
p1
H8
p1
p1r1
H9
p2
p3
p3
p2
• Short-term
strategy: add
additional
machines in
Datacenter A
• Largest impact on
recovery time is
how long to get
new hosts
provisioned
• Stand-by Hosts?
KAFKA@GS
Kafka
Broker
TOOLING OVERVIEW
Zookeeper
REST
Service
Metrics
Capture
Time-
Series

DB
GS CENTRALIZED
ALERTING
INFRASTRUCTURE
Phone/e-mail/IM
alerts to team
Images are shown for illustrative purposes only and should not be relied upon as representative of
actual or future information for Goldman Sachs products and services.
KAFKA@GS
TOOLING CLUSTER DASHBOARD
Endpoints include:
• View messages on
topic

• Consumer lag

• Leader & ISRs for
topic

• Highwatermark for
topic

• Run-Time Broker &
zookeeper
configuration
KAFKA@GS
TOOLING HEALTHCHECK
• Topic sizes monitoring
• Alert when at risk of losing data due to truncation
• If partitions breach threshold, escalate via GS alerting
infrastructure
KAFKA@GS
SAMPLE DEPLOYMENT
Direct Sink
Spark
Streaming
In-Memory

RDBMS
Data Lake
Kafka
Batch ETL
In-Memory
RDBMS
VertX/RESTAPIs
Order
Producers
KAFKA@GS
What If? I need to Replay
• Message Remains on Queue?
• Delivery Semantics
• == 1
• =< 1
• >=1
• Unique ID on everything
• Making Replay Easy
• KIP-98
KAFKA@GS
What If? Spark Streaming Fails
KAFKA@GS
SUMMARY
• Failure will occur
• Belt & Suspenders for everything
• Ultimate Throughput vs Ultimate Reliability
• Kafka has many Knobs, perhaps too many, hide
some
• Resilience through Transparency
KAFKA@GS
Q&A
text time…
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
streaming-kafka-spark

Weitere ähnliche Inhalte

Mehr von C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

Mehr von C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

When Streams Fail: From Flash Floods to Arroyos

  • 1. KAFKA@GS When Streams Fail Kafka Off the Shore What If? © 2017 Goldman Sachs.  This presentation should not be relied upon or considered investment advice. Goldman Sachs does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact.    These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete and/or up to date, and it should not be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition to Goldman Sachs presenting the Materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2017 The Goldman Sachs Group, Inc. All rights reserved.
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ streaming-kafka-spark
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. KAFKA@GS ABOUT US • Investment Management Division • $1.4 Trillion Assets Under Management • Core Front-Office Platform Team • Interface with 15+ teams • Kafka, Event Topology, Data Fabric, Akka, etc… • Sounds fun? Talk to me! • What if…
  • 5. KAFKA@GS What If? This presentation is boring • Topology Strategy, Context & Deployment Goals • What If? Hosts / Network / DC Failures • Deployment & Monitoring Strategy • What If? Framework & Software Failure
  • 6. KAFKA@GS KAFKA DEPLOYMENT USAGE PATTERNS • Most used clusters serve ~1.5Tb a week to consumers • However message count relatively low – order of millions per week; avg. several hundred a second • At peak periods • ~1,500 messages produced/second • ~2.5Mb produced/second • ~12.5Mb consumed/second
  • 7. KAFKA@GS DEPLOYMENT GOALS • No data-loss even in case of DC outage • No Primary/Back-Up notion • No “failover” scenarios • Minimize Outage time
  • 8. KAFKA@GS DATALOSS 101 • Tape Backup • Nightly Batch Replication • Async Replication • Sync Replication (ex: SRDF)
  • 9. KAFKA@GS Speed of Light Network Roundtrips NYC to SF ~60ms Virginia to Ohio ~12ms NYC to NJ ~4ms
  • 10. KAFKA@GS What If? Datacenters were near… Treat multi-DC environment as a single redundant data-center where half could go down
  • 11. KAFKA@GS Broker ZK Virtual Machine Datacenter A Broker ZK Virtual Machine Broker ZK Virtual Machine Broker ZK Virtual Machine Datacenter B Broker ZK Virtual Machine Broker ZK Virtual Machine Conceptual cluster Physical cluster Broker ZKZK ZK BrokerBroker Broker Broker Broker ZKZK ZK DEPLOYMENT STRATEGY
  • 12. KAFKA@GS H1 Datacenter A Partition assignment EXAMPLE Single Topic Setup • 1 topic • 3 partitions (p1-p3) • Replication factor of 4 • Ensure even replicas between DCs • Min.Insync.Replicas 3 • Cross-DC latency is low! p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  • 13. KAFKA@GS H1 Datacenter A What If? A host fails • 1-5 times / year • No impact to producers/consumers (still able to satisfy 3 ISR) • No manual recovery beyond replacing host p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  • 14. KAFKA@GS H1 Datacenter A What If? Two hosts fail • 1/year depending where hosts are (e.g. bad hypervisor) • Processing for some topics will be halted • Short-term: Add replicas for affected partitions on remaining hosts • ASAP: Replace bad hosts • GS Dynamic Compute allows seamless VM replace with no need to re-point dns aliases/change kafka config p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  • 15. KAFKA@GS H1 Datacenter A What If? Two hosts fail • 1/year depending where hosts are (e.g. bad hypervisor) • Processing for some topics will be halted • Short-term: Add replicas for affected partitions on remaining hosts • ASAP: Replace bad hosts • GS Dynamic Compute allows seamless VM replace with no need to re-point dns aliases/change kafka config H3 p1r1 H5 p2 p3 Datacenter B H2 p1 H4 p1 p1r1 H6 p3 p2 p2 p3 p1 pr2 p3 p1 p2 p3 p1 p1
  • 16. KAFKA@GS H1 Datacenter A What If? Three hosts fail • 1 / few years • Cluster processing halted as cannot satisfy in-sync replica requirements • Proceed with immediate host replacement p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  • 17. KAFKA@GS H1 Datacenter A What If? Datacenter Failed / Network Partition • Once a 20-year event • Short-term strategy: add additional machines in Datacenter A • Largest impact on recovery time is how long to get new hosts provisioned p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2
  • 18. KAFKA@GS H1 Datacenter A After adding additional hosts What If? Datacenter Failed / Network Partition p1 H3 p1 p1r1 H5 p2 p3 p3 Datacenter B p2 H2 p1 H4 p1 p1r1 H6 p2 p3 p3 p2 H7 p1 H8 p1 p1r1 H9 p2 p3 p3 p2 • Short-term strategy: add additional machines in Datacenter A • Largest impact on recovery time is how long to get new hosts provisioned • Stand-by Hosts?
  • 19. KAFKA@GS Kafka Broker TOOLING OVERVIEW Zookeeper REST Service Metrics Capture Time- Series
 DB GS CENTRALIZED ALERTING INFRASTRUCTURE Phone/e-mail/IM alerts to team Images are shown for illustrative purposes only and should not be relied upon as representative of actual or future information for Goldman Sachs products and services.
  • 20. KAFKA@GS TOOLING CLUSTER DASHBOARD Endpoints include: • View messages on topic
 • Consumer lag
 • Leader & ISRs for topic
 • Highwatermark for topic
 • Run-Time Broker & zookeeper configuration
  • 21. KAFKA@GS TOOLING HEALTHCHECK • Topic sizes monitoring • Alert when at risk of losing data due to truncation • If partitions breach threshold, escalate via GS alerting infrastructure
  • 22. KAFKA@GS SAMPLE DEPLOYMENT Direct Sink Spark Streaming In-Memory
 RDBMS Data Lake Kafka Batch ETL In-Memory RDBMS VertX/RESTAPIs Order Producers
  • 23. KAFKA@GS What If? I need to Replay • Message Remains on Queue? • Delivery Semantics • == 1 • =< 1 • >=1 • Unique ID on everything • Making Replay Easy • KIP-98
  • 24. KAFKA@GS What If? Spark Streaming Fails
  • 25. KAFKA@GS SUMMARY • Failure will occur • Belt & Suspenders for everything • Ultimate Throughput vs Ultimate Reliability • Kafka has many Knobs, perhaps too many, hide some • Resilience through Transparency
  • 27. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ streaming-kafka-spark