Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Designing Topic Structures for Data Resiliency and Disaster Recovery With Justin Lee | Current 2022

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 11 Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Designing Topic Structures for Data Resiliency and Disaster Recovery With Justin Lee | Current 2022 (20)

Weitere von HostedbyConfluent (20)

Anzeige

Aktuellste (20)

Designing Topic Structures for Data Resiliency and Disaster Recovery With Justin Lee | Current 2022

  1. 1. Designing Topic Structures for Data Resiliency and Disaster Recovery Justin Lee - Confluent
  2. 2. Before we start… Resiliency is really complex. It’s not just Kafka, it’s your entire data and application architecture. linkedin.com/in/leejustinr
  3. 3. linkedin.com/in/leejustinr See also: Cloudy with a Chance of Failure: How to Design a Kafka Architecture Resilient to Cloud Outages (Presented by Julie Wiederhold)
  4. 4. Agenda ● Major Considerations ● Topic Layout and Naming Conventions ● Planning for Failure ● What’s Next? linkedin.com/in/leejustinr
  5. 5. Major Considerations You will likely have different requirements (with different architectures) for each use case. ● What is my overall architecture goal? ○ Single-Active (Active/Passive, Active/Standby, Primary/DR, etc.) ○ Dual-Active (Active/Active) ● What are my customer expectations? ○ Is it more important to be able to continue to produce new data? ○ Is it more important to have access to existing data? ● How am I replicating data? ○ Cluster Linking ○ Replicator ○ MirrorMaker (2) linkedin.com/in/leejustinr
  6. 6. ● Where do my clients live? ○ In one fault domain? Both fault domains? Outside the fault domain? ● What are my clients doing? ○ Producers (if dual-active, producing to both fault domain, or just one) ○ Consumers (if dual-active, consuming from both fault domain, or just one) ○ Kafka Connect (Sources vs. Sinks) (if dual-active: both fault domains or just one?) ○ Stream Processors (Produce and Consume) ● How is my data keyed? ○ Ideally, for a given key, all data will be only going to a single fault domain at a time. Especially if ordering matters. More Considerations linkedin.com/in/leejustinr
  7. 7. Topic Layout and Naming Conventions When producing data to multiple fault domains ("dual-active"), bidirectional replication with a single topic is possible, but often problematic due to ordering (cross-cluster replication is asynchronous) 00 west linkedin.com/in/leejustinr P 01 02 03 04 05 06 07 00 east P 01 02 03 04 05 06 07
  8. 8. Topic Layout and Naming Conventions (cont.) Instead, produce to a distinct primary topic for each fault domain. Think of these as separate partitions, except you have separate topics. (Consumers can consume from both topics!) For example, there are a few options: east.clicks OR remote.clicks west.clicks OR remote.clicks west.clicks OR clicks east.clicks OR remote.clicks west east linkedin.com/in/leejustinr P P Three (3) proposed options: it’s easier/better to prefix during replication rather than rename; renaming west.clicks > remote.clicks isn’t a recommended combination
  9. 9. Planning for Failure linkedin.com/in/leejustinr Build a run book for the full disaster lifecycle. It should include at least the following: ● What are the failure modes we want to handle (consider partial failures)? ● How do we detect each failure mode? ● How do we handle failure: ○ Infrastructure changes (DNS, load balancers, etc.) ○ Changes to producers and consumers ○ Changes to Kafka streams and ksqlDB applications ○ Changes to Kafka topics (writeable topics vs. read-only topics) ● How do we handle recovery: ○ Fail back vs. fail forward ○ How do we handle new events generated during the disaster event ● Who makes decisions, and who is in charge of communication?
  10. 10. Planning for Failure (cont.) During a disaster event, expect things to go wrong: ○ There may be panic, and people will both forget the process and make snap decisions ○ There may be edge cases you haven’t considered ○ You may be unable to provision new resources (cloud provider resource contention) Automate everything. Document everything. Test everything. The devil is in the details! linkedin.com/in/leejustinr
  11. 11. What’s Next? Again: this can be really complex, and is really a game of trade-offs. There is no one-size-fits-all solution. ● Rewatch Julie Wiederhold’s session. Cloudy with a Chance of Failure: How to Design a Kafka Architecture Resilient to Cloud Outages ● Sit down and figure out your requirements (on a per-use-case basis) ○ Business Requirements ○ Technical Requirements ○ Compliance Requirements ● Whiteboard, document, test This is an iterative process; your architecture will likely evolve linkedin.com/in/leejustinr

×