Failure is inevitable in any distributed system but anticipating failures and building systems to recover from failures instantaneous makes the system highly resilient. At Capital One we process billions of events everyday and we leverage cloud, microservices, streaming and machine learning technologies to solve customer problems and provide the best customer experience.
As part of this session I will be talking about highly resilient streaming architecture that is supporting processing of billions of events every day then some of the strategies & best practices to build highly available and fault-tolerant systems utilizing Kafka and Cloud environments.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Digital transformation: Highly resilient streaming architecture and strategies | Sunil Kaitha, Capital One Services
1. Digital Transformation
Highly resilient event-driven microservices architectures and strategies
Presenter: Sunil Kaitha
Senior Manager - Lead Software Engineer @ Capital One
2. 26-year old, founder-led company
Nation’s largest direct bank
3rd largest credit card issuer in the U.S.
First largest financial institution auto loan originator
9th largest bank based on U.S. deposits
More than 70 million customer accounts and 50,000
plus associates
FORTUNE 100 Company (NYSE: COF)
AT A GLANCE
4. Resiliency
Failure is inevitable in distributed platform and the primary goal of resiliency is to
recover from the failures quickly and return to fully functioning state without any data
loss.
4
Disaster Recovery
High Availability
Fault-Tolerance
Source: Disaster Recovery (DR) Architecture on AWS
5. Streams
Messaging
System
Event
Producers
Event
Consumers
● Event producers: publishes events to Messaging System
● Messaging System: Distributed event streaming platform
● Event Consumers: Consuming events from Messaging System and processing those events
● Downstream Service: could be an API/S3/DB/Kafka Topic..etc
Downstream
Services
5
12. Failure Scenarios
Producer Messaging System Consumer Downstream
1 Producer technical issues Kafka Healthy Consumers missing those events Downstreams missing those events
2 Producer unable to publish events Kafka cluster/topic issues Consumers missing those events Downstreams missing those events
3 Producers Healthy Kafka Healthy Consumers event processing issues Downstreams issues
4 Producers Healthy Kafka Healthy Consumer issues Receiving missing or duplicate events
5 Regionwide outage
6 Connectivity issues between regions or Clusters
7 Connectivity issues between Producers/Streaming platform / Consumers/ Downstream services
12
Impacted System Issue Origin System in Healthy state Connectivity issue or outages impact multiple systems
LEGEND
13. Producer - Data Publishing Failures
Primary reasons for failures
○ Kafka Primary region unavailable
○ Kafka Topics might have corrupted or got deleted
○ Network issues between Producers & Kafka or between
Kafka nodes
Impact
○ Data loss
○ Duplicate events published
Solution
○ Configure proper acknowledgement level ( ack = all guaranteed delivery of
the message and replication )
○ Handle publishing errors
○ Able to retry publishing to primary Kafka Cluster Topic
○ Always have failover configuration and publish data to cross region Kafka
Topic 13
15. Consumer failures: Scenario-1 with Data consuming issues
Primary reasons for failures
○ Kafka Primary region unavailable
○ Kafka Topics might have corrupted or got deleted
○ Kafka Consumer information might have corrupted
○ Consumers to Kafka network issues
○ Producer publishing data to different topic version
Major Impacts
○ Data loss
○ Duplicate events
Solutions
○ Process primary and replica topic events if required*
○ Handle duplicates if required*
○ Process cross region topic data ( highly not recommended but
option available for certain batch specific use cases)
○ Topic Version upgrade must be planned ahead
15
16. Consumer failures: Scenario-1 with Data consuming issues
Solution
16
Note: Considering downstream Services are idempotent
18. Consumer failures: Scenario-2 with Consumer issues
Primary reasons for failures
○ Data deserialization issues
○ Consumer group id changes
○ Improper handling of OffSets ( Default autocommit true )
○ Any other Business or Technical issues
○ Adding or Removing Consumer to existing Consumer group
Major Impacts
○ Unable to process the Data
○ Data loss
○ Duplicate event processing
○ Delay processing data
Solutions
○ Handling Data deserialization appropriately
○ Follow unique standard consumer group id
○ Commit offsets only after processing
○ Handle duplicates if required*
○ Parallel processing of events
18
20. Consumer failures: Scenario-3 with Downstream issues
Primary reasons for failures
● Network issues between Consumers & Downstream services
● Depends on downstream service type the reason could be anything. Ex:
■ DB:DB not available, not accepting new connection..etc
■ API: Not reachable or slow response or errors
Major Impact
○ Data loss
○ Duplicate data processing
Solutions
○ Circuit breaker pattern must be adopted along with retry
○ Route data to failover downstream service
○ Handle duplicates ( Producer must have unique id for
every event )
20
24. Let’s look at complete Architecture
24
Note: There is no standard solution for all types of use cases but having this kind of resilient architecture may help solve some common resilience challenges with event-driven
microservices
25. Best practices
● Control over Offset Commits ( Default autocommit true )
● Dedupe checks ( Especially where exactly once processing required )
● Unique standard Consumer group id
● Adopt Circuit breaker design pattern
● Able to retry
● Deploy components in multiple regions
● RVT Testing ( Recovery Verification Testing ) - Non-Production & Production environments
● CHAOS Testing
● Enable autoscaling of Consumers
● Monitoring & Alerting (Systems should able to leverage this data to take intelligent decisions to avoid disruptions in services )
25
26. REFERENCES
Kafka: https://www.confluent.io/
Capital One Awards and Recognition
#Re-Imagine Autoscaling Stream Consumers in Cloud Environments - Confluent
https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html
https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/
https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.pdf
https://www.oreilly.com/library/view/apache-kafka-series/9781789346534/video4_12.html
https://docs.confluent.io/platform/current/multi-dc-deployments/multi-region.html
https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/disaster-recovery-dr-objectives.html
26