Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit SF 2022

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 34 Anzeige

Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit SF 2022

Herunterladen, um offline zu lesen

Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency. In this talk, learn how this can be validated for Apache Pulsar Kubernetes deployments. Various failures are injected using Chaos Mesh to simulate network and other infrastructure failure conditions. There are many questions that are asked about failure scenarios, but it could be hard to find answers to these important questions. When a failure happens, how long does it take to recover? Does it cause unavailability? How does it impact throughput and latency? Are the guarantees of no message loss and strong message ordering kept, even when components fail? If a complete availability zone fails, is the system configured correctly to handle AZ failures? This talk will help you find answers to these questions and apply the tooling and practices to your own testing and validation.

Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency. In this talk, learn how this can be validated for Apache Pulsar Kubernetes deployments. Various failures are injected using Chaos Mesh to simulate network and other infrastructure failure conditions. There are many questions that are asked about failure scenarios, but it could be hard to find answers to these important questions. When a failure happens, how long does it take to recover? Does it cause unavailability? How does it impact throughput and latency? Are the guarantees of no message loss and strong message ordering kept, even when components fail? If a complete availability zone fails, is the system configured correctly to handle AZ failures? This talk will help you find answers to these questions and apply the tooling and practices to your own testing and validation.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von StreamNative (20)

Aktuellste (20)

Anzeige

Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit SF 2022

  1. 1. Pulsar Summit San Francisco Hotel Nikko August 18 2022 Tech Deep Dive Validating Apache Pulsar’s Behavior under Failure Conditions Lari Hotari Engineering Coach • DataStax 1
  2. 2. Lari Hotari is an Apache Pulsar committer and PMC member. He has worked on the Java platform since 1997 and has contributed to open source for over 20 years. Lari Hotari Engineering Coach, Streaming Customer Reliability Engineering DataStax Lari.Hotari@datastax.com @lhotari 2
  3. 3. 3 Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.”
  4. 4. Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.” Expectation: Provided service meets the service consumer’s requirements with very low downtime. 4 Expectation: “two nines” (99% available) or more.
  5. 5. Validating Apache Pulsar’s Behavior under Failure Conditions Availability 5 Availability % Downtime per day (24 hours) 99% ("two nines") 14.4 minutes 99.5% ("two and a half nines") 7.20 minutes 99.9% ("three nines") 1.44 minutes 99.95% ("three and a half nines") 43.2 seconds 99.99% ("four nines") 8.64 seconds 99.995% ("four and a half nines") 4.32 seconds 99.999% ("five nines") 864 milliseconds ● During uptime, the provided service meets the agreed level of operational quality and performance defined in operational SLA ● The service consumer’s needs are met when service disruptions don’t cause essential negative business impact. Some factors impacting the availability figures ● Reporting interval ● What is considered as downtime? ○ Total Failure vs Service Degradation / Partial Failure ○ High error rate? Exceeding latency requirements?
  6. 6. Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.” Expectation: At-least-once message delivery. Published messages aren’t lost in the system in any case. 6 Consuming state is preserved so that the messages aren’t skipped in consuming. The system will redeliver messages which aren’t acknowledged.
  7. 7. Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.” Expectation: Messages are delivered to a consumer in the same order as the publisher has published them in to a single topic. 7
  8. 8. Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.” Expectation: The messaging system can be used for use cases where there is a low latency requirement. 8 Applications can expect messages to be published with low latency and the end-to-end latency from publishing to consuming is expected to be low and predictable.
  9. 9. Validating Apache Pulsar’s Behavior under Failure Conditions Highly available Summary of Expectations 9 No message loss Strong message ordering Predictable read and write latency
  10. 10. 10 Validating Apache Pulsar’s Behavior under Failure Conditions
  11. 11. Validating Apache Pulsar’s Behavior under Failure Conditions Failure Conditions What could possibly go wrong? 11
  12. 12. Validating Apache Pulsar’s Behavior under Failure Conditions How to think about the different ways and decide what to validate? ● Learning from real production systems ○ Incident reports / post mortems ● System analysis methods coming from ○ Reliability Engineering ■ Reliability Modeling ○ Systems Reliability Theory ■ FMEA/FMECA (Failure mode and effects analysis) ○ Risk assessment theory ■ Risk analysis 12
  13. 13. Validating Apache Pulsar’s Behavior under Failure Conditions Examples of failure conditions for Pulsar validation ● Broker/Bookie/Zookeeper node fails ● All components in an availability zone fail ● Network disconnected -> Network partitioning / Split-Brain ● Network limited bandwidth / increased latency ● Network flappy connectivity ● Network packet loss ● Bookie/Zookeeper disk fails 13
  14. 14. Validating Apache Pulsar’s Behavior under Failure Conditions Examples of other conditions for Pulsar validation ● Broker scale-up / scale-down ● Bookie scale-up / scale-down ● Broker/Bookie/Zookeeper software upgrade Performance / Load testing related failure conditions: ● Message publishing overload ● Message consuming overload 14
  15. 15. Validating Apache Pulsar’s Behavior under Failure Conditions Unknown failure conditions - these will always exist “Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries incident reports* , it is the latter category that tends to be the difficult ones.” - Donald Rumsfeld *, adapted to SRE 15
  16. 16. 16 Validating Apache Pulsar’s Behavior under Failure Conditions
  17. 17. Validating Apache Pulsar’s Behavior under Failure Conditions ● Useful for collaboration and communicating with stakeholders ● Written test plan with specific test cases and documented expectations ○ Test case descriptions include the definition of the failure condition ● Test reports that capture essential results for analysis 17 Test plans and test reports
  18. 18. Validating Apache Pulsar’s Behavior under Failure Conditions Test plan example 18 Test case format: - Test case identifier + title - Description and intent - Procedure - Expected outcome
  19. 19. Validating Apache Pulsar’s Behavior under Failure Conditions Test report example 19 Analysis and status update to stakeholders
  20. 20. Validating Apache Pulsar’s Behavior under Failure Conditions Validation approaches 20 Test Environment with Test Workload ● Resilience Testing ● Chaos Testing Production Environment with Production Workload ● Resilience Engineering ● Chaos Testing
  21. 21. Validating Apache Pulsar’s Behavior under Failure Conditions Chaos Testing ● Requires test tooling for fault injection ● Fault injection can be used to put specific infrastructure components into a failed or degraded state which can be controlled by the chaos testing framework 21
  22. 22. Validating Apache Pulsar’s Behavior under Failure Conditions Test workload 22 Simulated Workload Created With Test Tooling Test Applications In A Test Environment Anonymized / Shadowed Production Traffic
  23. 23. Validating Apache Pulsar’s Behavior under Failure Conditions Test workload generation ● NoSQLBench, ASL 2.0 license, https://github.com/nosqlbench/nosqlbench ○ Originally created for testing nosql databases, but has been since then adapted for testing messaging systems ● pulsar-perf ○ Comes with Apache Pulsar distribution ● Custom test workload generator applications 23
  24. 24. Validating Apache Pulsar’s Behavior under Failure Conditions Tooling requirement for validating Pulsar’s behavior ● end-to-end observability ○ NoSQLBench pulsar driver features: ■ Measure End-to-end Message Processing Latency ■ Detect Message Out-of-order, Message Loss, and Message Duplication 24 Highly available No message loss Strong message ordering Predictable read and write latency
  25. 25. Validating Apache Pulsar’s Behavior under Failure Conditions Example of NoSQLBench Pulsar driver metrics rendered with Grafana 25 End-to-end publish-to-consume latency and error metrics
  26. 26. Validating Apache Pulsar’s Behavior under Failure Conditions Message Error Rate (zoomed in) 26
  27. 27. Validating Apache Pulsar’s Behavior under Failure Conditions Detecting ordering issues 27 Pulsar Java client ordering issues fixed since Pulsar version 2.8.2: ● [Java Client] Remove data race in MultiTopicsConsumerImpl to ensure correct message order #12456 ● [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779
  28. 28. Validating Apache Pulsar’s Behavior under Failure Conditions Automation choices ● No automation - interactive testing ● Custom script / in-house test framework ● Fallout ○ Open source test orchestration harness ○ Automates creation of environment, workload execution, data collection and analysis ○ Plugin architecture integrates with common tools 28
  29. 29. 29 Example of a testing setup for Pulsar validation
  30. 30. Validating Apache Pulsar’s Behavior under Failure Conditions k8s cluster Deployment view of example setup 30 Chaos Mesh Pulsar deployment: brokers, bookies, zookeepers Test workload: Nosqlbench jobs run as k8s jobs on dedicated k8s node pool Prometheus Graphite Exporter Prometheus Grafana Grafana dashboards Grafana renderer Test control scripts
  31. 31. Validating Apache Pulsar’s Behavior under Failure Conditions Demo recording 31
  32. 32. Lari Hotari Thank you! Lari.Hotari@datastax.com @lhotari Pulsar Summit San Francisco Hotel Nikko August 18 2022 32
  33. 33. 33 Backup slides
  34. 34. Validating Apache Pulsar’s Behavior under Failure Conditions Four Cornerstones of Resilience 34 Knowing what to EXPECT Knowing what to DO Knowing what has HAPPENED Knowing what to LOOK FOR Anticipation Monitoring Response Learning Erik Hollnagel’s Four Cornerstones of Resilience

×