SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
Tech Deep Dive
Validating Apache
Pulsar’s Behavior
under Failure
Conditions
Lari Hotari
Engineering Coach • DataStax
1
Lari Hotari is an Apache Pulsar
committer and PMC member. He has
worked on the Java platform since 1997
and has contributed to open source for
over 20 years.
Lari Hotari
Engineering Coach, Streaming
Customer Reliability Engineering
DataStax
Lari.Hotari@datastax.com
@lhotari
2
3
Validating Apache Pulsar’s Behavior
under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Validating Apache Pulsar’s Behavior under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Expectation: Provided service meets
the service consumer’s requirements
with very low downtime.
4
Expectation: “two nines”
(99% available) or more.
Validating Apache Pulsar’s Behavior under Failure Conditions
Availability
5
Availability %
Downtime per day
(24 hours)
99% ("two nines") 14.4 minutes
99.5% ("two and a half nines") 7.20 minutes
99.9% ("three nines") 1.44 minutes
99.95% ("three and a half nines") 43.2 seconds
99.99% ("four nines") 8.64 seconds
99.995% ("four and a half nines") 4.32 seconds
99.999% ("five nines") 864 milliseconds
● During uptime, the provided service meets the
agreed level of operational quality and
performance defined in operational SLA
● The service consumer’s needs are met when
service disruptions don’t cause essential
negative business impact.
Some factors impacting the availability figures
● Reporting interval
● What is considered as downtime?
○ Total Failure vs Service Degradation / Partial
Failure
○ High error rate? Exceeding latency requirements?
Validating Apache Pulsar’s Behavior under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Expectation: At-least-once
message delivery. Published
messages aren’t lost in the system
in any case.
6
Consuming state is
preserved so that the
messages aren’t
skipped in
consuming.
The system will
redeliver messages
which aren’t
acknowledged.
Validating Apache Pulsar’s Behavior under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Expectation: Messages are delivered
to a consumer in the same order as
the publisher has published them in to
a single topic.
7
Validating Apache Pulsar’s Behavior under Failure Conditions
“Apache Pulsar is a highly available,
distributed messaging system that provides
guarantees of no message loss and strong
message ordering with predictable read and
write latency.”
Expectation: The messaging
system can be used for use cases
where there is a low latency
requirement.
8
Applications can expect messages to
be published with low latency and the
end-to-end latency from publishing to
consuming is expected to be low and
predictable.
Validating Apache Pulsar’s Behavior under Failure Conditions
Highly available
Summary of Expectations
9
No message loss
Strong message ordering
Predictable read and
write latency
10
Validating Apache Pulsar’s Behavior
under Failure Conditions
Validating Apache Pulsar’s Behavior under Failure Conditions
Failure Conditions
What could possibly go wrong?
11
Validating Apache Pulsar’s Behavior under Failure Conditions
How to think about the different ways and decide what to validate?
● Learning from real production systems
○ Incident reports / post mortems
● System analysis methods coming from
○ Reliability Engineering
■ Reliability Modeling
○ Systems Reliability Theory
■ FMEA/FMECA (Failure mode and effects analysis)
○ Risk assessment theory
■ Risk analysis 12
Validating Apache Pulsar’s Behavior under Failure Conditions
Examples of failure conditions for Pulsar validation
● Broker/Bookie/Zookeeper node fails
● All components in an availability zone fail
● Network disconnected -> Network partitioning / Split-Brain
● Network limited bandwidth / increased latency
● Network flappy connectivity
● Network packet loss
● Bookie/Zookeeper disk fails
13
Validating Apache Pulsar’s Behavior under Failure Conditions
Examples of other conditions for Pulsar validation
● Broker scale-up / scale-down
● Bookie scale-up / scale-down
● Broker/Bookie/Zookeeper software upgrade
Performance / Load testing related failure conditions:
● Message publishing overload
● Message consuming overload
14
Validating Apache Pulsar’s Behavior under Failure Conditions
Unknown failure conditions - these will always exist
“Reports that say that something hasn't happened are always
interesting to me, because as we know, there are known knowns;
there are things we know we know. We also know there are known
unknowns; that is to say we know there are some things we do not
know. But there are also unknown unknowns—the ones we don't
know we don't know. And if one looks throughout the history of our
country and other free countries incident reports*
, it is the latter
category that tends to be the difficult ones.”
- Donald Rumsfeld
*, adapted to SRE
15
16
Validating Apache Pulsar’s Behavior
under Failure Conditions
Validating Apache Pulsar’s Behavior under Failure Conditions
● Useful for collaboration and communicating with stakeholders
● Written test plan with specific test cases and documented
expectations
○ Test case descriptions include the definition of the failure
condition
● Test reports that capture essential results for analysis
17
Test plans and test reports
Validating Apache Pulsar’s Behavior under Failure Conditions
Test plan example
18
Test case format:
- Test case identifier + title
- Description and intent
- Procedure
- Expected outcome
Validating Apache Pulsar’s Behavior under Failure Conditions
Test report example
19
Analysis and
status update to
stakeholders
Validating Apache Pulsar’s Behavior under Failure Conditions
Validation approaches
20
Test Environment with Test Workload
● Resilience Testing
● Chaos Testing
Production Environment with Production Workload
● Resilience Engineering
● Chaos Testing
Validating Apache Pulsar’s Behavior under Failure Conditions
Chaos Testing
● Requires test tooling for fault injection
● Fault injection can be used to put specific infrastructure
components into a failed or degraded state which can be
controlled by the chaos testing framework
21
Validating Apache Pulsar’s Behavior under Failure Conditions
Test workload
22
Simulated
Workload Created
With Test Tooling
Test Applications In
A Test Environment
Anonymized /
Shadowed
Production Traffic
Validating Apache Pulsar’s Behavior under Failure Conditions
Test workload generation
● NoSQLBench, ASL 2.0 license,
https://github.com/nosqlbench/nosqlbench
○ Originally created for testing nosql
databases, but has been since then
adapted for testing messaging systems
● pulsar-perf
○ Comes with Apache Pulsar distribution
● Custom test workload generator applications
23
Validating Apache Pulsar’s Behavior under Failure Conditions
Tooling requirement for validating Pulsar’s behavior
● end-to-end observability
○ NoSQLBench pulsar driver features:
■ Measure End-to-end Message
Processing Latency
■ Detect Message Out-of-order,
Message Loss, and Message
Duplication
24
Highly
available
No message
loss
Strong
message
ordering
Predictable
read and write
latency
Validating Apache Pulsar’s Behavior under Failure Conditions
Example of NoSQLBench Pulsar driver metrics rendered with Grafana
25
End-to-end publish-to-consume latency and error metrics
Validating Apache Pulsar’s Behavior under Failure Conditions
Message Error Rate (zoomed in)
26
Validating Apache Pulsar’s Behavior under Failure Conditions
Detecting ordering issues
27
Pulsar Java client ordering issues fixed since Pulsar version 2.8.2:
● [Java Client] Remove data race in MultiTopicsConsumerImpl to ensure correct message order #12456
● [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779
Validating Apache Pulsar’s Behavior under Failure Conditions
Automation choices
● No automation - interactive testing
● Custom script / in-house test framework
● Fallout
○ Open source test orchestration harness
○ Automates creation of environment, workload
execution, data collection and analysis
○ Plugin architecture integrates with common tools
28
29
Example of a testing setup for Pulsar validation
Validating Apache Pulsar’s Behavior under Failure Conditions
k8s cluster
Deployment view of example setup
30
Chaos Mesh
Pulsar deployment:
brokers, bookies,
zookeepers
Test workload: Nosqlbench
jobs run as k8s jobs on
dedicated k8s node pool
Prometheus Graphite
Exporter
Prometheus
Grafana
Grafana
dashboards
Grafana renderer
Test control scripts
Validating Apache Pulsar’s Behavior under Failure Conditions
Demo recording
31
Lari Hotari
Thank you!
Lari.Hotari@datastax.com
@lhotari
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
32
33
Backup slides
Validating Apache Pulsar’s Behavior under Failure Conditions
Four Cornerstones of Resilience
34
Knowing what to
EXPECT
Knowing what to
DO
Knowing what has
HAPPENED
Knowing what to
LOOK FOR
Anticipation Monitoring Response Learning
Erik Hollnagel’s Four Cornerstones of Resilience

Weitere ähnliche Inhalte

Ähnlich wie Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit SF 2022

Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...HostedbyConfluent
 
Interpreting Performance Test Results
Interpreting Performance Test ResultsInterpreting Performance Test Results
Interpreting Performance Test ResultsEric Proegler
 
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...HostedbyConfluent
 
Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Matt Tesauro
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Performance testing in agile
Performance testing in agilePerformance testing in agile
Performance testing in agileOdessaQA
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE
 
HP LoadRunner
HP LoadRunnerHP LoadRunner
HP LoadRunnerFayis-QA
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
 
Resilience Testing
Resilience Testing Resilience Testing
Resilience Testing Ran Levy
 
MySQL Webinar 2/4 Performance tuning, hardware, optimisation
MySQL Webinar 2/4 Performance tuning, hardware, optimisationMySQL Webinar 2/4 Performance tuning, hardware, optimisation
MySQL Webinar 2/4 Performance tuning, hardware, optimisationMark Swarbrick
 
Failover-Apachecon-Asia-2022.pptx
Failover-Apachecon-Asia-2022.pptxFailover-Apachecon-Asia-2022.pptx
Failover-Apachecon-Asia-2022.pptxDavidKjerrumgaard1
 
Java EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The CloudJava EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The CloudBruno Borges
 
Creating Complete Test Environments in the Cloud: Skytap & Parasoft Webinar
Creating Complete Test Environments in the Cloud: Skytap & Parasoft WebinarCreating Complete Test Environments in the Cloud: Skytap & Parasoft Webinar
Creating Complete Test Environments in the Cloud: Skytap & Parasoft WebinarSkytap Cloud
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Martin Spier
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsFederico Michele Facca
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
DevOps, CLI, APIs, Oh My! Security Gone Agile
DevOps, CLI, APIs, Oh My!  Security Gone AgileDevOps, CLI, APIs, Oh My!  Security Gone Agile
DevOps, CLI, APIs, Oh My! Security Gone AgileMatt Tesauro
 
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
PhillyJug  Getting Started With Real-time Cloud Native Streaming With JavaPhillyJug  Getting Started With Real-time Cloud Native Streaming With Java
PhillyJug Getting Started With Real-time Cloud Native Streaming With JavaTimothy Spann
 

Ähnlich wie Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit SF 2022 (20)

Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verif...
 
Interpreting Performance Test Results
Interpreting Performance Test ResultsInterpreting Performance Test Results
Interpreting Performance Test Results
 
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
 
Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Performance testing in agile
Performance testing in agilePerformance testing in agile
Performance testing in agile
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
 
HP LoadRunner
HP LoadRunnerHP LoadRunner
HP LoadRunner
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
Resilience Testing
Resilience Testing Resilience Testing
Resilience Testing
 
MySQL Webinar 2/4 Performance tuning, hardware, optimisation
MySQL Webinar 2/4 Performance tuning, hardware, optimisationMySQL Webinar 2/4 Performance tuning, hardware, optimisation
MySQL Webinar 2/4 Performance tuning, hardware, optimisation
 
Failover-Apachecon-Asia-2022.pptx
Failover-Apachecon-Asia-2022.pptxFailover-Apachecon-Asia-2022.pptx
Failover-Apachecon-Asia-2022.pptx
 
Java EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The CloudJava EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The Cloud
 
Creating Complete Test Environments in the Cloud: Skytap & Parasoft Webinar
Creating Complete Test Environments in the Cloud: Skytap & Parasoft WebinarCreating Complete Test Environments in the Cloud: Skytap & Parasoft Webinar
Creating Complete Test Environments in the Cloud: Skytap & Parasoft Webinar
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
DevOps, CLI, APIs, Oh My! Security Gone Agile
DevOps, CLI, APIs, Oh My!  Security Gone AgileDevOps, CLI, APIs, Oh My!  Security Gone Agile
DevOps, CLI, APIs, Oh My! Security Gone Agile
 
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
PhillyJug  Getting Started With Real-time Cloud Native Streaming With JavaPhillyJug  Getting Started With Real-time Cloud Native Streaming With Java
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
 

Mehr von StreamNative

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022StreamNative
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...StreamNative
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...StreamNative
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022StreamNative
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...StreamNative
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022StreamNative
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022StreamNative
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022StreamNative
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022StreamNative
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022StreamNative
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022StreamNative
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022StreamNative
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022StreamNative
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...StreamNative
 
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021StreamNative
 

Mehr von StreamNative (20)

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
 

Kürzlich hochgeladen

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesSanjay Willie
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit SF 2022

  • 1. Pulsar Summit San Francisco Hotel Nikko August 18 2022 Tech Deep Dive Validating Apache Pulsar’s Behavior under Failure Conditions Lari Hotari Engineering Coach • DataStax 1
  • 2. Lari Hotari is an Apache Pulsar committer and PMC member. He has worked on the Java platform since 1997 and has contributed to open source for over 20 years. Lari Hotari Engineering Coach, Streaming Customer Reliability Engineering DataStax Lari.Hotari@datastax.com @lhotari 2
  • 3. 3 Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.”
  • 4. Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.” Expectation: Provided service meets the service consumer’s requirements with very low downtime. 4 Expectation: “two nines” (99% available) or more.
  • 5. Validating Apache Pulsar’s Behavior under Failure Conditions Availability 5 Availability % Downtime per day (24 hours) 99% ("two nines") 14.4 minutes 99.5% ("two and a half nines") 7.20 minutes 99.9% ("three nines") 1.44 minutes 99.95% ("three and a half nines") 43.2 seconds 99.99% ("four nines") 8.64 seconds 99.995% ("four and a half nines") 4.32 seconds 99.999% ("five nines") 864 milliseconds ● During uptime, the provided service meets the agreed level of operational quality and performance defined in operational SLA ● The service consumer’s needs are met when service disruptions don’t cause essential negative business impact. Some factors impacting the availability figures ● Reporting interval ● What is considered as downtime? ○ Total Failure vs Service Degradation / Partial Failure ○ High error rate? Exceeding latency requirements?
  • 6. Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.” Expectation: At-least-once message delivery. Published messages aren’t lost in the system in any case. 6 Consuming state is preserved so that the messages aren’t skipped in consuming. The system will redeliver messages which aren’t acknowledged.
  • 7. Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.” Expectation: Messages are delivered to a consumer in the same order as the publisher has published them in to a single topic. 7
  • 8. Validating Apache Pulsar’s Behavior under Failure Conditions “Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency.” Expectation: The messaging system can be used for use cases where there is a low latency requirement. 8 Applications can expect messages to be published with low latency and the end-to-end latency from publishing to consuming is expected to be low and predictable.
  • 9. Validating Apache Pulsar’s Behavior under Failure Conditions Highly available Summary of Expectations 9 No message loss Strong message ordering Predictable read and write latency
  • 10. 10 Validating Apache Pulsar’s Behavior under Failure Conditions
  • 11. Validating Apache Pulsar’s Behavior under Failure Conditions Failure Conditions What could possibly go wrong? 11
  • 12. Validating Apache Pulsar’s Behavior under Failure Conditions How to think about the different ways and decide what to validate? ● Learning from real production systems ○ Incident reports / post mortems ● System analysis methods coming from ○ Reliability Engineering ■ Reliability Modeling ○ Systems Reliability Theory ■ FMEA/FMECA (Failure mode and effects analysis) ○ Risk assessment theory ■ Risk analysis 12
  • 13. Validating Apache Pulsar’s Behavior under Failure Conditions Examples of failure conditions for Pulsar validation ● Broker/Bookie/Zookeeper node fails ● All components in an availability zone fail ● Network disconnected -> Network partitioning / Split-Brain ● Network limited bandwidth / increased latency ● Network flappy connectivity ● Network packet loss ● Bookie/Zookeeper disk fails 13
  • 14. Validating Apache Pulsar’s Behavior under Failure Conditions Examples of other conditions for Pulsar validation ● Broker scale-up / scale-down ● Bookie scale-up / scale-down ● Broker/Bookie/Zookeeper software upgrade Performance / Load testing related failure conditions: ● Message publishing overload ● Message consuming overload 14
  • 15. Validating Apache Pulsar’s Behavior under Failure Conditions Unknown failure conditions - these will always exist “Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries incident reports* , it is the latter category that tends to be the difficult ones.” - Donald Rumsfeld *, adapted to SRE 15
  • 16. 16 Validating Apache Pulsar’s Behavior under Failure Conditions
  • 17. Validating Apache Pulsar’s Behavior under Failure Conditions ● Useful for collaboration and communicating with stakeholders ● Written test plan with specific test cases and documented expectations ○ Test case descriptions include the definition of the failure condition ● Test reports that capture essential results for analysis 17 Test plans and test reports
  • 18. Validating Apache Pulsar’s Behavior under Failure Conditions Test plan example 18 Test case format: - Test case identifier + title - Description and intent - Procedure - Expected outcome
  • 19. Validating Apache Pulsar’s Behavior under Failure Conditions Test report example 19 Analysis and status update to stakeholders
  • 20. Validating Apache Pulsar’s Behavior under Failure Conditions Validation approaches 20 Test Environment with Test Workload ● Resilience Testing ● Chaos Testing Production Environment with Production Workload ● Resilience Engineering ● Chaos Testing
  • 21. Validating Apache Pulsar’s Behavior under Failure Conditions Chaos Testing ● Requires test tooling for fault injection ● Fault injection can be used to put specific infrastructure components into a failed or degraded state which can be controlled by the chaos testing framework 21
  • 22. Validating Apache Pulsar’s Behavior under Failure Conditions Test workload 22 Simulated Workload Created With Test Tooling Test Applications In A Test Environment Anonymized / Shadowed Production Traffic
  • 23. Validating Apache Pulsar’s Behavior under Failure Conditions Test workload generation ● NoSQLBench, ASL 2.0 license, https://github.com/nosqlbench/nosqlbench ○ Originally created for testing nosql databases, but has been since then adapted for testing messaging systems ● pulsar-perf ○ Comes with Apache Pulsar distribution ● Custom test workload generator applications 23
  • 24. Validating Apache Pulsar’s Behavior under Failure Conditions Tooling requirement for validating Pulsar’s behavior ● end-to-end observability ○ NoSQLBench pulsar driver features: ■ Measure End-to-end Message Processing Latency ■ Detect Message Out-of-order, Message Loss, and Message Duplication 24 Highly available No message loss Strong message ordering Predictable read and write latency
  • 25. Validating Apache Pulsar’s Behavior under Failure Conditions Example of NoSQLBench Pulsar driver metrics rendered with Grafana 25 End-to-end publish-to-consume latency and error metrics
  • 26. Validating Apache Pulsar’s Behavior under Failure Conditions Message Error Rate (zoomed in) 26
  • 27. Validating Apache Pulsar’s Behavior under Failure Conditions Detecting ordering issues 27 Pulsar Java client ordering issues fixed since Pulsar version 2.8.2: ● [Java Client] Remove data race in MultiTopicsConsumerImpl to ensure correct message order #12456 ● [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779
  • 28. Validating Apache Pulsar’s Behavior under Failure Conditions Automation choices ● No automation - interactive testing ● Custom script / in-house test framework ● Fallout ○ Open source test orchestration harness ○ Automates creation of environment, workload execution, data collection and analysis ○ Plugin architecture integrates with common tools 28
  • 29. 29 Example of a testing setup for Pulsar validation
  • 30. Validating Apache Pulsar’s Behavior under Failure Conditions k8s cluster Deployment view of example setup 30 Chaos Mesh Pulsar deployment: brokers, bookies, zookeepers Test workload: Nosqlbench jobs run as k8s jobs on dedicated k8s node pool Prometheus Graphite Exporter Prometheus Grafana Grafana dashboards Grafana renderer Test control scripts
  • 31. Validating Apache Pulsar’s Behavior under Failure Conditions Demo recording 31
  • 32. Lari Hotari Thank you! Lari.Hotari@datastax.com @lhotari Pulsar Summit San Francisco Hotel Nikko August 18 2022 32
  • 34. Validating Apache Pulsar’s Behavior under Failure Conditions Four Cornerstones of Resilience 34 Knowing what to EXPECT Knowing what to DO Knowing what has HAPPENED Knowing what to LOOK FOR Anticipation Monitoring Response Learning Erik Hollnagel’s Four Cornerstones of Resilience