2. • Committer on the Apache Pulsar
project.
• Former Principal Software Engineer on
Splunk’s Pulsar-as-a-Service team.
• Global Streaming Practice Director at
Streamlio & Hortonworks
3. • Author of Pulsar in Action
• Co-author, Practical Hive
4. When Failure is Not an
Option
Introducing Pulsar’s Failover Client
5. Defining Availability
• Uptime is measured
as the ratio of
uptime to downtime
within a year.
• Each layer build on
the previous one.
6. Multifaceted Availability
• Availability is a
concern across multiple
layers.
• Each of these have their
own uptime metric
• Application uptime is
equal to the lowest
uptime metric across all
layers.
8. Platform Availability Features
• Stateless brokers
• Redundant components
across all layers.
• Ability to leverage cloud
native features like
stateful sets to maintain
minimum replica count
9. Data Availability Features
• Self-healing replicated data storage.
• Rack placement policies
• Geo-replication of data
10. Application Availability Features
• Connection-aware clients
that automatically
detect and recover in
the event a client
disconnects from one of
the brokers.
• Completely transparent
to the application.
11. Availability in Pulsar Before 2.10
• Apache Pulsar can only provide high-availability.
• Application availability is the weakest link.
12. What was missing?
• Up until now, Pulsar clients could only interact with a
single Pulsar cluster and were unable to detect and
respond to a cluster-level failure event.
• In the event of a complete cluster failure, these
clients cannot reroute their messages to a
secondary/standby cluster automatically.
• In such a scenario, any application that uses the Pulsar
client is vulnerable to a prolonged outage since the
clients could not establish a connection to an active
cluster.
13. Pre-2.10 Cluster Failover
• To redirect the clients from the “active” to the standby
cluster, the DNS entry for the Pulsar endpoint that the client
applications are using must be updated to point to the load
balancer of the standby cluster.
• Pulsar clients are
configured to use a single
static URL to connect
• The DNS record is updated
to point to the regional
load balancer
14. What is wrong with this approach?
• It requires your DevOps team to monitor the health of your
Pulsar clusters and manually update the DNS record to
point to the stand-by cluster when the active cluster is
down.
• This cutover is not automatic, and the recovery time is
determined by the response time of your DevOps team.
• Even after the DNS record has been changed, it will take
some additional time before the DNS cache is refreshed.
16. Two new approaches
• There are two new cluster failover strategies
included in the upcoming 2.10 release.
• One supports automatic failover in the event of a
cluster outage, while the other enables you to
control the switch-over through an HTTP endpoint.
17. Automated Failover
• The AutoClusterFailover failover strategy
automatically switches from the primary cluster to a
stand-by cluster in the event of a cluster outage.
• This behavior is controlled by a probe task that
monitors the primary cluster.
• When it finds the primary cluster is unavailable for
more than failoverDelayMs, it will switch the
client connections over to the secondary cluster.
18.
19. Controlled Failover
• The ControlledClusterFailover strategy,
supports switching from the primary cluster to a
stand-by cluster in response to a signal sent
from an external service.
• This strategy enables your administrators to
trigger the cluster switch over.
23. What am I going to demo?
• Automatic Failover:
• Step 1: Start an application that uses the Automatic Failover client
to produce data to a topic.
• Step 2: Start consumers on both the active & standby clusters.
• Step 3: Stop the active Pulsar cluster
• Step 4: Observe the flow of data shift from the active to the standby
cluster
• Step 5: Restart the primary cluster
• Step 6: Observe the flow of data shift back to the primary cluster
24. What am I going to demo?
• Controlled Failover:
• Step 1: Start the REST Endpoint service.
• Step 2: Start an application that uses the Controlled Failover client to
produce data to a topic.
• Step 3: Start consumers on both the active & standby clusters.
• Step 4: Trigger the controller to switch to a different Pulsar cluster after
approximately 20 messages
• Step 5: Observe the flow of data shift from the active to the standby cluster
• Step 6: Trigger the controller to switch to the original Pulsar cluster after
approximately 30 messages
25. Summary
• Release 2.10 of Pulsar includes two new failover
clients that provide continuous availability for your
Pulsar applications
• I demonstrated how to configure and use the Automatic
failover client when producing messages.
• The Controlled Failover client is harder to implement
because it requires an additional service to be
written, but it does provide more flexibility.
26. Thanks for Attending
Scan the QR Code to
learn more about Apache
Pulsar.
Explore the Code
https://github.com/david-streamlio/cluster-failover-demo
Welcome to my talk entitled “when failure is not an option”.
Today I will be discussing the additions to the Apache Pulsar project that can help provide continuous availability for your applications that interact with Pulsar.
My name is David Kjerrumgaard, and I am proud to be a committer on the Apache Pulsar project.
I am currently a Developer Advocate at StreamNative, the company behind Apache Pulsar
Previously I was a principal software engineer at Splunk, where I worked on their Pulsar-as-a-Service team
I am also the author of Pulsar in Action by manning press
And co-author of practical Hive by APress
Developing a continuously-available application requires more than just utilizing fault-tolerant services such as Apache Pulsar in your software stack.
It also requires immediate failure detection and resolution including built-in failover when there are data center outages.
Up until now, Pulsar clients could only interact with a single Pulsar cluster and were unable to detect and respond to a cluster-level failure event. In the event of a complete cluster failure, these clients cannot reroute their messages to a secondary/standby cluster automatically.
This can lead to application failure., which for many is not an option.
uptime is typically measured by calculating the ratio of uptime to downtime within a year, then expressing that ratio as a percentage.
The concept of “five-nines” — availability of 99.999% — has been an industry gold standard for many years.
Systems that can only survive failures at the hardware layer (including individual server outages) is considered ”fault-tolerant”
Systems that can survive an AZ outage are considered “highly-available”
The ability to survive one or more regional outages is considered “continuously available”
When people use the term availability, they tend to think of only PLATFORM availability. i.e., is the system up or down?
This is because availability is generally considered a DevOps concern, but it is also an APPLICATION and DATA concern as well.
one approach to providing high-availability is to distribute the platform resources across different zones and/or geographical regions.
While necessary, this isn’t enough. The data used by the system must be kept in sync across that zones and regions as well.
A system with a missing or incomplete dataset is often worse than not having the system available at all, as it can lead to incorrect information, duplicate processing, etc.
From an application perspective, it is incumbent upon your application to be able to immediately detect a failure in the system and automatically switch over to the “active” platform in a seemless manner.
Let’s start with a quick review of all of Pulsar’s availability features already inside the platform.
Let’s look at Pulsar’s platform availability features.
Pulsar’s multi-tiered design makes it highly-available by default.
Separating the serving layer from the data storage layer allows Pulsar’s brokers to be 100% stateless.
Consequently, any broker can serve data from any topic by reading the data from separate storage layer instead of local disk (like other messaging systems such as Kafka)
Additionally, stateless brokers that fail can be easily replaced with new broker instances w/o any additional setup steps.
Pulsar’s storage layer maintains multiple replicas of the data on different bookie nodes to ensure that the loss of one or more bookies does not result in a loss of the data.
From a Data availability perspective,
Pulsar’s storage layer is self-healing. It will automatically detect any under-replicated data and re-create new copies of the data for you.
This allows us to easily replace any failed bookies with new bookie instances and allow the self-healing mechanism re-populate the new bookie with data.
This ensures data availability within an individual cluster.
Furthermore, Pulsar supports rack-placement to ensure that at least one replica of the data in the storage layer is stored in a different AZ within the same geographical region.
Pulsar’s geo-replication mechanism allows you to asynchronously replicate data across multiple clusters to maintain consist copies of your datasets between regions.
These capabilities combine to provide continuous data availability.
At the application level, Pulsar provides connection-aware clients that insulate the application from intermittent network outages.
The pulsar client automatically detects these network issues and re-establishes the connection rather than throw an exception that (if uncaught) could cause the application to crash
This behavior is completely hidden from the application code and provides resiliency to broker failures.
Prior to the 2.10 release Pulsar was able to provide continuous availability at only the platform and data level.
Pulsar’s geo-replication mechanism allows you to replicate the data across multiple geographic regions. Ensuring that your data will be available even in the event of a region failure event.
Similarly, Pulsar’s architecture supports multiple clusters spread across different geographical regions. Ensuring that a complete Pulsar cluster will be readily available in the event of a region failure event.
The one missing piece to the continuous availability story was the application layer.
READ SLIDE
Up until now, Pulsar clients could only interact with a single Pulsar cluster and were unable to detect and respond to a cluster-level failure event. In the event of a complete cluster failure, these clients cannot reroute their messages to a secondary/standby cluster automatically.
This would eventually lead to prolonged outages at the application level.
Prior to the 2.10 release of Pulsar, the best you could do was to provide a single static endpoint for Pulsar as shown here.
Oftentimes, the connection URL to Pulsar is provided by a configuration file. This value is read once and remains static inside the application.
Then when a regional failure occurred, you had to manually change the DNS entry for that URL to point to the stand-by cluster.
READ SLIDE
Starting with release 2.10 of Pulsar, we have added a new feature called failover clients that solves these problems.
There are two distinct types of failover clients that are available in the 2.10 release
The first is one that will automatically reroute your client connections to a different Pulsar cluster as soon as it detects a cluster outage.
The second one allows you to trigger the failover through an exposed HTTP endpoint. This client will periodically invoke the exposed endpoint to get the connection details
of the cluster it is supposed to connect to. This approach allows your admins to have more control over the failover process.
So, let’s discuss the automated failover client first.
As the name implies, this failover client will automatically switch clients over to a designated standby cluster if and when it detects an outage on the primary cluster.
This is accomplished by a probe task that periodically interrogates the primary cluster to determine if it is running or not.
Once it has detected that the primary cluster is unavailable, it starts a timer to measure the length of the outage. This is to ensure that we don’t inadvertently switch over due to a transient network issue.
If the outage continues for longer than the user-configured duration, then the switch-over occurs.
Let’s look at how this automatic failover client is configured and used
This first thing to note is creation of a separate set of authentication credentials for the secondary cluster.
Next, note that there are both a primary cluster URL property and a secondary property.
The primary property takes the broker URL for your preferred cluster connection, while the secondary takes a list of one or more alternative clusters to connect to.
This allows you to have multiple stand-by clusters, which matches pulsar geo-replication capabilities to support multiple clusters.
The failoverDelay property specifies how long the primary cluster outage must be before switching over to the standby cluster.
The switchback property specifies how long the client waits to switch back to the primary cluster once it detects that the primary cluster is back up and running.
This is because the probe against the primary cluster will continue to run even after the client has failed over to the standby cluster. Once it has detected that the primary cluster is back up it will wait this long to switch back to the primary cluster
The checkInterval controls the frequency at which the probe is executed.
Finally, the failover configuration is the used to build a Pulsar client.
Now let’s discuss the controlled failover client
As the name implies, this client allows you to control when and where your pulsar client will fail over to.
This is accomplished via a REST service that YOU must implement.
Let’s look at how this controlled failover client is configured and used
This first thing to note is creation of a separate set of authentication credentials. These are for accessing the REST endpoint (NOT the standby cluster)
The default service URL property takes the broker URL for your preferred cluster connection.
The checkInterval controls the frequency at which the REST endpoint is executed.
The urlProvider is where you specify the address of the REST service you implemented, and the urlHeader is where you provide the contents of the HTTP header.
The header can be used to provide authentication credentials, etc.
Finally, the failover configuration is the used to build a Pulsar client.
Let’s look at a simple example of a REST endpoint service
First notice that the expected return type is a JSON object that contains the four fields show here.
This data structure allows you to provide all the necessary authentication credentials required to connect to a Pulsar cluster.
Also note that this information is generated dynamically in the code, so it could in theory read this information from a database, etc.
This provides much more flexibility than the Automated failover client which requires you to provide a hard-coded list of Pulsar broker URLs.
In this example, I am forcing a switch over to a standby cluster based on the number of times the REST endpoint is called
This is the demonstrate a failover to a standby cluster and back to the active as we shall see.
Next I will demonstrate both of these failover clients in action.
For those of you that are interested, the source code for this demo is available in the GitHub repo shown here.