SlideShare ist ein Scribd-Unternehmen logo
1 von 28
When Failure is Not an Option
David Kjerrumgaard
2022.07.29
• Committer on the Apache Pulsar
project.
• Former Principal Software Engineer on
Splunk’s Pulsar-as-a-Service team.
• Global Streaming Practice Director at
Streamlio & Hortonworks
• Author of Pulsar in Action
• Co-author, Practical Hive
When Failure is Not an
Option
Introducing Pulsar’s Failover Client
Defining Availability
• Uptime is measured
as the ratio of
uptime to downtime
within a year.
• Each layer build on
the previous one.
Multifaceted Availability
• Availability is a
concern across multiple
layers.
• Each of these have their
own uptime metric
• Application uptime is
equal to the lowest
uptime metric across all
layers.
Pulsar‘s Availability
Features
Platform Availability Features
• Stateless brokers
• Redundant components
across all layers.
• Ability to leverage cloud
native features like
stateful sets to maintain
minimum replica count
Data Availability Features
• Self-healing replicated data storage.
• Rack placement policies
• Geo-replication of data
Application Availability Features
• Connection-aware clients
that automatically
detect and recover in
the event a client
disconnects from one of
the brokers.
• Completely transparent
to the application.
Availability in Pulsar Before 2.10
• Apache Pulsar can only provide high-availability.
• Application availability is the weakest link.
What was missing?
• Up until now, Pulsar clients could only interact with a
single Pulsar cluster and were unable to detect and
respond to a cluster-level failure event.
• In the event of a complete cluster failure, these
clients cannot reroute their messages to a
secondary/standby cluster automatically.
• In such a scenario, any application that uses the Pulsar
client is vulnerable to a prolonged outage since the
clients could not establish a connection to an active
cluster.
Pre-2.10 Cluster Failover
• To redirect the clients from the “active” to the standby
cluster, the DNS entry for the Pulsar endpoint that the client
applications are using must be updated to point to the load
balancer of the standby cluster.
• Pulsar clients are
configured to use a single
static URL to connect
• The DNS record is updated
to point to the regional
load balancer
What is wrong with this approach?
• It requires your DevOps team to monitor the health of your
Pulsar clusters and manually update the DNS record to
point to the stand-by cluster when the active cluster is
down.
• This cutover is not automatic, and the recovery time is
determined by the response time of your DevOps team.
• Even after the DNS record has been changed, it will take
some additional time before the DNS cache is refreshed.
Failover Clients
Two new Cluster Cut-Over Strategies
Two new approaches
• There are two new cluster failover strategies
included in the upcoming 2.10 release.
• One supports automatic failover in the event of a
cluster outage, while the other enables you to
control the switch-over through an HTTP endpoint.
Automated Failover
• The AutoClusterFailover failover strategy
automatically switches from the primary cluster to a
stand-by cluster in the event of a cluster outage.
• This behavior is controlled by a probe task that
monitors the primary cluster.
• When it finds the primary cluster is unavailable for
more than failoverDelayMs, it will switch the
client connections over to the secondary cluster.
Controlled Failover
• The ControlledClusterFailover strategy,
supports switching from the primary cluster to a
stand-by cluster in response to a signal sent
from an external service.
• This strategy enables your administrators to
trigger the cluster switch over.
Demo Time!
https://github.com/david-streamlio/cluster-failover-demo
What am I going to demo?
• Automatic Failover:
• Step 1: Start an application that uses the Automatic Failover client
to produce data to a topic.
• Step 2: Start consumers on both the active & standby clusters.
• Step 3: Stop the active Pulsar cluster
• Step 4: Observe the flow of data shift from the active to the standby
cluster
• Step 5: Restart the primary cluster
• Step 6: Observe the flow of data shift back to the primary cluster
What am I going to demo?
• Controlled Failover:
• Step 1: Start the REST Endpoint service.
• Step 2: Start an application that uses the Controlled Failover client to
produce data to a topic.
• Step 3: Start consumers on both the active & standby clusters.
• Step 4: Trigger the controller to switch to a different Pulsar cluster after
approximately 20 messages
• Step 5: Observe the flow of data shift from the active to the standby cluster
• Step 6: Trigger the controller to switch to the original Pulsar cluster after
approximately 30 messages
Summary
• Release 2.10 of Pulsar includes two new failover
clients that provide continuous availability for your
Pulsar applications
• I demonstrated how to configure and use the Automatic
failover client when producing messages.
• The Controlled Failover client is harder to implement
because it requires an additional service to be
written, but it does provide more flexibility.
Thanks for Attending
Scan the QR Code to
learn more about Apache
Pulsar.
Explore the Code
https://github.com/david-streamlio/cluster-failover-demo
Let’s Keep
in Touch!
Thanks

Weitere ähnliche Inhalte

Ähnlich wie Failover-Apachecon-Asia-2022.pptx

MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera Cluster
Abdul Manaf
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
Lucidworks
 
Planning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMPlanning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPM
WASdev Community
 
Drupal and Container Orchestration - Using Kubernetes to Manage All the Thing...
Drupal and Container Orchestration - Using Kubernetes to Manage All the Thing...Drupal and Container Orchestration - Using Kubernetes to Manage All the Thing...
Drupal and Container Orchestration - Using Kubernetes to Manage All the Thing...
onsitan
 

Ähnlich wie Failover-Apachecon-Asia-2022.pptx (20)

Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High Availability
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera Cluster
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
 
Data harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacingData harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacing
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
 
Transforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesTransforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web Services
 
Planning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMPlanning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPM
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on Kubernetes
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
OpenDaylight Openflow & OVSDB use cases ODL summit 2016
OpenDaylight Openflow & OVSDB use cases ODL summit 2016OpenDaylight Openflow & OVSDB use cases ODL summit 2016
OpenDaylight Openflow & OVSDB use cases ODL summit 2016
 
Drupal and Container Orchestration - Using Kubernetes to Manage All the Thing...
Drupal and Container Orchestration - Using Kubernetes to Manage All the Thing...Drupal and Container Orchestration - Using Kubernetes to Manage All the Thing...
Drupal and Container Orchestration - Using Kubernetes to Manage All the Thing...
 
Performance testing material
Performance testing materialPerformance testing material
Performance testing material
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
 
Kubernetes Introduction & Whats new in Kubernetes 1.6
Kubernetes Introduction & Whats new in Kubernetes 1.6Kubernetes Introduction & Whats new in Kubernetes 1.6
Kubernetes Introduction & Whats new in Kubernetes 1.6
 
Sql disaster recovery
Sql disaster recoverySql disaster recovery
Sql disaster recovery
 
Continuous Delivery of Cloud Applications: Blue/Green and Canary Deployments
Continuous Delivery of Cloud Applications:Blue/Green and Canary DeploymentsContinuous Delivery of Cloud Applications:Blue/Green and Canary Deployments
Continuous Delivery of Cloud Applications: Blue/Green and Canary Deployments
 
Cloud computing Module 2 First Part
Cloud computing Module 2 First PartCloud computing Module 2 First Part
Cloud computing Module 2 First Part
 
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
 

Kürzlich hochgeladen

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Kürzlich hochgeladen (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

Failover-Apachecon-Asia-2022.pptx

  • 1. When Failure is Not an Option David Kjerrumgaard 2022.07.29
  • 2. • Committer on the Apache Pulsar project. • Former Principal Software Engineer on Splunk’s Pulsar-as-a-Service team. • Global Streaming Practice Director at Streamlio & Hortonworks
  • 3. • Author of Pulsar in Action • Co-author, Practical Hive
  • 4. When Failure is Not an Option Introducing Pulsar’s Failover Client
  • 5. Defining Availability • Uptime is measured as the ratio of uptime to downtime within a year. • Each layer build on the previous one.
  • 6. Multifaceted Availability • Availability is a concern across multiple layers. • Each of these have their own uptime metric • Application uptime is equal to the lowest uptime metric across all layers.
  • 8. Platform Availability Features • Stateless brokers • Redundant components across all layers. • Ability to leverage cloud native features like stateful sets to maintain minimum replica count
  • 9. Data Availability Features • Self-healing replicated data storage. • Rack placement policies • Geo-replication of data
  • 10. Application Availability Features • Connection-aware clients that automatically detect and recover in the event a client disconnects from one of the brokers. • Completely transparent to the application.
  • 11. Availability in Pulsar Before 2.10 • Apache Pulsar can only provide high-availability. • Application availability is the weakest link.
  • 12. What was missing? • Up until now, Pulsar clients could only interact with a single Pulsar cluster and were unable to detect and respond to a cluster-level failure event. • In the event of a complete cluster failure, these clients cannot reroute their messages to a secondary/standby cluster automatically. • In such a scenario, any application that uses the Pulsar client is vulnerable to a prolonged outage since the clients could not establish a connection to an active cluster.
  • 13. Pre-2.10 Cluster Failover • To redirect the clients from the “active” to the standby cluster, the DNS entry for the Pulsar endpoint that the client applications are using must be updated to point to the load balancer of the standby cluster. • Pulsar clients are configured to use a single static URL to connect • The DNS record is updated to point to the regional load balancer
  • 14. What is wrong with this approach? • It requires your DevOps team to monitor the health of your Pulsar clusters and manually update the DNS record to point to the stand-by cluster when the active cluster is down. • This cutover is not automatic, and the recovery time is determined by the response time of your DevOps team. • Even after the DNS record has been changed, it will take some additional time before the DNS cache is refreshed.
  • 15. Failover Clients Two new Cluster Cut-Over Strategies
  • 16. Two new approaches • There are two new cluster failover strategies included in the upcoming 2.10 release. • One supports automatic failover in the event of a cluster outage, while the other enables you to control the switch-over through an HTTP endpoint.
  • 17. Automated Failover • The AutoClusterFailover failover strategy automatically switches from the primary cluster to a stand-by cluster in the event of a cluster outage. • This behavior is controlled by a probe task that monitors the primary cluster. • When it finds the primary cluster is unavailable for more than failoverDelayMs, it will switch the client connections over to the secondary cluster.
  • 18.
  • 19. Controlled Failover • The ControlledClusterFailover strategy, supports switching from the primary cluster to a stand-by cluster in response to a signal sent from an external service. • This strategy enables your administrators to trigger the cluster switch over.
  • 20.
  • 21.
  • 23. What am I going to demo? • Automatic Failover: • Step 1: Start an application that uses the Automatic Failover client to produce data to a topic. • Step 2: Start consumers on both the active & standby clusters. • Step 3: Stop the active Pulsar cluster • Step 4: Observe the flow of data shift from the active to the standby cluster • Step 5: Restart the primary cluster • Step 6: Observe the flow of data shift back to the primary cluster
  • 24. What am I going to demo? • Controlled Failover: • Step 1: Start the REST Endpoint service. • Step 2: Start an application that uses the Controlled Failover client to produce data to a topic. • Step 3: Start consumers on both the active & standby clusters. • Step 4: Trigger the controller to switch to a different Pulsar cluster after approximately 20 messages • Step 5: Observe the flow of data shift from the active to the standby cluster • Step 6: Trigger the controller to switch to the original Pulsar cluster after approximately 30 messages
  • 25. Summary • Release 2.10 of Pulsar includes two new failover clients that provide continuous availability for your Pulsar applications • I demonstrated how to configure and use the Automatic failover client when producing messages. • The Controlled Failover client is harder to implement because it requires an additional service to be written, but it does provide more flexibility.
  • 26. Thanks for Attending Scan the QR Code to learn more about Apache Pulsar. Explore the Code https://github.com/david-streamlio/cluster-failover-demo

Hinweis der Redaktion

  1. Welcome to my talk entitled “when failure is not an option”. Today I will be discussing the additions to the Apache Pulsar project that can help provide continuous availability for your applications that interact with Pulsar.
  2. My name is David Kjerrumgaard, and I am proud to be a committer on the Apache Pulsar project. I am currently a Developer Advocate at StreamNative, the company behind Apache Pulsar Previously I was a principal software engineer at Splunk, where I worked on their Pulsar-as-a-Service team
  3. I am also the author of Pulsar in Action by manning press And co-author of practical Hive by APress
  4. Developing a continuously-available application requires more than just utilizing fault-tolerant services such as Apache Pulsar in your software stack. It also requires immediate failure detection and resolution including built-in failover when there are data center outages. Up until now, Pulsar clients could only interact with a single Pulsar cluster and were unable to detect and respond to a cluster-level failure event. In the event of a complete cluster failure, these clients cannot reroute their messages to a secondary/standby cluster automatically. This can lead to application failure., which for many is not an option.
  5. uptime is typically measured by calculating the ratio of uptime to downtime within a year, then expressing that ratio as a percentage. The concept of “five-nines” — availability of 99.999% — has been an industry gold standard for many years. Systems that can only survive failures at the hardware layer (including individual server outages) is considered ”fault-tolerant” Systems that can survive an AZ outage are considered “highly-available” The ability to survive one or more regional outages is considered “continuously available”
  6. When people use the term availability, they tend to think of only PLATFORM availability. i.e., is the system up or down? This is because availability is generally considered a DevOps concern, but it is also an APPLICATION and DATA concern as well. one approach to providing high-availability is to distribute the platform resources across different zones and/or geographical regions. While necessary, this isn’t enough. The data used by the system must be kept in sync across that zones and regions as well. A system with a missing or incomplete dataset is often worse than not having the system available at all, as it can lead to incorrect information, duplicate processing, etc. From an application perspective, it is incumbent upon your application to be able to immediately detect a failure in the system and automatically switch over to the “active” platform in a seemless manner.
  7. Let’s start with a quick review of all of Pulsar’s availability features already inside the platform.
  8. Let’s look at Pulsar’s platform availability features. Pulsar’s multi-tiered design makes it highly-available by default. Separating the serving layer from the data storage layer allows Pulsar’s brokers to be 100% stateless. Consequently, any broker can serve data from any topic by reading the data from separate storage layer instead of local disk (like other messaging systems such as Kafka) Additionally, stateless brokers that fail can be easily replaced with new broker instances w/o any additional setup steps. Pulsar’s storage layer maintains multiple replicas of the data on different bookie nodes to ensure that the loss of one or more bookies does not result in a loss of the data.
  9. From a Data availability perspective, Pulsar’s storage layer is self-healing. It will automatically detect any under-replicated data and re-create new copies of the data for you. This allows us to easily replace any failed bookies with new bookie instances and allow the self-healing mechanism re-populate the new bookie with data. This ensures data availability within an individual cluster. Furthermore, Pulsar supports rack-placement to ensure that at least one replica of the data in the storage layer is stored in a different AZ within the same geographical region. Pulsar’s geo-replication mechanism allows you to asynchronously replicate data across multiple clusters to maintain consist copies of your datasets between regions. These capabilities combine to provide continuous data availability.
  10. At the application level, Pulsar provides connection-aware clients that insulate the application from intermittent network outages. The pulsar client automatically detects these network issues and re-establishes the connection rather than throw an exception that (if uncaught) could cause the application to crash This behavior is completely hidden from the application code and provides resiliency to broker failures.
  11. Prior to the 2.10 release Pulsar was able to provide continuous availability at only the platform and data level. Pulsar’s geo-replication mechanism allows you to replicate the data across multiple geographic regions. Ensuring that your data will be available even in the event of a region failure event. Similarly, Pulsar’s architecture supports multiple clusters spread across different geographical regions. Ensuring that a complete Pulsar cluster will be readily available in the event of a region failure event. The one missing piece to the continuous availability story was the application layer.
  12. READ SLIDE Up until now, Pulsar clients could only interact with a single Pulsar cluster and were unable to detect and respond to a cluster-level failure event. In the event of a complete cluster failure, these clients cannot reroute their messages to a secondary/standby cluster automatically. This would eventually lead to prolonged outages at the application level.
  13. Prior to the 2.10 release of Pulsar, the best you could do was to provide a single static endpoint for Pulsar as shown here. Oftentimes, the connection URL to Pulsar is provided by a configuration file. This value is read once and remains static inside the application. Then when a regional failure occurred, you had to manually change the DNS entry for that URL to point to the stand-by cluster.
  14. READ SLIDE
  15. Starting with release 2.10 of Pulsar, we have added a new feature called failover clients that solves these problems.
  16. There are two distinct types of failover clients that are available in the 2.10 release The first is one that will automatically reroute your client connections to a different Pulsar cluster as soon as it detects a cluster outage. The second one allows you to trigger the failover through an exposed HTTP endpoint. This client will periodically invoke the exposed endpoint to get the connection details of the cluster it is supposed to connect to. This approach allows your admins to have more control over the failover process.
  17. So, let’s discuss the automated failover client first. As the name implies, this failover client will automatically switch clients over to a designated standby cluster if and when it detects an outage on the primary cluster. This is accomplished by a probe task that periodically interrogates the primary cluster to determine if it is running or not. Once it has detected that the primary cluster is unavailable, it starts a timer to measure the length of the outage. This is to ensure that we don’t inadvertently switch over due to a transient network issue. If the outage continues for longer than the user-configured duration, then the switch-over occurs.
  18. Let’s look at how this automatic failover client is configured and used This first thing to note is creation of a separate set of authentication credentials for the secondary cluster. Next, note that there are both a primary cluster URL property and a secondary property. The primary property takes the broker URL for your preferred cluster connection, while the secondary takes a list of one or more alternative clusters to connect to. This allows you to have multiple stand-by clusters, which matches pulsar geo-replication capabilities to support multiple clusters. The failoverDelay property specifies how long the primary cluster outage must be before switching over to the standby cluster. The switchback property specifies how long the client waits to switch back to the primary cluster once it detects that the primary cluster is back up and running. This is because the probe against the primary cluster will continue to run even after the client has failed over to the standby cluster. Once it has detected that the primary cluster is back up it will wait this long to switch back to the primary cluster The checkInterval controls the frequency at which the probe is executed. Finally, the failover configuration is the used to build a Pulsar client.
  19. Now let’s discuss the controlled failover client As the name implies, this client allows you to control when and where your pulsar client will fail over to. This is accomplished via a REST service that YOU must implement.
  20. Let’s look at how this controlled failover client is configured and used This first thing to note is creation of a separate set of authentication credentials. These are for accessing the REST endpoint (NOT the standby cluster) The default service URL property takes the broker URL for your preferred cluster connection. The checkInterval controls the frequency at which the REST endpoint is executed. The urlProvider is where you specify the address of the REST service you implemented, and the urlHeader is where you provide the contents of the HTTP header. The header can be used to provide authentication credentials, etc. Finally, the failover configuration is the used to build a Pulsar client.
  21. Let’s look at a simple example of a REST endpoint service First notice that the expected return type is a JSON object that contains the four fields show here. This data structure allows you to provide all the necessary authentication credentials required to connect to a Pulsar cluster. Also note that this information is generated dynamically in the code, so it could in theory read this information from a database, etc. This provides much more flexibility than the Automated failover client which requires you to provide a hard-coded list of Pulsar broker URLs. In this example, I am forcing a switch over to a standby cluster based on the number of times the REST endpoint is called This is the demonstrate a failover to a standby cluster and back to the active as we shall see.
  22. Next I will demonstrate both of these failover clients in action. For those of you that are interested, the source code for this demo is available in the GitHub repo shown here.
  23. READ SLIDE