Intuit operates a highly distributed, multi-cluster, multi-region messaging platform to serve the queuing use-cases of its applications and services. In this session we will talk about the journey of our messaging platform in the world of Apache Pulsar and share the experiences and learnings gained so far. As we adopted Pulsar for our next generation platform and adapted it for Intuit specific requirements, we faced and solved some intrinsic challenges that we would be happy to share and get feedback. This journey has just begun and we would like to learn and absorb recommended best practices and guidelines.
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Building the Next-Generation Messaging Platform on Pulsar at Intuit - Pulsar Summit NA 2021 Keynote
1. Presenters: Madhavan Narayanan Sajith Sebastian Amit Kaushal Gokul Sarangapani
Pulsar Journey@Intuit
Building our next-gen messaging platform
Topic : persistent://pulsar/intuit/our-migration-story
2. Intuit Confidential and Proprietary 2
Messaging at Intuit
Background of Intuit Messaging platform and current technology used
Need for Migration
Limitations of the current platform and our migration goals
Messaging with Pulsar
Feasibility study and the target architecture for next-gen platform
Challenges and Solutions
Problems faced and the solutions
Journey Ahead
The future roadmap items
Agenda
4. Intuit Confidential and Proprietary 4
Intuit Messaging Platform - Current State
Tax Filing Workflow
Dispatchers
Order Management
Payments
Processing
Billing
Schedulers Processors Observers
…
…
Products
Use Cases
Services
The Platform
Point-to-Point
Queues
Multi-Subscription
Topics
Persistent
Storage
Multi-Region
Support
Active-Active Use
cases
Highly resilient
ActiveMQ Network-of-
Brokers
Intuit
Messaging
Platform
5. Intuit Confidential and Proprietary 5
Messaging with ActiveMQ
Network-of-Brokers
Broker2
Broker1
WEST EAST
Producers
Broker3 Broker4 Broker5 Broker6
➢ ActiveMQ brokers distributed
across 2 regions
➢ NLB in each region to route
connections to brokers
➢ Route53 for latency based
routing to closest NLB
➢ All brokers know each other
and form a network. Not easily
scalable
➢ Brokers store messages in
local files
➢ Producers and Consumers use
JMS APIs and connect to the
Route53 endpoint
NLB NLB
Route53
Consumers
JMS API JMS API
Each broker has connection to every other broker
6. Intuit Confidential and Proprietary 6
Active-Active support with ActiveMQ
ActiveMQ Network-of-
Brokers
Broker2
Broker1
WEST EAST
West
Producer
Broker3 Broker4 Broker5 Broker6
East
Consumer
East
Producer
➢ Producers and Consumers
connect to broker(s) in the local
region
➢ Producers always see low latency
➢ Messages for a given topic can
be stored in multiple brokers
➢ Messages are internally
forwarded between brokers and
find their way to consumers.
There is no message replication
➢ Inefficient and waste of
bandwidth due to the high
volume of inter-broker traffic.
➢ Highly resilient to individual
broker failures.
7. Intuit Confidential and Proprietary 7
Handling Region failure with ActiveMQ
West brokers down
Broker2
Broker1
WEST EAST
West
Producer
Broker3 Broker4 Broker5 Broker6
East
Consumer
East
Producer ➢ Producers transparently
reconnect to a broker in
remote region
➢ Producers now see a high
publish latency
➢ However producers can
continue their operation
without any adverse impact
➢ Messages stored in the
affected brokers are not
available for consumers until
they come back ‘online’
➢ The network automatically
recovers once the brokers are
available
9. Intuit Confidential and Proprietary 9
Technology
➢ ActiveMQ is an outdated technology, with architectural limitations
➢ To keep abreast with latest, modern, cloud-native technology
Scalability
➢ Scalability in ActiveMQ NoB is non-trivial and complex
➢ Significant increase in overheads as more brokers are added to the network
Throughput
➢ Maximum throughput of ActiveMQ NoB is limited, with little room to grow
➢ Need to be ready for future needs at Intuit. Significant growth in traffic projected
Cost
➢ High Price-Performance ratio of NoB. Significant loss of bandwidth in inter-broker traffic
➢ Need a solution that maximizes throughput with available resources
Operations
➢ Lack of central management in NoB. High cost of maintenance operations
➢ Lack of cluster level statistics and monitoring
Why we were looking to migrate
10. Intuit Confidential and Proprietary 10
Retain
● Multi-Region support
● Active-Active support
● Resiliency to system
failures
While we were evaluating multiple options against ActiveMQ capabilities, our focus was to
Migration Focus
Improve
● Ease of scalability
● Ease of operations
● Throughput and
performance
Avoid
● A single layer handling both
storage and customer traffic
● Inefficient inter-broker traffic
within the platform
● Duplicate message storage
for each subscriber
12. Intuit Confidential and Proprietary 12
Feasibility Study
➢ Setup a Pulsar cluster that was equivalent in cost to an ActiveMQ NoB
➢ Extended Pulsar Broker to encrypt/decrypt messages for parity with existing system
➢ Verified all basic messaging functions for queueing use case (produce/consume operations
for persistent topics, single and multiple subscriptions)
➢ Verified scalability of broker and proxy tiers
➢ Verified dynamic addition of bookies, racks placement strategies and namespace isolation
➢ Ran extensive performance tests
What
we did
Results ➢ For nearly the same cost, a pulsar cluster was able to support 3.5x times the throughput of
an equivalent ActiveMQ NoB
➢ Highly consistent and contained publish latencies even at high throughput traffic. Unlike in
the case of ActiveMQ brokers, producers were relatively unaffected by the presence of
consumer connections
13. Intuit Confidential and Proprietary 13
Next Gen Messaging Platform with Pulsar
➢ Global zookeeper
spanning multiple regions
➢ Proxies, Brokers and
Bookies connect to a local
zookeeper
➢ Scalable and extensible
Proxy tier for managing
traffic
➢ Scalable Broker tier for
serving messages
➢ A separate scalable
storage tier with rack
support
➢ JMS wrapper over pulsar
client library
JMS Producers
JMS API
Pulsar Client
JMS Consumers
JMS API
Pulsar Client
Pulsar SDK
Producers
Pulsar Client
Pulsar SDK
Consumers
Pulsar Client
15. Intuit Confidential and Proprietary 15
Challenge #1 - Zookeeper Quorum Issue
➢ Intuit operates primarily in 2 AWS regions in US, namely us-west-2 and us-east-2
➢ Messaging platform also spans these 2 regions only
➢ When a region failure occurs within the platform, the entire pulsar cluster collapses due to
zookeeper failure
➢ Zookeepers lose majority quorum when one region is down and take the cluster down
➢ Our clients suddenly start failing since the cluster is unavailable. This is a regression
Issue
Solution ➢ We added one more region ‘us-east-1’ to the cluster
➢ Only one zookeeper instance runs in ‘us-east-1’. No other components are used there
➢ us-east-1 is a rarely used region by Intuit services and doesn’t have the same support/SLA
from AWS as the other 2 regions
16. Intuit Confidential and Proprietary 16
Challenge #2 - Zookeeper issue again
➢ With zookeeper in 3 regions, we started seeing frequent issues even during normal mode of
operation. i.e when all the 3 regions were active
➢ Zookeepers would frequently seize and stall making the cluster unavailable
➢ Zookeeper in us-east-1 region was becoming the leader most of the time, but was unable to
moderate and keep the quorum working.
➢ This was due to high network latency in us-east-1. Also, the overall cluster performance
dropped significantly when an east zookeeper become the leader (most of traffic is in west)
➢ Unable to find any solution to precisely control who becomes the leader in a ZK cluster
Issue
Solution ➢ After a lot of troubleshooting and experiments, we found that the zookeeper instance with a
larger server id value had more probability of becoming the leader
➢ Now we just had to control the sequence of zookeeper server id values in configuration,
keeping the us-east-1 instance at the smallest value
➢ Never saw the issue again after this fix
18. Intuit Confidential and Proprietary 18
Challenge #3 - High publish latencies
➢ Pulsar design assigns a single owner broker for a topic. All traffic for the topic is handled by
this broker
➢ All message producers from both regions end up getting connected to this single broker
(via proxies in local region)
➢ This results in latency disparity between producers who are in the same region as the broker
and the ones who are in remote region
➢ The cross region latencies are as high as 50ms average. This was a serious regression when
compared to ActiveMQ Network-of-Brokers
Issue
Solution ➢ Since our customers use region-agnostic topic names and expect active-active support from
us, we had to implement region-level isolation of topics underneath
➢ Implemented a service discovery extension that is configured in proxy to handle custom
topic name lookups. Also used namespace isolation policies to pin topics to specific brokers
➢ Implemented a wrapper over pulsar client library that uses the extended lookup to
transparently map the region-agnostic topic name to a region-specific sub topic.
➢ Consumers read messages from all the sub topics
19. Intuit Confidential and Proprietary 19
Brokers
West Namespace Brokers
Bookie2
west-2a
Bookie1
west-2b
Bookie2
west-2b
Bookie1
west-2c
Bookie2
west-2c
Rack1 Rack2 Rack3
Bookie Group - West Local
Bookie1
west-2a
Bookie2
east-2a
Bookie1
east-2b
Bookie2
east-2b
Bookie1
west-2c
Bookie2
east-2c
Rack1 Rack2 Rack3
Bookie Group - East Local
Bookie1
east-2a
Pulsar Proxy
Service
Discovery
WEST EAST
Zookeepers
Brokers
Pulsar Proxy
Service
Discovery
JMS Producers
JMS API
Pulsar Client
JMS Consumers
JMS API
Pulsar Client
Pulsar SDK
Producers
Pulsar Client
Pulsar SDK
Consumers
Pulsar Client
Challenge #3 - High publish latencies - Solution
East Namespace Brokers
20. Intuit Confidential and Proprietary 20
Challenge #4 - Ledger recovery failure
➢ Messages for a topic are stored in a sequence of ledgers in BookKeeper. The owner broker
for the topic manages the state of the ledgers.
➢ Ledgers are replicated to multiple bookies based on the write quorum value
➢ When a bookie crashes, open ledgers in it are closed by brokers which then create new
ledgers using other available bookies. When a broker crashes, other brokers assume
ownership of the abandoned topics and are able to re-open the ledgers
➢ However in case of multiple system failures resulting in a combination of broker and bookie
crashes, it leads to a situation where the ledgers cannot be recovered and topic producers
are stalled and new messages cannot be published. This results in business impact
Issue
Solution ➢ For recovery, a quick restart of the bookies is needed. Due to sync operation, a delayed
restart can overshoot the SLA and result in customer impact
➢ We are working on a solution to use our custom service discovery to detect this condition
and redirect producer to the sub-topic for the remote region.
22. Intuit Confidential and Proprietary 22
Journey Ahead
➢ We are in production now with limited availability to restricted set of customers
➢ As we move towards making the platform generally available to all customers, the following are some
items of focus
○ Enhancing and fortifying the resiliency of the system
○ Enabling Transaction Support
○ Auto scaling of brokers
○ Enabling Pulsar Schema Support using a custom schema registry
➢ We also have long term plans to
○ Move the platform to Intuit’s Kubernetes Platform
○ Support multi-cloud messaging
23. Intuit Confidential and Proprietary 23
Let us know your thoughts
Please write your feedback and comments to
● madhavan_narayanan@intuit.com
● gokul_s@intuit.com
● sajith_sebastian@intuit.com
● amit_kaushal@intuit.com