This document discusses best practices for optimizing RabbitMQ performance and stability based on experience running thousands of RabbitMQ clusters. Key recommendations include using short-lived, separate connections for publishers and consumers; limiting queue sizes; enabling lazy queues for predictable performance; splitting queues across cores; keeping connections and channels open; adjusting prefetch values; using a stable RabbitMQ version; disabling unused plugins; and deleting unused queues. Diagnostic tools are also recommended to check for issues like connection leaks, large queues, and improper exchange usage.
5. ● Unstable RabbitMQ version
● Unoptimized configuration for a specific use case
➢ High availability
➢ High Performance
● Users (you?) are using RabbitMQ in a bad way
● Client libraries are using RabbitMQ in bad way
● Things are not done in an optimal way
● Customer use cases
● Configuration mistakes
● Common mistakes
Client side problems
Server side problems
6. What we've learned from running
thousands of production RabbitMQ
clusters
8. 23000 running instances 7 clouds
Largest provider of managed RabbitMQ servers
75 regions
Headquarter
Stockholm Sweden
9. Don’t use too many connections or channels
● Keep connection/channel count low
● Each connection uses about 100 KB of RAM
● Thousands of connections can be a heavy burden on a RabbitMQ server
● Channel and connections leaks are among the most common errors that we see
Recommendation number 1.
CONNECTIONS AND CHANNELS
10. ● Long-lived connections.
● Don’t open a channel every
time you are publishing
Don’t open and close connections or channels repeatedly
● AMQP connections: 7 TCP packages
● AMQP channel: 2 TCP packages
● AMQP publish: 1 TCP package
● AMQP close channel: 2 TCP packages
● AMQP close connection: 2 TCP packages
Total 14-19 packages (+ acks)
Recommendation number 2.
CONNECTIONS AND CHANNELS
11. ● Our benchmarks show that the proxy is increasing publishing
speed with a magnitude or more.
● https://github.com/cloudamqp/amqproxy
● Some clients can’t keep long-lived connections
(looking at you PHP )
● Avoid connection churn by using a proxy that pools
connections and channels for reuse.
AMQProxy
12. Flow control: Might not be able to consume if the connection is in flow control
Back pressure: RabbitMQ can apply back pressure on the TCP connection when the
publisher is sending too many messages
Separate connections for publishers and consumers
Recommendation number 3.
CONNECTIONS AND CHANNELS
13.
14. ● Less than 10 000 messages in one queue
● Heavy load on RAM usage
QUEUES
Recommendation number 4.
Don't have too large queues
○ In order to free up RAM, RabbitMQ starts page out messages to disk
○ Blocks the queue from processing messages
● Time-consuming to restart a cluster
● Limit queue size with TTL or max-length
15. ● Lazy queues was added in RabbitMQ 3.6
● Writes messages to disk immediately, thus spreading the work out over time instead of taking the
risk of a performance hit somewhere down the road
● More predictable and smooth performance curve
○ Messages are only loaded into memory when they are needed.
Enable lazy queues to get predictable performance
Recommendation number 5.
QUEUES
Enable lazy queues if…
● the publisher is sending many messages at once
● the consumers are not keeping up with the speed of the publishers all the time
Ignore lazy queues if..
● you require high performance
● queues are always short
16. The RabbitMQ management collects and calculates metrics for every queue, connection,
and channel in the cluster
● Slows down the server if you have thousands upon thousands of active queues or
consumers
Don’t set RabbitMQ Management statistics rate mode to detailed
Recommendation number 6.
QUEUES
17. Split queues over different cores, and route messages to multiple
queues
Recommendation number 7.1
QUEUES
● A queue is single threaded
○ 50k messages/s
● Queue performance is limited to one CPU core.
● All messages routed to a specific queue will end up
on the node where that queue resides.
Plugins
The consistent hash
exchange plugin
RabbitMQ sharding
18. Recommendation number 7.2
QUEUES
● Load-balance messages between queues
● Messages are consistently and equally distributed across many queues
● Consume from all queues
● https://github.com/rabbitmq/rabbitmq-consistent-hash-exchange
The consistent hash exchange plugin
19. Recommendation number 7.3
QUEUES
RabbitMQ sharding
● Automatic partitioning of queues
● Queues are created on every cluster node and messages are sharded across them
● Shows one queue to the consumer, but it could be many queues running behind it in
the background
● https://github.com/rabbitmq/rabbitmq-sharding
20. Recommendation number 8.
QUEUES
Have limited use on priority queues
● Each priority level uses an internal queue on the Erlang VM, which takes up
resources.
● In most use cases it's sufficient to have no more than 5 priority levels.
21. Recommendation number 9.
QUEUES
Send persistent messages and durable queues
● Messages, exchanges, and queues that are not durable and persistent are lost
during a broker restart
● High performance - use transit messages and temporary, or non-durable queues
22. Recommendation number 10.1
PREFETCH
Adjust prefetch value
● Limits how many messages the client can receive before acknowledging a message
● RabbitMQ default prefetch value - unlimited buffer
● RabbitMQ 3.7
○ Option to adjust the default prefetch
○ CloudAMQP servers has a default prefetch of 1000
25. Recommendation number 10.4
PREFETCH
Prefetch
● One single or few consumers with short processing time
○ prefetch many messages at once
● About the same processing time and a stable network
○ estimated prefetch value by using the total round trip time divided by
processing time on the client for each message
● Many consumers, and short processing time
○ A lower prefetch value than for one single or few consumers
● Many consumers, and/or long processing time
○ Set prefetch count to 1 so that messages are evenly distributed among all
your workers
● The prefetch value have no effect if your client auto-ack messages
26. Recommendation number 11.
HiPE
HiPE
● HiPE increases server throughput at the cost of increased start-up time
○ increases throughput with 20-80%
○ increases start-up time about 1 to 3 minutes
● HiPE is recommended if you require high availability
● We don’t consider HiPE as experimental any longer
27. ● Pay attention to where in your consumer logic you’re acknowledging messages
● For the fastest possible throughput, manual acks should be disabled
● Publish confirm is required if the publisher needs messages to be processed at
least once
Recommendation number 12.
ACKS AND CONFIRMS
Acknowledgments and Confirms
28. Great improvements are made to RabbitMQ, all the time <3
● 3.7
○ Default prefetch
○ Individual vhost message stores
● 3.6
○ Lots of many memory problems, up to version 3.6.14
○ Lazy queues
● 3.5
○ Still may customers on 3.5.7
Recommendation number 13.
VERSION
Use a stable RabbitMQ version
Back compatibility is
really good in RabbitMQ
29. ● Some plugins are consuming lots of resources
● Make sure to disable plugins that you are not using
Recommendation number 14.
Plugins
Disable plugins you are not using
30. ● Unused queues take up some resources, queue index, management statistics etc
● Temporary queues should be auto deleted
Recommendation number 15.
Unused queues
Delete unused queues
31. ● Message loss on netsplits
● Needed to be able to upgrade without losing messages at CloudAMQP
Recommendation number 16.
VHOST
Enable HA-vhost policy on custom vhosts
32. Summary Overall
Server side problems
● Short queues
● Long lived connections
● Limited use of priority queues
● Use multiple queues and consumers
● Split your queues over different cores
● Stable Erlang and RabbitMQ version
● Disable plugins you are not using
● Channels on all connections
33. Summary Overall
Server side problems
● Separate connections for publishers and
consumers
● Management statistics rate mode
● Delete unused queues
● Temporary queues should be auto deleted
34. Summary High Performance
Server side problems
● Short queues
○ max-length if possible
● Do not use lazy queues
● Send transit messages
● Disable manual acks and publish
confirms
● Avoid multiple nodes (HA)
● Enable RabbitMQ HiPE
35. Summary High Availability
Server side problems
● Enable lazy queues
● RabbitMQ HA - 2 nodes
○ HA-policy on all vhosts
● Persistent messages, durable queues
● Do not enable HiPE
37. DIAGNOSTIC TOOL
Diagnostics Tool
● RabbitMQ and Erlang version
● Queue length
● Unused queues
● Persistent messages in durable queues
● No mirrored auto delete queues
● Limited use of priority queues
● Long lived connections
● Connection and channel leak
● Channels on all connections
● Insecure connections
● Client library
● AMQP Heartbeats
● Channel prefetch
● Limited use of priority queues
● Management statistics rate mode
● Ensure that you are not using topic exchange as fanout
● Ensure that all published messages are routed
● Ensure that you have a HA-policy on all vhosts
● Auto delete on temporary queues
● Persistent messages in durable queues
● No transient messages in mirrored queues
● No mirrored auto delete queues
● Separate connections for publishers and consumers