What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson

What we've learned from running thousands
of production RabbitMQ clusters
Lovisa Johansson
lovisa@cloudamqp.com

● Unstable RabbitMQ version
● Unoptimized configuration for a specific use case
➢ High availability
➢ High Performance
● Users (you?) are using RabbitMQ in a bad way
● Client libraries are using RabbitMQ in bad way
● Things are not done in an optimal way
● Customer use cases
● Configuration mistakes
● Common mistakes
Client side problems
Server side problems

What we've learned from running
thousands of production RabbitMQ
clusters

Lovisa Johansson
Marketing Manager
Support Engineer
RabbitMQ Engineer
Umeå, Sweden

23000 running instances 7 clouds
Largest provider of managed RabbitMQ servers
75 regions
Headquarter
Stockholm Sweden

Don’t use too many connections or channels
● Keep connection/channel count low
● Each connection uses about 100 KB of RAM
● Thousands of connections can be a heavy burden on a RabbitMQ server
● Channel and connections leaks are among the most common errors that we see
Recommendation number 1.
CONNECTIONS AND CHANNELS

● Long-lived connections.
● Don’t open a channel every
time you are publishing
Don’t open and close connections or channels repeatedly
● AMQP connections: 7 TCP packages
● AMQP channel: 2 TCP packages
● AMQP publish: 1 TCP package
● AMQP close channel: 2 TCP packages
● AMQP close connection: 2 TCP packages
Total 14-19 packages (+ acks)

● Our benchmarks show that the proxy is increasing publishing
speed with a magnitude or more.
● https://github.com/cloudamqp/amqproxy
● Some clients can’t keep long-lived connections
(looking at you PHP )
● Avoid connection churn by using a proxy that pools
connections and channels for reuse.
AMQProxy

Flow control: Might not be able to consume if the connection is in flow control
Back pressure: RabbitMQ can apply back pressure on the TCP connection when the
publisher is sending too many messages
Separate connections for publishers and consumers

● Less than 10 000 messages in one queue
● Heavy load on RAM usage
QUEUES
Don't have too large queues
○ In order to free up RAM, RabbitMQ starts page out messages to disk
○ Blocks the queue from processing messages
● Time-consuming to restart a cluster
● Limit queue size with TTL or max-length

● Lazy queues was added in RabbitMQ 3.6
● Writes messages to disk immediately, thus spreading the work out over time instead of taking the
risk of a performance hit somewhere down the road
● More predictable and smooth performance curve
○ Messages are only loaded into memory when they are needed.
Enable lazy queues to get predictable performance
QUEUES
Enable lazy queues if…
● the publisher is sending many messages at once
● the consumers are not keeping up with the speed of the publishers all the time
Ignore lazy queues if..
● you require high performance
● queues are always short

The RabbitMQ management collects and calculates metrics for every queue, connection,
and channel in the cluster
● Slows down the server if you have thousands upon thousands of active queues or
consumers
Don’t set RabbitMQ Management statistics rate mode to detailed
QUEUES

Split queues over different cores, and route messages to multiple
queues
Recommendation number 7.1
QUEUES
● A queue is single threaded
○ 50k messages/s
● Queue performance is limited to one CPU core.
● All messages routed to a specific queue will end up
on the node where that queue resides.
Plugins
The consistent hash
exchange plugin
RabbitMQ sharding

QUEUES
● Load-balance messages between queues
● Messages are consistently and equally distributed across many queues
● Consume from all queues
● https://github.com/rabbitmq/rabbitmq-consistent-hash-exchange
The consistent hash exchange plugin

QUEUES
RabbitMQ sharding
● Automatic partitioning of queues
● Queues are created on every cluster node and messages are sharded across them
● Shows one queue to the consumer, but it could be many queues running behind it in
the background
● https://github.com/rabbitmq/rabbitmq-sharding

QUEUES
Have limited use on priority queues
● Each priority level uses an internal queue on the Erlang VM, which takes up
resources.
● In most use cases it's sufficient to have no more than 5 priority levels.

QUEUES
Send persistent messages and durable queues
● Messages, exchanges, and queues that are not durable and persistent are lost
during a broker restart
● High performance - use transit messages and temporary, or non-durable queues

PREFETCH
Adjust prefetch value
● Limits how many messages the client can receive before acknowledging a message
● RabbitMQ default prefetch value - unlimited buffer
● RabbitMQ 3.7
○ Option to adjust the default prefetch
○ CloudAMQP servers has a default prefetch of 1000

PREFETCH
Prefetch - Too small prefetch value
RabbitMQ is most of the
time waiting to get
permission to send more
messages

PREFETCH
Prefetch - Too large prefetch value

PREFETCH
Prefetch
● One single or few consumers with short processing time
○ prefetch many messages at once
● About the same processing time and a stable network
○ estimated prefetch value by using the total round trip time divided by
processing time on the client for each message
● Many consumers, and short processing time
○ A lower prefetch value than for one single or few consumers
● Many consumers, and/or long processing time
○ Set prefetch count to 1 so that messages are evenly distributed among all
your workers
● The prefetch value have no effect if your client auto-ack messages

HiPE
HiPE
● HiPE increases server throughput at the cost of increased start-up time
○ increases throughput with 20-80%
○ increases start-up time about 1 to 3 minutes
● HiPE is recommended if you require high availability
● We don’t consider HiPE as experimental any longer

● Pay attention to where in your consumer logic you’re acknowledging messages
● For the fastest possible throughput, manual acks should be disabled
● Publish confirm is required if the publisher needs messages to be processed at
least once
ACKS AND CONFIRMS
Acknowledgments and Confirms

Great improvements are made to RabbitMQ, all the time <3
● 3.7
○ Default prefetch
○ Individual vhost message stores
● 3.6
○ Lots of many memory problems, up to version 3.6.14
○ Lazy queues
● 3.5
○ Still may customers on 3.5.7
VERSION
Use a stable RabbitMQ version
Back compatibility is
really good in RabbitMQ

● Some plugins are consuming lots of resources
● Make sure to disable plugins that you are not using
Plugins
Disable plugins you are not using

● Unused queues take up some resources, queue index, management statistics etc
● Temporary queues should be auto deleted
Unused queues
Delete unused queues

● Message loss on netsplits
● Needed to be able to upgrade without losing messages at CloudAMQP
VHOST
Enable HA-vhost policy on custom vhosts

Summary Overall
● Short queues
● Long lived connections
● Limited use of priority queues
● Use multiple queues and consumers
● Split your queues over different cores
● Stable Erlang and RabbitMQ version
● Disable plugins you are not using
● Channels on all connections

Summary Overall
● Separate connections for publishers and
consumers
● Management statistics rate mode
● Delete unused queues
● Temporary queues should be auto deleted

Summary High Performance
● Short queues
○ max-length if possible
● Do not use lazy queues
● Send transit messages
● Disable manual acks and publish
confirms
● Avoid multiple nodes (HA)
● Enable RabbitMQ HiPE

Summary High Availability
● Enable lazy queues
● RabbitMQ HA - 2 nodes
○ HA-policy on all vhosts
● Persistent messages, durable queues
● Do not enable HiPE

DIAGNOSTIC TOOL
Diagnostics Tool
● RabbitMQ and Erlang version
● Queue length
● Unused queues
● Persistent messages in durable queues
● No mirrored auto delete queues
● Long lived connections
● Connection and channel leak
● Channels on all connections
● Insecure connections
● Client library
● AMQP Heartbeats
● Channel prefetch
● Management statistics rate mode
● Ensure that you are not using topic exchange as fanout
● Ensure that all published messages are routed
● Ensure that you have a HA-policy on all vhosts
● Auto delete on temporary queues
● Persistent messages in durable queues
● No transient messages in mirrored queues
● No mirrored auto delete queues
● Separate connections for publishers and consumers

It should be easier to do things
right!

Questions?
Visit www.cloudamqp.com blog site, documentation and FAQ for more info

What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson

Ähnlich wie What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson