Chartbeat measures and monetizes attention on the web. They were experiencing slow load times and TCP retransmissions due to default system settings. Tuning various TCP, NGINX and EC2 ELB settings like increasing buffers, disabling Nagle's algorithm, and enabling HTTP keep-alive resolved the issues and improved performance. These included tuning settings like net.ipv4.tcp_max_syn_backlog, net.core.somaxconn, and nginx listen backlog values.
2. Who are we?
Chartbeat measures and monetizes attention on the web. Working with 80% of
the top US news sites and global media sites in 50 countries, Chartbeat brings
together editors and advertisers to identify in real time the active time an
audience consumes articles, videos, paid content, and display advertising.
2 | TCP/NGINX Tuning on EC2
3. ● Founded in 2009
● Hosted on AWS , 400-500 servers
depending on time of day
● Around 180k - 220k req/sec
● 6 - 9 million concurrents
chartbeat.com/totaltotal
2 | TCP/NGINX Tuning on EC2
4. Who am I?
● Sr Web Operations Engineer
● Previously worked at
○ Bitly
○ TheStreet.com
○ Promotions.com
2 | TCP/NGINX Tuning on EC2
6. Problem
● Reports of slowness from some customers
● Taking 3 seconds to send data
2 | TCP/NGINX Tuning on EC2
Default Retransmission Timeout
RFC 1122: Section 4.2.3.1
The following values SHOULD be used to initialize the
estimation parameters for a new connection:
(a) RTT = 0 seconds.
(b) RTO = 3 seconds. (The smoothed variance is to be
initialized to the value that will result in this RTO).
15. Sources of info
● netstat -s
"tcp_active_connections_openings",
"tcp_connections_aborted_due_to_timeout",
"tcp_data_loss_events",
"tcp_failed_connection_attempts",
"tcp_other_tcp_timeouts",
"tcp_passive_connection_openings",
"tcp_segments_retransmited",
"tcp_segments_send_out",
"tcp_syns_to_listen_sockets_dropped",
"tcp_times_the_listen_queue_of_a_socket_overflowed",
2 | TCP/NGINX Tuning on EC2
Values > 1, can’t be
good
Confirmed what we
suspected
WHUT
21. Further settings explored
net.ipv4.tcp_slow_start_after_idle
net.ipv4.tcp_max_tw_buckets
net.ipv4.tcp_rmem/wrem
net.ipv4.tcp_fin_timeout
net.ipv4.tcp_mem
2 | TCP/NGINX Tuning on EC2
22. net.ipv4.tcp_slow_start_after_idle
Set to 0 to ensure connections don’t go back to
default window size after being idle too long.
Example: HTTP KeepAlive
2 | TCP/NGINX Tuning on EC2
23. net.ipv4.tcp_max_tw_buckets
Max number of sockets in TIME_WAIT. We
actually set this very high, since before we
moved instances behind an ELB it was normal
to have 200k+ sockets in TIME_WAIT state.
Exceeding this leads to sockets being torn
down until under limit
2 | TCP/NGINX Tuning on EC2
24. net.ipv4.tcp_rmem/wrem
Format: min default max ( in bytes)
The kernel will autotune the number of bytes to
use for each socket based on these settings. It
will start at default and work between the
min and max
2 | TCP/NGINX Tuning on EC2
25. net.ipv4.tcp_fin_timeout
The time a connection should spend in
FIN_WAIT_2 state. Default is 60 seconds,
lowering this will free memory more quickly and
transition the socket to TIME_WAIT.
This will NOT reduce the time a socket is in
TIME_WAIT which is set to 2 * MSL (max
segment lifetime)
2 | TCP/NGINX Tuning on EC2
26. net.ipv4.tcp_fin_timeout continued...
MSL is hardcoded in the kernel at 60 seconds!
https://github.
com/torvalds/linux/blob/master/include/net/tcp.
h#L115
#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy
TIME-WAIT * state, about 60 seconds */
2 | TCP/NGINX Tuning on EC2
27. net.ipv4.tcp_mem
Format: low pressure max (in pages!)
Below low, Kernel won’t put pressure on
sockets to reduce mem usage. Once pressure
hits, sockets reduce memory until low is hit. If
max hit, no new sockets.
2 | TCP/NGINX Tuning on EC2
30. net.ipv4.tcp_tw_recycle (DANGEROUS)
● Clients behind NAT/Stateful FW will get
dropped
● *99.99999999% of time should never be
enabled
* Probably 100% but there may be a valid case out there
2 | TCP/NGINX Tuning on EC2
32. Recycle vs Reuse Deep Dive
http://bit.ly/tcp-time-wait
2 | TCP/NGINX Tuning on EC2
33. One last thing…
TCP Congestion Window - initcwnd (initial)
2 | TCP/NGINX Tuning on EC2
Starting in Kernel 2.6.39 , set to 10
Previous default was 3!
http://research.google.com/pubs/pub36640.html
Older Kernel?
$ ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10
35. listen statement
● backlog
○ limited by net.core.somaxconn
● defer
○ TCP_DEFER_ACCEPT - Wait till we receive data
packet before passing socket to server. Completing
TCP Handshake won’t trigger an accept()
2 | TCP/NGINX Tuning on EC2
36. server block
● sendfile
○ Saves context switching from userspace on
read/write.
○ “zero copy” , happens in kernel space
● tcp_nopush
○ TCP_CORK
○ allows application to control building of packet, e.g
pack a packet with full HTTP response
● tcp_nodelay
○ Nagle’s Algorithm
○ Only affects keep-alive connections
● multi_accept
○ Accept all connections on listen queue at once
(careful, can overwhelm workers)
2 | TCP/NGINX Tuning on EC2
38. HTTP Keep-Alive
● Enabled once behind ELB
● Given small payload and 15 seconds between
data, waste of resources for us to enable
exposed directly to net
2 | TCP/NGINX Tuning on EC2
39. HTTP Keep-Alive Cont..
● Also enable on upstream proxies
○ Available since 1.1.4
○ *cough* had to upgrade Nginx and Fix memory leak
dealing with libevent and keepalives before we could
get this fully setup
2 | TCP/NGINX Tuning on EC2
41. Cross-Zone load balancing
Ensures requests to
each ELB in each AZ
go to ALL instances
in ALL AZs
2 | TCP/NGINX Tuning on EC2
42. Idle Connection Timeout
● Defaults to 60 seconds
● Finally tunable via API.
● Tweak if doing anything long lived , e.g.
Websockets, or ensure you are sending
“pings”
2 | TCP/NGINX Tuning on EC2
43. Connection draining
“Graceful” removal of node from ELB, will
ensure existing connections can finish instead
of a hard cutoff (old behavior)
2 | TCP/NGINX Tuning on EC2
44. Metrics to monitor
● SurgeQueueLength (Not Good)
A count of the total number of requests that are
pending submission to a registered instance.
● SpilloverCount (BAD)
A count of the total number of requests that
were rejected due to the queue being full.
2 | TCP/NGINX Tuning on EC2
45. Conclusions
● The internet is full of lies
● With enough traffic, tweaking system and
application defaults are a necessary
● Find trusted sources (Me? Maybe?) for
settings and test in staging environments
● Measure impact and understand what metrics
may be impacted by your tweaks
● Don’t get lost in all the sysctl settings
● TCP is complicated
2 | TCP/NGINX Tuning on EC2
46. 2 | TCP/NGINX Tuning on EC2
FIN
FIN_WAIT_1
FIN_WAIT_2
TIME_WAIT
47. Resources and References
https://www.kernel.
org/doc/Documentation/networking/ip-sysctl.txt
2 | TCP/NGINX Tuning on EC2
man tcp(7)
Record traffic during US Election and World Cup, USA vs Germany 10+ million concurrents. Presidential election at the time was 2x our normal traffic
High packet rate, low bandwidth. 43 bytes is small empty image we can send. Need this for error handling purposes on frontend side. Can’t send empty response
Reports from users about slowness in sending “pings” to our servers. Slow clients, slowness doesn’t really affect our numbers too much as long as its arriving < 5 seconds. Asked for some numbers, seeing pings taking around 3 seconds . That number sets off some alarms
These two numbers should raise alarms when you are doing any troubleshooting with TCP connections. 3 seconds is the default timeout for re-trying a connection, will backoff and re-try after 6 seconds, so 9 seconds total for a connection
Maybe on client side? How do we know it’s on us? At the time our Pingdom monitoring didn’t show anything unusual, later learned this is definitely not enough
Especially if you don’t have a good baseline for some metrics, you will end up chasing oddities in graphs that may be completely irrelevant
Only graphed relevant info from netstat -s, there is a ton of metrics that may be useful for debugging other issues, but we started with these since they appeared to be most related with the issue at hand. For example, “fast retransmit” , while relevant, wouldnt indicate a delay of 3 seconds, since it bypasses the timeout. Push these to ganglia/graphite
Had to enable logging, discard after hour, space contraints. Rotation impacts performance, switch to ext4 on log volume helped, no compress
Confirmed issues, didn’t give us a source. But some symptoms to look into.
Two queues, first for half established connections , can make large to help with SYN floods, although given todays flood attacks, probably not much help
Second queue for established connections for your app to pluck off
max_syn_backlog = system wide
net.core.somaxconn = per process
Still not 100% sure what this controls, from looking at Kernel source , it appears to be this
Nginx backlog originally wasn’t documented in documentation, I had to find this in the source code and from googling
Didnt know about nginx listen backlog at first. Initially changed first three values, saw a slight decrease in timeouts and listen queue overflows ,took a bunch of reading till I learned about the fact that each application has to set its own backlog queue and even further research to find what nginxs default value was
Kernel will reuse sockets in TIME_WAIT when it can, a socket in a TIME_WAIT state actually doesn’t take up any resources
Tweak if dealing with sending/receiving large amounts of data to improve throughput
We changed but our throughput is fairly low per server so didn’t see any measurable impact
Internet gets this wrong a lot. TIME_WAIT takes up no memory
everyone on the internet gets this wrong! If you really want to change TIME_WAIT time , see ip_conntrack_tcp_timeout_time_wait in ip_conntrack module
The pressure relates to the rmem and wmem settings we set earlier.
Definitions wrong, harmful settings recommended, even seen this in a lot of books when searching books.google.com for settings
If you are reading any blog/book that recommends enabling this, run far away
Amazing read into why recycle is bad, and why TIME_WAIT exists
Allows for more data in flight, if you are serving up larger content, you will see nice improvements here
Defer , saves on resources where handshake occurs but no data is sent or data is delayed. Leaves Nginx free to deal with connections already sending a data payload
we set both to on. Small payloads
tcp_nopush = application controls things
tcp_nodelay = seamless to developer, just happens
multi_accept, we have off, given our constant stream of connections, it can overwhelm downstream
Lower CPU utilization as well
Previous behavior, if request hit ELB in AZ us-east-1d, would only get routed to instances behind there. This change really smoothed out distribution for us
Indicates capacity issues
It’s easy to get carried away and tune too many things (premature optimization) or settings which may have little to no effect for you.
Dont trust random blogs, filled with terrible information. Sysctl settings defined wrong or extremely vague