SlideShare a Scribd company logo
1 of 49
Download to read offline
Tuning TCP and NGINX on 
EC2
Who are we? 
Chartbeat measures and monetizes attention on the web. Working with 80% of 
the top US news sites and global media sites in 50 countries, Chartbeat brings 
together editors and advertisers to identify in real time the active time an 
audience consumes articles, videos, paid content, and display advertising. 
2 | TCP/NGINX Tuning on EC2
● Founded in 2009 
● Hosted on AWS , 400-500 servers 
depending on time of day 
● Around 180k - 220k req/sec 
● 6 - 9 million concurrents 
chartbeat.com/totaltotal 
2 | TCP/NGINX Tuning on EC2
Who am I? 
● Sr Web Operations Engineer 
● Previously worked at 
○ Bitly 
○ TheStreet.com 
○ Promotions.com 
2 | TCP/NGINX Tuning on EC2
Traffic Characteristics 
Every 15 seconds 
213 byte, request size 
43 byte, response size 
2 | TCP/NGINX Tuning on EC2
Problem 
● Reports of slowness from some customers 
● Taking 3 seconds to send data 
2 | TCP/NGINX Tuning on EC2 
Default Retransmission Timeout 
RFC 1122: Section 4.2.3.1 
The following values SHOULD be used to initialize the 
estimation parameters for a new connection: 
(a) RTT = 0 seconds. 
(b) RTO = 3 seconds. (The smoothed variance is to be 
initialized to the value that will result in this RTO).
2 | TCP/NGINX Tuning on EC2 
flickr: wallyg
2 | TCP/NGINX Tuning on EC2 
flickr: oregondot
Now what? 
TCPDump + Wireshark confirms retransmissions 
2 | TCP/NGINX Tuning on EC2
DON’T GRAPH ALL THE THINGS 
● Graph only relevant metrics 
○ you’ll end up with a ton of red herrings 
2 | TCP/NGINX Tuning on EC2
Sources of info 
● ss -s 
○ summary of socket statistics 
TCP: 10678 (estab 2503, closed 8167, orphaned 0, synrecv 0, timewait 8167/0), 
ports 0 
● netstat -s 
"tcp_active_connections_openings", 
"tcp_connections_aborted_due_to_timeout", 
"tcp_data_loss_events", 
"tcp_failed_connection_attempts", 
"tcp_other_tcp_timeouts", 
"tcp_passive_connection_openings", 
"tcp_segments_retransmited", 
"tcp_segments_send_out", 
"tcp_syns_to_listen_sockets_dropped", 
"tcp_times_the_listen_queue_of_a_socket_overflowed", 
● 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2 
TCP/IP 
Illustrated 
Volume 1 
Second Ed.
2 | TCP/NGINX Tuning on EC2 
Logster + Graphite 
https://github.com/etsy/logster 
Tails logs, generates metrics and 
outputs to Graphite or Ganglia
FINDINGS 
2 | TCP/NGINX Tuning on EC2
Sources of info 
● netstat -s 
"tcp_active_connections_openings", 
"tcp_connections_aborted_due_to_timeout", 
"tcp_data_loss_events", 
"tcp_failed_connection_attempts", 
"tcp_other_tcp_timeouts", 
"tcp_passive_connection_openings", 
"tcp_segments_retransmited", 
"tcp_segments_send_out", 
"tcp_syns_to_listen_sockets_dropped", 
"tcp_times_the_listen_queue_of_a_socket_overflowed", 
2 | TCP/NGINX Tuning on EC2 
Values > 1, can’t be 
good 
Confirmed what we 
suspected 
WHUT
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_max_syn_backlog 
Systems Performance 
Enterprise and the Cloud by Brendan Gregg, pg 492 
2 | TCP/NGINX Tuning on EC2 
net.core.somaxconn 
Nginx: listen backlog=####
Insane Defaults 
● net.core.netdev_max_backlog = 1000 
○ Per CPU backlog? 
○ Network Frames 
● net.ipv4.tcp_max_syn_backlog = 128 
● net.core.somaxconn = 128 
● nginx listen backlog = 511 ?!? 
○ Silently truncated to somaxconn value 
2 | TCP/NGINX Tuning on EC2
New Values 
● net.core.netdev_max_backlog = 16384 
● net.ipv4.tcp_max_syn_backlog = 65536 
● net.core.somaxconn = 16384 
● nginx listen backlog = 16384 
○ should be <= somaxconn 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2 
Results
Further settings explored 
net.ipv4.tcp_slow_start_after_idle 
net.ipv4.tcp_max_tw_buckets 
net.ipv4.tcp_rmem/wrem 
net.ipv4.tcp_fin_timeout 
net.ipv4.tcp_mem 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_slow_start_after_idle 
Set to 0 to ensure connections don’t go back to 
default window size after being idle too long. 
Example: HTTP KeepAlive 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_max_tw_buckets 
Max number of sockets in TIME_WAIT. We 
actually set this very high, since before we 
moved instances behind an ELB it was normal 
to have 200k+ sockets in TIME_WAIT state. 
Exceeding this leads to sockets being torn 
down until under limit 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_rmem/wrem 
Format: min default max ( in bytes) 
The kernel will autotune the number of bytes to 
use for each socket based on these settings. It 
will start at default and work between the 
min and max 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_fin_timeout 
The time a connection should spend in 
FIN_WAIT_2 state. Default is 60 seconds, 
lowering this will free memory more quickly and 
transition the socket to TIME_WAIT. 
This will NOT reduce the time a socket is in 
TIME_WAIT which is set to 2 * MSL (max 
segment lifetime) 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_fin_timeout continued... 
MSL is hardcoded in the kernel at 60 seconds! 
https://github. 
com/torvalds/linux/blob/master/include/net/tcp. 
h#L115 
#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy 
TIME-WAIT * state, about 60 seconds */ 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_mem 
Format: low pressure max (in pages!) 
Below low, Kernel won’t put pressure on 
sockets to reduce mem usage. Once pressure 
hits, sockets reduce memory until low is hit. If 
max hit, no new sockets. 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_tw_recycle (DANGEROUS) 
● Clients behind NAT/Stateful FW will get 
dropped 
● *99.99999999% of time should never be 
enabled 
* Probably 100% but there may be a valid case out there 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_tw_reuse 
● Makes a safer attempt at freeing sockets in 
TIME_WAIT state. 
2 | TCP/NGINX Tuning on EC2
Recycle vs Reuse Deep Dive 
http://bit.ly/tcp-time-wait 
2 | TCP/NGINX Tuning on EC2
One last thing… 
TCP Congestion Window - initcwnd (initial) 
2 | TCP/NGINX Tuning on EC2 
Starting in Kernel 2.6.39 , set to 10 
Previous default was 3! 
http://research.google.com/pubs/pub36640.html 
Older Kernel? 
$ ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10
2 | TCP/NGINX Tuning on EC2 
NGINX
listen statement 
● backlog 
○ limited by net.core.somaxconn 
● defer 
○ TCP_DEFER_ACCEPT - Wait till we receive data 
packet before passing socket to server. Completing 
TCP Handshake won’t trigger an accept() 
2 | TCP/NGINX Tuning on EC2
server block 
● sendfile 
○ Saves context switching from userspace on 
read/write. 
○ “zero copy” , happens in kernel space 
● tcp_nopush 
○ TCP_CORK 
○ allows application to control building of packet, e.g 
pack a packet with full HTTP response 
● tcp_nodelay 
○ Nagle’s Algorithm 
○ Only affects keep-alive connections 
● multi_accept 
○ Accept all connections on listen queue at once 
(careful, can overwhelm workers) 
2 | TCP/NGINX Tuning on EC2
Nagle’s Algorithm (tcp_nopush) 
Small payload + need for low latency? 
Disable 
2 | TCP/NGINX Tuning on EC2
HTTP Keep-Alive 
● Enabled once behind ELB 
● Given small payload and 15 seconds between 
data, waste of resources for us to enable 
exposed directly to net 
2 | TCP/NGINX Tuning on EC2
HTTP Keep-Alive Cont.. 
● Also enable on upstream proxies 
○ Available since 1.1.4 
○ *cough* had to upgrade Nginx and Fix memory leak 
dealing with libevent and keepalives before we could 
get this fully setup 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2 
ELB
Cross-Zone load balancing 
Ensures requests to 
each ELB in each AZ 
go to ALL instances 
in ALL AZs 
2 | TCP/NGINX Tuning on EC2
Idle Connection Timeout 
● Defaults to 60 seconds 
● Finally tunable via API. 
● Tweak if doing anything long lived , e.g. 
Websockets, or ensure you are sending 
“pings” 
2 | TCP/NGINX Tuning on EC2
Connection draining 
“Graceful” removal of node from ELB, will 
ensure existing connections can finish instead 
of a hard cutoff (old behavior) 
2 | TCP/NGINX Tuning on EC2
Metrics to monitor 
● SurgeQueueLength (Not Good) 
A count of the total number of requests that are 
pending submission to a registered instance. 
● SpilloverCount (BAD) 
A count of the total number of requests that 
were rejected due to the queue being full. 
2 | TCP/NGINX Tuning on EC2
Conclusions 
● The internet is full of lies 
● With enough traffic, tweaking system and 
application defaults are a necessary 
● Find trusted sources (Me? Maybe?) for 
settings and test in staging environments 
● Measure impact and understand what metrics 
may be impacted by your tweaks 
● Don’t get lost in all the sysctl settings 
● TCP is complicated 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2 
FIN 
FIN_WAIT_1 
FIN_WAIT_2 
TIME_WAIT
Resources and References 
https://www.kernel. 
org/doc/Documentation/networking/ip-sysctl.txt 
2 | TCP/NGINX Tuning on EC2 
man tcp(7)
Additional reading 
http://engineering.chartbeat.com 
Full story about experiences with our 
architecture and material discussed in 
slides 
2 | TCP/NGINX Tuning on EC2
Questions / Comments? 
2 | TCP/NGINX Tuning on EC2 
@Lintzston 
justin@chartbeat.com

More Related Content

What's hot

MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
Twitterのsnowflakeについて
TwitterのsnowflakeについてTwitterのsnowflakeについて
Twitterのsnowflakeについて
moai kids
 

What's hot (20)

NGINX: Basics and Best Practices
NGINX: Basics and Best PracticesNGINX: Basics and Best Practices
NGINX: Basics and Best Practices
 
Docker Compose 徹底解説
Docker Compose 徹底解説Docker Compose 徹底解説
Docker Compose 徹底解説
 
大規模サービスを支えるネットワークインフラの全貌
大規模サービスを支えるネットワークインフラの全貌大規模サービスを支えるネットワークインフラの全貌
大規模サービスを支えるネットワークインフラの全貌
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
 
コンテナ時代のOpenStack
コンテナ時代のOpenStackコンテナ時代のOpenStack
コンテナ時代のOpenStack
 
AWSでDockerを扱うためのベストプラクティス
AWSでDockerを扱うためのベストプラクティスAWSでDockerを扱うためのベストプラクティス
AWSでDockerを扱うためのベストプラクティス
 
さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)
 
密かに話題のBufferbloat
密かに話題のBufferbloat密かに話題のBufferbloat
密かに話題のBufferbloat
 
PostgreSQL: XID周回問題に潜む別の問題
PostgreSQL: XID周回問題に潜む別の問題PostgreSQL: XID周回問題に潜む別の問題
PostgreSQL: XID周回問題に潜む別の問題
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
ML2/OVN アーキテクチャ概観
ML2/OVN アーキテクチャ概観ML2/OVN アーキテクチャ概観
ML2/OVN アーキテクチャ概観
 
Harbor RegistryのReplication機能
Harbor RegistryのReplication機能Harbor RegistryのReplication機能
Harbor RegistryのReplication機能
 
Twitterのsnowflakeについて
TwitterのsnowflakeについてTwitterのsnowflakeについて
Twitterのsnowflakeについて
 
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/FallZabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
Zabbix最新情報 ~Zabbix 6.0に向けて~ @OSC2021 Online/Fall
 
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
 
KubernetesバックアップツールVeleroとちょっとした苦労話
KubernetesバックアップツールVeleroとちょっとした苦労話KubernetesバックアップツールVeleroとちょっとした苦労話
KubernetesバックアップツールVeleroとちょっとした苦労話
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
DockerとPodmanの比較
DockerとPodmanの比較DockerとPodmanの比較
DockerとPodmanの比較
 
ヤフー社内でやってるMySQLチューニングセミナー大公開
ヤフー社内でやってるMySQLチューニングセミナー大公開ヤフー社内でやってるMySQLチューニングセミナー大公開
ヤフー社内でやってるMySQLチューニングセミナー大公開
 
外部キー制約に伴うロックの小話
外部キー制約に伴うロックの小話外部キー制約に伴うロックの小話
外部キー制約に伴うロックの小話
 

Viewers also liked

Naxsi, an open source WAF for Nginx
Naxsi, an open source WAF  for NginxNaxsi, an open source WAF  for Nginx
Naxsi, an open source WAF for Nginx
Positive Hack Days
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
Jos Boumans
 

Viewers also liked (20)

Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and Tuning
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 
Lcu14 Lightning Talk- NGINX
Lcu14 Lightning Talk- NGINXLcu14 Lightning Talk- NGINX
Lcu14 Lightning Talk- NGINX
 
Learn nginx in 90mins
Learn nginx in 90minsLearn nginx in 90mins
Learn nginx in 90mins
 
DevConf 2014 Kernel Networking Walkthrough
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking Walkthrough
 
How to secure your web applications with NGINX
How to secure your web applications with NGINXHow to secure your web applications with NGINX
How to secure your web applications with NGINX
 
The 3 Models in the NGINX Microservices Reference Architecture
The 3 Models in the NGINX Microservices Reference ArchitectureThe 3 Models in the NGINX Microservices Reference Architecture
The 3 Models in the NGINX Microservices Reference Architecture
 
AWS Black Belt Techシリーズ 2015 Amazon Elastic Block Store (EBS)
AWS Black Belt Techシリーズ 2015 Amazon Elastic Block Store (EBS)AWS Black Belt Techシリーズ 2015 Amazon Elastic Block Store (EBS)
AWS Black Belt Techシリーズ 2015 Amazon Elastic Block Store (EBS)
 
Nginx Internals
Nginx InternalsNginx Internals
Nginx Internals
 
(SDD423) Elastic Load Balancing Deep Dive and Best Practices | AWS re:Invent ...
(SDD423) Elastic Load Balancing Deep Dive and Best Practices | AWS re:Invent ...(SDD423) Elastic Load Balancing Deep Dive and Best Practices | AWS re:Invent ...
(SDD423) Elastic Load Balancing Deep Dive and Best Practices | AWS re:Invent ...
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.
 
Monitoring NGINX (plus): key metrics and how-to
Monitoring NGINX (plus): key metrics and how-toMonitoring NGINX (plus): key metrics and how-to
Monitoring NGINX (plus): key metrics and how-to
 
Nginx monitoring with graphite
Nginx monitoring with graphiteNginx monitoring with graphite
Nginx monitoring with graphite
 
Naxsi, an open source WAF for Nginx
Naxsi, an open source WAF  for NginxNaxsi, an open source WAF  for Nginx
Naxsi, an open source WAF for Nginx
 
Devops training in Hyderabad
Devops training in HyderabadDevops training in Hyderabad
Devops training in Hyderabad
 
Fluentd and docker monitoring
Fluentd and docker monitoringFluentd and docker monitoring
Fluentd and docker monitoring
 
Tuning 17 march
Tuning 17 marchTuning 17 march
Tuning 17 march
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 

Similar to Tuning TCP and NGINX on EC2

Programming TCP for responsiveness
Programming TCP for responsivenessProgramming TCP for responsiveness
Programming TCP for responsiveness
Kazuho Oku
 

Similar to Tuning TCP and NGINX on EC2 (20)

Reconsider TCPdump for Modern Troubleshooting
Reconsider TCPdump for Modern TroubleshootingReconsider TCPdump for Modern Troubleshooting
Reconsider TCPdump for Modern Troubleshooting
 
Abandon Decades-Old TCPdump for Modern Troubleshooting
Abandon Decades-Old TCPdump for Modern TroubleshootingAbandon Decades-Old TCPdump for Modern Troubleshooting
Abandon Decades-Old TCPdump for Modern Troubleshooting
 
Real-time in the real world: DIRT in production
Real-time in the real world: DIRT in productionReal-time in the real world: DIRT in production
Real-time in the real world: DIRT in production
 
Transaction TCP
Transaction TCPTransaction TCP
Transaction TCP
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
 
Make container without_docker_7
Make container without_docker_7Make container without_docker_7
Make container without_docker_7
 
Programming TCP for responsiveness
Programming TCP for responsivenessProgramming TCP for responsiveness
Programming TCP for responsiveness
 
IRJET- Modeling a New Startup Algorithm for TCP New Reno
IRJET- Modeling a New Startup Algorithm for TCP New RenoIRJET- Modeling a New Startup Algorithm for TCP New Reno
IRJET- Modeling a New Startup Algorithm for TCP New Reno
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow Switch
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecasesLF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
 
XS Boston 2008 Network Topology
XS Boston 2008 Network TopologyXS Boston 2008 Network Topology
XS Boston 2008 Network Topology
 
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
ECS19 - Ingo Gegenwarth -  Running Exchangein large environmentECS19 - Ingo Gegenwarth -  Running Exchangein large environment
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
 
Linux HTTPS/TCP/IP Stack for the Fast and Secure Web
Linux HTTPS/TCP/IP Stack for the Fast and Secure WebLinux HTTPS/TCP/IP Stack for the Fast and Secure Web
Linux HTTPS/TCP/IP Stack for the Fast and Secure Web
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
High perf-networking
High perf-networkingHigh perf-networking
High perf-networking
 
TCP and Mobile Networks Turbulent Relationship
TCP and Mobile Networks Turbulent RelationshipTCP and Mobile Networks Turbulent Relationship
TCP and Mobile Networks Turbulent Relationship
 
Programming TCP for responsiveness
Programming TCP for responsivenessProgramming TCP for responsiveness
Programming TCP for responsiveness
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 

More from Chartbeat

Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...
Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...
Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...
Chartbeat
 

More from Chartbeat (12)

Chartbeat: Discovery & Engagement in a Mobile-first World
Chartbeat: Discovery & Engagement in a Mobile-first WorldChartbeat: Discovery & Engagement in a Mobile-first World
Chartbeat: Discovery & Engagement in a Mobile-first World
 
Audience Building in the Age of Platforms
Audience Building in the Age of PlatformsAudience Building in the Age of Platforms
Audience Building in the Age of Platforms
 
Data in the Newsroom: Content Solutions for Advertising Challenges
Data in the Newsroom: Content Solutions for Advertising ChallengesData in the Newsroom: Content Solutions for Advertising Challenges
Data in the Newsroom: Content Solutions for Advertising Challenges
 
ONA 2015 – Can You Have Their Attention Please?
ONA 2015 – Can You Have Their Attention Please?ONA 2015 – Can You Have Their Attention Please?
ONA 2015 – Can You Have Their Attention Please?
 
Insights from Around the World
Insights from Around the WorldInsights from Around the World
Insights from Around the World
 
A Data State of the Union: Can We Make Quality Pay Online?
A Data State of the Union: Can We Make Quality Pay Online?A Data State of the Union: Can We Make Quality Pay Online?
A Data State of the Union: Can We Make Quality Pay Online?
 
Understanding Your Traffic Sources
Understanding Your Traffic Sources Understanding Your Traffic Sources
Understanding Your Traffic Sources
 
An Introduction to Video Analytics
An Introduction to Video Analytics An Introduction to Video Analytics
An Introduction to Video Analytics
 
Top Trends in Online Journalism for 2014
Top Trends in Online Journalism for 2014Top Trends in Online Journalism for 2014
Top Trends in Online Journalism for 2014
 
Say Hello to the New Chartbeat Publishing
Say Hello to the New Chartbeat Publishing Say Hello to the New Chartbeat Publishing
Say Hello to the New Chartbeat Publishing
 
Building a Loyal and Returning Audience with the New Chartbeat Publishing
Building a Loyal and Returning Audience with the New Chartbeat PublishingBuilding a Loyal and Returning Audience with the New Chartbeat Publishing
Building a Loyal and Returning Audience with the New Chartbeat Publishing
 
Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...
Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...
Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 

Tuning TCP and NGINX on EC2

  • 1. Tuning TCP and NGINX on EC2
  • 2. Who are we? Chartbeat measures and monetizes attention on the web. Working with 80% of the top US news sites and global media sites in 50 countries, Chartbeat brings together editors and advertisers to identify in real time the active time an audience consumes articles, videos, paid content, and display advertising. 2 | TCP/NGINX Tuning on EC2
  • 3. ● Founded in 2009 ● Hosted on AWS , 400-500 servers depending on time of day ● Around 180k - 220k req/sec ● 6 - 9 million concurrents chartbeat.com/totaltotal 2 | TCP/NGINX Tuning on EC2
  • 4. Who am I? ● Sr Web Operations Engineer ● Previously worked at ○ Bitly ○ TheStreet.com ○ Promotions.com 2 | TCP/NGINX Tuning on EC2
  • 5. Traffic Characteristics Every 15 seconds 213 byte, request size 43 byte, response size 2 | TCP/NGINX Tuning on EC2
  • 6. Problem ● Reports of slowness from some customers ● Taking 3 seconds to send data 2 | TCP/NGINX Tuning on EC2 Default Retransmission Timeout RFC 1122: Section 4.2.3.1 The following values SHOULD be used to initialize the estimation parameters for a new connection: (a) RTT = 0 seconds. (b) RTO = 3 seconds. (The smoothed variance is to be initialized to the value that will result in this RTO).
  • 7. 2 | TCP/NGINX Tuning on EC2 flickr: wallyg
  • 8. 2 | TCP/NGINX Tuning on EC2 flickr: oregondot
  • 9. Now what? TCPDump + Wireshark confirms retransmissions 2 | TCP/NGINX Tuning on EC2
  • 10. DON’T GRAPH ALL THE THINGS ● Graph only relevant metrics ○ you’ll end up with a ton of red herrings 2 | TCP/NGINX Tuning on EC2
  • 11. Sources of info ● ss -s ○ summary of socket statistics TCP: 10678 (estab 2503, closed 8167, orphaned 0, synrecv 0, timewait 8167/0), ports 0 ● netstat -s "tcp_active_connections_openings", "tcp_connections_aborted_due_to_timeout", "tcp_data_loss_events", "tcp_failed_connection_attempts", "tcp_other_tcp_timeouts", "tcp_passive_connection_openings", "tcp_segments_retransmited", "tcp_segments_send_out", "tcp_syns_to_listen_sockets_dropped", "tcp_times_the_listen_queue_of_a_socket_overflowed", ● 2 | TCP/NGINX Tuning on EC2
  • 12. 2 | TCP/NGINX Tuning on EC2 TCP/IP Illustrated Volume 1 Second Ed.
  • 13. 2 | TCP/NGINX Tuning on EC2 Logster + Graphite https://github.com/etsy/logster Tails logs, generates metrics and outputs to Graphite or Ganglia
  • 14. FINDINGS 2 | TCP/NGINX Tuning on EC2
  • 15. Sources of info ● netstat -s "tcp_active_connections_openings", "tcp_connections_aborted_due_to_timeout", "tcp_data_loss_events", "tcp_failed_connection_attempts", "tcp_other_tcp_timeouts", "tcp_passive_connection_openings", "tcp_segments_retransmited", "tcp_segments_send_out", "tcp_syns_to_listen_sockets_dropped", "tcp_times_the_listen_queue_of_a_socket_overflowed", 2 | TCP/NGINX Tuning on EC2 Values > 1, can’t be good Confirmed what we suspected WHUT
  • 16. 2 | TCP/NGINX Tuning on EC2
  • 17. net.ipv4.tcp_max_syn_backlog Systems Performance Enterprise and the Cloud by Brendan Gregg, pg 492 2 | TCP/NGINX Tuning on EC2 net.core.somaxconn Nginx: listen backlog=####
  • 18. Insane Defaults ● net.core.netdev_max_backlog = 1000 ○ Per CPU backlog? ○ Network Frames ● net.ipv4.tcp_max_syn_backlog = 128 ● net.core.somaxconn = 128 ● nginx listen backlog = 511 ?!? ○ Silently truncated to somaxconn value 2 | TCP/NGINX Tuning on EC2
  • 19. New Values ● net.core.netdev_max_backlog = 16384 ● net.ipv4.tcp_max_syn_backlog = 65536 ● net.core.somaxconn = 16384 ● nginx listen backlog = 16384 ○ should be <= somaxconn 2 | TCP/NGINX Tuning on EC2
  • 20. 2 | TCP/NGINX Tuning on EC2 Results
  • 21. Further settings explored net.ipv4.tcp_slow_start_after_idle net.ipv4.tcp_max_tw_buckets net.ipv4.tcp_rmem/wrem net.ipv4.tcp_fin_timeout net.ipv4.tcp_mem 2 | TCP/NGINX Tuning on EC2
  • 22. net.ipv4.tcp_slow_start_after_idle Set to 0 to ensure connections don’t go back to default window size after being idle too long. Example: HTTP KeepAlive 2 | TCP/NGINX Tuning on EC2
  • 23. net.ipv4.tcp_max_tw_buckets Max number of sockets in TIME_WAIT. We actually set this very high, since before we moved instances behind an ELB it was normal to have 200k+ sockets in TIME_WAIT state. Exceeding this leads to sockets being torn down until under limit 2 | TCP/NGINX Tuning on EC2
  • 24. net.ipv4.tcp_rmem/wrem Format: min default max ( in bytes) The kernel will autotune the number of bytes to use for each socket based on these settings. It will start at default and work between the min and max 2 | TCP/NGINX Tuning on EC2
  • 25. net.ipv4.tcp_fin_timeout The time a connection should spend in FIN_WAIT_2 state. Default is 60 seconds, lowering this will free memory more quickly and transition the socket to TIME_WAIT. This will NOT reduce the time a socket is in TIME_WAIT which is set to 2 * MSL (max segment lifetime) 2 | TCP/NGINX Tuning on EC2
  • 26. net.ipv4.tcp_fin_timeout continued... MSL is hardcoded in the kernel at 60 seconds! https://github. com/torvalds/linux/blob/master/include/net/tcp. h#L115 #define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT * state, about 60 seconds */ 2 | TCP/NGINX Tuning on EC2
  • 27. net.ipv4.tcp_mem Format: low pressure max (in pages!) Below low, Kernel won’t put pressure on sockets to reduce mem usage. Once pressure hits, sockets reduce memory until low is hit. If max hit, no new sockets. 2 | TCP/NGINX Tuning on EC2
  • 28. 2 | TCP/NGINX Tuning on EC2
  • 29. 2 | TCP/NGINX Tuning on EC2
  • 30. net.ipv4.tcp_tw_recycle (DANGEROUS) ● Clients behind NAT/Stateful FW will get dropped ● *99.99999999% of time should never be enabled * Probably 100% but there may be a valid case out there 2 | TCP/NGINX Tuning on EC2
  • 31. net.ipv4.tcp_tw_reuse ● Makes a safer attempt at freeing sockets in TIME_WAIT state. 2 | TCP/NGINX Tuning on EC2
  • 32. Recycle vs Reuse Deep Dive http://bit.ly/tcp-time-wait 2 | TCP/NGINX Tuning on EC2
  • 33. One last thing… TCP Congestion Window - initcwnd (initial) 2 | TCP/NGINX Tuning on EC2 Starting in Kernel 2.6.39 , set to 10 Previous default was 3! http://research.google.com/pubs/pub36640.html Older Kernel? $ ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10
  • 34. 2 | TCP/NGINX Tuning on EC2 NGINX
  • 35. listen statement ● backlog ○ limited by net.core.somaxconn ● defer ○ TCP_DEFER_ACCEPT - Wait till we receive data packet before passing socket to server. Completing TCP Handshake won’t trigger an accept() 2 | TCP/NGINX Tuning on EC2
  • 36. server block ● sendfile ○ Saves context switching from userspace on read/write. ○ “zero copy” , happens in kernel space ● tcp_nopush ○ TCP_CORK ○ allows application to control building of packet, e.g pack a packet with full HTTP response ● tcp_nodelay ○ Nagle’s Algorithm ○ Only affects keep-alive connections ● multi_accept ○ Accept all connections on listen queue at once (careful, can overwhelm workers) 2 | TCP/NGINX Tuning on EC2
  • 37. Nagle’s Algorithm (tcp_nopush) Small payload + need for low latency? Disable 2 | TCP/NGINX Tuning on EC2
  • 38. HTTP Keep-Alive ● Enabled once behind ELB ● Given small payload and 15 seconds between data, waste of resources for us to enable exposed directly to net 2 | TCP/NGINX Tuning on EC2
  • 39. HTTP Keep-Alive Cont.. ● Also enable on upstream proxies ○ Available since 1.1.4 ○ *cough* had to upgrade Nginx and Fix memory leak dealing with libevent and keepalives before we could get this fully setup 2 | TCP/NGINX Tuning on EC2
  • 40. 2 | TCP/NGINX Tuning on EC2 ELB
  • 41. Cross-Zone load balancing Ensures requests to each ELB in each AZ go to ALL instances in ALL AZs 2 | TCP/NGINX Tuning on EC2
  • 42. Idle Connection Timeout ● Defaults to 60 seconds ● Finally tunable via API. ● Tweak if doing anything long lived , e.g. Websockets, or ensure you are sending “pings” 2 | TCP/NGINX Tuning on EC2
  • 43. Connection draining “Graceful” removal of node from ELB, will ensure existing connections can finish instead of a hard cutoff (old behavior) 2 | TCP/NGINX Tuning on EC2
  • 44. Metrics to monitor ● SurgeQueueLength (Not Good) A count of the total number of requests that are pending submission to a registered instance. ● SpilloverCount (BAD) A count of the total number of requests that were rejected due to the queue being full. 2 | TCP/NGINX Tuning on EC2
  • 45. Conclusions ● The internet is full of lies ● With enough traffic, tweaking system and application defaults are a necessary ● Find trusted sources (Me? Maybe?) for settings and test in staging environments ● Measure impact and understand what metrics may be impacted by your tweaks ● Don’t get lost in all the sysctl settings ● TCP is complicated 2 | TCP/NGINX Tuning on EC2
  • 46. 2 | TCP/NGINX Tuning on EC2 FIN FIN_WAIT_1 FIN_WAIT_2 TIME_WAIT
  • 47. Resources and References https://www.kernel. org/doc/Documentation/networking/ip-sysctl.txt 2 | TCP/NGINX Tuning on EC2 man tcp(7)
  • 48. Additional reading http://engineering.chartbeat.com Full story about experiences with our architecture and material discussed in slides 2 | TCP/NGINX Tuning on EC2
  • 49. Questions / Comments? 2 | TCP/NGINX Tuning on EC2 @Lintzston justin@chartbeat.com

Editor's Notes

  1. Record traffic during US Election and World Cup, USA vs Germany 10+ million concurrents. Presidential election at the time was 2x our normal traffic
  2. High packet rate, low bandwidth. 43 bytes is small empty image we can send. Need this for error handling purposes on frontend side. Can’t send empty response
  3. Reports from users about slowness in sending “pings” to our servers. Slow clients, slowness doesn’t really affect our numbers too much as long as its arriving < 5 seconds. Asked for some numbers, seeing pings taking around 3 seconds . That number sets off some alarms
  4. These two numbers should raise alarms when you are doing any troubleshooting with TCP connections. 3 seconds is the default timeout for re-trying a connection, will backoff and re-try after 6 seconds, so 9 seconds total for a connection
  5. Maybe on client side? How do we know it’s on us? At the time our Pingdom monitoring didn’t show anything unusual, later learned this is definitely not enough
  6. Especially if you don’t have a good baseline for some metrics, you will end up chasing oddities in graphs that may be completely irrelevant
  7. Only graphed relevant info from netstat -s, there is a ton of metrics that may be useful for debugging other issues, but we started with these since they appeared to be most related with the issue at hand. For example, “fast retransmit” , while relevant, wouldnt indicate a delay of 3 seconds, since it bypasses the timeout. Push these to ganglia/graphite
  8. Had to enable logging, discard after hour, space contraints. Rotation impacts performance, switch to ext4 on log volume helped, no compress
  9. Confirmed issues, didn’t give us a source. But some symptoms to look into.
  10. Two queues, first for half established connections , can make large to help with SYN floods, although given todays flood attacks, probably not much help Second queue for established connections for your app to pluck off max_syn_backlog = system wide net.core.somaxconn = per process
  11. Still not 100% sure what this controls, from looking at Kernel source , it appears to be this Nginx backlog originally wasn’t documented in documentation, I had to find this in the source code and from googling
  12. Didnt know about nginx listen backlog at first. Initially changed first three values, saw a slight decrease in timeouts and listen queue overflows ,took a bunch of reading till I learned about the fact that each application has to set its own backlog queue and even further research to find what nginxs default value was
  13. Kernel will reuse sockets in TIME_WAIT when it can, a socket in a TIME_WAIT state actually doesn’t take up any resources
  14. Tweak if dealing with sending/receiving large amounts of data to improve throughput We changed but our throughput is fairly low per server so didn’t see any measurable impact
  15. Internet gets this wrong a lot. TIME_WAIT takes up no memory
  16. everyone on the internet gets this wrong! If you really want to change TIME_WAIT time , see ip_conntrack_tcp_timeout_time_wait in ip_conntrack module
  17. The pressure relates to the rmem and wmem settings we set earlier.
  18. Definitions wrong, harmful settings recommended, even seen this in a lot of books when searching books.google.com for settings
  19. If you are reading any blog/book that recommends enabling this, run far away
  20. Amazing read into why recycle is bad, and why TIME_WAIT exists
  21. Allows for more data in flight, if you are serving up larger content, you will see nice improvements here
  22. Defer , saves on resources where handshake occurs but no data is sent or data is delayed. Leaves Nginx free to deal with connections already sending a data payload
  23. we set both to on. Small payloads tcp_nopush = application controls things tcp_nodelay = seamless to developer, just happens multi_accept, we have off, given our constant stream of connections, it can overwhelm downstream
  24. Lower CPU utilization as well
  25. Previous behavior, if request hit ELB in AZ us-east-1d, would only get routed to instances behind there. This change really smoothed out distribution for us
  26. Indicates capacity issues
  27. It’s easy to get carried away and tune too many things (premature optimization) or settings which may have little to no effect for you.
  28. Dont trust random blogs, filled with terrible information. Sysctl settings defined wrong or extremely vague