No one likes a broken website. Learn about some of the techniques that NGINX users employ to ensure that server failures are detected and worked around, so that you too can build large-scale, highly-available web services.
View full webinar on demand at http://nginx.com/resources/webinars/nginx-high-availability-monitoring/
Axa Assurance Maroc - Insurer Innovation Award 2024
NGINX High Availability and Monitoring
1. NGINX High Availability
and Monitoring
Introduced by Andrew Alexeev
Presented by Owen Garrett
Nginx, Inc.
2. About this webinar
No one likes a broken website. Learn about some of the techniques that NGINX
users employ to ensure that server failures are detected and worked around, so that
you too can build large-scale, highly-available web services.
4. The causes of downtime
“ Through 2015, 80% of outages impacting mission-critical
services will be caused by people and process
issues, and more than 50% of those outages will be
caused by change/configuration/release integration
and hand-off issues. ”
Configuration Management for Virtual and Cloud
Infrastructures
Ronni J. Colville and George Spafford, Gartner
Hardware failures, disasters
People and Process
6. What is NGINX?
Internet
Proxy
Caching, Load Balancing… HTTP traffic
N
Web Server
Serve content from disk
Application Server
FastCGI, uWSGI, Passenger…
Application Acceleration
SSL and SPDY termination
Performance Monitoring
High Availability
Advanced Features: Bandwidth Management
Content-based Routing
Request Manipulation
Response Rewriting
Authentication
Video Delivery
Mail Proxy
GeoLocation
8. 22%
Top 1 million websites 37%
Top 1,000 websites
9. NGINX and NGINX Plus
NGINX F/OSS
nginx.org
3rd party
modules
Large community of >100 modules
10. NGINX and NGINX Plus
NGINX F/OSS
nginx.org
3rd party
modules
Large community of >100 modules
NGINX Plus
Advanced load balancing features
Ease-of-management
Commercial support
12. Quick review of load balancing
server {
listen 80;
location / {
proxy_pass http://backend;
}
}
upstream backend {
server webserver1:80;
server webserver2:80;
server webserver3:80;
server webserver4:80;
}
Internet
N
13. Three NGINX Techniques for High Availability
NGINX: Basic Error Checks
NGINX Plus: Advanced Health Checks
Live software upgrades
1
2
3
14. 1. Basic Error Checks
• Monitor transactions as they happen
– Retry transactions that ‘fail’ where possible
– Mark failed servers as dead
15. Basic Error Checks
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout; # http_503..., off
}
}
upstream backend {
server webserver1:80 max_fails=1 fail_timeout=10s;
server webserver2:80 max_fails=1 fail_timeout=10s;
server webserver3:80 max_fails=1 fail_timeout=10s;
server webserver4:80 max_fails=1 fail_timeout=10s;
}
16. More sophisticated retries
server {
listen 80;
location / {
# On error/timeout, try the upstream group one more time
error_page 502 504 = @fallback;
proxy_pass http://backend;
proxy_next_upstream off;
}
location @fallback {
proxy_pass http://backend;
proxy_next_upstream off;
}
}
17. 2. Advanced Health Checks
• “Synthetic Transactions”
– Probes server health
– Complex, custom tests are possible
– Available in NGINX Plus
18. Advanced Health Checks
server {
listen 80;
location / {
proxy_pass http://backend;
health_check;
}
}
upstream backend {
zone backend 64k;
server webserver1:80;
server webserver2:80;
server webserver3:80;
server webserver4:80;
}
health_check:
interval = period between checks
fails = failure count before dead
passes = pass count before alive
uri = custom URI
Default:
5 seconds, 1 fail, 1 pass, uri = /
19. Advanced usage
server {
listen 80;
location / {
proxy_pass http://backend;
health_check uri=/test.php match=statusok;
proxy_set_header Host www.foo.com;
}
}
match statusok {
# Used for /test.php health check
status 200;
header Content-Type = text/html;
body ~ "Server[0-9]+ is alive";
}
Health checks inherit all
parameters from location
block.
match blocks define the
success criteria for a
health check
20. Edge cases – variables in configuration
server {
location / {
proxy_pass http://backend;
health_check;
proxy_set_header Host $host;
}
}
This may not work as expected.
Remember – the health_check
tests run in the context of the
enclosing location.
21. Edge cases – variables in configuration
server {
location / {
proxy_pass http://backend;
health_check;
proxy_set_header Host $host;
}
}
server {
location /internal-check {
internal;
proxy_pass http://backend;
health_check;
proxy_set_header Host www.foo.com;
}
}
This may not work as expected.
Remember – the health_check
tests run in the context of the
enclosing location.
This is the common alternative.
Use a custom URI for the location.
Tag the location as internal.
Set headers manually.
Useful for authentication.
22. Examples of using health checks
• Verify that pages
don’t contain errors
• Run internal tests (e.g. test.php => DB connect)
• Managed removal of servers
$ touch $DOCROOT/isactive.txt
23. Advantages of ‘Health Checks’
• Run tests asynchronously (find errors faster)
• Custom tests (not related to ‘real’ traffic)
• More flexibility to specify success/error
25. Slow start
• When basic error checks and advanced health
checks recover:
upstream backends {
zone backends 64k;
server webserver1 slow_start=30s;
}
26. NGINX Plus status monitoring
http://demo.nginx.com/ and http://demo.nginx.com/status
Total data and connections
Current data and conns.
Split per ‘server zone’
Cache statistics
Upstream statistics:
Traffic
Health and Error status
(web) (JSON)
27. 3. Live software upgrades
• Upgrade your NGINX binary on-the-fly
– No downtime
– No dropped connections
28. No downtime – ever!
• Reload configuration with SIGHUP
# nginx –s reload
• Re-exec binary with copy-and-signal
http://nginx.org/en/docs/control.html#upgrade
NGINX parent process
NGINX workers
NGINX workers
NGINX workers
NGINX workers
29. In summary...
NGINX F/OSS:
Basic Error checks and retry logic On-the-fly upgrades
NGINX Plus:
Advanced health checks + slow start Extended status monitoring
Compared to other load balancers and ADCs, NGINX Plus is uniquely well-suited
to a devops-driven environment.
30. Closing thoughts
• 37% of the busiest websites use NGINX
– In most situations, it’s a drop-in extension
• Check out the blogs on nginx.com
• Future webinars: nginx.com/webinars
Try NGINX F/OSS (nginx.org) or NGINX Plus (nginx.com)
Hinweis der Redaktion
Story starts with a single guy, Igor Sysoev
What was originally a tool for managing concurrency hos evolved into a Web Application Accelerator
Not because of vision but user driven innovation
http://www.networkworld.com/careers/2004/0105man.html
http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html
Cost of downtime:
Reputation
PPC ads
Job losses
£1m / hour downtime UK service example
How do we reduce this 80%? We need infrastructure that works with our processes, is tightly integrated with our devops practices, that we can work with rather than battle against.
http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html
Misconfigurations Have Major Impact on Performance
The IT Process Institute's Visible Ops Handbook reports that "80% of unplanned outages are due to ill-planned changes made by administrators ("operations staff") or developers." (Visible Ops). Getting to the bottom of the matter, the Enterprise Management Association reports that 60% of availability and performance errors are the result of misconfigurations. The little changes that are implemented to the environment and system configuration parameters all the time.
A recent Gartner study projected that "Through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues." (Ronni J. Colville and George Spafford Configuration Management for Virtual and Cloud Infrastructures)
Manual configuration errors can cost companies up to $72,000 per hour in Web application downtime. While application maintenance costs are increasing at a rate of 20% annually, 35% of those polled said at least one-quarter of their downtime was caused by configuration errors. (How much will you spend on application downtime this year?)
- See more at: http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html#sthash.QvXJhaCC.dpuf
As we go through this presentation, we’ll highlight some of the new features that are specific to nginx plus
proxy_next_upstream error timeout is default
server max_fails=1 fail_timeout=10s is default
502 – bad gateway
504 - timeout
5s
1 fail
1 pass
URI = /
Remember that a significant proportion of failures occur because of errors in process.
NGINX Plus is more flexible and can be more easily accomodated by your standard devops process.