As a growing company Wix has tried many monitoring solutions some worked better than others. In this talk we will go over the lessons we learned at Wix about what to monitor and how to monitor production systems; when to trigger alerts and also when not to trigger alerts.
We will go over some of the tools we use and also some of the tools we built to help us sleep better at night while doing 400 deployments to production every month.
http://www.youtube.com/watch?v=OLPA2KOWJ8I
Apidays New York 2024 - The value of a flexible API Management solution for O...
Lessons Learned Monitoring Production
1. Red Alert Or False Alarm
Monitoring Production Systems
Aviran Mordo
Head Of Back-End Engineering @ Wix
@aviranm
http://www.linkedin.com/in/aviran
http://www.aviransplace.com
01:21
3. Wix in Numbers
• 40,000,000 users
– Adding over 1,000,000 new users each month
• Static storage is over 200TB of data
– Adding over 1TB of files every day
• 3 Data centers + 2 Clouds (Google AE, Amazon)
– Around 300 servers
• 400 Deployments a month (Continuous Delivery)
• Over 100,000,000 Server API calls per day
• Over 450 people work at Wix
– ~ 150 people in R&D
01:21
8. 01:21
Cons
• No early warning – Only when site
is down
• Don’t know what is the problem
• Does not monitor API
Pros
• 24 / 7 Uptime monitoring
• Different Geo locations
Pingdom
9. 01:21
Cons
• Manually record flows
• Does not monitor internal servers
Pros
• Transaction monitoring from real
user perspective
• Support Flash
• Different geo locations
Keynote
10. Monitor Hardware and OS
01:21
Cons
• Monitor at the OS level, not
application level*
• Does not know when there is a
problem with the application (the
Pros
• Monitor machine health
• Built-in integration with Graphite
• Custom checks
12. Server Logs
01:21
Cons
• Too much information
• Hard to read, Not friendly to
developers
• Pinpointing the problem takes long
time
• Server cluster need log
Pros
• Verbose and flexible
13. Log collections
01:21
• Client & Server logs are collected
with Flume and Syslog-ng
• Storm + Esper analyzes log events
and feeds Graphite
• Store in Hadoop+HBase for in-depth
analysis
17. App-Info Monitoring
01:21
Cons
• Cores grained information for an
overview
• Too much information
Pros
• Detailed and easy view of a server
• Almost no need to look at logs
18. Graphite
01:21
• All systems feed Graphite with
metrics (Nagios, App-info, Storm)
• Nagios query Graphite and triggers
alerts
19. Graphite
01:21
Cons
• Not a dashboard (you can build
dashboard on top of it)
• Design data schema (hierarchy) in
Pros
• Numerous formulas available
• Share graphs
• Easy to create new graphs
21. New Relic
01:21
Pros
• Easy to use – developer friendly
• Service level overview (both
cluster and single server)
• Customizable dashboards
• JVM profiler on production
• Code instrumentation
• Real User Monitoring
22. New Relic
01:21
Cons
• No distributed transaction trace
for specific server
• No exception classification
• A lot of false alarms due to
misbehaving bots
• False alarms for low throughput
services