New Relic provides application performance monitoring that enables a company to quickly detect and resolve issues with their website. This is demonstrated through two case studies:
In the first case, New Relic detected JavaScript errors on the live site caused by some corporate networks blocking content from CDNs. This allowed the issue to be identified and resolved without any user reports.
In the second case, New Relic provided advance warning of high server load before a crash occurred. This allowed the DevOps team to begin resolving the issue before any downtime. When the server did crash, New Relic instantly alerted support. The server was rebooted within 29 minutes, resolving the issue faster than arranging meetings about the downtime.
7. Scenario:
● Nobody has reported an issue (yet).
● We didn’t pick up on any issues in testing.
● Critical issue - the website won’t work without it.
Recent changes:
● Loading jQuery from a Content Delivery Network.
● Feature-detect based embedding of jQuery.
8. Data inspection time...
It happens predominantly in Internet Explorer,
but the browser version is not to blame, this time.
11. ● Some Corporate (or Educational Institution) networks will be ‘protected’ by
disallowing external resources from Content Delivery Networks.
● This will result in 404 Errors when requesting files, which would explain the
errors we see in New Relic.
● Our solution: put a fallback version of the file locally to continue supporting
these customers.
Result!
12. Most importantly, we’ve identified issues that real users are
experiencing, debugged and resolved them without the customer or
the client having to report any issues or provide any details.
Which also means we fixed the issue in a fraction of the time!
15. End Users
Technical Contact
Product Owner
DevOps
Development Team
Customer Support
At the time it all kicks off… (11am)
PO & Technical AWOL
(everyone’s allowed a lunch break)
16. End Users
Technical Contact
Product Owner
DevOps
Development Team
Customer Support
11:00am
!
Warning: High load on server
DevOps team gets advance
warning of high load on server
and begin investigating.
17. End Users
Technical Contact
Product Owner
DevOps
Development Team
Customer Support
11:31am
!
!
!
Alert: Server unavailable
!
!
Alert: Downtime
Server Crash!
Email alerts for everyone!
Customer Support team knows
of the downtime instantly.
500 Error
18. Fortunately, DevOps have been aware of the issue
for 30 minutes already and attempting to fix it.
When the server finally dies, they talk to the the Development Team
about the possibility of simply rebooting the server.
Solution & implications agreed: Server is rebooted.
19. End Users
Technical Contact
Product Owner
DevOps
Development Team
Customer Support
12:00am
Rebooted server comes online
and restores service.
20. Server Crash: Fallout
● First instance of downtime since we started using New Relic.
● Server ‘officially’ down for 29 minutes.
(Customer Support were only aware of downtime for final 11 minutes)
● Enhanced visibility of server health meant remediation steps were
underway before the downtime started.
● Downtime issue was resolved in the same time it took for meetings
about the downtime to be arranged.
● Once normal operation is resumed, we can use New Relic data &
server logs to perform a ‘post mortem’ on the incident.
21. “Real user data is much, much
better than artificially-created
lab results”