SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Teach your sites to call for helpTeach your sites to call for help
Automate problem reporting for online servicesAutomate problem reporting for online services
Dan Poirier
Caktus Consulting Group
https://caktusgroup.com
Since 1992...Since 1992...
1992-2011: IBM
2011-present: Caktus
I started working in this business in 1992 at IBM, and moved to Caktus in 2011. Things I've worked on have almost always affected real
customers and included having to diagnose and fix problems they run into.
Speaker notes
Libyan Voter RegistrationLibyan Voter Registration
I'd like to use a concrete example for this talk. I always find it easier to understand things from concrete examples, in addition to the abstract
theory.
One of Caktus's projects that provided a lot of experience relevant to this talk was helping to build a voter registration system for Libya.
After the fall of Khadaffi, there was an urgent desire to bootstrap a democracy. One part of that was registering citizens to vote. Caktus
helped develop a text-message-based voter registration system, and later a web site for checking registrations and to administer the
system. Over 1.5 million citizens used the system to register for two elections a few years ago.
Registration was opened again just this past December, and a million more people had registered as of the beginning of February.
Speaker notes
Keep it runningKeep it running
If a web site or online service is worth setting up, it's worth keeping it running.
Mostly they just keep going and don't need much work.
So if 2 months after you set up a site, it goes down, you might not notice right away. Your site users might not notify you - and if your site is
completely down, they might not be able to. And more likely, they'll just go to another site.
We can keep checking the site to see if it's all working - but that's boring, and we're likely to get tired of it and do a poor job or stop doing it
completely. This is the kind of thing computers are good at.
So, let's use computers to let us know when something's wrong.
That's all pretty obvious, of course. As they say, the devil's in the details.
Speaker notes
Is my site down?Is my site down?
Let's start with what's really the simplest - getting notified when your site goes completely down. In that case, we really can't rely on anything
in our site code, or even on our site's servers, to notify us - they're not working right. So we need an external service.
Like most of the tools I'm going to talk about today, there are multiple options, and I make no pretense of having done a comprehensive
survey. I'll talk about tools that have worked for us, but I make no claim that there aren't better alternatives.
What I'd like you to take away today are the kinds of things you should consider and watch out for when deciding what to use for your own
site.
For getting notified when your site goes down, the most well-known service is pingdom.com. Some other choices are New Relic, and the
new statuscake.com.
Speaker notes
Site downSite down
Any of these can notify you in various ways when your site goes down. Both pricing and other services offered vary. Your decision here is
likely to depend somewhat on your choices in other areas of monitoring your site.
For example, New Relic can check your sites' reachability from places all around the world. If for some reason your site is working fine from
North America, but not from Europe, they can let you know. New Relic is a more expensive option, but they offer a *lot* of other services.
Just saying...
And status cake does the same thing.
Speaker notes
Considerations for web page monitoringConsiderations for web page monitoring
Test non-trivial pages:
behind login
database access required
Don't get ooded with alerts
A home page is the obvious page to point a "site down" checker at, but it's often your most trivial page. It might work fine even when things
are so broken no other pages work. Or maybe people can't log in, or only the pages behind logins are broken.
You might want to set up a special page just for the site monitor to check. It might do some database access or whatever would give you
confidence that all the important parts of your site are healthy.
You don't want repeated notices every time the service sees a failure to load a page. I know someone with a site that has a small bug that
only shows up for one hour a year - that hours when daylight saving time causes the same hour to happen twice. And for every second of
that hour, he gets an alert that his site is broken. Then it stops for another year... so he puts off fixing it.... and then another year has gone by
and it happens again :-)
Speaker notes
Site outage in LibyaSite outage in Libya
Tell story of site going down a couple of weeks ago, project lead getting notified and working with sys admin on Slack to diagnose. Turned
out to be the ISP having blackholed our IP address after a DOS attack.
Luckily, the web site is not the most important part of the Libya project and the text messages continued being processed without
interruption.
Speaker notes
PerformancePerformance
What if your site is responding, but abnormally slowly? You'd like to know that, but a simple check that it's reachable won't tell you. Most of
the same services that can let you know your site is down, can also tell you if it's working poorly. Statuscake can let you know that. New
Relic can not only let you know that, but show you exactly where in your request processing the slowdown is.
Speaker notes
Memory, CPU, DiskMemory, CPU, Disk
Another thing to think about is whether you can get advance notice that trouble is coming. It's a good idea to set up warnings for things like
disk space and CPU usage going over some threshold. Then you can take action - enlarge your disk, add servers, etc - before it starts
causing problems for your users. New Relic and Status Cake can both do this. Here's a screenshot from Status Cake.
Speaker notes
Domain expirationDomain expiration
And then there's the infamous problem of having your domain expire without noticing. Microsoft let passport.com expire in 1999, causing a
total outage for Hotmail. You don't want to make the news for something like that. You guessed it, there are tools to warn you about that.
Speaker notes
Certi cate expirationCerti cate expiration
And the same thing for SSL certificates.
Speaker notes
Gather your logsGather your logs
When you're trying to figure out a problem, sooner or later you'll end up looking at logs.
By default, every application that helps keep your site or service running will put its log messages somewhere different, and you'll have to
login to every system individually to look at them. This is a nuisance even for a small, single-server site, and almost unusable as your site
grows beyond that.
You might also not want your developers even to be able to log into your production servers, depending on your service's security
requirements.
The solution is to gather your logs in a central place.
You can actually do this without any additional software. Unix logging systems have had the capability for decades to forward messages to
other servers and to accept incoming ones and store them in a single place, or route them based on various criteria.
In these days of the cloud, there are lots of online servers where you can forward your log messages, and they'll store them, make them
easily searchable, allow setting up patterns that'll trigger notifications. Some can even analyze the messages and extract data about your
site.
We've used Papertrail, a paid cloud service with a free tier if you don't have a large volume of log messages. ELK can also do this, which I'll
talk about later.
Speaker notes
View logs from any serverView logs from any server
Save searchesSave searches
View messagesView messages
Be alerted when searches matchBe alerted when searches match
Details for central loggingDetails for central logging
1.Security: logs sent to a service that is append-only cannot be
modifed by an attacker to hide their trail
2.Security: if the logs are accessible elsewhere, you might not even
need to let your developers have login access to the servers
3.There are different kinds of logs, so consider treating them
individually. Some might warrant long storage, others fairly
short, for example.
Logging on the Libya projectLogging on the Libya project
We have over a dozen servers running haproxy, vumi,
postgresql, nginx, Django, and other services that all have their
own logs. It wouldn't be feasible to check those logs directly on
every server.
Con guring log message forwardingCon guring log message forwarding
Python logging con gurationPython logging con guration
{
...
'handlers': {
'SysLog': {
'level': 'DEBUG',
'class': 'logging.handlers.SysLo
'formatter': 'simple',
'address': ('logsN.papertrailapp
},
...
}
The dog in the nighttimeThe dog in the nighttime
In the Sherlock Holmes story "Silver Blaze", Holmes is the only one who notices what *didn't* happen but should have -- the dog didn't bark
in the nighttime.
The lesson for us is that we tend not to notice when things that are supposed to happen, don't. Like our automated backups. Or sending out
monthly invoices. Often important stuff.
This is a really good thing to assign a computer to do for us.
There are numerous online services that will help you with this. The one I've used is healthchecks.io, which is free for everything I've needed
to do with it.
You tell it you want to monitor some periodic event, and how often it's supposed to happen. It assigns a unique URL, and you arrange to
ping that URL when the event completes successfully. For example, you might add "&& wget <URL>" to a cron command.
If healthchecks.io *doesn't* hear from your task when it should, it notifies you.
Speaker notes
Healthchecks.io and remote loggingHealthchecks.io and remote logging
Example crontab:
# crontab
@daily /usr/local/bin/daily 
| logger -n log.example.com -p XXX 
&& curl https://hchk.io/8fae4...
We run a daily command with cron.
Adding "&& curl <URL>" pings that URL only if the previous command succeeds.
We can also redirect the output to "logger", a Unix utility that sends its input to syslog. In this case we're sending it to our remote log server.
Speaker notes
healthchecks.io from Pythonhealthchecks.io from Python
import requests
requests.get("https://hchk.io/327b...")
RecapRecap
Site down
Site slow
Possible upcoming problems:
1.Resources: Memory, CPU, Disk, ...
2.Expirations: Domains, SSL certi cates
Collect logs
Monitor periodic tasks
So far we've been talking about things that can affect our entire site or service, and how to get advance notice of upcoming problems, and
urgent notice when problems actually occur. We've gone beyond just seeing if the home page works, to checking the site from around the
world, watching things like disk space, cpu, and memory, and even getting reminded when a domain or certificate needs to be renewed.
Speaker notes
Errors on your siteErrors on your site
But there are also problems that only affect some of your users, or only when users do certain things. Again, some users might let you know,
but most won't. You can't rely on it.
At Caktus, we use the Django framework to build our web sites using Python. Django has a neat feature where it can email the site admins
whenever an uncaught exception happens during request processing. That won't tell you if a user is getting shown a wrong answer, but it
will let you know if a user is getting an error page. The email message includes the exception, the stack trace, and details of the request and
the site settings.
Speaker notes
Django's error emailsDjango's error emails
From: django@yoursite.com
To: your_inbox
Subject: Someone hit an error
Request URL & headers
Reponse status
Stack trace
Django settings
This is great during testing, and when you're just getting your site going. It not only notifies you, it gathers a lot of information that is often all
you need to figure out exactly what happened.
If you're using Django, I recommend enabling this during your site development. When your testers hit problems, it'll gather more information
than your tester could have gotten for you at the time.
I imagine there are ways to do something similar when errors occur if you're not using Django.
Once your site starts getting any significant traffic, though, you should turn this off and replace it with something better.
Speaker notes
A ood of error emailsA ood of error emails
Imagine this scenario. Your site is getting hundreds or thousands of requests per hour. That's per *hour*, not minute or second; it's not really
all that much traffic. Overnight, something happens that starts making every request fail. Maybe your database goes down, or somebody
changes the password on the search backend - whatever. And every time a request fails, the site send you another email message. Picture
your inbox when you come in the next morning. It won't be pretty - thousands or tens of thousands of these error messages. And as you try
to fix the problem, they'll keep pouring in. And after you fix the problem, more will keep coming, and you'll notice they are dated hours ago,
and that the flood of email was so great that a lot of it is backed up somewhere in the internet and is going to keep getting delivered for hours
and hours more. Your SMTP provider (handling the outgoing mail from your servers) might not be too happy with you either. As you might
gather, I've been there. And when it happens, I kick myself because I should have set up Sentry earlier.
Speaker notes
SentrySentry
Sentry is an open source service to sit between your site's error reports and you. You can subscribe to a hosted service, or set up your own
server. Then you configure your site to notify Sentry about problems instead of sending emails to you.
Sentry is smart. It not only gathers *more* information about the problem than Django includes in its emails, it also recognizes when the
same problem happens over and over, and only notifies you *once*.
It keeps track, of course. You can go to the Sentry web interface for that problem and see how many times people have hit that problem,
how long it's been happening, and whether it's getting more frequent. Then you can drill down into one of the occurrences to see the
exception, the stack trace, the request data, plus local variables and function arguments.
Once you think you've fixed a problem, you can mark it _resolved_ in Sentry. Once you do that, if it ever happens again, Sentry will send
you a fresh notice, and also let you know this problem has happened before and so this is a possible reggresion.
Speaker notes
Con gure SentryCon gure Sentry
# Python package
$ pip install raven --upgrade
# Django config
INSTALLED_APPS = (
...
'raven.contrib.django.raven_compat'
)
RAVEN_CONFIG = {
'dsn': 'https://<key>:<secret>@sentry.io
}
Sentry in LibyaSentry in Libya
We have our own sentry server in the datacenter in Libya that
handles all the errors from Django and Vumi, and noti es us of
problems while gathering diagnostic information.
New RelicNew Relic
Now I'm going to say a little more about New Relic, which I've already mentioned. New Relic is the gold standard of monitoring tools. A
surprising number of well-known sites use it, because it's worth it - it's expensive, but it does almost everything I've mentioned here, and
does it as well or better than any other tool I know of.
It can check that your site is up from places around the world.
It can notify you when there's a performance or resource problem.
It's so well integrated into your application that if, for example, a request is slow, New Relic can show you which part of the code processing
that request is taking most of the time. It's incredibly helpful to know whether a query is slow, or you're doing a query more times than you
need to, or maybe rendering some templates is taking a long time, or some 3rd-party service is responding slowly.
There's a free tier that provides enough of the New Relic service to get a feel for it.
Speaker notes
Con guring New Relic to monitor your Python-Con guring New Relic to monitor your Python-
based serverbased server
# Install
$ pip install newrelic
# Create config file
$ newrelic-admin generate-config YOUR_LICENS
<path>/newrelic.ini
# Run your server
$ NEW_RELIC_CONFIG_FILE=<path>/newrelic.ini
newrelic-admin run-program 
COMMAND_THAT_STARTS_YOUR_SERVER
ELK: ElasticSearch, Logstash, KibanaELK: ElasticSearch, Logstash, Kibana
I'm going to finish up by talking about another set of open source tools that we've been trying out lately that shows some real promise. It's
the combination of Elasticsearch, Logstash, and Kibana, or more briefly, "ELK".
These tools have been around for a while, but putting them together was left as an exercise. Now the developers are providing a container
with the whole set configured to work together.
Logstash stores information gathered from your servers, both log messages and statistics.
Then Kibana provides a very flexible front-end to analyze and visualize all that data, using Elasticsearch to efficiently process the data.
There are agents to install on your servers that forward log messages, or gather metrics.
So ELK can replace several other services that run on your serve
rs.
Speaker notes
ELK monitors resources and gathers logsELK monitors resources and gathers logs
Like papertrail, ELK can collect your logs, search them, etc. Like status cake or New Relic, ELK can monitor memory, disk, etc.
Speaker notes
ELK gathers dataELK gathers data
Send any numbers you want to ELK
Then you can list them, graph them, set up alerts from them,
compare them to other metrics, etc.
ELK and Libya projectELK and Libya project
Performance
Resources: CPU, Memory, Disk
Kibana complexityKibana complexity
Where ELK still needs some work - It's extremely powerful, but unfortunately even the simple things require a pretty detailed understanding
of how it all works.
Speaker notes
RecapRecap
Get noti ed when users hit errors
Don't get ooded - use Sentry
For more in-depth data, New Relic
Build it yourself using ELK
Image sourcesImage sources
Libyan Voter Registration: https://www.caktusgroup.com/case-
study/worlds- rst-sms-voter-registration-system/
Track maintenance vehicle:
https://commons.wikimedia.org/wiki/File:UP_track_maintenance_ve
Logs: https://commons.wikimedia.org/wiki/File:Logs-Port-of-
Burnie-20160208-004.jpg
Dog in the nighttime: https://pixabay.com/en/dog-howl-moon-
tree-sky-star-647533/
Great Wave:
https://en.wikipedia.org/wiki/File:The_Great_Wave_off_Kanagawa.j
Questions?Questions?

Weitere ähnliche Inhalte

Ähnlich wie Teach Your Sites to Call for Help: Automated Problem Reporting for Online Services

Footprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hackingFootprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hacking
Sathishkumar A
 
What Are We Still Doing Wrong
What Are We Still Doing WrongWhat Are We Still Doing Wrong
What Are We Still Doing Wrong
afa reg
 
ASAE Tech: Data Data Everywhere
ASAE Tech: Data Data EverywhereASAE Tech: Data Data Everywhere
ASAE Tech: Data Data Everywhere
mjgoldsmith
 
Security panel-western-mass-drupal-camp
Security panel-western-mass-drupal-campSecurity panel-western-mass-drupal-camp
Security panel-western-mass-drupal-camp
cwworks
 

Ähnlich wie Teach Your Sites to Call for Help: Automated Problem Reporting for Online Services (20)

Footprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hackingFootprinting-and-the-basics-of-hacking
Footprinting-and-the-basics-of-hacking
 
Join 2017_Deep Dive_Workflows with Zapier
Join 2017_Deep Dive_Workflows with ZapierJoin 2017_Deep Dive_Workflows with Zapier
Join 2017_Deep Dive_Workflows with Zapier
 
Web Server Application Logs LTEC2013
Web Server Application Logs LTEC2013Web Server Application Logs LTEC2013
Web Server Application Logs LTEC2013
 
Keynote3
Keynote3Keynote3
Keynote3
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
Using wikto
Using wiktoUsing wikto
Using wikto
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
 
Developing a Globally Distributed Purging System
Developing a Globally Distributed Purging SystemDeveloping a Globally Distributed Purging System
Developing a Globally Distributed Purging System
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
What Are We Still Doing Wrong
What Are We Still Doing WrongWhat Are We Still Doing Wrong
What Are We Still Doing Wrong
 
Windows logging workshop - BSides Austin 2014
Windows logging workshop - BSides Austin 2014Windows logging workshop - BSides Austin 2014
Windows logging workshop - BSides Austin 2014
 
CHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
CHAPTER 3 BASIC DYNAMIC ANALYSIS.pptCHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
CHAPTER 3 BASIC DYNAMIC ANALYSIS.ppt
 
ASAE Tech: Data Data Everywhere
ASAE Tech: Data Data EverywhereASAE Tech: Data Data Everywhere
ASAE Tech: Data Data Everywhere
 
Security panel-western-mass-drupal-camp
Security panel-western-mass-drupal-campSecurity panel-western-mass-drupal-camp
Security panel-western-mass-drupal-camp
 
Web tips
Web tipsWeb tips
Web tips
 
SELJE - VFP and IT Security.pdf
SELJE - VFP and IT Security.pdfSELJE - VFP and IT Security.pdf
SELJE - VFP and IT Security.pdf
 
Production debugging web applications
Production debugging web applicationsProduction debugging web applications
Production debugging web applications
 
Web Page Speed - A Most Important Feature
Web Page Speed - A Most Important FeatureWeb Page Speed - A Most Important Feature
Web Page Speed - A Most Important Feature
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Teach Your Sites to Call for Help: Automated Problem Reporting for Online Services

  • 1. Teach your sites to call for helpTeach your sites to call for help Automate problem reporting for online servicesAutomate problem reporting for online services Dan Poirier Caktus Consulting Group https://caktusgroup.com
  • 2. Since 1992...Since 1992... 1992-2011: IBM 2011-present: Caktus
  • 3. I started working in this business in 1992 at IBM, and moved to Caktus in 2011. Things I've worked on have almost always affected real customers and included having to diagnose and fix problems they run into. Speaker notes
  • 4. Libyan Voter RegistrationLibyan Voter Registration
  • 5. I'd like to use a concrete example for this talk. I always find it easier to understand things from concrete examples, in addition to the abstract theory. One of Caktus's projects that provided a lot of experience relevant to this talk was helping to build a voter registration system for Libya. After the fall of Khadaffi, there was an urgent desire to bootstrap a democracy. One part of that was registering citizens to vote. Caktus helped develop a text-message-based voter registration system, and later a web site for checking registrations and to administer the system. Over 1.5 million citizens used the system to register for two elections a few years ago. Registration was opened again just this past December, and a million more people had registered as of the beginning of February. Speaker notes
  • 6. Keep it runningKeep it running
  • 7. If a web site or online service is worth setting up, it's worth keeping it running. Mostly they just keep going and don't need much work. So if 2 months after you set up a site, it goes down, you might not notice right away. Your site users might not notify you - and if your site is completely down, they might not be able to. And more likely, they'll just go to another site. We can keep checking the site to see if it's all working - but that's boring, and we're likely to get tired of it and do a poor job or stop doing it completely. This is the kind of thing computers are good at. So, let's use computers to let us know when something's wrong. That's all pretty obvious, of course. As they say, the devil's in the details. Speaker notes
  • 8. Is my site down?Is my site down?
  • 9. Let's start with what's really the simplest - getting notified when your site goes completely down. In that case, we really can't rely on anything in our site code, or even on our site's servers, to notify us - they're not working right. So we need an external service. Like most of the tools I'm going to talk about today, there are multiple options, and I make no pretense of having done a comprehensive survey. I'll talk about tools that have worked for us, but I make no claim that there aren't better alternatives. What I'd like you to take away today are the kinds of things you should consider and watch out for when deciding what to use for your own site. For getting notified when your site goes down, the most well-known service is pingdom.com. Some other choices are New Relic, and the new statuscake.com. Speaker notes
  • 11. Any of these can notify you in various ways when your site goes down. Both pricing and other services offered vary. Your decision here is likely to depend somewhat on your choices in other areas of monitoring your site. For example, New Relic can check your sites' reachability from places all around the world. If for some reason your site is working fine from North America, but not from Europe, they can let you know. New Relic is a more expensive option, but they offer a *lot* of other services. Just saying... And status cake does the same thing. Speaker notes
  • 12. Considerations for web page monitoringConsiderations for web page monitoring Test non-trivial pages: behind login database access required Don't get ooded with alerts
  • 13. A home page is the obvious page to point a "site down" checker at, but it's often your most trivial page. It might work fine even when things are so broken no other pages work. Or maybe people can't log in, or only the pages behind logins are broken. You might want to set up a special page just for the site monitor to check. It might do some database access or whatever would give you confidence that all the important parts of your site are healthy. You don't want repeated notices every time the service sees a failure to load a page. I know someone with a site that has a small bug that only shows up for one hour a year - that hours when daylight saving time causes the same hour to happen twice. And for every second of that hour, he gets an alert that his site is broken. Then it stops for another year... so he puts off fixing it.... and then another year has gone by and it happens again :-) Speaker notes
  • 14. Site outage in LibyaSite outage in Libya
  • 15. Tell story of site going down a couple of weeks ago, project lead getting notified and working with sys admin on Slack to diagnose. Turned out to be the ISP having blackholed our IP address after a DOS attack. Luckily, the web site is not the most important part of the Libya project and the text messages continued being processed without interruption. Speaker notes
  • 17. What if your site is responding, but abnormally slowly? You'd like to know that, but a simple check that it's reachable won't tell you. Most of the same services that can let you know your site is down, can also tell you if it's working poorly. Statuscake can let you know that. New Relic can not only let you know that, but show you exactly where in your request processing the slowdown is. Speaker notes
  • 19. Another thing to think about is whether you can get advance notice that trouble is coming. It's a good idea to set up warnings for things like disk space and CPU usage going over some threshold. Then you can take action - enlarge your disk, add servers, etc - before it starts causing problems for your users. New Relic and Status Cake can both do this. Here's a screenshot from Status Cake. Speaker notes
  • 21. And then there's the infamous problem of having your domain expire without noticing. Microsoft let passport.com expire in 1999, causing a total outage for Hotmail. You don't want to make the news for something like that. You guessed it, there are tools to warn you about that. Speaker notes
  • 22. Certi cate expirationCerti cate expiration
  • 23. And the same thing for SSL certificates. Speaker notes
  • 25. When you're trying to figure out a problem, sooner or later you'll end up looking at logs. By default, every application that helps keep your site or service running will put its log messages somewhere different, and you'll have to login to every system individually to look at them. This is a nuisance even for a small, single-server site, and almost unusable as your site grows beyond that. You might also not want your developers even to be able to log into your production servers, depending on your service's security requirements. The solution is to gather your logs in a central place. You can actually do this without any additional software. Unix logging systems have had the capability for decades to forward messages to other servers and to accept incoming ones and store them in a single place, or route them based on various criteria. In these days of the cloud, there are lots of online servers where you can forward your log messages, and they'll store them, make them easily searchable, allow setting up patterns that'll trigger notifications. Some can even analyze the messages and extract data about your site. We've used Papertrail, a paid cloud service with a free tier if you don't have a large volume of log messages. ELK can also do this, which I'll talk about later. Speaker notes
  • 26. View logs from any serverView logs from any server
  • 29. Be alerted when searches matchBe alerted when searches match
  • 30. Details for central loggingDetails for central logging 1.Security: logs sent to a service that is append-only cannot be modifed by an attacker to hide their trail 2.Security: if the logs are accessible elsewhere, you might not even need to let your developers have login access to the servers 3.There are different kinds of logs, so consider treating them individually. Some might warrant long storage, others fairly short, for example.
  • 31. Logging on the Libya projectLogging on the Libya project We have over a dozen servers running haproxy, vumi, postgresql, nginx, Django, and other services that all have their own logs. It wouldn't be feasible to check those logs directly on every server.
  • 32. Con guring log message forwardingCon guring log message forwarding Python logging con gurationPython logging con guration { ... 'handlers': { 'SysLog': { 'level': 'DEBUG', 'class': 'logging.handlers.SysLo 'formatter': 'simple', 'address': ('logsN.papertrailapp }, ... }
  • 33. The dog in the nighttimeThe dog in the nighttime
  • 34. In the Sherlock Holmes story "Silver Blaze", Holmes is the only one who notices what *didn't* happen but should have -- the dog didn't bark in the nighttime. The lesson for us is that we tend not to notice when things that are supposed to happen, don't. Like our automated backups. Or sending out monthly invoices. Often important stuff. This is a really good thing to assign a computer to do for us. There are numerous online services that will help you with this. The one I've used is healthchecks.io, which is free for everything I've needed to do with it. You tell it you want to monitor some periodic event, and how often it's supposed to happen. It assigns a unique URL, and you arrange to ping that URL when the event completes successfully. For example, you might add "&& wget <URL>" to a cron command. If healthchecks.io *doesn't* hear from your task when it should, it notifies you. Speaker notes
  • 35. Healthchecks.io and remote loggingHealthchecks.io and remote logging Example crontab: # crontab @daily /usr/local/bin/daily | logger -n log.example.com -p XXX && curl https://hchk.io/8fae4...
  • 36. We run a daily command with cron. Adding "&& curl <URL>" pings that URL only if the previous command succeeds. We can also redirect the output to "logger", a Unix utility that sends its input to syslog. In this case we're sending it to our remote log server. Speaker notes
  • 37. healthchecks.io from Pythonhealthchecks.io from Python import requests requests.get("https://hchk.io/327b...")
  • 38. RecapRecap Site down Site slow Possible upcoming problems: 1.Resources: Memory, CPU, Disk, ... 2.Expirations: Domains, SSL certi cates Collect logs Monitor periodic tasks
  • 39. So far we've been talking about things that can affect our entire site or service, and how to get advance notice of upcoming problems, and urgent notice when problems actually occur. We've gone beyond just seeing if the home page works, to checking the site from around the world, watching things like disk space, cpu, and memory, and even getting reminded when a domain or certificate needs to be renewed. Speaker notes
  • 40. Errors on your siteErrors on your site
  • 41. But there are also problems that only affect some of your users, or only when users do certain things. Again, some users might let you know, but most won't. You can't rely on it. At Caktus, we use the Django framework to build our web sites using Python. Django has a neat feature where it can email the site admins whenever an uncaught exception happens during request processing. That won't tell you if a user is getting shown a wrong answer, but it will let you know if a user is getting an error page. The email message includes the exception, the stack trace, and details of the request and the site settings. Speaker notes
  • 42. Django's error emailsDjango's error emails From: django@yoursite.com To: your_inbox Subject: Someone hit an error Request URL & headers Reponse status Stack trace Django settings
  • 43. This is great during testing, and when you're just getting your site going. It not only notifies you, it gathers a lot of information that is often all you need to figure out exactly what happened. If you're using Django, I recommend enabling this during your site development. When your testers hit problems, it'll gather more information than your tester could have gotten for you at the time. I imagine there are ways to do something similar when errors occur if you're not using Django. Once your site starts getting any significant traffic, though, you should turn this off and replace it with something better. Speaker notes
  • 44. A ood of error emailsA ood of error emails
  • 45. Imagine this scenario. Your site is getting hundreds or thousands of requests per hour. That's per *hour*, not minute or second; it's not really all that much traffic. Overnight, something happens that starts making every request fail. Maybe your database goes down, or somebody changes the password on the search backend - whatever. And every time a request fails, the site send you another email message. Picture your inbox when you come in the next morning. It won't be pretty - thousands or tens of thousands of these error messages. And as you try to fix the problem, they'll keep pouring in. And after you fix the problem, more will keep coming, and you'll notice they are dated hours ago, and that the flood of email was so great that a lot of it is backed up somewhere in the internet and is going to keep getting delivered for hours and hours more. Your SMTP provider (handling the outgoing mail from your servers) might not be too happy with you either. As you might gather, I've been there. And when it happens, I kick myself because I should have set up Sentry earlier. Speaker notes
  • 47. Sentry is an open source service to sit between your site's error reports and you. You can subscribe to a hosted service, or set up your own server. Then you configure your site to notify Sentry about problems instead of sending emails to you. Sentry is smart. It not only gathers *more* information about the problem than Django includes in its emails, it also recognizes when the same problem happens over and over, and only notifies you *once*. It keeps track, of course. You can go to the Sentry web interface for that problem and see how many times people have hit that problem, how long it's been happening, and whether it's getting more frequent. Then you can drill down into one of the occurrences to see the exception, the stack trace, the request data, plus local variables and function arguments. Once you think you've fixed a problem, you can mark it _resolved_ in Sentry. Once you do that, if it ever happens again, Sentry will send you a fresh notice, and also let you know this problem has happened before and so this is a possible reggresion. Speaker notes
  • 48. Con gure SentryCon gure Sentry # Python package $ pip install raven --upgrade # Django config INSTALLED_APPS = ( ... 'raven.contrib.django.raven_compat' ) RAVEN_CONFIG = { 'dsn': 'https://<key>:<secret>@sentry.io }
  • 49. Sentry in LibyaSentry in Libya We have our own sentry server in the datacenter in Libya that handles all the errors from Django and Vumi, and noti es us of problems while gathering diagnostic information.
  • 51. Now I'm going to say a little more about New Relic, which I've already mentioned. New Relic is the gold standard of monitoring tools. A surprising number of well-known sites use it, because it's worth it - it's expensive, but it does almost everything I've mentioned here, and does it as well or better than any other tool I know of. It can check that your site is up from places around the world. It can notify you when there's a performance or resource problem. It's so well integrated into your application that if, for example, a request is slow, New Relic can show you which part of the code processing that request is taking most of the time. It's incredibly helpful to know whether a query is slow, or you're doing a query more times than you need to, or maybe rendering some templates is taking a long time, or some 3rd-party service is responding slowly. There's a free tier that provides enough of the New Relic service to get a feel for it. Speaker notes
  • 52. Con guring New Relic to monitor your Python-Con guring New Relic to monitor your Python- based serverbased server # Install $ pip install newrelic # Create config file $ newrelic-admin generate-config YOUR_LICENS <path>/newrelic.ini # Run your server $ NEW_RELIC_CONFIG_FILE=<path>/newrelic.ini newrelic-admin run-program COMMAND_THAT_STARTS_YOUR_SERVER
  • 53. ELK: ElasticSearch, Logstash, KibanaELK: ElasticSearch, Logstash, Kibana
  • 54. I'm going to finish up by talking about another set of open source tools that we've been trying out lately that shows some real promise. It's the combination of Elasticsearch, Logstash, and Kibana, or more briefly, "ELK". These tools have been around for a while, but putting them together was left as an exercise. Now the developers are providing a container with the whole set configured to work together. Logstash stores information gathered from your servers, both log messages and statistics. Then Kibana provides a very flexible front-end to analyze and visualize all that data, using Elasticsearch to efficiently process the data. There are agents to install on your servers that forward log messages, or gather metrics. So ELK can replace several other services that run on your serve rs. Speaker notes
  • 55. ELK monitors resources and gathers logsELK monitors resources and gathers logs
  • 56. Like papertrail, ELK can collect your logs, search them, etc. Like status cake or New Relic, ELK can monitor memory, disk, etc. Speaker notes
  • 57. ELK gathers dataELK gathers data Send any numbers you want to ELK Then you can list them, graph them, set up alerts from them, compare them to other metrics, etc.
  • 58. ELK and Libya projectELK and Libya project Performance Resources: CPU, Memory, Disk
  • 60. Where ELK still needs some work - It's extremely powerful, but unfortunately even the simple things require a pretty detailed understanding of how it all works. Speaker notes
  • 61. RecapRecap Get noti ed when users hit errors Don't get ooded - use Sentry For more in-depth data, New Relic Build it yourself using ELK
  • 62. Image sourcesImage sources Libyan Voter Registration: https://www.caktusgroup.com/case- study/worlds- rst-sms-voter-registration-system/ Track maintenance vehicle: https://commons.wikimedia.org/wiki/File:UP_track_maintenance_ve Logs: https://commons.wikimedia.org/wiki/File:Logs-Port-of- Burnie-20160208-004.jpg Dog in the nighttime: https://pixabay.com/en/dog-howl-moon- tree-sky-star-647533/ Great Wave: https://en.wikipedia.org/wiki/File:The_Great_Wave_off_Kanagawa.j