Teach Your Sites to Call for Help: Automated Problem Reporting for Online Services

Teach your sites to call for helpTeach your sites to call for help
Automate problem reporting for online servicesAutomate problem reporting for online services
Dan Poirier
Caktus Consulting Group
https://caktusgroup.com

Since 1992...Since 1992...
1992-2011: IBM
2011-present: Caktus

I started working in this business in 1992 at IBM, and moved to Caktus in 2011. Things I've worked on have almost always affected real
customers and included having to diagnose and fix problems they run into.
Speaker notes

Libyan Voter RegistrationLibyan Voter Registration

I'd like to use a concrete example for this talk. I always find it easier to understand things from concrete examples, in addition to the abstract
theory.
One of Caktus's projects that provided a lot of experience relevant to this talk was helping to build a voter registration system for Libya.
After the fall of Khadaffi, there was an urgent desire to bootstrap a democracy. One part of that was registering citizens to vote. Caktus
helped develop a text-message-based voter registration system, and later a web site for checking registrations and to administer the
system. Over 1.5 million citizens used the system to register for two elections a few years ago.
Registration was opened again just this past December, and a million more people had registered as of the beginning of February.
Speaker notes

Keep it runningKeep it running

If a web site or online service is worth setting up, it's worth keeping it running.
Mostly they just keep going and don't need much work.
So if 2 months after you set up a site, it goes down, you might not notice right away. Your site users might not notify you - and if your site is
completely down, they might not be able to. And more likely, they'll just go to another site.
We can keep checking the site to see if it's all working - but that's boring, and we're likely to get tired of it and do a poor job or stop doing it
completely. This is the kind of thing computers are good at.
So, let's use computers to let us know when something's wrong.
That's all pretty obvious, of course. As they say, the devil's in the details.
Speaker notes

Is my site down?Is my site down?

Let's start with what's really the simplest - getting notified when your site goes completely down. In that case, we really can't rely on anything
in our site code, or even on our site's servers, to notify us - they're not working right. So we need an external service.
Like most of the tools I'm going to talk about today, there are multiple options, and I make no pretense of having done a comprehensive
survey. I'll talk about tools that have worked for us, but I make no claim that there aren't better alternatives.
What I'd like you to take away today are the kinds of things you should consider and watch out for when deciding what to use for your own
site.
For getting notified when your site goes down, the most well-known service is pingdom.com. Some other choices are New Relic, and the
new statuscake.com.
Speaker notes

Any of these can notify you in various ways when your site goes down. Both pricing and other services offered vary. Your decision here is
likely to depend somewhat on your choices in other areas of monitoring your site.
For example, New Relic can check your sites' reachability from places all around the world. If for some reason your site is working fine from
North America, but not from Europe, they can let you know. New Relic is a more expensive option, but they offer a *lot* of other services.
Just saying...
And status cake does the same thing.
Speaker notes

Considerations for web page monitoringConsiderations for web page monitoring
Test non-trivial pages:
behind login
database access required
Don't get ooded with alerts

A home page is the obvious page to point a "site down" checker at, but it's often your most trivial page. It might work fine even when things
are so broken no other pages work. Or maybe people can't log in, or only the pages behind logins are broken.
You might want to set up a special page just for the site monitor to check. It might do some database access or whatever would give you
confidence that all the important parts of your site are healthy.
You don't want repeated notices every time the service sees a failure to load a page. I know someone with a site that has a small bug that
only shows up for one hour a year - that hours when daylight saving time causes the same hour to happen twice. And for every second of
that hour, he gets an alert that his site is broken. Then it stops for another year... so he puts off fixing it.... and then another year has gone by
and it happens again :-)
Speaker notes

Site outage in LibyaSite outage in Libya

Tell story of site going down a couple of weeks ago, project lead getting notified and working with sys admin on Slack to diagnose. Turned
out to be the ISP having blackholed our IP address after a DOS attack.
Luckily, the web site is not the most important part of the Libya project and the text messages continued being processed without
interruption.
Speaker notes

What if your site is responding, but abnormally slowly? You'd like to know that, but a simple check that it's reachable won't tell you. Most of
the same services that can let you know your site is down, can also tell you if it's working poorly. Statuscake can let you know that. New
Relic can not only let you know that, but show you exactly where in your request processing the slowdown is.
Speaker notes

Memory, CPU, DiskMemory, CPU, Disk

Another thing to think about is whether you can get advance notice that trouble is coming. It's a good idea to set up warnings for things like
disk space and CPU usage going over some threshold. Then you can take action - enlarge your disk, add servers, etc - before it starts
causing problems for your users. New Relic and Status Cake can both do this. Here's a screenshot from Status Cake.
Speaker notes

Domain expirationDomain expiration

And then there's the infamous problem of having your domain expire without noticing. Microsoft let passport.com expire in 1999, causing a
total outage for Hotmail. You don't want to make the news for something like that. You guessed it, there are tools to warn you about that.
Speaker notes

Certi cate expirationCerti cate expiration

And the same thing for SSL certificates.
Speaker notes

Gather your logsGather your logs

When you're trying to figure out a problem, sooner or later you'll end up looking at logs.
By default, every application that helps keep your site or service running will put its log messages somewhere different, and you'll have to
login to every system individually to look at them. This is a nuisance even for a small, single-server site, and almost unusable as your site
grows beyond that.
You might also not want your developers even to be able to log into your production servers, depending on your service's security
requirements.
The solution is to gather your logs in a central place.
You can actually do this without any additional software. Unix logging systems have had the capability for decades to forward messages to
other servers and to accept incoming ones and store them in a single place, or route them based on various criteria.
In these days of the cloud, there are lots of online servers where you can forward your log messages, and they'll store them, make them
easily searchable, allow setting up patterns that'll trigger notifications. Some can even analyze the messages and extract data about your
site.
We've used Papertrail, a paid cloud service with a free tier if you don't have a large volume of log messages. ELK can also do this, which I'll
talk about later.
Speaker notes

View logs from any serverView logs from any server

Be alerted when searches matchBe alerted when searches match

Details for central loggingDetails for central logging
1.Security: logs sent to a service that is append-only cannot be
modifed by an attacker to hide their trail
2.Security: if the logs are accessible elsewhere, you might not even
need to let your developers have login access to the servers
3.There are different kinds of logs, so consider treating them
individually. Some might warrant long storage, others fairly
short, for example.

Logging on the Libya projectLogging on the Libya project
We have over a dozen servers running haproxy, vumi,
postgresql, nginx, Django, and other services that all have their
own logs. It wouldn't be feasible to check those logs directly on
every server.

Con guring log message forwardingCon guring log message forwarding
Python logging con gurationPython logging con guration
{
...
'handlers': {
'SysLog': {
'level': 'DEBUG',
'class': 'logging.handlers.SysLo
'formatter': 'simple',
'address': ('logsN.papertrailapp
},
...
}

The dog in the nighttimeThe dog in the nighttime

In the Sherlock Holmes story "Silver Blaze", Holmes is the only one who notices what *didn't* happen but should have -- the dog didn't bark
in the nighttime.
The lesson for us is that we tend not to notice when things that are supposed to happen, don't. Like our automated backups. Or sending out
monthly invoices. Often important stuff.
This is a really good thing to assign a computer to do for us.
There are numerous online services that will help you with this. The one I've used is healthchecks.io, which is free for everything I've needed
to do with it.
You tell it you want to monitor some periodic event, and how often it's supposed to happen. It assigns a unique URL, and you arrange to
ping that URL when the event completes successfully. For example, you might add "&& wget <URL>" to a cron command.
If healthchecks.io *doesn't* hear from your task when it should, it notifies you.
Speaker notes

Healthchecks.io and remote loggingHealthchecks.io and remote logging
Example crontab:
# crontab
@daily /usr/local/bin/daily
| logger -n log.example.com -p XXX
&& curl https://hchk.io/8fae4...

We run a daily command with cron.
Adding "&& curl <URL>" pings that URL only if the previous command succeeds.
We can also redirect the output to "logger", a Unix utility that sends its input to syslog. In this case we're sending it to our remote log server.
Speaker notes

healthchecks.io from Pythonhealthchecks.io from Python
import requests
requests.get("https://hchk.io/327b...")

RecapRecap
Site down
Site slow
Possible upcoming problems:
1.Resources: Memory, CPU, Disk, ...
2.Expirations: Domains, SSL certi cates
Collect logs
Monitor periodic tasks

So far we've been talking about things that can affect our entire site or service, and how to get advance notice of upcoming problems, and
urgent notice when problems actually occur. We've gone beyond just seeing if the home page works, to checking the site from around the
world, watching things like disk space, cpu, and memory, and even getting reminded when a domain or certificate needs to be renewed.
Speaker notes

Errors on your siteErrors on your site

But there are also problems that only affect some of your users, or only when users do certain things. Again, some users might let you know,
but most won't. You can't rely on it.
At Caktus, we use the Django framework to build our web sites using Python. Django has a neat feature where it can email the site admins
whenever an uncaught exception happens during request processing. That won't tell you if a user is getting shown a wrong answer, but it
will let you know if a user is getting an error page. The email message includes the exception, the stack trace, and details of the request and
the site settings.
Speaker notes

Django's error emailsDjango's error emails
From: django@yoursite.com
To: your_inbox
Subject: Someone hit an error
Request URL & headers
Reponse status
Stack trace
Django settings

This is great during testing, and when you're just getting your site going. It not only notifies you, it gathers a lot of information that is often all
you need to figure out exactly what happened.
If you're using Django, I recommend enabling this during your site development. When your testers hit problems, it'll gather more information
than your tester could have gotten for you at the time.
I imagine there are ways to do something similar when errors occur if you're not using Django.
Once your site starts getting any significant traffic, though, you should turn this off and replace it with something better.
Speaker notes

A ood of error emailsA ood of error emails

Imagine this scenario. Your site is getting hundreds or thousands of requests per hour. That's per *hour*, not minute or second; it's not really
all that much traffic. Overnight, something happens that starts making every request fail. Maybe your database goes down, or somebody
changes the password on the search backend - whatever. And every time a request fails, the site send you another email message. Picture
your inbox when you come in the next morning. It won't be pretty - thousands or tens of thousands of these error messages. And as you try
to fix the problem, they'll keep pouring in. And after you fix the problem, more will keep coming, and you'll notice they are dated hours ago,
and that the flood of email was so great that a lot of it is backed up somewhere in the internet and is going to keep getting delivered for hours
and hours more. Your SMTP provider (handling the outgoing mail from your servers) might not be too happy with you either. As you might
gather, I've been there. And when it happens, I kick myself because I should have set up Sentry earlier.
Speaker notes

Sentry is an open source service to sit between your site's error reports and you. You can subscribe to a hosted service, or set up your own
server. Then you configure your site to notify Sentry about problems instead of sending emails to you.
Sentry is smart. It not only gathers *more* information about the problem than Django includes in its emails, it also recognizes when the
same problem happens over and over, and only notifies you *once*.
It keeps track, of course. You can go to the Sentry web interface for that problem and see how many times people have hit that problem,
how long it's been happening, and whether it's getting more frequent. Then you can drill down into one of the occurrences to see the
exception, the stack trace, the request data, plus local variables and function arguments.
Once you think you've fixed a problem, you can mark it _resolved_ in Sentry. Once you do that, if it ever happens again, Sentry will send
you a fresh notice, and also let you know this problem has happened before and so this is a possible reggresion.
Speaker notes

Con gure SentryCon gure Sentry
# Python package
$ pip install raven --upgrade
# Django config
INSTALLED_APPS = (
...
'raven.contrib.django.raven_compat'
)
RAVEN_CONFIG = {
'dsn': 'https://<key>:<secret>@sentry.io
}

Sentry in LibyaSentry in Libya
We have our own sentry server in the datacenter in Libya that
handles all the errors from Django and Vumi, and noti es us of
problems while gathering diagnostic information.

Now I'm going to say a little more about New Relic, which I've already mentioned. New Relic is the gold standard of monitoring tools. A
surprising number of well-known sites use it, because it's worth it - it's expensive, but it does almost everything I've mentioned here, and
does it as well or better than any other tool I know of.
It can check that your site is up from places around the world.
It can notify you when there's a performance or resource problem.
It's so well integrated into your application that if, for example, a request is slow, New Relic can show you which part of the code processing
that request is taking most of the time. It's incredibly helpful to know whether a query is slow, or you're doing a query more times than you
need to, or maybe rendering some templates is taking a long time, or some 3rd-party service is responding slowly.
There's a free tier that provides enough of the New Relic service to get a feel for it.
Speaker notes

Con guring New Relic to monitor your Python-Con guring New Relic to monitor your Python-
based serverbased server
# Install
$ pip install newrelic
# Create config file
$ newrelic-admin generate-config YOUR_LICENS
<path>/newrelic.ini
# Run your server
$ NEW_RELIC_CONFIG_FILE=<path>/newrelic.ini
newrelic-admin run-program
COMMAND_THAT_STARTS_YOUR_SERVER

ELK: ElasticSearch, Logstash, KibanaELK: ElasticSearch, Logstash, Kibana

I'm going to finish up by talking about another set of open source tools that we've been trying out lately that shows some real promise. It's
the combination of Elasticsearch, Logstash, and Kibana, or more briefly, "ELK".
These tools have been around for a while, but putting them together was left as an exercise. Now the developers are providing a container
with the whole set configured to work together.
Logstash stores information gathered from your servers, both log messages and statistics.
Then Kibana provides a very flexible front-end to analyze and visualize all that data, using Elasticsearch to efficiently process the data.
There are agents to install on your servers that forward log messages, or gather metrics.
So ELK can replace several other services that run on your serve
rs.
Speaker notes

ELK monitors resources and gathers logsELK monitors resources and gathers logs

Like papertrail, ELK can collect your logs, search them, etc. Like status cake or New Relic, ELK can monitor memory, disk, etc.
Speaker notes

ELK gathers dataELK gathers data
Send any numbers you want to ELK
Then you can list them, graph them, set up alerts from them,
compare them to other metrics, etc.

ELK and Libya projectELK and Libya project
Performance
Resources: CPU, Memory, Disk

Kibana complexityKibana complexity

Where ELK still needs some work - It's extremely powerful, but unfortunately even the simple things require a pretty detailed understanding
of how it all works.
Speaker notes

RecapRecap
Get noti ed when users hit errors
Don't get ooded - use Sentry
For more in-depth data, New Relic
Build it yourself using ELK

Image sourcesImage sources
Libyan Voter Registration: https://www.caktusgroup.com/case-
study/worlds- rst-sms-voter-registration-system/
Track maintenance vehicle:
https://commons.wikimedia.org/wiki/File:UP_track_maintenance_ve
Logs: https://commons.wikimedia.org/wiki/File:Logs-Port-of-
Burnie-20160208-004.jpg
Dog in the nighttime: https://pixabay.com/en/dog-howl-moon-
tree-sky-star-647533/
Great Wave:
https://en.wikipedia.org/wiki/File:The_Great_Wave_off_Kanagawa.j

Teach Your Sites to Call for Help: Automated Problem Reporting for Online Services

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Teach Your Sites to Call for Help: Automated Problem Reporting for Online Services

Ähnlich wie Teach Your Sites to Call for Help: Automated Problem Reporting for Online Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Teach Your Sites to Call for Help: Automated Problem Reporting for Online Services