1. Barry Laffoy – Senior DevOps Engineer
Scaling a Monitoring Strategy
For a Microservices Architecture
Thanks to Our Sponsors
http://community.kloia.co.ukJoin Our Community Slack Channel
2. Monitoring In A Microservices
Environment
Or how to scale your alerting strategy with your team and application
3. Who AmI?
Why Should You Listen to me?
Physics
Actuarial Science
Build Engineering
Experience building and maintaining Excel and
Jenkins
DevOps at ClearScore
4. Who Are
ClearScore
Aim to Solve Money for the World
Present people their data in a beautiful way, to
empower financial decision making
Committed to best-in-class technical solutions
Committed to having fun while we do it
17. Notsogreatfor
microservices
Instrumented inside container (not 12 factor)
Paying for license per process (not scalable)
Manual configuration of alerting rules
Limited Language support
Tracing from service to service very difficult
Alerting on ”abnormal traffic” limited by simple
statistical model
22. OffTheShelf
External Synthetics with PingDom
Container security scanning with quay.io
Dependency security scanning with maven/npm
AMI security scanning with Inspector
Performance monitoring as part of CI pipeline
Internal Synthetics with consul-alerts/liveness-readiness probes
23. Highly
customizable
Cloud Native with CloudWatch
Annotating releases in Grafana
Self managed with statsd
Infrastructure metrics
Custom Application Metrics
Third party integration monitoring
Alerting rules are “all or nothing”
27. Traditional
Vendors
Poor support for distributed microservices
Poor language support (Scala/akka)
Mixed results on configurability
28. EnterInstana
Discovered quite by accident
Beautiful UI
Extremely easy to set up
Covered most of our desired features out of the box
Infrastructure monitoring
Microservice APM
End-User-Monitoring
33. Youbuildityour
runit!
Delivery teams own their microservices
Responsible for performance and monitoring in
dev/ci/stg environments
Ideally, incidents alert to dev team responsible
Unfortunately, we don’t quite do that
Sophisticated routing system <picture of me>
34.
35. Peoplecause
problems
Things go wrong, when people change things
Luckily, this means things go wrong during business
hours (mostly)
Everyone empowered to inspect monitoring tools
On-call teams supports problem resolution, doesn’t fix
everything
Understanding teams and services drives platform
improvement
36. AlertGrooming
Lots of noise on alert channels
Alert Fatigue
”Boy who cried wolf” syndrome
Requires proactive maintenance of alerts
Fix ALL annoying alerts, even if that means fixing the
the alert, not the underlying service
Investment takes time, but pays dividends in
productivity
37. MajorIncidents
Zero blame retros
Involve stake-holders
Generate action points with owners (and follow up)
Detailed incident report with business-friendly
summaries and cost estimates
39. Replatforming
Hashicorp platform
Great choice to get us to the cloud
Focused on supporting zillions of containers in HPC
environment
Limiting our scalability and speed of delivery
Encouraged anti-pattern of integrating platform details
into services
Kubernetes migration
Solves many of our problems
Natively supports blue-green
Instana support for cluster health monitoring
Prometheus on-cluster monitoring
What to do with our statsd?
40. Continuous
Deployment2.0
Investigating CD platforms
Spinnaker/Concourse/Drone
Routing non-prod alerts to development teams
Performance, Tracing, Vulnerability issues should be
flagged
42. Serverless
Functions as a service (on AWS lambda)
Horizontal auto-scaling
“No Ops”
Cheap
Unsupported by traditional monitoring/tracing
solutions
X-Ray tracing features with Instana