DevOps Underground - Microservices Monitoring

Barry Laffoy – Senior DevOps Engineer
Scaling a Monitoring Strategy
For a Microservices Architecture
Thanks to Our Sponsors
http://community.kloia.co.ukJoin Our Community Slack Channel

Monitoring In A Microservices
Environment
Or how to scale your alerting strategy with your team and application

Who AmI?
 Why Should You Listen to me?
 Physics
 Actuarial Science
 Build Engineering
 Experience building and maintaining Excel and
Jenkins
 DevOps at ClearScore

Who Are
ClearScore
 Aim to Solve Money for the World
 Present people their data in a beautiful way, to
empower financial decision making
 Committed to best-in-class technical solutions
 Committed to having fun while we do it

Whywouldwe
wantthem?
 12 factor app
 Scalable
 Releasable
 Loggable
 Discoverable
 Monitorable
 And several more “–ables”
 We handle
 Scaling
 Releasing
 Logging
 Discovering
 Monitoring
 Released independently
 Empower ownership
 Distribute risk

CALMS
 Culture
 Automation
 Lean
 Monitoring
 Sharing

Business
Objectives
 Autonomous Cross Functional Teams
 Increased releases
 Deliver feature more quickly with less risk
 Uptime

ThePlatform
 Drive developer ownership
 Not just a scheduler (Nomad/k8s/etc)
 CI/CD
 Local dev tools
 Cloud native resources
 Logging
 Metrics
 Monitoring
 Alerting

ThePlatform
 Amazon Web Services
 Immutable infrastructure (Packer)
 Infrastructure as code (Terraform)
 Service discovery (Consul)
 Scheduling (Nomad)
 CI/CD (Jenkins)

2. How Hard Is
Monitoring?
Application Performance Monitoring

TraditionalAPM
 Worked great for bare-metal deployment of single Java
app
 Tracing
 Alerts
 Health dashboards

Notsogreatfor
microservices
 Instrumented inside container (not 12 factor)
 Paying for license per process (not scalable)
 Manual configuration of alerting rules
 Limited Language support
 Tracing from service to service very difficult
 Alerting on ”abnormal traffic” limited by simple
statistical model

Tools,tools,
tools
 Pingdom
 Liveness probes
 CloudWatch
 ElasticSearch
 StatsD, influxdb, grafana
 Next Gen APM, Instana

ThirdParty
Services
 Partner integrations
 Flaky
 Knock-on effects

OffTheShelf
 External Synthetics with PingDom
 Container security scanning with quay.io
 Dependency security scanning with maven/npm
 AMI security scanning with Inspector
 Performance monitoring as part of CI pipeline
 Internal Synthetics with consul-alerts/liveness-readiness probes

Highly
customizable
 Cloud Native with CloudWatch
 Annotating releases in Grafana
 Self managed with statsd
 Infrastructure metrics
 Custom Application Metrics
 Third party integration monitoring
 Alerting rules are “all or nothing”

Whatdowe
need?
 Light touch configuration
 Scalable deployment model
 Auto service discovery
 Sensible default alerting rules
 Flexible configuration
 Tracing between asynchronous services

Rollourown
 Already collecting statsd
 Means writing and supporting a lot of logic
 Using some sort of ML?

Traditional
Vendors
 Poor support for distributed microservices
 Poor language support (Scala/akka)
 Mixed results on configurability

EnterInstana
 Discovered quite by accident
 Beautiful UI
 Extremely easy to set up
 Covered most of our desired features out of the box
 Infrastructure monitoring
 Microservice APM
 End-User-Monitoring

5. Culture of Ownership
(It’s not just about tools)

Youbuildityour
runit!
 Delivery teams own their microservices
 Responsible for performance and monitoring in
dev/ci/stg environments
 Ideally, incidents alert to dev team responsible
 Unfortunately, we don’t quite do that
 Sophisticated routing system <picture of me>

Peoplecause
problems
 Things go wrong, when people change things
 Luckily, this means things go wrong during business
hours (mostly)
 Everyone empowered to inspect monitoring tools
 On-call teams supports problem resolution, doesn’t fix
everything
 Understanding teams and services drives platform
improvement

AlertGrooming
 Lots of noise on alert channels
 Alert Fatigue
 ”Boy who cried wolf” syndrome
 Requires proactive maintenance of alerts
 Fix ALL annoying alerts, even if that means fixing the
the alert, not the underlying service
 Investment takes time, but pays dividends in
productivity

MajorIncidents
 Zero blame retros
 Involve stake-holders
 Generate action points with owners (and follow up)
 Detailed incident report with business-friendly
summaries and cost estimates

Replatforming
 Hashicorp platform
 Great choice to get us to the cloud
 Focused on supporting zillions of containers in HPC
environment
 Limiting our scalability and speed of delivery
 Encouraged anti-pattern of integrating platform details
into services
 Kubernetes migration
 Solves many of our problems
 Natively supports blue-green
 Instana support for cluster health monitoring
 Prometheus on-cluster monitoring
 What to do with our statsd?

Continuous
Deployment2.0
 Investigating CD platforms
Spinnaker/Concourse/Drone
 Routing non-prod alerts to development teams
 Performance, Tracing, Vulnerability issues should be
flagged

GoingGlobal
 Support across timezones
 More and more services
 More and more teams

Serverless
 Functions as a service (on AWS lambda)
 Horizontal auto-scaling
 “No Ops”
 Cheap
 Unsupported by traditional monitoring/tracing
solutions
 X-Ray tracing features with Instana

DevOps Underground - Microservices Monitoring

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie DevOps Underground - Microservices Monitoring

Ähnlich wie DevOps Underground - Microservices Monitoring (20)

Mehr von kloia

Mehr von kloia (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

DevOps Underground - Microservices Monitoring