Proactive ops for container orchestration environments

JOHN HARRIS
Docker
Proactive Ops for
Container
Orchestration
@johnharris85

Manual
● User initiated
● Interactive, command-line tools, simple scripts
● Checklist and process driven
Reactive
● Hardware-centric data collection
● Simple metric and log collection
● Siloed tools and information
● Manual analysis and remediation
Proactive
● Application-centric data collection
● End-to-end observability
● Key metrics and thresholds well understood
● Semi-automated analysis and remediation

Improve incrementally
+
Go after quick wins

Evolving Architectures
Observability
Chaos Engineering
Agenda

The ‘What’
Blackbox monitoring — that is, monitoring a system from
the outside by treating it as a blackbox — is something I
find very good at answering the what and for alerting
about a problem that’s already occurring (and ideally end
user-impacting).
Cindy Sridharan
Engineer @ Apple

The USE Model
For every resource, check Utilization, Saturation, and Errors.
Resource: all physical server functional components (CPUs,
disks, busses, ...)
Utilization: the average time that the resource was busy
servicing work
Saturation: the degree to which the resource has extra work
which it can't service, often queued
Errors: the count of error events
Brendan Gregg
Performance Engineer @ Netflix

Docker Stats
docker top / kubectl top

Evolving Workloads
As highly available cloud native infrastructure and
application workloads become more prevalent, more
care needs to be taken to get the monitoring systems
right, and to be sure that you are using dependable
metrics to dynamically manage your environments.
Adrian Cockroft
VP Cloud Architecture @ AWS

Latency
Traffic
Errors
Saturation
4 Golden Signals

The RED Model
Measure, for every microservice in your architecture:
(Request) Rate: the number of requests, per second, you
services are serving.
(Request) Errors: the number of failed requests per second.
(Request) Duration: distributions of the amount of time each
request takes.
Tom Wilkie
VP Product @ Grafana (Prev. @ Weaveworks)

White Box
Instrumentation
Predictive vs Active
Context / Metadata

Metrics
Logging
Tracing
Alerting
Observability

Application
Understanding Failure Modes
Config Mgt Monitoring LoggingCI/CD ..more..
PhysicalVirtualizationPublic Cloud
Platform
Security
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Networking Orchestration Storage
Container Engine
ENTERPRISE EDITION PLATFORM
ApplicationApplicationApplication

Application
Host / Hardware
Platform
Security
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Container Engine
Storage
CPU
Memory
Liveness
File Descriptors
Storage Capacity

Application
Networking
Platform
Security
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Container Engine
Networking
Reachability
Link Utilization
IPAM
Connection Errors

Application
Application
Platform
Security
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Container Engine
Response Times
Error Rates
App-specific Metrics
Availability

Application
Orchestration
Platform
Security
Developer
Services
Registry
Services
Access
Policies
App Lifecycle
Management
Automation &
Extensibility
Container Engine
Orchestration
Container Engine
Orchestrator State
Deployment Rates
Cluster Capacity
Scheduling Events

Docker EE Architecture
Docker EE Cluster
Node Node Node
DOCKER ENTERPRISE EDITION
Node
Manager
Worker
Node
Worker Worker Worker
Node
Manager
Node
Manager
Management Plane

Metadata / Context
[Google has a] concept called tags. Tags are arbitrary key-value
pairs we propagate all across the stack. Tags are propagated
from top to very bottom, and each layer can add more to add to
the context.
Tags often carry the originator library name, originator RPC
name, etc. Once we retrieve instrumentation data from the
low-end services, we can easily filter and point out what
specific services, libraries or RPCs contributed to the state
of the things.
Jaana B. Dogan
Engineer @ Google

Chaos Engineering
"Chaos Engineering is the discipline of experimenting on
a distributed system in order to build confidence in the
system’s capability to withstand turbulent conditions in
production."
… from http://principlesofchaos.org/
Lorin Hochstein
Chaos Engineering @ Netflix

Monitoring the Monitoring
Docker UCP
Production Environments
Datacenter 2
Docker UCP
Production Environments
Datacenter 1

Monitoring the Monitoring
The first thing that would be useful is to have a
monitoring system that has failure modes which are
uncorrelated with the infrastructure it is monitoring. For
efficiency it is common to co-locate a monitoring system
with the infrastructure, in the same datacenter or cloud
region, but that sets up common dependencies that
could cause both to fail together.
Adrian Cockroft
VP Cloud Architecture @ AWS

Observability @ Docker
Pulls per 2 weeks
1B+

Containers
1500
Time-series
metrics in
Prometheus
985K
Logs per
second
35K
Microservices
70
Total logs per
day
1TB
Observability @ Docker

Understand your failure modes
Structure your logs
Add context /metadata to events, traces & metrics
Optimize for MTTR (not uptime) and introduce
failure proactively
Takeaway Checklist

Great Content & Perspectives
Charity Majors (@mipsytipsy)
Jaana B. Dogan (@rakyll)
Cindy Sridharan (@copyconstruct)
Brian Christner (@idomyowntricks)
Stefan Prodan (@stefanprodan)

Proactive ops for container orchestration environments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Proactive ops for container orchestration environments

Similar to Proactive ops for container orchestration environments (20)

More from Docker, Inc.

More from Docker, Inc. (20)

Recently uploaded

Recently uploaded (20)

Proactive ops for container orchestration environments