The keynote discussed how containers can provide robustness and improved utilization of resources. Containers isolate applications and enable sets of applications called pods to run together with shared resources. The key challenges discussed were unpredictable interference between containers, low resource utilization, and hard to enforce isolation. Solutions presented were using cgroups for isolation, allowing "slack" resources to be used for lower priority tasks, and moving enforcement directly into the kernel. Kubernetes was introduced as an open source project for orchestrating pods across multiple machines through replication and reconciliation of the actual vs desired state.
2. 1) Application-centric, not machine-centric view
It is easier, more natural, and more productive
Why we Love Containers
Over 2B containers launched per week
(even our VMs run inside containers)
We evolved here over the last decade…
but Docker made it exciting and much easier to use (thanks!)
2) Essentially the way Google works internally:
Signed static bundles + Linux containers (resolve dependencies up front)
3. Containers interfere with each other
• Unimportant things break important things
• We want fair use among equally important things
Solution: resource & performance isolation
Series of open-source solutions:
2005: cpusets + “fake” NUMA to partition cores, memory
2006: cgroups for general task hierarchies
2009: bandwidth fair use, QoS levels
2010: memcg for better memory accounting, enforcement
Status: isolation works well in practice (if you use these tools)
First Problem: Unpredictable Interference
4. Second Problem: Low Utilization
Tier 1: Live services (e.g. search engine)
• Provision for peak load (2-10x higher than average)
• High priority, always get resources when needed
Tier 2: Batch jobs (e.g. MapReduce)
• Run in the leftovers, never displace Tier 1
• Lots of capacity — rarely at peak load
If you partition resources, utilization goes down…
Solution: controlled use of slack resources (free $$)
Status: Our OSS container solutions support this well
Note: Google does not overcommit customer VMs — you get the whole VM all the time
5. Third Problem: Hard to Enforce Isolation
Bad way: control loop (see LPC 2011)
• Read stats, verify allocation, tune knobs, repeat
• Slow response time, fragile
Right way:
• Direct enforcement in the kernel
• Many patches to make this happen… (e.g. memcg)
Status: enforcement now mostly in the kernel
• Caches, memory bandwidth can still cause interference
• Challenges getting these changes accepted upstream
• Meta control loop: detect interference and migrate tasks (see CPI2)
6. “Let Me Contain That For You” (LMCTFY = “L-M-C-T-
fee”)
You want this, but didn’t know it
• Declarative allocation, prioritization of resources
• Enforces resource isolation, with multiple hierarchies
• Many resources: CPU, memory, bandwidth, latency, disk I/O, …
• Enables better utilization
• Stable API, as kernel mechanisms continue to evolve
• Released as OSS in 2013 (see LPC 2013)
OSS containers based on Docker are a core foundation for the future
• Many contributors over the decade: SGI, LXC, RedHat, Parallels, Docker, …
• We want to move LMCTFY functionality into Docker’s libcontainer
• Released for Docker Hackathon: cAdvisor for container stats & alerts (written in Go)
7. Pods (or how we really use containers)
We actually use groups of nested containers = pods
• Use LMCTFY for nesting, isolation & utilization
• Many things implemented as helpers:
• Logging and log rotation
• Content management system + webserver
Pod attributes:
• Deployed together (in a parent container!)
• Shared local volumes
• Individual IP address (even if multiple pods per VM)
• Ensures clean port allocation
OK, we don’t use a single IP per pod, but we should have…
Without this, need to track/distribute port allocations, since they must be late bound...
8. Kubernetes “koo ber NAY tace” — Greek for “helmsman”
New OSS release: orchestrating replicated pods across multiple nodes
Craig McLuckie, Brendan Burns to cover at 2pm today
Master:
• Manages worker pods dynamically
• Uses etcd to track desired configuration API Server
Replica
Controller
etcd
k Workers:
• Replicated Docker image
• Parameterized: arguments passed in via
environment variables
• Shared view of load-balanced services
Kubelet
Service
Proxy
Docker
9. Concept 1: Labels and Services
Service = load-balanced replica set
• Pod labels ⇒ the services they implement
• Pods access services via localhost:<port>
• (Local) proxy sends traffic to member of set
• Ports are the service “names”
{
"id": "redisslave",
"port": 10001,
"labels": {
"name": "redisslave"
}
}
Service Definition (JSON)
"labels": {
"name": "redisslave"
}
Partial pod definition (JSON)
Pods have labels
Many overlapping sets of labels:
stage: production name: redis
zone: west version: 2.6
Replica set = a group of pods with the same labels
The set is defined by a query (not a static list)
(because entropy happens)
10. Having an explicit desired state is a good idea!
Otherwise can’t tell if the desire changed, or the actual state changes
Concept 2: The Reconciler Model
Key idea: Declare the desired state
Loop { // the reconciler loop, run by master
• Query the actual state of the system
• Compare with desired state
• Implement corrections (if any) // reconcile reality with desired state
}
In Kubernetes
desiredState: if we lose a replica for some reason, add one
replicas: 2
11. Robust Containers
Docker (used well) ⇒ clean, repeatable images
Single Node (pods):
• Allocate ports per pod (conflict free!)
• Attach data-only containers to the pod (as volumes)
(clean sharing of data)
• “Parameterized containers” using environment variables
Multi-Node:
• Labels for time-varying overlapping sets
• Services are load-balanced groups of replicated pods
• The Reconciler Model recovers from changes (expected or not)
(actually used at worker level and master level)
12. Containers are the Path to “Cloud Native”
Pods as a building block
• Clean port namespace
• Shared volumes
• Isolation, prioritization, tools for utilization
• Auto restart (don’t run supervisord k times)
• Liveness probes, stats for load balancing
• sshd in environment (not in your container)
Application-level cloud events per container or pod
• Start, stop, restart
• Notification of migration, resizing, new shards, ...
• Resource alerts, OOM management
Services and labels
Reconciliation
13. Summary
We are standardizing around the Docker container image
• Pushing for usable, scalable, open containers
• Isolation, nesting, utilization, enforcement
• Moving to Go to simplify integration (and because we like it)
Thanks to Docker…
for making containers lightweight, easy to use, and exciting!
We look forward to creating a great robust space together
News today:
• Kubernetes: see Craig & Brendan at 2pm today
• Docker on GAE: see Ekaterina Volkova at 2:50pm today
• cAdvisor: stats & alerts for containers
Hinweis der Redaktion
Static binaries prevent changes in behavior due to changes in libraries, much like Docker containers pre-resolve the file system and packages.
Signing the binaries adds security by preventing tampering with the binaries once produced.
2004: SGI started cpusets, influenced by Solaris and others. They were the first to need to deal with more cores. http://man7.org/linux/man-pages/man7/cpuset.7.html
2005: NUMA aware memory usage essentially put physical memory into different groups so that you could allocate memory from the right place (nearby). Fake NUMA makes up artificial groups as a way to limit memory usage to a group.
2006: cpusets had some support for hierarchies for CPUs; cgroups generalized it. Then each subsystem was modified to make use of cgroups for allocation and/or accounting.
2010: memcg (=memory cgroup) adds better control over memory allocation and accounting and put enforcement in the kernel.
We actually use more than two levels; LMCTFY supports four. Two gets across the core idea, but you get higher utilization as you get more sophisticated with policies.
By the way, we have to fight in general to get these container changes accepted upstream. For example, we would like OOM handling to be more flexible. See http://lwn.net/Articles/591990/ for a discussion about how to handle OOM management in the kernel (or not). See http://lwn.net/Articles/589404/ for a representative patch set user user-level OOM handling and its motivation.
Really important to have a declarative, stable API about how to allocate and prioritize resources. The direct use of kernel knobs, although more powerful, leads to a wide, fragile API, extra complexity, and poor evolution. It hinders both the app AND the kernel, since the kernel must support the wide complex API for a very long time.
cAdvisor released on June 6th, 2014 for Docker Hackathon as part of LMCTFY: https://github.com/google/lmctfy/tree/master/cadvisor
It may move to its own project to facilitate integration.
Pods that offer the service “myService” need to have the labels “stage: production” AND “zone: west” (simple conjunction for now). A more realistic use would have labels for a service name and version as well.
The service “myService” can be found on localhost:8000. That request goes to the local proxy (nginx), which forwards it to the best pod.