Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using Docker’s SwarmKit as a real-world example. Gain a deeper understanding of how orchestration systems like SwarmKit work in practice, and walk away with more insights into your production applications.
6. 6
Orchestration
A control system for your cluster
ClusterO
-
Δ St
D
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
9. 9
Data Model Requirements
- Represent difference in cluster state
- Maximize observability
- Support convergence
- Do this while being extensible and reliable
10. Show me your data structures
and I’ll show you your
orchestration system
11. 11
Services
- Express desired state of the cluster
- Abstraction to control a set of containers
- Enumerates resources, network availability, placement
- Leave the details of runtime to container process
- Implement these services by distributing processes across a cluster
Node 1 Node 2 Node 3
12. 12
Declarative
$ docker network create -d overlay backend
vpd5z57ig445ugtcibr11kiiz
$ docker service create -p 6379:6379 --network backend redis
pe5kyzetw2wuynfnt7so1jsyl
$ docker service scale serene_euler=3
serene_euler scaled to 3
$ docker service ls
ID NAME REPLICAS IMAGE COMMAND
pe5kyzetw2wu serene_euler 3/3 redis
23. 23
Orchestration
A control system for your cluster
ClusterO
-
Δ St
D
D = Desired State
O = Orchestrator
C = Cluster
St
= State at time t
Δ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
28. 28
Push Model Pull Model
Manager
Worker
Discovery
System
(ZooKeeper)
3 - Payload
1 - Register
2 - Discover Manager
Worker
Registration &
Payload
29. 29
Push Model Pull Model
Pros Provides better control over
communication rate
- Managers decide when to
contact Workers
Cons Requires a discovery
mechanism
- More failure scenarios
- Harder to troubleshoot
Pros Simpler to operate
- Workers connect to Managers
and don’t need to bind
- Can easily traverse networks
- Easier to secure
- Fewer moving parts
Cons Workers must maintain connection
to Managers at all times
30. 30
Push vs Pull
• SwarmKit adopted the Pull model
• Favored operational simplicity
• Engineered solutions to provide rate control in pull mode
32. 32
Rate Control: Heartbeats
• Manager dictates heartbeat rate to Workers
• Rate is configurable (not by end user)
• Managers agree on same rate by
consensus via Raft
• Managers add jitter so pings are spread
over time (avoid bursts)
Manager
Worker
Ping? Pong!
Ping me back in
5.2 seconds
33. 33
Rate Control: Workloads
• Worker opens a gRPC stream to
receive workloads
• Manager can send data whenever it
wants to
• Manager will send data in batches
• Changes are buffered and sent in
batches of 100 or every 100 ms,
whichever occurs first
• Adds little delay (at most 100ms) but
drastically reduces amount of
communication
Manager
Worker
Give me
work to do
100ms - [Batch of 12 ]
200ms - [Batch of 26 ]
300ms - [Batch of 32 ]
340ms - [Batch of 100]
360ms - [Batch of 100]
460ms - [Batch of 42 ]
560ms - [Batch of 23 ]
36. 36
Replication
Manager Manager Manager
Worker
Leader FollowerFollower
• Followers multiplex all workers
to the Leader using a single
connection
• Backed by gRPC channels
(HTTP/2 streams)
• Reduces Leader networking load
by spreading the connections
evenly
Worker Worker
Example: On a cluster with 10,000 workers and 5 managers,
each will only have to handle about 2,000 connections. Each
follower will forward its 2,000 workers using a single socket to
the leader.
37. 37
Replication
Manager Manager Manager
Worker
Leader FollowerFollower
• Upon Leader failure, a new one
is elected
• All managers start redirecting
worker traffic to the new one
• Transparent to workers
Worker Worker
43. 43
Presence
• Leader commits Worker state (Up vs Down) into Raft
− Propagates to all managers
− Recoverable in case of leader re-election
• Heartbeat TTLs kept in Leader memory
− Too expensive to store “last ping time” in Raft
• Every ping would result in a quorum write
− Leader keeps worker<->TTL in a heap (time.AfterFunc)
− Upon leader failover workers are given a grace period to reconnect
• Workers considered Unknown until they reconnect
• If they do they move back to Up
• If they don’t they move to Down
45. 45
The Raft Consensus Algorithm
Orchestration systems typically use some kind of service to
maintain state in a distributed system
- etcd
- ZooKeeper
- …
Many of these services are backed by the Raft consensus
algorithm
46. 46
SwarmKit and Raft
Docker chose to implement the algorithm directly
- Fast
- Don’t have to set up a separate service to get started with
orchestration
- Differentiator between SwarmKit/Docker and other orchestration
systems
49. Sequencer
● Every object in the store has a Version field
● Version stores the Raft index when the object was last updated
● Updates must provide a base Version; are rejected if it is out of date
● Similar to CAS
● Also exposed through API calls that change objects in the store
49