An algorithm that makes it possible to monitor full-mesh clusters of up to 1000 nodes without having to apply fast timer supervision between all nodes. Failure discovery times are default between one and two seconds, but can be made shorter.
2. When a cluster node becomes unresponsive due to
crash, reboot or lost connectivity we want to:
ï Have all affected connections on the remaining nodes aborted
ï Inform other users who have subscribed for cluster connectivity
events
ï Within a well-defined short interval from the occurrence of the
event
PURPOSE
3. 1) Crank up the connection keepalive timer
ï§ Network and CPU load quickly gets out of hand when there are thousands of connections
ï§ Does not provide a neighbor monitoring service that can be used by others
2) Dedicated full-mesh framework of per-node daemons with
frequently probed connections
ï§ Even here monitoring traffic becomes overwhelming when cluster size > 100 nodes
ï§ Does not automatically abort any other connections
COMMON SOLUTIONS
4. ï Full-mesh framework of frequently probed node-to-node âlinksâ
ï§ At kernel level
ï§ Provides generic neighbor monitoring service
ï Each link endpoint keeps track of all connections to peer node
ï§ Issues âABORTâ message to its local socket endpoints when connectivity to peer node is lost
ï Even this solution causes excessive traffic beyond ~100 nodes
ï§ CPU load grows with ~N
ï§ Network load grows with ~N*(N-1)
TIPC SOLUTION: HIERARCHY + FULL MESH
5. ï Each node monitors its two nearest neighbors by heatbeats
ï§ Low monitoring network overhead, - increases by ~2*N
ï Node loss can also be detected through loss of an iterating token
ï§ Both solutions offered by Corosync
ï Hard to handle accidental network partitioning
ï§ How do we detect loss of nodes not adjacent to fracture point in opposite partition?
ï§ Consensus on ring topology required
OTHER SOLUTION: RING
6. ï Each node periodically transmits its known network view to a
randomly selected set of known neighbors
ï§ Each node knows and monitors only a subset of all nodes
ï§ Scales extremely well
ï§ Used by BitTorrent client Tribler
ï Non-deterministic delay until all cluster nodes are informed
ï§ Potentially very long because of the periodic and random nature of event propagation
ï§ Unpredictable number of generations to reach last node
ï§ Extra network overhead because of duplicate information spreading
OTHER SOLUTION: GOSSIP PROTOCOL
7. THE CHALLENGE
Finding an algorithm which:
ï Has the scalability of Gossip, but with
ï§ A deterministic set of peer nodes to monitor and update from each node
ï§ A predictable number of propagation generations before all nodes are reached
ï§ Predictable, well-defined and short event propagation delay
ï Has the light-weight properties of ring monitoring, but
ï§ Is able to handle accidental network partitioning
ï Has the full-mesh link connectivity of TIPC, but
ï§ Does not require full-mesh active monitoring
8. THE ANSWER:
OVERLAPPING RING MONITORING
ï Sort all cluster nodes into a circular list
ï§ All nodes use same algorithm and criteria
ï Select next [âN] - 1 downstream nodes in the
list as âlocal domainâ to be actively monitored
ï§ CPU load increases by ~âN
ï Distribute a record describing the local domain
to all other nodes in the cluster
ï Select and monitor a set of âheadâ nodes outside
the local domain so that no node is more than
two active monitoring hops away
ï§ There will be [âN] - 1 such nodes
ï§ Guarantees failure discovery even at
accidental network partitioning
ï Each node now monitors 2 x (âN â 1) neighbors
âą 6 neighbors in a 16 node cluster
âą 56 neighbors in an 800 node cluster
ï All nodes use this algorithm
ï In total 2 x (âN - 1) x N actively monitored links
âą 96 links in a 16 node cluster
âą 44,800 links in an 800 node cluster
+ x N =
(âN â 1) Local Domain
Destinations
(âN â 1) Remote
âHeadâ Destinations
2 x N x (âN â 1) Actively
Monitored Links
9. LOSS OF LOCAL DOMAIN NODE
State change of local
domain node detected
1
ï A domain record is sent to all other nodes in cluster when any state change
(discovery, loss, re-establish) is detected in a local domain node
ï The record keeps a generation id, so the receiver can know if it really
contains a change before it starts parsing and applying it
ï It is piggy-backed on regular unicast link state/probe messages, which must
always be sent out after a domain state change
ï May be sent several times until the receiver acknowledges reception of the
current generation
ï Because probing is driven by a background timer, it may take up to 375 ms
(configurable) until all nodes are updated
1
Domain record distributed to
all other nodes in cluster
10. LOSS OF ACTIVELY MONITORED HEAD NODE
Node failure detected Brief confirmation probing of
lost nodeâs domain members
After recalculation
ï The two-hop criteria plus confirmation probing eliminates the
network partitioning problem
ï If we really have a partition worst-case failure detection time will be
ï§ Tfailmax = 2 x active failure detection time
ï Active failure detection time is configurable
ï§ 50 ms â 10 s
ï§ Default 1.5 s in TIPC/Linux 4.7
Actively monitored nodes outside local domain
11. LOSS OF INDIRECTLY MONITORED NODE
Actively monitoring neighbors
discover failure
Actively monitoring neighbors
report failure
ï Max one event propagation hop
ï Near uniform failure detection time across the whole cluster
ï§ Tfailmax = active failure detection time + (1 x event propagation hop time)
Actively monitored nodes outside local domain
12. DIFFERING NETWORK VIEWS
1
A node has discovered a peer that
nobody else is monitoring
ï Actively monitor that node
ï Add it to its circular list according to algorithm (as local domain
member or âheadâ)
ï Handle its domain members according to algorithm (âappliedâ
or ânon-appliedâ)
ï Continue calculating the monitoring view from the next peer
Actively monitored nodes outside local domain
1
A node is unable to discover a peer
that others are monitoring
ï Donât add the peer to the circular list
ï Ignore it during the calculation of the monitoring view
ï Keep it as ânon-appliedâ in the copies of received domain records
ï Apply it to the monitoring view if it is discovered at a later moment
Transiently, this happens all the time, and must be considered a normal situation