A shallow look at all one needs to know when dealing with Distributed Systems, such as the CAP theorem, Harvest/Yield metrics, Partitioning vs. Replication, and Consensus Algorithms.
2. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
RUBEN TAN LONG ZHENG
▸ CTO of Neuroware (R1 Dot My Sdn Bhd)
▸ We Do Blockchain Stuff™
▸ Co-founder of Javascript Developers Malaysia
▸ Proud owner of 2 useless cats
▸ rubentan.com
▸ @roguejs
3. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
SESSION OVERVIEW
▸ Defining Distributed Systems
▸ Eight Fallacies Of Distributed Systems
▸ CAP Theorem
▸ Harvest / Yield
▸ Replication vs. Partitioning
▸ Consensus Algorithms
5. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS HAPPEN WHEN DEMAND OUTPACES YOUR INFRASTRUCTURE
6. ▸ Distributed System
▸ A bunch of processes in a networked environment
▸ Communicates by passing messages
▸ Observed as one single entity by outsiders
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DEFINING DISTRIBUTED SYSTEMS
NODE
NODE
NODE
NODE
NODE
7. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DEFINING DISTRIBUTED SYSTEMS
▸ Centralized vs. Decentralized
▸ A topology for control
▸ Centralized distributed system - has an authoritative
entity to ensure correctness
▸ Decentralized distributed system - no leader, every
node operates independently
8. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DEFINING DISTRIBUTED SYSTEMS
▸ General Characteristics
▸ Networked - each node is connected in a network
▸ Independent Failure - each node can fail independently
▸ Concurrent - computing is done en masse
▸ No Global Clock - nodes do not need a central clock
10. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
A PRACTICAL EXAMPLE - CONFERENCE DIRECTORY
WEB DIRECTORY
P P
P
P
P
P
P
P
P
SERVER
DATABASE
11. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
A PRACTICAL EXAMPLE - CONFERENCE DIRECTORY
WEB DIRECTORY
P P
P
P
P
P
P
P
P
SERVER
DATABASE
P
PP
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
12. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
A PRACTICAL EXAMPLE - CONFERENCE DIRECTORY
WEB DIRECTORY
P P
P
P
P
P
P
P
P
LOAD BALANCER
DATABASE
P
PP
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
SERVER SERVERSERVER
13. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
A PRACTICAL EXAMPLE - CONFERENCE DIRECTORY
WEB DIRECTORY
P P
P
P
P
P
P
P
P
LOAD BALANCER
DATABASE
P
PP
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
SERVER SERVERSERVER
14. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
A PRACTICAL EXAMPLE - CONFERENCE DIRECTORY
WEB DIRECTORY
P P
P
P
P
P
P
P
P
LOAD BALANCER
DATABASE
P
PP
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
SERVER SERVERSERVER
DATABASE DATABASE
15. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
A PRACTICAL EXAMPLE - CONFERENCE DIRECTORY
WEB DIRECTORY
P P
P
P
P
P
P
P
P
LOAD BALANCER
DATABASE
P
PP
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
SERVER SERVERSERVER
DATABASE DATABASE
Congratulations, you now have a
distributed system!
16. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DEFINING DISTRIBUTED SYSTEMS
▸ Why learn about distributed systems?
▸ Microservices - learn how to evaluate your topology
▸ Load planning - understand how to measure and plan for
load
▸ Failure management - eliminate or mitigate single point
of failures
▸ Evaluate products - understand what product to use and
what exactly do they bring to the table
18. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
20. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
21. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
Hardware failure
Human error
Datacenter/Cloud failure DDoS
22. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Identify critical components and SPoF
▸ Chaos Monkey in Netflix
▸ Monitor with heartbeats
▸ Simplify the failure model
▸ Watch out for shared states
23. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
24. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ Latency Is Zero
SERVER
A
SERVER
B SERVER
CUSER
10ms
50ms
100ms
25. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ Latency Is Zero
▸ Identify potential race conditions
▸ Avoid sequential operations
▸ Plan to terminate locking requests
26. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
28. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ Bandwidth Is Infinite
▸ Not as big a fallacy as others
▸ Made worse because more bandwidth is almost always
immediately consumed
▸ Plan for unpredictable bandwidth
▸ Graceful degradation
29. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
30. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ Network Is Secure
SWIFT network lost 81
million USD to a cyber heist
in 2016
LinkedIn was breached,
more than 117 million
accounts compromised
31. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ Network Is Secure
▸ Harden infrastructure as early as possible
▸ Adopt industry best practises on access control
▸ Plan for byzantine faults, or at least detect them
32. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
33. BASIC DISTRIBUTED COMPUTING PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ Topology Does Not Change
WEB DIRECTORY
SERVER
DATABASE
WEB DIRECTORY
LOAD BALANCER
DATABASE
SERVER SERVERSERVER
DATABASE DATABASE
34. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ Topology Does Not Change
▸ Topology change is the most common fallacy
▸ Small changes can have massive paradigm shifts
▸ Crucial to understand distributed principles
35. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
37. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ There Is One Administrator
▸ Conflict between system administration and infrastructure
design
▸ Access control can often cause unexpected failures
▸ System administrators have different focus compared to
software developers
▸ Think about management tools, software defined
network, etc
38. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
40. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ Transport Cost Is Zero
▸ Business decisions can become hard constraints
▸ More powerful hardware can yield minimal results
▸ Transport layer may incur additional resource costs
▸ Different protocol (TCP/UDP) can have different
performance tradeoffs
41. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Reliable
▸ Latency Is Zero
▸ Bandwidth Is Infinite
▸ The Network Is Secure
▸ Topology Does Not Change
▸ There Is One Administrator
▸ Transport Cost Is Zero
▸ The Network Is Homogeneous
42. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Homogenous
LINUX WINDOWS
43. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
DISTRIBUTED SYSTEMS FALLACIES
▸ The Network Is Homogenous
▸ Avoid propriety protocols/formats
▸ Focus on software/hardware that allows
interoperatability
▸ Not that big of a deal in modern day world
45. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
▸ Consistency - most up-to-date data upon request - from
weak to strong
▸ Availability - able to respond to a request - from low to
high
▸ Partition-tolerance - able to continue operating in the
event of a network partition - mandatory
47. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
CONSISTENCY AVAILABILITY
PARTITION-
TOLERANCE
Strong Consistency + Partition Tolerant
48. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
CONSISTENCY AVAILABILITY
PARTITION-
TOLERANCE
Strong Consistency + Partition Tolerant
• Mission critical systems
• Financial systems
49. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
CONSISTENCY AVAILABILITY
PARTITION-
TOLERANCE
High Availability + Partition Tolerant
50. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
CONSISTENCY AVAILABILITY
PARTITION-
TOLERANCE
High Availability + Partition Tolerant
• “Webscale” systems
• Most web service backends
52. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
CONSISTENCY AVAILABILITY
PARTITION-
TOLERANCE
Also known as NOT a distributed system
53. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
CONSISTENCY AVAILABILITY
PARTITION-
TOLERANCE
Most systems are tuneable aka TRADEOFF
54. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
Also, there are no absolutes in the system
Absolute Consistency Absolute Availability
55. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
CAP THEOREM
Also, there are no absolutes in the system
Absolute Consistency Absolute Availability
• CAP Theorem describes the nature of how a system acts
when a network partition is encountered
• Understand what consistency and availability means
• Sacrifice some consistency for more availability, or vice
versa
57. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
HARVEST / YIELD
▸ Harvest - the completeness of the response to a query
▸ Yield - the probability of completing a request
▸ A response to CAP Theorem being widely misunderstood
and misused
▸ Armando Fox, Eric A. Bower - Harvest, Yield and Scalable
Tolerant Systems (1999)
▸ How much of harvest/yield to sacrifice in the event of a
network partition
58. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
HARVEST / YIELD
▸ Harvest - the completeness of the response to a query
▸ Total data available / Total data = Harvest
▸ Harvest is an abstract idea - depends on what you define as
completeness
▸ Examples:
▸ Pagination on large datasets
▸ Returning partial dataset on shard failure
▸ Return less accurate search results on node failure
59. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
HARVEST / YIELD
▸ Yield - the probability of completing a request
▸ Total responses / Total requests = Yield, result of 0 to 1
▸ Example:
▸ Total responses = 999
▸ Total requests = 1000
▸ Yield = 0.999
▸ Yield is NOT uptime!
60. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
HARVEST / YIELD
Harvest Yield
• You can trade harvest for yield - Probabilistic Availability
• Examples
• Returning stale data
• Prioritise the most critical data
61. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
HARVEST / YIELD
Harvest Yield
• You can trade yield for harvest
• Examples
• Database transactional locks
• Return error on network failure instead
62. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
HARVEST / YIELD
▸ Distributed systems can be evaluated by its decision to
reduce harvest/yield under network partitions
▸ Some architectures utilise different harvest/yield tradeoffs
in individual components
▸ A better representation of the kind of tradeoffs one will
make compared to the CAP Theorem
64. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING & REPLICATION
▸ Strategies - replication and partitioning are two different
strategies on scaling a distributed system
▸ Partitioning - dividing data to improve yield during high
loads
▸ Replication - creating redundancy of data to improve
Harvest in the event of node failures
▸ Both strategies are used together in some combinations
65. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
▸ Partitioning - dividing data to improve yield during high loads
▸ Data can be divided using deterministic indexing strategies
▸ Example:
▸ By Geograpy (Asia, Europe, North America)
▸ By Hash (3xf8ca8e, etc)
▸ By Category (Hot/cold data)
67. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
When load becomes greater than the ability
of a node to handle, we need to partition
the data
68. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
Each node contains a shard of the
original dataset
69. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
A-D
E-H
I-M N-Q
R-U
V-Z
70. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
A-D
E-H
I-M N-Q
R-U
V-Z
Search: “Justice League”
71. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
A-D
E-H
I-M N-Q
R-U
V-Z
“Justice League”
72. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
A-D
E-H
I-M N-Q
R-U
V-Z
Consistent (deterministic)
hashing is used to perform
queries in a sharded dataset
to be able to quickly map a
search query to its containing
node
73. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
A-D
E-H
I-M N-Q
R-U
V-Z
Bonus question: what
happens when you need to
add a new partition?
NODE
74. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
A-D
E-H
I-M N-Q
R-U
V-Z
What key range do you use?
NODE
?-?
75. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
PARTITIONING & REPLICATION
▸ Partitioning does not improve system resilience towards
network partitions or node failures
▸ Replication - used in conjunction with partitioning to
improve data redundancy
▸ However, as we replicate data, we improve read yield at
the cost of write yield, if we care about strong consistency
76. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
REPLICATION
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
77. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
REPLICATION
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE
NODE NODE
Request requires data a specific
partition
NODE
NODE
78. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
REPLICATION
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE
NODE NODENode fails, your data is gone
NODE
NODE
79. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
REPLICATION
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
Replicate 2 more nodes
to improve data
redundancy
80. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
REPLICATION
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
What happens when the
data is updated?
81. BASIC DISTRIBUTED SYSTEMS PRINCIPLES
REPLICATION
WEB SERVER
P P
P
P
P
P
P
P
P
NODE
NODE NODE
NODE
NODE NODE
Your data is now
inconsistent. The solution
is to implement a
consensus algorithm
amongst the replicas
83. CONSENSUS OVERVIEW
▸ Achieving Consensus = distributed system acting as one entity
▸ Consensus Problem = getting nodes in a distributed system to
agree on something (value, operation, etc)
▸ Common Examples
▸ Commit transactions to a database
▸ Synchronising clocks
▸ Replicating logs
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
84. REALITIES OF DISTRIBUTED SYSTEMS
▸ Distributed systems fail often (more often than you think)
▸ Development of distributed systems cost more
▸ Consensus/coordination is a hard problem
▸ Problems usually bigger than available memory
▸ Debugging a distributed system? Good luck
▸ Monitoring a distributed system? Good luck
▸ Learn to live with imperfections and partial availability
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
85. FAILURE MODES
▸ Fail-stop = a node dies
▸ Fail-recover = a node dies and comes back later (Jesus/
Zombie)
▸ Byzantine = a node misbehaves
▸ The scary part? The symptoms are the same!
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
86. FLP IMPOSSIBILITY PROOF
▸ Michael J. Fisher, Nancy A. Lynch, and Michael S. Patterson
▸ Impossibility of Distributed Consensus with One Faulty
Process (1985) - Dijkstra (dike-stra) Award (2001)
▸ In synchronous settings, it is possible to reach consensus at
the cost of time
▸ Consensus is impossible in an asynchronous setting even
when only 1 node will crash
▸ Why is this important? Because math > your arguments!
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
87. BYZANTINE GENERAL’S PROBLEM
▸ Originated from the Two General’s Problem (1975)
▸ Explored in detail in Leslie Lamport, Robert Shostak,
Marshall Pease paper: The Byzantine General Problem
(1982)
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
90. BYZANTINE FAULT TOLERANCE
▸ Byzantine Fault
▸ Any fault that presents different symptoms to different
observers (some general attack, some general retreat)
▸ Byzantine Failure
▸ The loss of a system service reliant on consensus due to
Byzantine Fault
▸ Byzantine Fault Tolerance
▸ A system that is resilient/tolerant of a Byzantine Fault
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
91. SOLVING THE CONSENSUS PROBLEM
▸ Strong consensus follows these properties:
▸ Termination - all nodes eventually decide on a value
▸ Agreement - all nodes decide on a value
▸ Integrity - all nodes must decide on at most 1 value, and
this value must be a value that’s been proposed
▸ Validity - if all correct nodes propose the same value,
then all nodes decide on the same value
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
93. 2 PHASE COMMIT
▸ Simplest consensus protocol
▸ Phase 1 - Proposal
▸ A node (called coordinator) proposes a value to all other nodes,
then gathers votes
▸ Phase 2 - Commit-or-abort
▸ The coordinator sends:
▸ Commit if all nodes voted yes. All nodes commit the new value
▸ Abort if 1 or more nodes voted no. All nodes abort the value
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
97. 2 PHASE COMMIT
▸ Agreement - every node accepts the value from the
coordinator at phase 2 = YES
▸ Integrity - commit/abort originated from the coordinator = YES
▸ Termination - no loops in the steps, doesn’t run forever = YES
▸ Validity - all correct nodes accept the correct proposed value =
YES
▸ Therefore, 2 phase commit fulfils the requirements of a
consensus protocol
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
98. 2 PHASE COMMIT
▸ Blocking failure when coordinator fails before sending
proposal to all nodes
COOR.
NODE
NODE
NODE
Coordinator proposes a value
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
99. ▸ Blocking failure when coordinator fails before sending
proposal to all nodes
2 PHASE COMMIT
COOR.
NODE
NODE
NODE
Receives proposed
value, votes yes, now
waiting for commit
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
100. ▸ Blocking failure when coordinator fails before sending
proposal to all nodes
2 PHASE COMMIT
COOR.
NODE
NODE
NODE
Coordinator crashes… and a different
coordinator comes in to propose a
different value
NEW
COOR.
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
101. ▸ Blocking failure when coordinator fails before sending
proposal to all nodes
2 PHASE COMMIT
COOR.
NODE
NODE
NODE
NEW
COOR.
Node cannot accept new proposal
because waiting on commit. Cannot
abort because first Coordinator might
recover.
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
102. 2 PHASE COMMIT
▸ Guarantees safety, but not liveness
▸ Safety = all nodes agree on a value proposed by a node
▸ Liveness = should still be able to function when some
nodes crash
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
103. 3 PHASE COMMIT
▸ Similar to 2 Phase Commit, with an extra phase (duh)
▸ Phase 1 - canCommit - same as 2PC, nodes reply with Yes
▸ Phase 2 - preCommit - similar to 2PC commit-or-abort, but
nodes reply with ACK instead
▸ Phase 3 - doCommit - now the nodes commit
▸ Tolerant of node crashes, but not network partitions
▸ Won’t cover in detail
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
110. COOR.
NODE
NODE
NODE
NODE
Guarantee 1: if ANY node
receives an preCommit,
we can safely assume that
ALL nodes have replied
YES, in which case they
can safely assume that the
value is agreed upon by
all nodes
111. COOR.
NODE
NODE
NODE
NODE
Guarantee 2: if ANY node
receives an canCommit,
we can safely assume that
ALL nodes have replied
ACK, in which case even if
the coordinator fails, they
can safely assume their
commit is correct
112. PAXOS
▸ Presented by Leslie Lamport in The Part-Time Parliament
(1988)
▸ Named after the Paxos civilisation’s legislation
▸ Remains as:
▸ The hardest to understand in theory
▸ The hardest to implement
▸ The closest we get to reaching ideal consensus
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
113. PAXOS
▸ Used in:
▸ Apache Zookeeper
▸ Google Chubby (BigTable)
▸ Google Spannar
▸ Apache Mesos
▸ Apache Cassandra
▸ etc
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
114. BASIC PAXOS
▸ Components:
▸ Proposers
▸ Proposes values to other nodes
▸ Acceptors
▸ Respond to proposers with votes
▸ Commits chosen value & decision state
▸ Server can have both 1 Proposer & 1 Acceptor
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
115. BASIC PAXOS
▸ Revolves around two important properties: proposal
number, and time
▸ Proposal number is unique, and higher proposal
number has priority over lower proposal number
▸ Proposal number needs to be persisted on each node
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
116. BASIC PAXOS
▸ Uses a two-phase approach:
▸ Broadcast Prepare
▸ Find out if there’s already a chosen value
▸ Block older proposals that have yet to be completed
▸ Broadcast Accept
▸ Ask acceptors to accept a value
▸ Impossible to have an algorithm that completes in one cycle
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
117. BASIC PAXOS
▸ Proposal Phase
▸ Proposer generates a proposal number p
▸ Proposer broadcasts p and a value v
▸ Acceptor checks p if higher than its min-p, updates if so
▸ Acceptor replies any accepted-p and accepted-v
▸ Proposer waits for majority (quorum) to reply
▸ Checks if any return accepted-p is highest, and replace v with accepted-v
▸ If no quorum, generate a new proposal, using accepted-p as a base
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
118. PAXOS
▸ Accept Phase
▸ Proposer sends p and v to all acceptors
▸ Acceptors check if p is lower than min-p, and ignores if
so. Otherwise, accepted-p = min-p = p and accepted-v =
v, and returns the min-p
▸ Acceptor reply accepted or rejected
▸ If majority accepted, terminate with v. Otherwise, restart
Propose Phase with new p
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
119. A1
A2
A3
X
S1 proposes X using proposal number 1
MIN-P 0 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
PROPOSAL PHASE
120. A1
A2
A3
X
Both A1 and A2 sets their min-P, and replies with acc-P and acc-V
MIN-P 1.1 ACC-P - ACC-V -
MIN-P 1.1 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
PROPOSAL PHASE
ACC-P -
ACC-V -
ACC-P -
ACC-V -
121. A1
A2
A3
X
S1 notices that the highest Acc-P is not more than its own P
MIN-P 1.1 ACC-P - ACC-V -
MIN-P 1.1 ACC-P - ACC-V -
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
PROPOSAL PHASE
ACC-P -
ACC-V -
ACC-P -
ACC-V -
122. A1
A2
A3
X
S1 issues an accept command using the same proposal and value
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 1.1 ACC-P 1.1 ACC-V X
S1
P1.1 X
P1.1 X
COMMIT PHASE
P1.1 X
123. PAXOS - MULTI PROPOSERS
▸ What if there were multiple proposers?
▸ Brace yourself, It’s Complicated™ (not really)
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
124. A1
A2
A3
S1 proposes X using proposal number 1, 2 out of 3 nodes already accepted
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
YS2PROPOSAL PHASE
S2 proposes Y using proposal number 2.2
125. A1
A2
A3
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
YS2PROPOSAL PHASE
A3 will return null for Acc-P and Acc-V…
ACC-P -
ACC-V -
126. A1
A2
A3
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
YS2PROPOSAL PHASE
But A2 will return Acc-P of 1.1, and Acc-V of X
ACC-P -
ACC-V -
ACC-P 1.1
ACC-V X
127. A1
A2
A3
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
YS2PROPOSAL PHASE
What is the highest Acc-P? 1.1
ACC-P -
ACC-V -
ACC-P 1.1
ACC-V X
128. A1
A2
A3
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
XS2PROPOSAL PHASE
Change value to X, and send it back as a commit
129. A1
A2
A3
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 1.1 ACC-P 1.1 ACC-V X
MIN-P 0 ACC-P - ACC-V -
S1
P1.1 X
P1.1 X
XS2COMMIT PHASE
Change value to X, and send it back as a commit
P2.1 X
P2.1 X
P2.1 X
130. A1
A2
A3
MIN-P 2.1 ACC-P 2.1 ACC-V X
MIN-P 2.1 ACC-P 2.1 ACC-V X
MIN-P 2.1 ACC-P 2.1 ACC-V X
S1
P1.1 X
P1.1 X
XS2COMMIT PHASE
P2.1 X
P2.1 X
P2.1 X
131. A1
A2
A3
MIN-P 2.1 ACC-P 2.1 ACC-V X
MIN-P 2.1 ACC-P 2.1 ACC-V X
MIN-P 2.1 ACC-P 2.1 ACC-V X
S1
P1.1 X
P1.1 X
XS2COMMIT PHASE
P2.1 X
P2.1 X
P2.1 X
132. A1
A2
A3
MIN-P 2.1 ACC-P 2.1 ACC-V X
MIN-P 2.1 ACC-P 2.1 ACC-V X
MIN-P 2.1 ACC-P 2.1 ACC-V X
S1
P1.1 X
P1.1 X
S2COMMIT PHASE
P2.1 X
P2.1 X
P2.1 X
All values are in sync now
133. BASIC PAXOS
▸ This is BASIC Paxos: 2PC with a twist (Quorum)
▸ It has vulnerabilities!
▸ Best of 2PC (safety), with strong liveness
▸ Most Consensus Algorithm are a variant of Paxos
▸ Forms the basis of Distributed Consensus Research
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
134. CLOSING…
▸ Basic Paxos is not Byzantine Fault Tolerant, but more
advanced variants can be (e.g. PBFT, raft)
▸ It is a challenge to create a consensus protocol
(termination, agreement, validity) that is Byzantine Fault
Tolerant
▸ Further developments: Multi-Paxos, Raft, Byzantine Fault
Tolerant Paxos, etc…
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
135. BITCOIN CONSENSUS
▸ Why you need to know?
▸ Bitcoin
▸ Litecoin
▸ Dogecoin
▸ etc
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
136. BITCOIN CONSENSUS
▸ Requirements:
▸ Anybody can access the ledger
▸ Anybody can modify the ledger
▸ Everybody must have the same truth
▸ Nobody exerts sole authority over the truth
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
137. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
138. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
1 - Each block contains a list of transactions
2 - Each block contains a “hash” of its previous parent
3 - Each block is timestamped to a specific time
139. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
T1 T3
T4
T5
T2 T6
T7
T8
New transactions arrive into a memory pool
140. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
T1 T3
T4
T5
T2 T6
T7
T8
MINER-1
MINER-2
MINER-3
All miners receive these transactions via gossip, and
collects them into blocks
BLK-X
BLK-Y
BLK-Z
141. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
T1 T3
T4
T5
T2 T6
T7
T8
MINER-1
MINER-2
MINER-3
Miners hashes the block, and races to match the hash
with a pattern. This pattern has a difficulty that roughly
determines the number of hashes required to solve it.
BLK-X
BLK-Y
BLK-Z
1
BLK-X 2
BLK-X 3
142. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
T1
T3
T4
T5
T2
T6
T7
T8
MINER-1
MINER-2
MINER-3
A match is found! Broadcast the solution, hash, and block
BLK-X
BLK-Y
BLK-Z
11
143. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
T3
T5
T6
T7
T8
MINER-1
MINER-2
MINER-3
BLK-5 T T T2017 FEB 23, HOUR 5
$$$
:(
:(
144. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
BLK-5 T T T
2017 FEB 23, HOUR 5
BLK-5 T T T T T T
So… what if 2 blocks are discovered at the same time?
145. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
BLK-5B T T TBLK-5A T T T T T T
Fork it. Next block chooses the block that has the most work
MINER-1
146. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
BLK-5B T T TBLK-5A T T T T T T
5A it is!
MINER-1
147. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
BLK-5B T T TBLK-5A T T T T T T
BLK-6 T T T T
2017 FEB 23, HOUR 5
2017 FEB 23, HOUR 6
All future blocks will only choose the longest chain, so 5B is orphaned
148. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
T
T
T
BLK-5A T T T T T T
BLK-6 T T T T
2017 FEB 23, HOUR 5
2017 FEB 23, HOUR 6
Transactions in 5B eventually gets returned to the mempool, to be included into
a different block
149. BLK-1 T T T T T T T T T T T T2017 FEB 23, HOUR 1
BLK-2 T T T T T T T2017 FEB 23, HOUR 2
BLK-3 T T T T T T T T T T2017 FEB 23, HOUR 3
BLK-4 T T T T T T T2017 FEB 23, HOUR 4
BLK-5 T T T T T T
BLK-6 T T T T
2017 FEB 23, HOUR 5
2017 FEB 23, HOUR 6
Consensus achieved, one single version of truth!
150. BITCOIN CONSENSUS
▸ Achieves consensus through proof of work
▸ An economic solution to a distributed problem
▸ Expensive to attack, even when attack vectors are known
▸ Incentivised by playing nice
▸ Great basis for cryptocurrencies
▸ Tradeoffs
▸ Limited number of transactions per second
▸ Improvements limited by politics
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
151. CLOSING…
▸ Understanding distributed models will level up your
perspective when developing
▸ Gives you the tools to evaluate technologies to see if they
fit your problems
▸ Allows you to reliably tell what kind of tradeoffs a
technology makes, and if you are okay with that sacrifice
BASIC DISTRIBUTED SYSTEMS PRINCIPLES
156. TAKEAWAYS
▸ If you can solve it in memory, don’t go distributed
▸ If you can afford a monolithic architecture, don’t go
microservices
▸ If you insist on microservices, use it as a service abstraction,
not a scaling method
▸ Scale vertically first before horizontally
▸ When you need to scale horizontally, use these principles to
evaluate solutions and design your system
BASIC DISTRIBUTED SYSTEMS PRINCIPLES