NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
1. Non-Stop Hadoop
Applying Paxos to make critical Hadoop
services Continuously Available
Jagane Sundar - CTO, WANdisco
Brett Rudenstein – Senior Product Manager, WANdisco
2. WANdisco Background
WANdisco: Wide Area Network Distributed Computing
Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s
data challenges of secure storage, scalability and availability
Leader in tools for software engineers – Subversion
Apache Software Foundation sponsor
Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)
US patented active-active replication technology granted, November 2012
Global locations
- San Ramon (CA)
- Chengdu (China)
- Tokyo (Japan)
- Boston (MA)
- Sheffield (UK)
- Belfast (UK)
5. Elementary Server Software:
Single thread processing client requests in a loop
Server Process
make change to state (db)
OP OP OP OP
get client request e.g. hbase put send return value to client
6. Multi-threaded Server Software:
Multiple threads processing client requests in a loop
Server Process
make change to state (db)
get client request e.g. hbase put send return value to client
OP OP OP OP
OP
OP
OP OPOP OP
OP
OP
thread 1
thread 3
thread 2
thread 1
thread 2
thread 3
acquire lock release lock
8. Problem
How do we ensure that the three servers contain exactly the same data?
In other words, how do you achieve strong consistency replication?
9. Two parts to the solution:
(Multiple Replicas of) A Deterministic State Machine
The exact same sequence of operations, to be applied to each replica of the
DSM
10. A Deterministic State Machine
A state machine where a specific operation will always result in a deterministic
state
Non deterministic factors cannot play a role in the end state of any operation
in a DSM. Examples of non-deterministic factors
- Time
- Random
11. Creating three replicated servers
Apply all modify operations in the same exact sequence in each replicated server =
Multiple Servers with exactly the same replicated data
server1
Server Process
(DSM)
server2
Server Process
(DSM)
server3
Server Process
(DSM)
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
12. Problem:
How to achieve consensus between these servers as to the sequence of operations to
perform?
Paxos is the answer
- Algorithm for reaching consensus in a network of unreliable processors
13. Three replicated servers
server3
Server Process
OP OP OP OP
Distributed
Coordination Engine
server2
Server Process
Distributed
Coordination Engine
OP OP OP OP
server
1
Server Process
OP OP OP OP
Distributed
Coordination Engine Paxos
Client Client
ClientClient
Client
Paxos
OP
OPOP
OP
15. Paxos
Paxos is an Algorithm for building Replicated Servers with strong consistency
1. Synod algorithm for achieving consensus among a network of unreliable processes
2. The application of consensus to the task of replicating a Deterministic State Machine
Paxos does not
- Specify a network protocol
- Invent a new language
- Restrict use in a specific language
16. Replicated State Machine
Installed on each node that participates in the distributed system
All nodes function as peers to deliver and assure the same transaction order occurs on
every system
- Achieve Consistent Replication
Consensus
- Roles
• Proposers, Acceptors, Learners
- Phases
• Election of a node to be the proposer
• Broadcast of the proposal to peers
• Acceptance of the proposal for majority
17. Paxos Roles
Proposer
- The client or a proxy for the client
- Proposes a change to the Deterministic State Machine
Acceptor
- Acceptors are the ‘memory’ of paxos
- Quorum is established amongst acceptors
Learner
- The DSM (Replicated, of course)
- Each Learner applies the exact same sequence of operations as proposed by the Proposers,
and accepted by a majority quorum of Acceptors
18. Paxos - Ordering
Proposers issue a new sequence number of a higher value from the last sequence
known
A majority agrees this number has not been seen
Consensus must be reached on the current proposal
20. DConE Innovations
Beyond Paxos
Quorum Configurations
- Majority, Singleton, Unanimous
- Distinguished Node – Tie Breaker
- Quorum Rotations – follow the sun
- Emergency Reconfigure
Concurrent agreement handling
- Paxos only allows agreements on one proposal at a time
• Slow performance in a high transaction volume environment
- DConE allows simultaneous proposals from multiple proposers
21. DConE Innovations
Beyond Paxos
Dynamic group evolution
- Add and remove nodes
- Add and remove sites
- No interruption of current operations
Distributed garbage collection
- Safely discard state on disk and in memory when it is no longer required to assist in recovery
- Messages are sent to peers pre-defined intervals to determine the highest common agreement
- All agreements and agreed proposals are deleted
22. DConE Innovations
Beyond Paxos
Backoff and collision avoidance
- Avoids repeated pre-emption of proposers by their peers
- Prevents thrashing which can severely degrade performance.
- When a round is pre-empted, a backoff delay is computed
23. Self Healing
Automatic Back up and Recovery
All nodes are mirrors/replicas of each other
- Any node can be used as a helper to bring it back
Read access without Quorum
- Cluster is still accessible for reads
- No writes prevent split brain
Automatic catch up
- Servers that have been offline, learn of transactions that were agreed on while it was
unavailable
- The missing transactions are played back and one caught up become fully participating
members of the distributed system again
Servers can be updated without down time
- Allows for rolling upgrades
25. Co-ordinating intent
Proposal to
mkdir /a
P
a
x
o
s
server2
server
1
Proposal to
createFile /a
createFile /a
createFile /a
mkdir /a
mkdir /a
Op fails
Op fails
mkdir /a
Co-ordinate
outcome
(WAL, HDFS
Edits Log,
etc.)
server2
server
1
createFile /a
server1 state is wrong
mkdir /a operation needs to be undone
Co-ordinating outcome
27. HDFS Architecture
HDFS metadata is decoupled from data
- Namespace is a hierarchy of files and directories represented by INodes
- INodes record attributes: permissions, quotas, timestamps, replication
NameNode keeps its entire state in RAM
- Memory state: the namespace tree and the mapping of blocks to DataNodes
- Persistent state: recent checkpoint of the namespace and journal log
File data is divided into blocks (default 128MB)
- Each block is independently replicated on multiple DataNodes (default 3)
- Block replicas stored on DataNodes as local files on local drives
Reliable distributed file system for storing very large data sets
27
28. HDFS Cluster
Single active NameNode
Thousands of DataNodes
Tens of thousands of HDFS clients
Active-Standby Architecture
28
29. Standard HDFS operations
Active NameNode workflow
1. Receive request from a client,
2. Apply the update to its memory state,
3. Record the update as a journal transaction in persistent storage,
4. Return result to the client
HDFS Client (read or write to a file)
- Send request to the NameNode, receive replica locations
- Read or write data from or to DataNodes
DataNode
- Data transfer to / from clients and between DataNodes
- Report replica state change to NameNode(s): new, deleted, corrupt
- Report its state to NameNode(s): heartbeats, block reports
29
31. Replicated Namespace
Replicated NameNode is called a ConsensusNode or CNode
ConsensusNodes play equal active role on the cluster
- Provide write and read access to the namespace
The namespace replicas are consistent with each other
- Each CNode maintains a copy of the same namespace
- Namespace updates applied to one CNode propagated to the others
Coordination Engine establishes the global order of namespace updates
- All CNodes apply the same deterministic updates in the same deterministic order
- Starting from the same initial state and applying the same updates = consistency
Coordination Engine provides consistency of multiple namespace replicas
31
32. Coordinated HDFS Cluster
Independent CNodes – the same namespace
Load balancing client requests
Proposal, Agreement
Coordinated updates
Multiple active Consensus Nodes share namespace via Coordination Engine
32
33. Coordinated HDFS operations
ConsensusNode workflow
1. Receive request from a client
2. Submit proposal to update to the Coordination Engine
Wait for agreement
3. Apply the agreed update to its memory state,
4. Record the update as a journal transaction in persistent storage (optional)
5. Return result to the client
HDFS Client and DataNode operations remain the same
Updates to the namespace when a file or a directory is created are coordinated
33
34. Strict Consistency Model
Coordination Engine transforms namespace modification proposals into the global
sequence of agreements
- Applied to namespace replicas in the order of their Global Sequence Number
ConsensusNodes may have different states at a given moment of “clock” time
- As the rate of consuming agreements may vary
CNodes have the same namespace state when they reach the same GSN
One-copy-equivalence
- each replica presented to the client as if it has only one copy
One-Copy-Equivalence as known in replicated databases
34
35. Consensus Node Proxy
CNodeProxyProvider – a pluggable substitute of FailoverProxyProvider
- Defined via Configuration
Main features
- Randomly chooses CNode when client is instantiated
- Sticky until a timeout occurs
- Fails over to another CNode
- Smart enough to avoid SafeMode
Further improvements
- Take into account network proximity
Reads do not modify namespace can be directed to any ConsensusNode
35
37. Using a TCP Connection to send data to three
replicated servers (Load Balancer)
server3
Server Process
OP OP
server2
Server Process
OP OP OP OP
server
1
Server Process
OP OP OP OP
Client
OP OP OP OP
Load BalancerLoad Balancer
38. Problems with using a Load Balancer
Load balancer becomes the single point of failure
- Need to make the LB highly available and distributed
Since Paxos is not employed to reach consensus between the three replicas, strong
consistency cannot be guaranteed
- Replicas will quickly diverge
39. HBase WAL or HDFS Edits Log replication
State Machine (HRegion contents, HDFS NameNode metadata, etc.) is modified first
Modification Log (HBase WAL or HDFS Edits Log) is sent to a Highly Available shared
storage, QJM, etc.
Standby Server(s) read edits log and serve as warm standby servers, ready to take
over should the active server fail
40. HBase WAL or HDFS Edits Log replication
server
1
Server Process
OP OP OP OP
server2
Server Process
Shared
Storage
Standby Server
WAL/Edits Log
Single Active Server
41. Only one active server is possible
Failover takes time
Failover is error prone, with intricate fencing etc.
Cost of reaching consensus needs to be paid for HDFS Edits log entry to be deemed
safely stored, so why not pay the cost before modifying the state and thereby have
multiple active servers?
HBase WAL or HDFS Edits Log tailing
45. NonStopRegionServer:
Client Service
e.g. multi
Client Service
DConE
HRegionServer
NonStopRegionServer 1
Client Service
e.g. multi
Client Service
DConE
HRegionServer
NonStopRegionServer 2
Hbase Client 1. Client calls HRegionServer multi
2. NonStopRegionServer intercepts
3. NonStopRegionServer makes paxos
proposal using DConE library
4. Proposal comes back as agreement
on all NonStopRegionServers
5. NonStopRegionServer calls super.multi
on all nodes. State changes are recorded
6. NonStopRegionServer 1 alone sends
response back to client
Subclassing the HRegionServer
46. HBase RegionServer replication using
WANdisco DConE
Shared nothing architecture
HFiles, WALs etc. are not shared
Replica count is tuned
Snapshots of HFiles do not need to be created
Messy details of WAL tailing are not necessary
Sequenced set of operations
Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
Majority agrees that a higher number has not been seen and if so allows transaction to complete
Consensus must be reached on the current proposal
Seven key innovations over paxos
Distributed garbage collection
Any system that deals with distributed state should be able to safely discard state information on disk and in memory for efficient resource utilization. The point at which it is safe to do so is the point at which the state information is no longer required to assist in the recovery of a node at any site. Each DConE instance sends messages to its peers at other nodes at pre-defined intervals to determine the highest contiguously populated agreement common to all of them. It then deletes all agreements from the agreement log, and all agreed proposals from the proposal log that are no longer needed for recovery.
Distinguished and fair round numbers fopr proposals
Weak reservations
DConE’s use of distinguished and fair round numbers in the process of achieving consensus avoids the contention that would otherwise arise when multiple proposals are submitted simultaneously by different nodes using the same round number. If this option is used, the round number will consist of three components: (1) a monotonically increasing component which is simply the increment of the last monotonic component; (2) a distinguished component which is a component specific to each proposer and (3) a random component. If two proposers clash on the first component, then the random component is evaluated, and the proposer whose number has the larger random number component wins. If there is still no winner, then the distinguished component is compared, and the winner is the one with the largest distinguished component. Without this approach the competing nodes could end up simply incrementing the last attempted round number and resubmitting their proposals. This could lead to thrashing that would negatively impact the performance of the distributed system. This approach also ensures fairness in the sense that it prevents any node from always winning.
Weak Reservations
DConE provides an optional weak reservation mechanism to eliminate pre- emption of proposers under high transaction volume scenarios. For example, if there are three proposers - one, two and three - the proposer’s number determines which range of agreement numbers that proposer will drive. This avoids any possibility of collisions among the multiple proposals from each proposer that are proceeding in parallel across the distributed system.
Dynamic group evolution
DConE supports the concept of dynamic group evolution, allowing a distributed system to scale to support new sites and users. New nodes can be added to a distributed system, or existing nodes can be removed without interrupting the operation of the remaining nodes.
Backoff and collison avoidance
DConE provides a backoff mechanism for avoiding repeated pre-emption of proposers by their peers. Conventional replicated state machines allow the preempted proposer to immediately initiate a new round with an agreement number higher than that of the pre-emptor. This approach can lead an agreement protocol to thrash for an extended period of time and severely degrade performance.
With DConE, when a round is pre-empted, the DConE instance which initiated the proposal computes the duration of backoff delay. The proposer then waits for this duration before initiating the next round. DConE uses an approach similar to Carrier Sense Multiple Access/Collision Detection (CSMA/CD) protocols for nonswitched ethernet.
Multiple propsers
Both say I want tx 179 so they are competing..
Collison avoidance
Paxos round… sends out read messagner to acceptors
Disadvantages:
1. Resources used to support Standby
2. Single NN is a bottleneck
3. Failover: complex, still outage
Can do better than that with consistent replication
Disadvantages:
1. Resources used to support Standby
2. Single NN is a bottleneck
3. Failover: complex, still outage
Can do better than that with consistent replication
Double determinism is important
NameNodes start from the same state and apply the same deterministic updates in the same deterministic order, their states are consistent.
Independent NameNodes don’t know about each other
Sequenced set of operations
Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
Majority agrees that a higher number has not been seen and if so allows transaction to complete
Consensus must be reached on the current proposal
Sequenced set of operations
Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
Majority agrees that a higher number has not been seen and if so allows transaction to complete
Consensus must be reached on the current proposal
Sequenced set of operations
Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
Majority agrees that a higher number has not been seen and if so allows transaction to complete
Consensus must be reached on the current proposal