NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available

Non-Stop Hadoop
Applying Paxos to make critical Hadoop
services Continuously Available
Jagane Sundar - CTO, WANdisco
Brett Rudenstein – Senior Product Manager, WANdisco

WANdisco Background
 WANdisco: Wide Area Network Distributed Computing
 Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s
data challenges of secure storage, scalability and availability
 Leader in tools for software engineers – Subversion
 Apache Software Foundation sponsor
 Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)
 US patented active-active replication technology granted, November 2012
 Global locations
- San Ramon (CA)
- Chengdu (China)
- Tokyo (Japan)
- Boston (MA)
- Sheffield (UK)
- Belfast (UK)

Recap of
Server Software Architecture

Elementary Server Software:
Single thread processing client requests in a loop
Server Process
make change to state (db)
OP OP OP OP
get client request e.g. hbase put send return value to client

Multi-threaded Server Software:
Multiple threads processing client requests in a loop
Server Process
make change to state (db)
get client request e.g. hbase put send return value to client
OP OP OP OP
OP
OP
OP OPOP OP
OP
OP
thread 1
thread 3
thread 2
thread 1
thread 2
thread 3
acquire lock release lock

Continuously Available Servers
Multiple Servers replicated and serving the same content
server1
Server Process
server2
Server Process
server3
Server Process

Problem
 How do we ensure that the three servers contain exactly the same data?
 In other words, how do you achieve strong consistency replication?

Two parts to the solution:
 (Multiple Replicas of) A Deterministic State Machine
 The exact same sequence of operations, to be applied to each replica of the
DSM

A Deterministic State Machine
 A state machine where a specific operation will always result in a deterministic
state
 Non deterministic factors cannot play a role in the end state of any operation
in a DSM. Examples of non-deterministic factors
- Time
- Random

Creating three replicated servers
Apply all modify operations in the same exact sequence in each replicated server =
Multiple Servers with exactly the same replicated data
server1
Server Process
(DSM)
server2
Server Process
(DSM)
server3
Server Process
(DSM)
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P
O
P

Problem:
 How to achieve consensus between these servers as to the sequence of operations to
perform?
 Paxos is the answer
- Algorithm for reaching consensus in a network of unreliable processors

Three replicated servers
server3
Server Process
OP OP OP OP
Distributed
Coordination Engine
server2
Server Process
Distributed
Coordination Engine
OP OP OP OP
server
1
Server Process
OP OP OP OP
Distributed
Coordination Engine Paxos
Client Client
ClientClient
Client
Paxos
OP
OPOP
OP

Paxos
 Paxos is an Algorithm for building Replicated Servers with strong consistency
1. Synod algorithm for achieving consensus among a network of unreliable processes
2. The application of consensus to the task of replicating a Deterministic State Machine
 Paxos does not
- Specify a network protocol
- Invent a new language
- Restrict use in a specific language

Replicated State Machine
 Installed on each node that participates in the distributed system
 All nodes function as peers to deliver and assure the same transaction order occurs on
every system
- Achieve Consistent Replication
 Consensus
- Roles
• Proposers, Acceptors, Learners
- Phases
• Election of a node to be the proposer
• Broadcast of the proposal to peers
• Acceptance of the proposal for majority

Paxos Roles
 Proposer
- The client or a proxy for the client
- Proposes a change to the Deterministic State Machine
 Acceptor
- Acceptors are the ‘memory’ of paxos
- Quorum is established amongst acceptors
 Learner
- The DSM (Replicated, of course)
- Each Learner applies the exact same sequence of operations as proposed by the Proposers,
and accepted by a majority quorum of Acceptors

Paxos - Ordering
 Proposers issue a new sequence number of a higher value from the last sequence
known
 A majority agrees this number has not been seen
 Consensus must be reached on the current proposal

DConE Innovations
Beyond Paxos
 Quorum Configurations
- Majority, Singleton, Unanimous
- Distinguished Node – Tie Breaker
- Quorum Rotations – follow the sun
- Emergency Reconfigure
 Concurrent agreement handling
- Paxos only allows agreements on one proposal at a time
• Slow performance in a high transaction volume environment
- DConE allows simultaneous proposals from multiple proposers

DConE Innovations
Beyond Paxos
 Dynamic group evolution
- Add and remove nodes
- Add and remove sites
- No interruption of current operations
 Distributed garbage collection
- Safely discard state on disk and in memory when it is no longer required to assist in recovery
- Messages are sent to peers pre-defined intervals to determine the highest common agreement
- All agreements and agreed proposals are deleted

DConE Innovations
Beyond Paxos
 Backoff and collision avoidance
- Avoids repeated pre-emption of proposers by their peers
- Prevents thrashing which can severely degrade performance.
- When a round is pre-empted, a backoff delay is computed

Self Healing
Automatic Back up and Recovery
 All nodes are mirrors/replicas of each other
- Any node can be used as a helper to bring it back
 Read access without Quorum
- Cluster is still accessible for reads
- No writes prevent split brain
 Automatic catch up
- Servers that have been offline, learn of transactions that were agreed on while it was
unavailable
- The missing transactions are played back and one caught up become fully participating
members of the distributed system again
 Servers can be updated without down time
- Allows for rolling upgrades

‘Co-ordinate intent, not the outcome’
- Yeturu Aahlad
Active-Active, not Active-Standby

Co-ordinating intent
Proposal to
mkdir /a
P
a
x
o
s
server2
server
1
Proposal to
createFile /a
createFile /a
createFile /a
mkdir /a
mkdir /a
Op fails
Op fails
mkdir /a
Co-ordinate
outcome
(WAL, HDFS
Edits Log,
etc.)
server2
server
1
createFile /a
server1 state is wrong
mkdir /a operation needs to be undone
Co-ordinating outcome

HDFS Architecture
 HDFS metadata is decoupled from data
- Namespace is a hierarchy of files and directories represented by INodes
- INodes record attributes: permissions, quotas, timestamps, replication
 NameNode keeps its entire state in RAM
- Memory state: the namespace tree and the mapping of blocks to DataNodes
- Persistent state: recent checkpoint of the namespace and journal log
 File data is divided into blocks (default 128MB)
- Each block is independently replicated on multiple DataNodes (default 3)
- Block replicas stored on DataNodes as local files on local drives
Reliable distributed file system for storing very large data sets
27

HDFS Cluster
 Single active NameNode
 Thousands of DataNodes
 Tens of thousands of HDFS clients
Active-Standby Architecture
28

Standard HDFS operations
 Active NameNode workflow
1. Receive request from a client,
2. Apply the update to its memory state,
3. Record the update as a journal transaction in persistent storage,
4. Return result to the client
 HDFS Client (read or write to a file)
- Send request to the NameNode, receive replica locations
- Read or write data from or to DataNodes
 DataNode
- Data transfer to / from clients and between DataNodes
- Report replica state change to NameNode(s): new, deleted, corrupt
- Report its state to NameNode(s): heartbeats, block reports
29

Consensus Node
 Coordinated Replication of HDFS Namespace
30

Replicated Namespace
 Replicated NameNode is called a ConsensusNode or CNode
 ConsensusNodes play equal active role on the cluster
- Provide write and read access to the namespace
 The namespace replicas are consistent with each other
- Each CNode maintains a copy of the same namespace
- Namespace updates applied to one CNode propagated to the others
 Coordination Engine establishes the global order of namespace updates
- All CNodes apply the same deterministic updates in the same deterministic order
- Starting from the same initial state and applying the same updates = consistency
Coordination Engine provides consistency of multiple namespace replicas
31

Coordinated HDFS Cluster
 Independent CNodes – the same namespace
 Load balancing client requests
 Proposal, Agreement
 Coordinated updates
Multiple active Consensus Nodes share namespace via Coordination Engine
32

Coordinated HDFS operations
 ConsensusNode workflow
1. Receive request from a client
2. Submit proposal to update to the Coordination Engine
Wait for agreement
3. Apply the agreed update to its memory state,
4. Record the update as a journal transaction in persistent storage (optional)
5. Return result to the client
 HDFS Client and DataNode operations remain the same
Updates to the namespace when a file or a directory is created are coordinated
33

Strict Consistency Model
 Coordination Engine transforms namespace modification proposals into the global
sequence of agreements
- Applied to namespace replicas in the order of their Global Sequence Number
 ConsensusNodes may have different states at a given moment of “clock” time
- As the rate of consuming agreements may vary
 CNodes have the same namespace state when they reach the same GSN
 One-copy-equivalence
- each replica presented to the client as if it has only one copy
One-Copy-Equivalence as known in replicated databases
34

Consensus Node Proxy
 CNodeProxyProvider – a pluggable substitute of FailoverProxyProvider
- Defined via Configuration
 Main features
- Randomly chooses CNode when client is instantiated
- Sticky until a timeout occurs
- Fails over to another CNode
- Smart enough to avoid SafeMode
 Further improvements
- Take into account network proximity
Reads do not modify namespace can be directed to any ConsensusNode
35

Alternatives to a Paxos based Replicated State
Machine

Using a TCP Connection to send data to three
replicated servers (Load Balancer)
server3
Server Process
OP OP
server2
Server Process
OP OP OP OP
server
1
Server Process
OP OP OP OP
Client
OP OP OP OP
Load BalancerLoad Balancer

Problems with using a Load Balancer
 Load balancer becomes the single point of failure
- Need to make the LB highly available and distributed
 Since Paxos is not employed to reach consensus between the three replicas, strong
consistency cannot be guaranteed
- Replicas will quickly diverge

HBase WAL or HDFS Edits Log replication
 State Machine (HRegion contents, HDFS NameNode metadata, etc.) is modified first
 Modification Log (HBase WAL or HDFS Edits Log) is sent to a Highly Available shared
storage, QJM, etc.
 Standby Server(s) read edits log and serve as warm standby servers, ready to take
over should the active server fail

HBase WAL or HDFS Edits Log replication
server
1
Server Process
OP OP OP OP
server2
Server Process
Shared
Storage
Standby Server
WAL/Edits Log
Single Active Server

 Only one active server is possible
 Failover takes time
 Failover is error prone, with intricate fencing etc.
 Cost of reaching consensus needs to be paid for HDFS Edits log entry to be deemed
safely stored, so why not pay the cost before modifying the state and thereby have
multiple active servers?
HBase WAL or HDFS Edits Log tailing

HBase Single Points of Failure
 HBase Region Server
 HBase Master

HBase Region Server
Replication

NonStopRegionServer:
Client Service
e.g. multi
Client Service
DConE
HRegionServer
NonStopRegionServer 1
Client Service
e.g. multi
Client Service
DConE
HRegionServer
NonStopRegionServer 2
Hbase Client 1. Client calls HRegionServer multi
2. NonStopRegionServer intercepts
3. NonStopRegionServer makes paxos
proposal using DConE library
4. Proposal comes back as agreement
on all NonStopRegionServers
5. NonStopRegionServer calls super.multi
on all nodes. State changes are recorded
6. NonStopRegionServer 1 alone sends
response back to client
Subclassing the HRegionServer

HBase RegionServer replication using
WANdisco DConE
 Shared nothing architecture
 HFiles, WALs etc. are not shared
 Replica count is tuned
 Snapshots of HFiles do not need to be created
 Messy details of WAL tailing are not necessary

HBase RegionServer replication using
WANdisco DConE
 Not an eventual consistency model
 Does not serve up stale data

Thank you
Jagane Sundar
jagane.sundar@wandisco.com
@jagane

NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available

Ähnlich wie NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available

Hinweis der Redaktion