SlideShare a Scribd company logo
1 of 42
Download to read offline
Coordinating Metadata
Replication
Survival Strategy for Distributed Systems
Konstantin V. Shvachko
April 3, 2014
Amsterdam, Netherlands
Introduction to Survival
 Clusters of servers as communities of machines
- United by a common mission
 Communication
- Act as a whole towards a single goal
 Coordination is the key for survival
- Got a problem – propose a solution
- Make a decision – come to the agreement
- Execute the decision
 Typical for their type of species
- Computers propose Numbers, and
- Use Algorithms to reach an agreement
In the world of computers
2
Contents
 Coordination Engine – the survival basics
 Replicated Namespace for HDFS
 Geographically-distributed HDFS
 Replicated Regions for HBase
 Potential Benefits – more survival tools
3
Coordination Engine
 The Survival Basics
4
Coordination Engine
 Coordination Engine allows to agree on the order of events submitted to the
engine by multiple proposers
- Anybody can Propose
- Single Agreement every time
- Learners observe the same agreements in the same order
Sequencer: Determines the order in which a number of operations occur
5
Proposer
Crash course in Coordination Engines: submit Proposal
6
Acceptor
Crash course in Coordination Engines: produce an Agreement
7
Learner
Crash course in Coordination Engines: take action based on the Agreement
8
Single-Node Coordination Engine
 Easy to Coordinate
- Single NameNode is an example of a simple Coordination Engine
- Performance and availability bottleneck
- Single point of failure
Simple but lacks reliability
9
Distributed Coordination Engine
 Not so easy to Coordinate
A reliable approach is to use multiple nodes
10
Challenges of Distributed Coordination
 Coordinating single value (the battle start time)
- Design for failures
Two Generals’ Problem
11
Challenges of Distributed Coordination
 Failures in distributed systems
- Any node can fail at any time
- Any failed node can recover at any time
- Messages can be lost, duplicated, reordered, or delayed arbitrary long
Anything that can go wrong will go wrong: Murphy's law (the law of entropy)
12
Distributed Coordination Engine
 Distributed Coordination Engine consists of nodes
- Node Roles: Proposer, Learner, and Acceptor
- Each node can combine multiple roles
 Distributed coordination
- Multiple nodes submit events as proposals to a quorum of acceptors
- Acceptors agree on the order of each event in the global sequence of events
- Learners learn agreements in the same deterministic order
A reliable approach is to use multiple nodes
13
Consensus Algorithms
 Coordination Engine guarantees the same state of the learners at a given GSN
- Each agreement is assigned a unique Global Sequence Number (GSN)
- GSNs form a monotonically increasing number series – the order of agreements
- Learners start from the same initial state
- Learners apply the same deterministic agreements in the same deterministic order
- GSN represents “logical” time in the coordinated system
 PAXOS is a consensus algorithm proven to tolerate a variety of failures
- Quorum-based Consensus
- Deterministic State Machine
- Leslie Lamport: Part-Time Parliament (1990)
Consensus is the process of agreeing on one result among a group of participants
14
HDFS
 State of the Art
15
HDFS Architecture
 HDFS metadata is decoupled from data
- Namespace is a hierarchy of files and directories represented by Inodes
- INodes record attributes: permissions, quotas, timestamps, replication
 NameNode keeps its entire state in RAM
- Memory state: the namespace tree and the mapping of blocks to DataNodes
- Persistent state: recent checkpoint of the namespace and journal log
 File data is divided into blocks (default 128MB)
- Each block is independently replicated on multiple DataNodes (default 3)
- Block replicas stored on DataNodes as local files on local drives
Reliable distributed file system for storing very large data sets
16
HDFS Cluster
 Single active NameNode
 Thousands of DataNodes
 Tens of thousands of HDFS clients
Active-Standby Architecture
17
Standard HDFS operations
 Active NameNode workflow
1. Receive request from a client,
2. Apply the update to its memory state,
3. Record the update as a journal transaction in persistent storage,
4. Return result to the client
 HDFS Client (read or write to a file)
- Send request to the NameNode, receive replica locations
- Read or write data from or to DataNodes
 DataNode
- Data transfer to / from clients and between DataNodes
- Report replica state change to NameNode(s): new, deleted, corrupt
- Report its state to NameNode(s): heartbeats, block reports
18
Consensus Node
 Coordinated Replication of HDFS Namespace
19
Replicated Namespace
 Replicated NameNode is called a ConsensusNode or CNode
 ConsensusNodes play equal active role on the cluster
- Provide write and read access to the namespace
 The namespace replicas are consistent with each other
- Each CNode maintains a copy of the same namespace
- Namespace updates applied to one CNode propagated to the others
 Coordination Engine establishes the global order of namespace updates
- All CNodes apply the same deterministic updates in the same deterministic order
- Starting from the same initial state and applying the same updates = consistency
Coordination Engine provides consistency of multiple namespace replicas
20
Coordinated HDFS Cluster
 Independent CNodes – the same namespace
 Load balancing client requests
 Proposal, Agreement
 Coordinated updates
Multiple active Consensus Nodes share namespace via Coordination Engine
21
Coordinated HDFS operations
 ConsensusNode workflow
1. Receive request from a client
2. Submit proposal to update to the Coordination Engine
Wait for agreement
3. Apply the agreed update to its memory state,
4. Record the update as a journal transaction in persistent storage (optional)
5. Return result to the client
 HDFS Client and DataNode operations remain the same
Updates to the namespace when a file or a directory is created are coordinated
22
Strict Consistency Model
 Coordination Engine transforms namespace modification proposals into the
global sequence of agreements
- Applied to namespace replicas in the order of their Global Sequence Number
 ConsensusNodes may have different states at a given moment of “clock” time
- As the rate of consuming agreements may vary
 CNodes have the same namespace state when they reach the same GSN
 One-copy-equivalence
- each replica presented to the client as if it has only one copy
One-Copy-Equivalence as known in replicated databases
23
Consensus Node Proxy
 CNodeProxyProvider – a pluggable substitute of FailoverProxyProvider
- Defined via Configuration
 Main features
- Randomly chooses CNode when client is instantiated
- Sticky until a timeout occurs
- Fails over to another CNode
- Smart enough to avoid SafeMode
 Further improvements
- Take into account network proximity
Reads do not modify namespace can be directed to any ConsensusNode
24
Stale Read Problem
1. Same client fails over to a CNode, which has an older namespace state (GSN)
- CNode1 at GSN 900: mkdir(p) –> ls(p) –> failover to CNode2
- CNode2 at GSN 890: ls(p) –> directory not found
2. One client modifies namespace, which needs to be seen by other clients
MapReduce use case:
- JobClient to CNode1: create job.xml
- MapTask to CNode2: read job.xml –> FileNotFoundException
1) Client connects to CNode only if its GSN >= last seen
- May need to wait until CNode catches up
2) Must coordinate file read
- Special files only: configuration-defined regexp
- CNode coordinates read only once per file, then marks it as coordinated
- Coordinated read: submit proposal, wait for agreement
A few read requests must be coordinated
25
Block Management
 Block Manager keeps information about file blocks and DataNodes
- Blocks Map, DataNode Manager, Replication Queues
- New block generation: <blockId, generationStamp, locations>
- Replication and deletion of replicas that are under- or over-replicated
 Consistent block management problem
- Collision of BlockIDs if the same ID generated by different CNodes
- Over-replicated block: if CNodes decide to replicate same block simultaneously
- Missing block data: if CNodes decide to delete different replicas of the block
 New block generation – all CNodes
- Assign nextID and nextGeneratoinStamp while processing the agreement
 Replication and deletion of replicas – a designated CNode (Block Replicator)
- Block Replicator is elected using Coordination Engine and reelected if fails
- Only Block Replicator sends DataNode commands to transfer or delete replicas
26
GeoNode
 Geographically Distributed HDFS
27
Scaling HDFS Across Data Centers
 The system should appear, act, and be operated as a single cluster
- Instant and automatic replication of data and metadata
 Parts of the cluster on different data centers should have equal roles
- Data could be ingested or accessed through any of the centers
 Data creation and access should typically be at the LAN speed
- Running time of a job executed on one data center as if there are no other centers
 Failure scenarios: the system should provide service and remain consistent
- Any GeoNode can fail
- GeoNodes can fail simultaneously on two or more data centers
- An entire data center can fail. E.g. due to WAN partitioning
- Tolerate DataNode failures traditional for HDFS
• Simultaneous failure of two DataNodes
• Failure of an entire rack
Continuous Availability, and Disaster Recovery over WAN
28
WAN HDFS Cluster
 Wide Area Network replication
 Metadata – online
 Data – offline
Multiple active GeoNodes share namespace via Coordination Engine over WAN
29
Architecture Principles
 Foreign vs Native notations for nodes, block replicas, clients
- Foreign means on a different data center. Native – on the same data center
 Metadata between GeoNodes is coordinated instantaneously via agreements
 Data between data centers replicated asynchronously in the background
- Replicas of a block can be stored on DataNodes of one or multiple data centers
- GeoNode may learn about the file and its blocks before native replicas created
 DataNodes do not communicate to foreign GeoNodes
- Heartbeats, block reports, etc. sent to native GeoNodes only
- GeoNodes cannot send direct commands to foreign DataNodes
 DataNodes copy blocks over the WAN to provide foreign block replication
- Copy single replica over the WAN, allow native replication of that replica
 Clients optimized to access or create data on native DataNodes
- But can read data from another data center if needed
30
File Create
 Client: choose a native GeoNode, send request
 GeoNode: propose, wait for agreement
- All GeoNodes execute the agreement
- addBlock: choose native locations only
 Client: create pipeline, write data
 DataNodes report replicas to native GeoNodes
 Client can continue adding next block or closing file
- Clients do not wait until over the WAN transfer is completed
 Foreign block replication handled asynchronously
Create file entry, add blocks, write data, close file
31
1
2
3
4
5
Foreign Block Replication
Block is created on one data center and replicated to other asynchronously
32
FRR
Foreign Replica Report
 Foreign Replica Report is submitted in two cases
1. Count of native block replicas reaches full replication for the data center or
2. Count of native block replicas is reduced to 0
 If block replica is lost it is replicated natively first
- New locations reported to other data centers after full replication is reached
 If no native replicas left, one should be transferred from another data center
 Replication factor of a given block can vary on different data centers
33
Foreign Block Management
 Designated BlockReplicator per data center
- Elected among GeoNodes of its data center
- Handles deletion and replication of blocks among native DataNodes
- Schedules foreign replica transfers
 BlockMonitor in addition to native replicas analyses foreign block replication
- Periodically scans blocks
- If a block with no foreign replicas found – schedule WAN replication
34
Selective Data Replication
 Requirements / use cases:
- Directory is replicated to and accessible from all data centers
- Directory is readable from all DCs but is writeable only at a given site
- Directory is replicated on some (single) DCs, but never replicated to other
 Asymmetric Block Replication
- Allow different replications for files on different data centers
- Files created with the default for the data center replication
- setReplication() is applied to single data center
- Special case of replication = 0: data is not replicated on that data center
 Datacenter-specific attributes
- Default replication: per directory
- Permissions - Selective visibility of files and directories
35
HBase
 Replicated Regions
36
Apache HBase
 Table: big, sparse, loosely structured
- Collection of rows, sorted by row keys
- Rows can have arbitrary number of columns
- Columns grouped into Column families
 Table is split into Regions by row ranges
- Dynamic Table partitioning
- Region Servers serve regions to applications
 Distributed Cache:
- Regions are loaded in nodes’ RAM
- Real-time access to data
A distributed key-value store for real-time access to semi-structured data
37
DataNode DataNode DataNode
NameNode
Resource
Manager
RegionServer RegionServerRegionServer
Node
Manager
Node
Manager
Node
Manager
HBase
Master
HBase Operations
 Administrative functions
- Create, delete, list tables
- Create, update, delete columns, column families
- Split, compact, flush
 Access table data
- Get – retrieve cells of a row
- Put – update a row
- Delete – delete cells/row
 Mass operations
- Scanner – scan col family
- Variety of Filters
HBase client connects to a server, which serves the region
38
HBase Challenge
 Failure of a region requires failover
- Regions reassigned to other Region Servers
- Clients failover and reconnect to new servers
 Regions in high demand
- Many client connections to one server introduce bottleneck
 Good idea to replicate regions on multiple Region Servers
 Open Problem: consistent updates of region replicas
- Solution: Coordinated updates
39
Coordinated Put
 Client sends Put request to one of the Region Servers hosting the region
 The Region Server submits Put proposal to update the row
 Coordination Engine decides the order of the update
 Agreement delivered to all Region Servers hosting the region
- The row is deterministically updated on multiple Region Servers
40
Future Benefits
 Atomic update of multiple rows
- Proposal to Update <row1, row2, row3>
- Agreement delivered to all regions containing either of the rows
- Consistent state of regions at the same GSN
 Secondary indexes
- Agreement to update is delivered both primary and secondary tables
 Distributed Transactions
- ACID: atomicity, consistency, isolation, durability
- Atomic multi-row and multi-table updates are transactions
Coordination Engine as a tool for the evolution of distributed systems
41
Konstantin V Shvachko
Thank You
Plamen Jeliazkov
Tao Luo
Keith Pak
Jagane Sundar
Henry Wang
Byron Wong
Dasha Boudnik
Sergey Melnykov
Virginia Wang
Yeturu Aahlad
Naeem Akhtar
Ben Horowitz
Michael Parkin
Mikhail Antonov
Konstantin Boudnik
Sergey Soldatov
Siva G
Durai Ezhilarasan
Gordon Hamilton
Trevor Lorimer
David McCauley
Guru Yeleswarapu
Robert Budas
Brett Rudenstein
Warren Harper

More Related Content

What's hot

Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHanborq Inc.
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems ReviewSchubert Zhang
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 

What's hot (20)

Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Hadoop
HadoopHadoop
Hadoop
 
11. dfs
11. dfs11. dfs
11. dfs
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Andrew File System
Andrew File SystemAndrew File System
Andrew File System
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 

Viewers also liked

Trent_SIF_Proposal_FINAL_9May2016
Trent_SIF_Proposal_FINAL_9May2016Trent_SIF_Proposal_FINAL_9May2016
Trent_SIF_Proposal_FINAL_9May2016Maya Chaddah
 
HPC Midlands - Update for Bull eXtreme Computing User Group 2012 meeting
HPC Midlands - Update for Bull eXtreme Computing User Group 2012 meetingHPC Midlands - Update for Bull eXtreme Computing User Group 2012 meeting
HPC Midlands - Update for Bull eXtreme Computing User Group 2012 meetingMartin Hamilton
 
Clustering general-presentation (Project Cyprus Part III)
Clustering general-presentation (Project Cyprus Part III)Clustering general-presentation (Project Cyprus Part III)
Clustering general-presentation (Project Cyprus Part III)Lean6Sigma4all
 
Wanhive vs Chord Distributed Hash Table
Wanhive vs Chord Distributed Hash TableWanhive vs Chord Distributed Hash Table
Wanhive vs Chord Distributed Hash TableAmit Kumar
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Mysql 57-upcoming-changes
Mysql 57-upcoming-changesMysql 57-upcoming-changes
Mysql 57-upcoming-changesMorgan Tocker
 
Tran Minh: big data platform in high performance computing at NISCI
Tran Minh: big data platform in high performance computing at NISCITran Minh: big data platform in high performance computing at NISCI
Tran Minh: big data platform in high performance computing at NISCIVu Hung Nguyen
 
Business plan presentation-vtk
Business plan presentation-vtkBusiness plan presentation-vtk
Business plan presentation-vtkMatt Moore
 
Distributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingDistributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingCloudFundoo
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Fifty Features of Java EE 7 in 50 Minutes
Fifty Features of Java EE 7 in 50 MinutesFifty Features of Java EE 7 in 50 Minutes
Fifty Features of Java EE 7 in 50 Minutesglassfish
 
Introducing Apple Watch
Introducing Apple WatchIntroducing Apple Watch
Introducing Apple WatchJJ Wu
 

Viewers also liked (19)

Chord DHT
Chord DHTChord DHT
Chord DHT
 
Trent_SIF_Proposal_FINAL_9May2016
Trent_SIF_Proposal_FINAL_9May2016Trent_SIF_Proposal_FINAL_9May2016
Trent_SIF_Proposal_FINAL_9May2016
 
Clustering strategy (1)
Clustering strategy (1)Clustering strategy (1)
Clustering strategy (1)
 
HPC Midlands - Update for Bull eXtreme Computing User Group 2012 meeting
HPC Midlands - Update for Bull eXtreme Computing User Group 2012 meetingHPC Midlands - Update for Bull eXtreme Computing User Group 2012 meeting
HPC Midlands - Update for Bull eXtreme Computing User Group 2012 meeting
 
Clustering general-presentation (Project Cyprus Part III)
Clustering general-presentation (Project Cyprus Part III)Clustering general-presentation (Project Cyprus Part III)
Clustering general-presentation (Project Cyprus Part III)
 
Apache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other VersionsApache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other Versions
 
Wanhive vs Chord Distributed Hash Table
Wanhive vs Chord Distributed Hash TableWanhive vs Chord Distributed Hash Table
Wanhive vs Chord Distributed Hash Table
 
O(1) DHT
O(1) DHTO(1) DHT
O(1) DHT
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Mysql 57-upcoming-changes
Mysql 57-upcoming-changesMysql 57-upcoming-changes
Mysql 57-upcoming-changes
 
Distributed Hash Table
Distributed Hash TableDistributed Hash Table
Distributed Hash Table
 
Tran Minh: big data platform in high performance computing at NISCI
Tran Minh: big data platform in high performance computing at NISCITran Minh: big data platform in high performance computing at NISCI
Tran Minh: big data platform in high performance computing at NISCI
 
Business plan presentation-vtk
Business plan presentation-vtkBusiness plan presentation-vtk
Business plan presentation-vtk
 
Distributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingDistributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent Hashing
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Fifty Features of Java EE 7 in 50 Minutes
Fifty Features of Java EE 7 in 50 MinutesFifty Features of Java EE 7 in 50 Minutes
Fifty Features of Java EE 7 in 50 Minutes
 
Introducing Apple Watch
Introducing Apple WatchIntroducing Apple Watch
Introducing Apple Watch
 

Similar to Coordinating Metadata Replication: Survival Strategy for Distributed Systems

NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...DataWorks Summit
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopDataWorks Summit
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmDataWorks Summit
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsAnkit Gupta
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsArush Nagpal
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level ParallelismDilum Bandara
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating systemMoeez Ahmad
 
Storage solutions for High Performance Computing
Storage solutions for High Performance ComputingStorage solutions for High Performance Computing
Storage solutions for High Performance Computinggmateesc
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
Dsm (Distributed computing)
Dsm (Distributed computing)Dsm (Distributed computing)
Dsm (Distributed computing)Sri Prasanna
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 

Similar to Coordinating Metadata Replication: Survival Strategy for Distributed Systems (20)

NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Unit 1
Unit 1Unit 1
Unit 1
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating system
 
Storage solutions for High Performance Computing
Storage solutions for High Performance ComputingStorage solutions for High Performance Computing
Storage solutions for High Performance Computing
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Dsm (Distributed computing)
Dsm (Distributed computing)Dsm (Distributed computing)
Dsm (Distributed computing)
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
 
Gfs final
Gfs finalGfs final
Gfs final
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Google File System
Google File SystemGoogle File System
Google File System
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Coordinating Metadata Replication: Survival Strategy for Distributed Systems

  • 1. Coordinating Metadata Replication Survival Strategy for Distributed Systems Konstantin V. Shvachko April 3, 2014 Amsterdam, Netherlands
  • 2. Introduction to Survival  Clusters of servers as communities of machines - United by a common mission  Communication - Act as a whole towards a single goal  Coordination is the key for survival - Got a problem – propose a solution - Make a decision – come to the agreement - Execute the decision  Typical for their type of species - Computers propose Numbers, and - Use Algorithms to reach an agreement In the world of computers 2
  • 3. Contents  Coordination Engine – the survival basics  Replicated Namespace for HDFS  Geographically-distributed HDFS  Replicated Regions for HBase  Potential Benefits – more survival tools 3
  • 4. Coordination Engine  The Survival Basics 4
  • 5. Coordination Engine  Coordination Engine allows to agree on the order of events submitted to the engine by multiple proposers - Anybody can Propose - Single Agreement every time - Learners observe the same agreements in the same order Sequencer: Determines the order in which a number of operations occur 5
  • 6. Proposer Crash course in Coordination Engines: submit Proposal 6
  • 7. Acceptor Crash course in Coordination Engines: produce an Agreement 7
  • 8. Learner Crash course in Coordination Engines: take action based on the Agreement 8
  • 9. Single-Node Coordination Engine  Easy to Coordinate - Single NameNode is an example of a simple Coordination Engine - Performance and availability bottleneck - Single point of failure Simple but lacks reliability 9
  • 10. Distributed Coordination Engine  Not so easy to Coordinate A reliable approach is to use multiple nodes 10
  • 11. Challenges of Distributed Coordination  Coordinating single value (the battle start time) - Design for failures Two Generals’ Problem 11
  • 12. Challenges of Distributed Coordination  Failures in distributed systems - Any node can fail at any time - Any failed node can recover at any time - Messages can be lost, duplicated, reordered, or delayed arbitrary long Anything that can go wrong will go wrong: Murphy's law (the law of entropy) 12
  • 13. Distributed Coordination Engine  Distributed Coordination Engine consists of nodes - Node Roles: Proposer, Learner, and Acceptor - Each node can combine multiple roles  Distributed coordination - Multiple nodes submit events as proposals to a quorum of acceptors - Acceptors agree on the order of each event in the global sequence of events - Learners learn agreements in the same deterministic order A reliable approach is to use multiple nodes 13
  • 14. Consensus Algorithms  Coordination Engine guarantees the same state of the learners at a given GSN - Each agreement is assigned a unique Global Sequence Number (GSN) - GSNs form a monotonically increasing number series – the order of agreements - Learners start from the same initial state - Learners apply the same deterministic agreements in the same deterministic order - GSN represents “logical” time in the coordinated system  PAXOS is a consensus algorithm proven to tolerate a variety of failures - Quorum-based Consensus - Deterministic State Machine - Leslie Lamport: Part-Time Parliament (1990) Consensus is the process of agreeing on one result among a group of participants 14
  • 15. HDFS  State of the Art 15
  • 16. HDFS Architecture  HDFS metadata is decoupled from data - Namespace is a hierarchy of files and directories represented by Inodes - INodes record attributes: permissions, quotas, timestamps, replication  NameNode keeps its entire state in RAM - Memory state: the namespace tree and the mapping of blocks to DataNodes - Persistent state: recent checkpoint of the namespace and journal log  File data is divided into blocks (default 128MB) - Each block is independently replicated on multiple DataNodes (default 3) - Block replicas stored on DataNodes as local files on local drives Reliable distributed file system for storing very large data sets 16
  • 17. HDFS Cluster  Single active NameNode  Thousands of DataNodes  Tens of thousands of HDFS clients Active-Standby Architecture 17
  • 18. Standard HDFS operations  Active NameNode workflow 1. Receive request from a client, 2. Apply the update to its memory state, 3. Record the update as a journal transaction in persistent storage, 4. Return result to the client  HDFS Client (read or write to a file) - Send request to the NameNode, receive replica locations - Read or write data from or to DataNodes  DataNode - Data transfer to / from clients and between DataNodes - Report replica state change to NameNode(s): new, deleted, corrupt - Report its state to NameNode(s): heartbeats, block reports 18
  • 19. Consensus Node  Coordinated Replication of HDFS Namespace 19
  • 20. Replicated Namespace  Replicated NameNode is called a ConsensusNode or CNode  ConsensusNodes play equal active role on the cluster - Provide write and read access to the namespace  The namespace replicas are consistent with each other - Each CNode maintains a copy of the same namespace - Namespace updates applied to one CNode propagated to the others  Coordination Engine establishes the global order of namespace updates - All CNodes apply the same deterministic updates in the same deterministic order - Starting from the same initial state and applying the same updates = consistency Coordination Engine provides consistency of multiple namespace replicas 20
  • 21. Coordinated HDFS Cluster  Independent CNodes – the same namespace  Load balancing client requests  Proposal, Agreement  Coordinated updates Multiple active Consensus Nodes share namespace via Coordination Engine 21
  • 22. Coordinated HDFS operations  ConsensusNode workflow 1. Receive request from a client 2. Submit proposal to update to the Coordination Engine Wait for agreement 3. Apply the agreed update to its memory state, 4. Record the update as a journal transaction in persistent storage (optional) 5. Return result to the client  HDFS Client and DataNode operations remain the same Updates to the namespace when a file or a directory is created are coordinated 22
  • 23. Strict Consistency Model  Coordination Engine transforms namespace modification proposals into the global sequence of agreements - Applied to namespace replicas in the order of their Global Sequence Number  ConsensusNodes may have different states at a given moment of “clock” time - As the rate of consuming agreements may vary  CNodes have the same namespace state when they reach the same GSN  One-copy-equivalence - each replica presented to the client as if it has only one copy One-Copy-Equivalence as known in replicated databases 23
  • 24. Consensus Node Proxy  CNodeProxyProvider – a pluggable substitute of FailoverProxyProvider - Defined via Configuration  Main features - Randomly chooses CNode when client is instantiated - Sticky until a timeout occurs - Fails over to another CNode - Smart enough to avoid SafeMode  Further improvements - Take into account network proximity Reads do not modify namespace can be directed to any ConsensusNode 24
  • 25. Stale Read Problem 1. Same client fails over to a CNode, which has an older namespace state (GSN) - CNode1 at GSN 900: mkdir(p) –> ls(p) –> failover to CNode2 - CNode2 at GSN 890: ls(p) –> directory not found 2. One client modifies namespace, which needs to be seen by other clients MapReduce use case: - JobClient to CNode1: create job.xml - MapTask to CNode2: read job.xml –> FileNotFoundException 1) Client connects to CNode only if its GSN >= last seen - May need to wait until CNode catches up 2) Must coordinate file read - Special files only: configuration-defined regexp - CNode coordinates read only once per file, then marks it as coordinated - Coordinated read: submit proposal, wait for agreement A few read requests must be coordinated 25
  • 26. Block Management  Block Manager keeps information about file blocks and DataNodes - Blocks Map, DataNode Manager, Replication Queues - New block generation: <blockId, generationStamp, locations> - Replication and deletion of replicas that are under- or over-replicated  Consistent block management problem - Collision of BlockIDs if the same ID generated by different CNodes - Over-replicated block: if CNodes decide to replicate same block simultaneously - Missing block data: if CNodes decide to delete different replicas of the block  New block generation – all CNodes - Assign nextID and nextGeneratoinStamp while processing the agreement  Replication and deletion of replicas – a designated CNode (Block Replicator) - Block Replicator is elected using Coordination Engine and reelected if fails - Only Block Replicator sends DataNode commands to transfer or delete replicas 26
  • 28. Scaling HDFS Across Data Centers  The system should appear, act, and be operated as a single cluster - Instant and automatic replication of data and metadata  Parts of the cluster on different data centers should have equal roles - Data could be ingested or accessed through any of the centers  Data creation and access should typically be at the LAN speed - Running time of a job executed on one data center as if there are no other centers  Failure scenarios: the system should provide service and remain consistent - Any GeoNode can fail - GeoNodes can fail simultaneously on two or more data centers - An entire data center can fail. E.g. due to WAN partitioning - Tolerate DataNode failures traditional for HDFS • Simultaneous failure of two DataNodes • Failure of an entire rack Continuous Availability, and Disaster Recovery over WAN 28
  • 29. WAN HDFS Cluster  Wide Area Network replication  Metadata – online  Data – offline Multiple active GeoNodes share namespace via Coordination Engine over WAN 29
  • 30. Architecture Principles  Foreign vs Native notations for nodes, block replicas, clients - Foreign means on a different data center. Native – on the same data center  Metadata between GeoNodes is coordinated instantaneously via agreements  Data between data centers replicated asynchronously in the background - Replicas of a block can be stored on DataNodes of one or multiple data centers - GeoNode may learn about the file and its blocks before native replicas created  DataNodes do not communicate to foreign GeoNodes - Heartbeats, block reports, etc. sent to native GeoNodes only - GeoNodes cannot send direct commands to foreign DataNodes  DataNodes copy blocks over the WAN to provide foreign block replication - Copy single replica over the WAN, allow native replication of that replica  Clients optimized to access or create data on native DataNodes - But can read data from another data center if needed 30
  • 31. File Create  Client: choose a native GeoNode, send request  GeoNode: propose, wait for agreement - All GeoNodes execute the agreement - addBlock: choose native locations only  Client: create pipeline, write data  DataNodes report replicas to native GeoNodes  Client can continue adding next block or closing file - Clients do not wait until over the WAN transfer is completed  Foreign block replication handled asynchronously Create file entry, add blocks, write data, close file 31
  • 32. 1 2 3 4 5 Foreign Block Replication Block is created on one data center and replicated to other asynchronously 32 FRR
  • 33. Foreign Replica Report  Foreign Replica Report is submitted in two cases 1. Count of native block replicas reaches full replication for the data center or 2. Count of native block replicas is reduced to 0  If block replica is lost it is replicated natively first - New locations reported to other data centers after full replication is reached  If no native replicas left, one should be transferred from another data center  Replication factor of a given block can vary on different data centers 33
  • 34. Foreign Block Management  Designated BlockReplicator per data center - Elected among GeoNodes of its data center - Handles deletion and replication of blocks among native DataNodes - Schedules foreign replica transfers  BlockMonitor in addition to native replicas analyses foreign block replication - Periodically scans blocks - If a block with no foreign replicas found – schedule WAN replication 34
  • 35. Selective Data Replication  Requirements / use cases: - Directory is replicated to and accessible from all data centers - Directory is readable from all DCs but is writeable only at a given site - Directory is replicated on some (single) DCs, but never replicated to other  Asymmetric Block Replication - Allow different replications for files on different data centers - Files created with the default for the data center replication - setReplication() is applied to single data center - Special case of replication = 0: data is not replicated on that data center  Datacenter-specific attributes - Default replication: per directory - Permissions - Selective visibility of files and directories 35
  • 37. Apache HBase  Table: big, sparse, loosely structured - Collection of rows, sorted by row keys - Rows can have arbitrary number of columns - Columns grouped into Column families  Table is split into Regions by row ranges - Dynamic Table partitioning - Region Servers serve regions to applications  Distributed Cache: - Regions are loaded in nodes’ RAM - Real-time access to data A distributed key-value store for real-time access to semi-structured data 37 DataNode DataNode DataNode NameNode Resource Manager RegionServer RegionServerRegionServer Node Manager Node Manager Node Manager HBase Master
  • 38. HBase Operations  Administrative functions - Create, delete, list tables - Create, update, delete columns, column families - Split, compact, flush  Access table data - Get – retrieve cells of a row - Put – update a row - Delete – delete cells/row  Mass operations - Scanner – scan col family - Variety of Filters HBase client connects to a server, which serves the region 38
  • 39. HBase Challenge  Failure of a region requires failover - Regions reassigned to other Region Servers - Clients failover and reconnect to new servers  Regions in high demand - Many client connections to one server introduce bottleneck  Good idea to replicate regions on multiple Region Servers  Open Problem: consistent updates of region replicas - Solution: Coordinated updates 39
  • 40. Coordinated Put  Client sends Put request to one of the Region Servers hosting the region  The Region Server submits Put proposal to update the row  Coordination Engine decides the order of the update  Agreement delivered to all Region Servers hosting the region - The row is deterministically updated on multiple Region Servers 40
  • 41. Future Benefits  Atomic update of multiple rows - Proposal to Update <row1, row2, row3> - Agreement delivered to all regions containing either of the rows - Consistent state of regions at the same GSN  Secondary indexes - Agreement to update is delivered both primary and secondary tables  Distributed Transactions - ACID: atomicity, consistency, isolation, durability - Atomic multi-row and multi-table updates are transactions Coordination Engine as a tool for the evolution of distributed systems 41
  • 42. Konstantin V Shvachko Thank You Plamen Jeliazkov Tao Luo Keith Pak Jagane Sundar Henry Wang Byron Wong Dasha Boudnik Sergey Melnykov Virginia Wang Yeturu Aahlad Naeem Akhtar Ben Horowitz Michael Parkin Mikhail Antonov Konstantin Boudnik Sergey Soldatov Siva G Durai Ezhilarasan Gordon Hamilton Trevor Lorimer David McCauley Guru Yeleswarapu Robert Budas Brett Rudenstein Warren Harper