Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Â
Hdfs high availability
1. A High Availability story for HDFS
AvatarNode
Dhruba Borthakur & Dmytro Molkov
dhruba@apache.org & dms@facebook.com
Presented at The Hadoop User Group Meeting,
Sept 29, 2010
2. How infrequently does the NameNode (NN) stop?
  Hadoop Software Bugs
–  Two directories in fs.name.dir, but when a write to first
directory failed, the NN ignored the second one (once)
–  Upgrade from 0.17 to 0.18 caused data corruption
(once)
  Configuration errors
–  Fsimage partition ran out of space (once)
–  Network Load Anomalies (about 10 times)
  Maintenance:
–  Deploy new patches (once every month)
3. What does the SecondaryNameNode do?
  Periodically merges Transaction logs
  Requires the same amount of memory as NN
  Why is it separate from NN?
–  Avoids fine-grain locking of NN data structures
–  Avoids implementing copy-on-write for NN data
structures
  Renamed as CheckpointNode (CN) in 0.21 release.
4. Shortcomings of the SecondaryNameNode?
  Does not have a copy of the latest transaction log
  Periodic and is not continuous
–  Configured to run every hour
  If the NN dies, the SecondaryNameNode does not take
over the responsibilities of the NN
5. BackupNode (BN)
  NN streams transaction log to BackupNode
  BackupNode applies log to in-memory and disk image
  BN always commit to disk before success to NN
  If BN restarts, it has to catch up with NN
  Available in HDFS 0.21 release
6. Limitations of BackupNode (BN)
  Maximum of one BackupNode per NN
–  Support only two-machine failure
  NN does not forward block reports to BackupNode
  Time to restart from 12 GB image, 70M files + 100 M
blocks
–  3 – 5 minutes to read the image from disk
–  20 min to process block reports
–  BN will still take 25+ minutes to failover!
7. Overlapping Clusters for HA
  “Always available for write” model
  Two logical clusters each with their own NN
  Each physical machine runs two instances of DataNode
  Two DataNode instances share the same physical storage
device
  Application has logic to failover writes from one HDFS
cluster to another
  More details at http://hadoopblog.blogspot.com/2009/06/hdfs-
scribe-integration.html
8. HDFS+Zookeeper
  HDFS can store transaction logs in Zookeeper/Bookeeper
–  http://issues.apache.org/jira/browse/HDFS-234
  Transaction log need not be stored in NFS filer
  A new NN will still have to process block reports
–  Not good for HA yet, because NN failover will take 30 minutes
9. Our use case for High Availability
  Failover should occur in less than a minute
  Failovers are needed only for new software upgrades
10. Challenges
  DataNodes send block location
information to only one
NameNode
  NameNode needs block locations
in memory to serve clients
  The in-memory metadata for 100
million files could be 60 GB,
huge!
DataNodes
Primary
NameNode
Client
Block location
message “yes, I
have blockid 123”
Client retrieves
block location from
NameNode
11. Introduction for AvatarNode
  Active-Standby Pair
–  Coordinated via zookeeper
–  Failover in few seconds
–  Wrapper over NameNode
  Active AvatarNode
–  Writes transaction log to filer
  Standby AvatarNode
–  Reads transactions from filer
–  Latest metadata in memory
http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html
NFS
Filer
Active
AvatarNode
(NameNode)
Client
Standby
AvatarNode
(NameNode)
Block
location
messages
Client retrieves
block location from
Primary or Standby
write
transaction
read
transaction
Block
location
messages
DataNodes
12. ZooKeeper integration for Clients
  DistributedAvatarFileSystem:
–  Connects to ZooKeeper to figure out who the Primary node is.
There is a znode in ZooKeeper that has the current address of the
primary. Clients read it on creation and during failover.
–  Is aware of the failover state and pauses until it is over. If the
znode is empty the cluster is failing over to the new Primary, just
wait for that to finish.
–  Handles failures in calls to the NameNode. If the call failed with
network exception – checks for the failover in progress and retries
the call after the new Primary is up.
13. Four steps to failover
  Wipe ZooKeeper entry. Clients will know the failover is in
progress. (0 seconds)
  Stop the primary namenode. Last bits of data will be
flushed to Transaction Log and it will die. (Seconds)
  Switch Standby to Primary. It will consume the rest of the
Transaction log and get out of safemode ready to serve
traffic. (Seconds)
  Update the entry in ZooKeeper. All the clients waiting for
failover will pick up the new connection. (0 seconds)
  After: Start the first node in the Standby Mode (Takes a
while, but the cluster is up and running)
14. Why add ZooKeeper to the mix
  Provides a clean way to execute failovers in the application
layer.
  A centralized control of all the clients. Gives us the ability
to Pause clients until the failover is done.
  A good stepping stone for future improvements needed to
perform automatic failover:
–  Nodes voting on who will be the primary
–  DataNodes knowing who has the authority to delete blocks
15. ZooKeeper is an option
  Can be implemented using IP failover based on the existing
infrastructure.
  The clients will not know if the failover is done and it is
safe to make the call again.
  IP failover works well in tightly coupled system (both nodes
in one rack) so the Single Point of Failure is still there (rack
switch)
  There is no need to run a dedicated ZooKeeper cluster.
16. It is not all about failover
  The Standby node has a lot of CPU that is not used
  The Standby node has a full (but delayed a bit) picture of
Blocks and the Namesystem.
  Send all reads that can deal with stale data to Standby.
  Have pluggable services run as a part of Standby node and
use the metadata of the filesystem directly from the
shared memory instead of querying the namenode all the
time.
17.   My Hadoop Blog:
–  http://hadoopblog.blogspot.com/
–  http://www.facebook.com/hadoopfs