The document discusses how HDFS architecture has evolved to meet new requirements for higher scalability, availability, and improved random read performance. It summarizes the key aspects of HDFS architecture in 2010, including limitations, and improvements made since then, such as read pipeline optimizations, federated namespaces, and high availability name nodes. It also outlines future directions for HDFS architecture.
5. ✛ Each cluster has…
> A single Name Node
∗ Stores file system metadata
∗ Stores “Block ID” -> Data Node mapping
> Many Data Nodes
∗ Store actual file data
> Clients of HDFS…
∗ Communicate with Name Node to browse file system, get
block locations for files
∗ Communicate directly with Data Nodes to read/write files
5
7. ✛ Want to support larger clusters
> ~4,000 node limit with 2010 architecture
> New nodes beefier than old nodes
∗ 2009: 8 cores, 16GB RAM, 4x1TB disks
∗ 2012: 16 cores, 48GB RAM, 12x3TB disks
✛ Want to increase availability
> With rise of HBase, HDFS now serving live traffic
> Downtime means immediate user-facing impact
✛ Want to improve random read performance
> HBase usually does small, random reads, not bulk
7
8. ✛ Single Name Node
> If Name Node goes offline, cluster is unavailable
> Name Node must fit all FS metadata in memory
✛ Inefficiencies in read pipeline
> Designed for large, streaming reads
> Not small, random reads (like HBase use case)
8
9. ✛ Fine for offline, batch-oriented applications
✛ If cluster goes offline, external customers don’t
notice
✛ Can always use separate clusters for different
groups
✛ HBase didn’t exist when Hadoop first created
> MapReduce was the only client application
9
11. HDFS CPU Improvements: Checksumming
• HDFS checksums every piece of data in/out
• Significant CPU overhead
• Measure by putting ~1G in HDFS, cat file in a loop
• 0.20.2: ~30-50% of CPU time is CRC32 computation!
• Optimizations:
• Switch to “bulk” API: verify/compute 64KB at a time
instead of 512 bytes (better instruction cache locality,
amortize JNI overhead)
• Switch to CRC32C polynomial, SSE4.2, highly tuned
assembly (~8 bytes per cycle with instruction level
parallelism!)
11 Copyright 2011 Cloudera Inc. All rights reserved
12. Checksum improvements
(lower is better)
1360us
100%
90%
80%
70%
60% 760us
50%
CDH3u0
40%
Optimized
30%
20%
10%
0%
Random-read Random-read CPU Sequential-read
latency usage CPU usage
Post-optimization: only 16% overhead vs un-checksummed access
Maintain ~800MB/sec from a single thread reading OS cache
12 Copyright 2011 Cloudera Inc. All rights reserved
13. HDFS Random access
• 0.20.2:
• Each individual read operation reconnects to
DataNode
• Much TCP Handshake overhead, thread creation,
etc
• 2.0.0:
• Clients cache open sockets to each datanode (like
HTTP Keepalive)
• Local readers can bypass the DN in some
circumstances to directly read data
• Rewritten BlockReader to eliminate a data copy
• Eliminated lock contention in DataNode’s
FSDataset class
13 Copyright 2011 Cloudera Inc. All rights reserved
14. Random-read micro benchmark (higher is better)
700
600
Speed (MB/sec)
500
400
300
200
100
106 253 299 247 488 635 187 477 633
0
4 threads, 1 file 16 threads, 1 file 8 threads, 2 files
0.20.2 Trunk (no native) Trunk (native)
TestParallelRead benchmark, modified to 100% random read
proportion.
Quad core Core i7 Q820@1.73Ghz
14 Copyright 2011 Cloudera Inc. All rights reserved
15. Random-read macro benchmark (HBase YCSB)
CDH4
Reads/sec
CDH3u1
time
15 Copyright 2011 Cloudera Inc. All rights reserved
17. ✛ Instead of one Name Node per cluster, several
> Before: Only one Name Node, many Data Nodes
> Now: A handful of Name Nodes, many Data Nodes
✛ Distribute file system metadata between the
NNs
✛ Each Name Node operates independently
> Potentially overlapping ranges of block IDs
> Introduce a new concept: block pool ID
> Each Name Node manages a single block pool
19. ✛ Improve scalability to 6,000+ Data Nodes
> Bumping into single Data Node scalability now
✛ Allow for better isolation
> Could locate HBase dirs on dedicated Name Node
> Could locate /user dirs on dedicated Name Node
✛ Clients still see unified view of FS namespace
> Use ViewFS – client side mount table configuration
Note: Federation != Increased Availability
19
21. Current HDFS Availability & Data Integrity
• Simple design, storage fault tolerance
• Storage: Rely on OS’s file system rather
than use raw disk
• Storage Fault Tolerance: multiple replicas,
active monitoring
• Single NameNode Master
• Persistent state: multiple copies + checkpoints
• Restart on failure
21
22. Current HDFS Availability & Data Integrity
• How well did it work?
• Lost 19 out of 329 Million blocks on 10 clusters with 20K
nodes in 2009
• 7-9’s of reliability, and that bug was fixed in 0.20
• 18 months Study: 22 failures on 25 clusters - 0.58 failures
per year per cluster
• Only 8 would have benefitted from HA failover!! (0.23
failures per cluster year)
22
23. So why build an HA NameNode?
• Most cluster downtime in practice is planned
downtime
• Cluster restart for a NN configuration change (e.g
new JVM configs, new HDFS configs)
• Cluster restart for a NN hardware upgrade/repair
• Cluster restart for a NN software upgrade (e.g. new
Hadoop, new kernel, new JVM)
• Planned downtimes cause the vast majority of
outage!
• Manual failover solves all of the above!
• Failover to NN2, fix NN1, fail back to NN1, zero
downtime
23
24. Approach and Terminology
• Initial goal: Active-Standby with Hot
Failover
• Terminology
• Active NN: actively serves read/write
operations from clients
• Standby NN: waits, becomes active when
Active dies or is unhealthy
• Hot failover: standby able to take over
instantly
24
25. HDFS Architecture: High Availability
• Single NN configuration; no failover
• Active and Standby with manual failover
• Addresses downtime during upgrades – main
cause of unavailability
• Active and Standby with automatic
failover
• Addresses downtime during unplanned outages
(kernel panics, bad memory, double PDU failure,
etc)
• See HDFS-1623 for detailed use cases
• With Federation each namespace volume has an
active-standby NameNode pair
25
26. HDFS Architecture: High Availability
• Failover controller outside NN
• Parallel Block reports to Active and
Standby
• NNs share namespace state via a shared
edit log
• NAS or Journal Nodes
• Like RDBMS “log shipping replication”
• Client failover
• Smart clients (e.g configuration, or ZooKeeper for
coordination)
• IP Failover in the future
26
29. ✛ Increase scalability of single Data Node
> Currently the most-noticed scalability limit
✛ Support for point-in-time snapshots
> To better support DR, backups
✛ Completely separate block / namespace layers
> Increase scalability even further, new use cases
✛ Fully distributed NN metadata
> No pre-determined “special nodes” in the system