Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS.
Introducing the data science sandbox as a service 8.30.18
Strata + Hadoop World 2012: HDFS: Now and Future
1. HDFS:
Now and Future
Todd Lipcon (todd@cloudera.com)
Sanjay Radia (sanjay@hortonworks.com)
2. Outline
Part 1 – Todd Lipcon (Cloudera)
• Namenode HA
• HDFS Performance improvements
• Taking advantage of next-gen hardware
• Storage Efficiency (RAID and compression)
Part 2 - Sanjay Radia (Hortonworks)
• Federation and Generalized storage service
– Leverage it for further innovation
• Snapshots
• Other
– WebHDFS
– Wire compatibility
2 O'Reilly Strata & Hadoop World
3. HDFS HA in Hadoop 2.0.0
• Initial implementation last year
– Introduced Standby NameNode and manual hot
failover (see Hadoop World 2011 presentation)
• Handled planned maintenance (eg upgrades) but not
unplanned
– Required a highly-available NFS filer to store
NameNode metadata
• Complicated and expensive to set up
3 O'Reilly Strata & Hadoop World
4. HDFS HA Phase 2
• Automatic failover
– Uses Apache ZooKeeper to automatically detect
NameNode failures and trigger a failover
– Ops may invoke manual failover for planned
maintenance windows
• Removed dependency on NFS storage
– HDFS HA is entirely self-contained
– No special hardware or software required
– No SPOF anywhere in the system
4 O'Reilly Strata & Hadoop World
5. Automatic Failover
• Each NameNode has a new process called
ZooKeeperFailoverController (ZKFC)
– Maintains a session to ZooKeeper
– Periodically runs a health-check against its local NameNode to verify
that it is running properly
• Triggers failover if the health check fails or the ZK session expires
• Operators may still issue manual failover commands for planned
maintenance
• Failover time: 30-40 seconds unplanned; 0-3 seconds planned.
• Handles all types of faults: machine, software, network, etc.
5 O'Reilly Strata & Hadoop World
6. Removed NFS/filer dependency
• Shared storage on NFS practical for some
organizations, but difficult for others
– Complex configuration, custom fencing scripts
– Filer itself must be highly available
– Expensive to buy, expensive to support
– Buggy NFS clients in Linux
• Introduced new system for reliable edit log
storage: QuorumJournalManager
6 O'Reilly Strata & Hadoop World
7. QuorumJournalManager
• Run 3 or 5 JournalNodes, collocated on existing hardware
investment
• Each edit must be committed to a majority of the nodes (i.e
a quorum)
– A minority of nodes may crash or be slow without affecting
system availability
– Run N nodes to tolerate (N-1)/2 failures (same as ZooKeeper)
• Built into HDFS
– Designed for existing Hadoop ops teams to understand
– Hadoop Metrics support, full Kerberos support, etc.
7 O'Reilly Strata & Hadoop World
8. HDFS HA Architecture
(with Automatic Failover and QuorumJournalManager)
ZK ZK ZK
Heartbeat Heartbeat
FailoverController FailoverController
Active Standby
Cmds JN JN JN
NN Shared NN state NN
Monitor Health through Quorum
of NN. OS, HW
Active of JournalNodes
Standby Monitor Health
of NN. OS, HW
Block Reports to Active & Standby
DN fencing: only obey commands
from active
DN DN DN DN
8 O'Reilly Strata & Hadoop World
9. HA Improvements Summary
• Automatic failover
– Avoid both planned an unplanned downtime
• Non-NFS Shared Storage
– No need to buy or configure a filer
• Result: HA with no external dependencies
• Available now in HDFS trunk and CDH4.1
• Come to our 5pm talk in this room for more
details on these HA improvements!
9 O'Reilly Strata & Hadoop World
10. HDFS Performance Update: 2.x vs 1.x
• Significant speedups from SSE4.2 hardware checksum
calculation (2.5-3x less CPU on read path)
• Rewritten read path for fewer memory copies
• Short-circuit past datanodes for 2-3x faster random
read (HBase workloads)
• I/O scheduling improvements: push down hints to
Linux using posix_fadvise()
• Covered in my presentation from Hadoop World 2011
10 O'Reilly Strata & Hadoop World
11. HDFS Performance: Recent Work
• Completed
– Zero-copy read for libhdfs (2-3x improvement for C++
clients like Impala reading cached data)
– Expose mapping of blocks to disks: 2x improvement by
avoiding contention on slower drives (HDFS-3672)
• In progress
– Using native checksum computation on write path
– Avoiding copies and allocation on write path
11 O'Reilly Strata & Hadoop World
12. HDFS Performance Benchmarks
1000
(as of June 2012)
800
Throughput
(MB/sec)
600
Read
400
Write
200
0
Raw ext4 HDFS HDFS with disk awareness
Dual quad-core, 12x2T 7200RPM drives, measured max disk throughput at
900MB/sec.
Write throughput is CPU bound; improvements in progress bring it to max disk
throughput as well
Easily saturates SATA3 bus bandwidth on common hardware
12 O'Reilly Strata & Hadoop World
13. Hardware Trends
• Denser storage
– 36T per node already common
– Millions of blocks per DN
• New need to invest in scaling DataNode memory usage
• More RAM
– 64GB common today. 256GB soon inexpensive
– Customers want to explicitly pin recently ingested data in RAM
(especially with efficient query engines like Impala)
• Solid state storage (SSD, FusionIO, etc)
– HDFS should transparently or explicitly migrate hot random-
access data to/from flash
13
– Hierarchical storage management O'Reilly Strata & Hadoop World
14. HDFS Storage Efficiency
• Many customers are expanding their clusters simply to add storage
– How can we better utilize the disks they already have?
• RAID (Reed-Solomon coding)
– Store blocks at low replication, keep parity blocks to allow
reconstruction if they are lost
– Effective replication: 1.5x with same durability, less locality
• Transparent compression
– Automatically detect infrequently used files, transparently re-
compress with Snappy, GZip, bz2, or LZMA
– Cloudera workload traces indicate 10% of files accessed 90% of the
time!
14 O'Reilly Strata & Hadoop World
15. Outline
Part 1 – Todd Lipcon (Cloudera)
• Namenode HA
• HDFS Performance improvements
• Taking advantage of next-gen hardware
• Storage Efficiency (RAID and compression)
Part 2 - Sanjay Radia (Hortonworks)
• Federation and Generalized storage service
– Leverage it for further innovation
• Snapshots
• Other
– WebHDFS
– Wire compatibilityHA in Hadoop 1!
15 O'Reilly Strata & Hadoop World
16. Federation: Generalized Block Storage
NN-1 NN-k NN-n
Namespace
Foreign
NS1 NS k NS n
.. ..
. .
Pool 1 Pool k Pool n
Block Storage
Block Pools
DN 1 DN 2 DN m
.. .. ..
Common Storage
• Block Storage as generic storage service
– Set of blocks for a Namespace Volume is called a Block Pool
– DNs store blocks for all the Namespace Volumes – no partitioning
• Multiple independent Namenodes and Namespace Volumes in a cluster
– Namespace Volume = Namespace + Block Pool
16 O'Reilly Strata & Hadoop World
17. HDFS’ Generic Storage Service
Opportunities for Innovation
• Federation - Distributed (Partitioned) Namespace
– Simple and Robust due to independent masters
Alternate NN
– Scalability, Isolation, Availability Implementation
HBase
• New Services – Independent Block Pools
HDFS
Namespace MR tmp
– New FS - Partial namespace in memory
– MR Tmp storage directly on block storage
– Shadow file system – caches HDFS, NFS, S3
Storage Service
• Future: move Block Management in DataNodes
– Simplifies namespace/application implementation
– Distributed namenode becomes significantly simple
17 O'Reilly Strata & Hadoop World
18. Managing Namespaces
• Federation has multiple namespaces
/ Client-side
• Don’t you need a single global namespace?
mount-table
– Some tenants want private namespace
– Do you create a single DB or Single Table?
– Many volumes, share what you want data project home tmp
– Global? Key is to share the data and the names used to access the data
• Client-side mount table can implement global or private namespaces
– Shared mount-table => “global” shared view NS4
– Personalized mount-table => per-application view
• Share the data that matter by mounting it
• Client-side implementation of mount tables
NS1 NS2 NS3
– xInclude from shared place – global view
– No single point of failure
– No hotspot for root and top level directories
18 O'Reilly Strata & Hadoop World
19. Next Steps… first class support for volumes
• NameServer - Container for namespaces
– Lots of small namespace volumes
• Chosen per user, tenant, data feed
• Management policies (quota, …)
• Mount tables for unified namespace
…
– Centrally managed – (xInclude, ZK, ..)
NameServers as
Containers of Namespaces • Keep only WorkingSet of namespace in memory
– Break away from old NN’s full namespace in memory
Datanode … Datanode – Faster startup, Billions of names, Hundreds of volumes
• Number of NameServers =
Storage Layer – Sum of (Namespace working set)
– Sum of (Namespace throughput)
19 – Move namespace for balancing
O'Reilly Strata & Hadoop World
20. Snapshots
• Take snapshot of any directory
– Multiple snapshots allowed
• Snapshot metadata info stored in Namemode
– Datanodes have no knowledge
– Blocks are shared
• All regular commands/apis can be used against
snapshots
– Cp /foo/bar/.snapshot/x/y /a/b/z
• New CLI’s to create and delete snapshots
20 O'Reilly Strata & Hadoop World
21. Snapshots - Status
• HDFS-2802 (feature branch)
– Initial design and prototype – March 2012
– Development active
• Updated design document and test plan posted
– Review meeting – 1st week November
• 15 + patches
– Expected completion – early December!
21 O'Reilly Strata & Hadoop World
22. Enterprise Use Cases
• Storage fault-tolerance – built into HDFS Architecture
– Over 7’9s of data reliability
• High Availability
• Standard Interfaces
– WebHdfs(REST) , Fuse and NFS access
• HTTPFS – (WebHDFS as farm of proxy servers)
• libWebhdfs – pure c-library for HDFS
• Wire protocol compatibility
– Protocol buffers
• Rolling upgrades
– Rolling upgrades for dot-releases
• Snapshots - Under active development
• Disaster Recovery
– Distcp does parallel and incremental copies across cluster
• Future - Enhance using journal interface & Snapshots
22 O'Reilly Strata & Hadoop World
23. Summary
• HA for Namenode
– Hot failover, shared storage not required (QJM)
• Performance improvements
• Utilize today’s and tomorrow’s hardware to full potential
• Federation and Generalized storage layer
– Opportunities for innovation
• Partial namespace in memory, shadow/caching file system, MR tmp, etc.
• Wire compatibility, WebHdfs, …
• Snapshots - Development well in progress
23 O'Reilly Strata & Hadoop World