"An Elephan can't jump. But can carry heavy load".
Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations: Amazon.com, Apple, eBay, IBM, ImageShack, LinkedIn, Microsoft, Twitter, The New York Times...
1. Students: An Du – Tan Tran – Toan Do – Vinh Nguyen
Instructor: Professor Lothar Piepmayer
HDFS at a glance
2. Agenda
1. Design of HDFS
2.1. HDFS Concepts – Blocks
2.1. HDFS Concepts - Namenode and datanode
3.1 Dataflow - Anatomy of a read file
3.2 Dataflow - Anatomy of a write file
3.3 Dataflow - Coherency model
4. Parallel copying
5. Demo - Command line
3. The Design of HDFS
Very large distributed file system
Up to 10K nodes, 1 billion files, 100PB
Streaming data access
Write once, read many times
Commodity hardware
Files are replicated to handle hardware failure
Detect failures and recover from them
5. HDFS Blocks
Normal Filesystem blocks are few kilobytes
HDFS has Large block size
Default 64MB
Typical 128MB
Unlike a file system for a single disk. A file in HDFS that is
smaller than a single block does not occupy a full block
6. HDFS Blocks
A file is stored in blocks on various nodes in hadoop cluster.
HDFS creates several replication of the data blocks
Each and every data block is replicated to multiple nodes
across the cluster.
8. Why blocks in HDFS so large?
Minimize the cost of seeks
=> Make transfer time = disk transfer rate
9. Benefit of Block abstraction
A file can be larger than any single disk in the network
Simplify the storage subsystem
Providing fault tolerance and availability
11. Namenode & Datanodes
Namenode (master)
– manages the filesystem namespace
– maintains the filesystem tree and metadata for all the
files and directories in the tree.
Datanodes (slaves)
– store data in the local file system
– Periodically report back to the namenode with lists of all
existing blocks
Clients communicate with both namenode and datanodes.
18. Parallel copying in HDFS
Transfer data between clusters
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
Implemented as MapReduce, each file per map
Each map take at least 256MB
Default max maps is 20 per node
The diffirent versions only supported by webhdfs protocol:
% hadoop distcp webhdfs://namenode1:50070/foo
webhdfs://namenode2:50070/bar
19. Setup
Cluster with 03 nodes:
04 GB RAM
02 CPU @ 2.0Ghz+
100G HDD
Using vmWare on 03 different servers
Network: 100Mbps
Operating System: Ubuntu 11.04
Windows: Not tested
23. Command Line
Sorting: Standard method to test cluster
TeraGen: Generate dummy data
TeraSort: Sort
TeraValidate: Validate sort result
Command Line:
hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar
terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
24. Benchmark Result
2 Nodes, 1GB data: 0:03:38
3 Nodes, 1GB data: 0:03:13
2 Nodes, 10GB data: 0:38:07
3 Nodes, 10GB data: 0:31:28
Virtual Machine's harddisks are the bottle-neck