3. • What is it?
Distributed file system, designed to store very large files with streaming data access
patterns
• Why it is needed?
Very large file
Streaming data access
Commodity hardware
• Traditional design limits
RAC, MPP, brings data to computation, network become bottleneck
• Trade-offs
High latency data access
Not good for lot of small files
Write once, not support multiple write
7. Network Distances in Hadoop
• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
• distance(/d1/r1/n1,/d1/r2/n3) = 4 (nodesondifferentracksinthesamedatacenter)
• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)
8. • HDFS blocks, default size 128 mb (for a reason),
default replication 3x
• Name Node, stores metadata of all blocks in the
clusters, location configuration
dfs.namenode.name.dir, default /dfs/xx
• Data nodes, store data blocks, also has metadata
related to local blocks
• POSIX like (almost) permissions, rw(x), users,
groups, mode
9. • HDFS logs and web Interface,
port 50070, port 50075
• WebHDFS/ HTTPFS REST interface
http://sabtu:50070/webhdfs/v1/tmp?user.name=hdfs&op=GETFILESTATUS
{"FileStatus":{"accessTime":0,"blockSize":0,"childrenNum":4,"fileId":16386,"group":"supergroup","length":
0,"modificationTime":1467099643710,"owner":"hdfs","pathSuffix":"","permission":"1777","replication":
0,"type":"DIRECTORY"}}
10. • High Availability mode
• HDFS federation, similar concept with namespace /
database sharding
• HDFS balancer
• Safe mode
• Distributed copy (distcp)
Some Features
12. • start cluster
$HADOOP_PREFIX_HOME/bin/start-dfs.sh
• stop cluster
$HADOOP_PREFIX_HOME/bin/stop-dfs.sh
• file operations
hdfs dfs -cp x y
hdfs dfs -ls x
hdfs dfs -cat x
hdfs dfs -put x y
hdfs dfs -get x y
Common Commands