2. Local
File
System
FileA
FileB
FileC
Inode-‐n
Inode-‐m
Inode-‐p
Inode-‐n
File
A0ributes
Block
0
Address
Block
1
Address
Block
2
Address
Block
3
Address
Inode-‐m
File
A0ributes
Block
0
Address
Block
1
Address
Block
2
Address
Inode-‐m
File
A0ributes
Block
0
Address
Block
1
Address
Block
2
Address
Block
3
Address
DISK
Directory
MBR
Par@@on
Table
Boot
block
Super
block
Free
Space
Trk
i-‐nodes
Root
dir
File
block
size
is
based
on
what
is
used
when
FS
is
defined
2
3. Hadoop
Distributed
File
System
Master
Host
(NN)
FileA
FileB
FileC
HDFS
Directory
H1:blk0,
H2:blk1
H3:blk0,H1:blk1
H2:blk0;H3:blk1
Local
File
System
File
DISK
Local
FS
Directory
FileA0
FileB1
Inode-‐x
Inode-‐y
Host
1
Local
FS
Directory
FileA1
FileC0
Inode-‐a
Inode-‐n
Host
2
Local
FS
Directory
FileB0
FileC1
Inode-‐r
Inode-‐c
Host
3
In-‐x
In-‐y
In-‐a
In-‐n
In-‐r
In-‐c
DISK
DISK
DISK
Files
created
are
of
size
equal
to
the
HDFS
blksize
3
4. HDFS
HDFS
Data
Transfer
Protocol
Date
Node
${dfs.data.dir}/current/VERSION
/blk_<id_1>,/blk_<id_1>.meta
/...
/subdir2/
HTTP/S
Data
Node
${dfs.data.dir}/current/VERSION
/blk_<id_1>,/blk_<id_1>.meta
/...
/subdir2/
Data
Node
${dfs.data.dir}/current/VERSION
/blk_<id_1>,/blk_<id_1>.meta
/...
/subdir2/
Name
Node
${dfs.name.dir}/current/VERSION
/edits,/fsimage,/fs@me
Secondary
Name
Node
${fs.checkpoint.dir}/current/VERSION
/edits,/fsimage,/fs@me
Hadoop
CLI
WebHDFS
HDFS
UI
Data
Nodes
RPC
HTTP
RPC
4
5. HDFS
Config
Files
and
Ports
• Default
configuraJon
– core-‐default.xml,
hdfs-‐default.xml
• Site
specific
configuraJon
– core-‐site.xml,
hdfs-‐site.xml
under
conf
• ConfiguraJon
of
daemon
processes
– hadoop-‐env.sh
under
conf
• List
of
slave/data
nodes
– “slaves”
file
under
conf
• Ports
– Default
NN
UI
port
50070
(HTTP),
50470
(HTTPS)
– Default
NN
Port
8020/9000
– Default
DN
UI
port
50075
(HTTP),
50475(HTTPS)
5
6. HDFS
-‐
Write
Flow
Client
Name
Node
Namespace
MetaData
Blockmap
(Fsimage
Edit
files)
Data
Node
Data
Node
Data
Node
1
2
3
4
5
8
7
7
6
6
1. Client
requests
to
open
a
file
to
write
through
fs.create()
call.
This
will
overwrite
exisJng
file.
2. Name
node
responds
with
a
lease
to
the
file
path
3. Client
writes
to
local
and
when
data
reaches
block
size,
requests
Name
Node
for
write
4. Name
Node
responds
with
a
new
blockid
and
the
desJnaJon
data
nodes
for
write
and
replicaJon
5. Client
sends
the
first
data
node
the
data
and
the
checksum
generated
on
the
data
to
be
wriaen
6. First
data
node
writes
the
data
and
checksum
and
in
parallel
pipelines
the
replicaJons
to
other
DN
7. Each
data
node
where
the
data
is
replicated
responds
back
with
success
/failure
to
the
first
DN
8. First
data
node
in
turn
informs
to
the
Name
node
that
the
write
request
for
the
block
is
complete
which
in
turn
will
update
its
block
map
Note:
There
can
be
only
one
write
at
a
Jme
on
a
file
6
7. HDFS
-‐
Read
Flow
Client
Name
Node
Namespace
MetaData
Blockmap
(Fsimage
Edit
files)
Data
Node
Data
Node
Data
Node
1
2
3
4
5
6
7
1. Client
requests
to
open
a
file
to
read
through
fs.open()
call
2. Name
node
responds
with
a
lease
to
the
file
path
3. Client
requests
for
read
the
data
in
the
file
4. Name
Node
responds
with
block
ids
in
sequence
and
the
corresponding
data
nodes
5. Client
reaches
out
directly
to
the
DNs
for
each
block
of
data
in
the
file
6. When
DNs
sends
back
data
along
with
check
sum,
client
performs
a
checksum
verificaJon
by
generaJng
a
checksum
7. If
the
checksum
verificaJon
fails
client
reaches
out
to
other
DNs
where
the
re
is
a
replicaJon
7
8. HDFS
-‐
Name
Node
Fsimage
(MetaData)
Namespace
Ownership
Permissions
Create/mod/Access
Jme,
Is
hidden
EditFile
(Journal)
Changes
to
metadata
BlockMap
(In-‐memory)
Details
on
File
blocks
and
where
they
are
stored
1. Name
node
manages
the
HDFS
file
system
using
the
fsimage/edifile
and
block-‐map
data
structures
2. Fsimage
and
edifile
data
are
stored
on
disk.
When
hdfs
starts
they
are
read,
merged
and
stored
in-‐memory
3. Data
nodes
sends
details
about
the
blocks
they
are
storing
when
it
starts
and
also
at
regular
intervals
4. Name
node
uses
the
block
map
send
by
data
nodes
to
build
the
BlockMap
data
structure
data
5. The
BlockMap
data
is
used
when
requests
for
reads
on
files
comes
to
the
FileSystem
6. Also
the
BlockMap
data
is
used
to
idenJfy
the
under/over
replicated
files
which
requires
correcJon
7. At
no
point
Name
node
stores
data
locally
or
directly
involved
in
transferring
data
from
files
to
client
8. The
client
reading/wriJng
data
receives
meta
data
details
from
NN
and
then
directly
works
with
DNs
9. Name
nodes
require
large
memory
since
it
needs
to
hold
all
the
in-‐memory
data
structures
10. If
the
NN
is
lost
the
data
in
the
file
systems
can’t
be
accessed
8
9. FS
Meta
Data
Change
Management
At
Start-‐up
Periodically
Fsimage
(MetaData)
EditFile
(Journal)
Secondary
NameNode
Fsimage_1
(MetaData)
EditFile_1
(MetaData)
Fsimage
(MetaData)
EditFile
(Journal)
NameNode
Fsimage_1
(MetaData)
EditFile_1
(MetaData)
1. When
HDFS
is
up
and
running
changes
to
file
system
metadata
are
stored
in
Edit
files
2. When
NN
starts
it
looks
for
EditFiles
in
the
system
and
merges
the
content
with
the
fsimage
on
the
disk
3. The
merging
process
creates
new
fsimage
and
edifile.
Also
the
process
discards
the
old
fsimage
&
edit
files.
4. Since
the
edit
files
can
be
large
for
a
very
acJve
HDFS
cluster,
the
NN
start-‐up
will
take
a
long
Jme
5. Secondary
name
node
at
regular
interval
or
aier
a
certain
edifile
size,
merges
the
edit
file
and
fsimage
file
6. The
merge
process
creates
a
new
fsimage
file
and
an
edit
file.
The
secondary
NN
copies
the
new
fsimage
file
back
to
NN
7. This
will
reduce
the
NN
start-‐up
process
and
also
the
fsimage
can
be
used
if
there
is
a
failure
in
the
NN
server
to
restore
9
10. HDFS
-‐
Data
Node
Name
Node
MetaData
BlockMap
Data
Node
Heart
Beat
/
Block
map
Data
Node
Data
Node
1. Data
nodes
stores
blocks
of
data
for
each
file
stored
in
HDFS
and
the
default
clock
size
is
128
MB
2. Blocks
of
data
is
replicated
n
Jmes
and
by
default
it
is
3
Jmes
3. Data
node
periodically
sends
a
heartbeat
to
the
name
node
to
inform
NN
that
it
is
alive
4. If
NN
doesn’t
receive
a
heart
beat
,
it
will
mark
the
DN
as
dead
and
stops
sending
further
requests
to
the
DN
5. Also
in
periodic
intervals,
data
node
sends
out
a
block
map
which
includes
all
the
file
blocks
it
stores
6. When
a
DN
is
dead,
all
the
files
for
which
blocks
were
stored
in
the
DN
will
get
marked
as
under
replicated
7. NN
will
recJfy
under
replicaJon
by
replicaJng
the
blocks
to
other
data
nodes
10
11. Ensuring
Data
Integrity
• Through
replicaJon/replicaJon
assurance
– First
replica
closer
to
client
node
– Second
replica
on
a
different
rack
– Third
replica
on
the
rack
as
the
second
replica
• File
system
checks
run
manually
• Block
scanning
over
a
period
of
Jme
• Storing
checksums
along
with
block
data
11
12. Permission
and
Quotas
• File
and
directories
use
much
of
POSIX
model
– Associated
with
an
owner
and
a
group
– Permission
for
owner,
group
and
others
– r
for
read,
w
for
append
to
files
– r
for
lisJng
files,
w
for
delete/create
files
in
dirs
– x
to
access
child
directories
– Stciky
bit
on
dirs
prevents
deleJons
by
others
– User
idenJficaJon
can
be
simple
(OS)
or
Kerberos
12
13. Permission
and
Quotas
• Quota
for
number
of
files
– Name
quota
– dfsadmin
-‐setQuota
<N>
<dir>...<dir>
– dfsadmin
-‐clrSpaceQuota
<dir>...<dir>
• Quota
on
the
size
of
data
– Space
quota
can
be
set
to
restrict
space
usage
– dfsadmin
-‐setSpaceQuota
<N>
<dir>...<dir>
• Replicated
data
also
consumes
quota
– dfsadmin
-‐clrSpaceQuota
<dir>...<dir>
• ReporJng
– fs
-‐count
-‐q
<dir>...<dir>
13
14. HDFS
snapshot
• No
copy
of
data
blocks.
Only
the
metadata
(block
list
and
file
names)
are
copied
• Allow
snapshot
on
a
directory
– hdfs
dfsadmin
–allowSnapshot
<path>
• Create
snapshot
– hdfs
dfs
–createSnapshot
<path>
[<name>]
– Default
name
is
‘s’+Jmestamp
• Verify
snapshot
– hadoop
fs
–ls
<path>/.snapshot
• Directory
with
snapshot
can’t
be
deleted
or
renamed
• Disallow
snapshot
– hdfs
dfsadmin
–disallowSnapshot
<path>
– All
exisJng
snapshot
need
to
be
deleted
before
disallow
• Delete
snapshot
– hdfs
dfs
–deleteSnapshot
<path>
<name>
• Rename
snapshot
– hdfs
dfs
–renameSnapshot
<path>
<oldname>
<newname>
• Snapshot
differences
– hdfs
snapshotDiff
<path>
<starJng
snapshot
name>
<ending
snapshot
name>
• List
all
snap
shoaable
directories
– hdfs
lsSnapshoaableDir
14
15. HDFS
back-‐up
using
snapshot
• Create
a
snapshot
on
the
source
cluster
• Perform
a
“distcp”
of
the
snapshot
to
backup
cluster
• Create
a
snapshot
of
the
copy
on
the
backup
cluster
• Cleanup
any
old
back-‐up
copies
to
comply
with
the
enterprise
retenJon
policy
• The
reverse
can
be
followed
to
recover
data
from
the
backup
– Data
need
to
be
removed
on
the
producJon
cluster
before
the
restore
– During
deleJon
–skipTrash
opJon
of
“rm”
will
help
reduce
space
usage
15
16. distcp
• Tool
to
perform
inter
and
intra
cluster
copy
of
data
• UJlizes
mapreduce
to
perform
the
copy
• It
can
be
used
to
– Copy
data
with
in
a
cluster
– Copy
data
between
clusters
– Copy
files
or
directories
– Copy
data
from
mulJple
sources
• Can
be
used
to
create
a
backup
cluster
• Starts
up
containers
on
both
source
and
target
• Consumes
network
traffic
between
clusters
• Need
to
be
scheduled
at
appropriate
Jme
• Can
control
resource
uJlizaJon
using
parameters
16
17. distcp
• Hadoop
distcp
[opJons]
<srcURL>
…
<srcURL>
<destURL>
– Source
path
need
to
be
obsolute
– DesJnaJon
directory
will
be
created
if
not
present
– “update”
opJon
will
update
only
the
changed
files
– “skipcrccheck”
opJon
to
disable
checksum
– “overwrite”
opJon
is
to
overwrite
exisJng
files
which
is
by
default
skipped
if
present
– “delete”
opJon
to
delete
files
in
desJnaJon
which
are
not
in
source
– “hip”
fs
need
to
be
used
to
copy
between
different
versions
of
HDFS
– “m”
opJon
to
specify
the
number
of
mappers
17
18. distcp
– “atomic”
opJon
to
commit
all
changes
or
none
– “async”
to
run
distcp
async
i.e.
non
blocking
– “i”
opJon
to
ignore
failures
during
copy
– “log”
directory
on
DFS
where
logs
to
be
saved
– “p
[rbugp]”
preserve
file
status
as
source
– “strategy
[staJc|dynamic]”
– “bandwidth
[MB]”
bandwidth
per
map
in
MB
18
20. HDFS
FederaJon
HDFS
without
Federa@on
Diagram
source:
hadoop.apache.org
–
JIRA
HDFS-‐1052
HDFS
with
Federa@on
-‐ Namespace
management
and
block
management
together
-‐ Supports
one
name
space
-‐ Hinders
scalability
above
400
0
nodes
-‐ Doesn’t
support
some
of
mulJ-‐tenancy
requirements
-‐ Namespace
management
and
block
management
seperated
-‐ Block
management
can
be
on
its
node
on
its
own
-‐ Supports
more
than
one
name
space/NN
-‐ Scalable
beyond
4000
nodes
and
millions
of
rows
-‐ Can
deploy
mulJ-‐tenancy
requirements
like
NN
for
specific
departments
and
isoloaJon
-‐ A
namespace
and
block
pool
is
called
namespace
volume
20
21. Enabling
HDFS
federaJon
• IdenJfy
an
unique
cluster
id
• IdenJfy
nameservices
ids
for
name
nodes
• Add
dfs.nameservices
to
hdfs-‐site.xml
– Comma
separated
nameservice(ns)
names
• Update
hdfs-‐site.xml
on
all
NNs
and
DNs
– dfs.namenode.rpc-‐address.ns
– dfs.namenode.hap-‐address.ns
– dfs.namenode.servicerpc-‐address.ns
– dfs.namenode.haps-‐address.ns
– dfs.namenode.secondaryhap-‐address.ns
– dfs.namenode.backup.address.ns
• Format
all
name
nodes
using
the
cluster
id
– hdfs
namenode
–format
–clusterId
<cluster
id>
21
22. HDFS
Rack
Awareness
• Rack
awareness
enables
efficient
data
placement
– Data
writes
– Balancer
– Decommissioning/commissioning
of
nodes
• Each
node
is
assigned
to
a
rack
(rack
id)
– Rack
id
is
used
in
the
path
names
• Data
placement
– First
block
is
placed
near
client
or
random
node/rack
– Second
replica
of
block
placed
in
a
second
rack
node
– Third
replica
is
placed
in
a
different
node
in
second
rack
– If
HDFS
is
not
rack
aware,
second
and
third
replicas
are
placed
at
random
nodes
22
23. Enabling
HDFS
Rack
Awareness
• Update
core-‐site.xml
with
topology
properJes
– topology.script.file.name
• Script
can
be
shell
script,
Python,
Java
– topology.script.number.args
• Copy
the
script
to
the
conf
directory
• Distribute
the
script
and
core-‐site.xml
• Stop
and
start
the
name
node
• Verify
that
the
racks
are
recognized
by
HDFS
– hdfs
fsck
-‐racks
23
24. HDFS
NFS
Gateway
• Allows
HDFS
HDFS
to
be
mounted
as
part
of
local
FS
• Stateless
daemon
translates
NFS
to
HDFS
access
protocol
• DFSClient
is
part
of
the
gateway
daemon
– Averages
30
MB/S
for
writes
• MulJple
gateways
can
be
used
for
scalability
• Gateway
machine
requires
all
soiware
and
configs
like
HDFS
client
– Gateway
can
be
run
on
HDFS
cluster
nodes
• Random
writes
are
not
supported
HDFS
Cluster
NN
DN
DN
DN
NFS
Gateway
(DFSClient)
RPC
HDFS
Client
NFSv3
24
25. HDFS
NFS
Gateway
ConfiguraJon
• Consists
of
two
daemons
– portmap
and
nfs3
• ConfiguraJon
– dfs.nodename.access.precision;
3600000
(1
Hr)
• Name
node
restart
– dfs.nfs3.dump.dir;
dir
to
store
out
of
seq
data
• Enough
space
to
store
data
for
all
concurrent
file
writes
• Use
NFS
for
smaller
file
transfers
in
the
order
of
1
GB
– dfs.nfs.exports.allowed.hosts;
Host
access
• client*.abc.com
r;client*.xyc.com
rw
– Update
log4j.properJes
file
• log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG
• log4j.logger.org.apache.hadoop.oncrpc=DEBUG
25
26. HDFS
NFS
Gateway
ConfiguraJon
• Stop
nfs
&
rpcbind
services
provided
by
OS
– service
nfs
stop
– service
rpcbind
stop
• Start
hadoop
portmap
as
root
– hadoop-‐daemon.sh
start
portmap
– To
stop
use
“stop”
instead
of
“start”
as
parameter
• Start
mountd
and
nfsd
as
user
starJng
HDFS
– hadoop-‐daemon.sh
start
nfs3
– To
stop
use
“stop”
instead
of
“start”
as
parameter
26
27. HDFS
NFS
Gateway
ConfiguraJon
• Validate
NFS
services
are
running
– rpcinfo
–p
$nfs_server_ip
– Should
see
entries
for
mountd,
portmapper
&
nfs
• Verify
HDFS
namespace
is
exported
for
mount
– showmount
–e
$nfs_server_ip
– Should
see
the
export
list
• Mount
HDFS
on
client
– Create
a
mount
point
as
root;
– Change
ownership
of
mount
point
to
user
running
HDFS
cluster
– mount
-‐t
nfs
-‐o
vers=3,proto=tcp,nolock
$nfs_server:/
$mount_point
– Client
sends
UID
of
user
to
NFS
– NFS
looks
up
the
username
for
UID
and
uses
it
to
access
HDFS
– User
name
and
UID
should
be
the
same
on
client
and
NFS
27
28. HDFS
Name
Node
HA
Ac@ve
Name
Node
Passive
Name
Node
Shared
Storage
ZKFC
ZKFC
Zookeeper
Quorum
ZK
ZK
ZK
HB
HB
Data
Node
Data
Node
Data
Node
• Zookeeper
does
failure
detecJon
and
helps
acJve
name
node
elecJon
• ZKFC
ZooKeeper
Failover
Controller
• monitors
the
health
of
name
node
• Holds
a
session
open
on
ZK
and
a
lock
for
acJve
NN
• If
no
other
NN
holds
zlock,
it
tries
to
acquire
it
to
make
NN
acJve
• Share
storage
can
be
NFS
mount
or
quorum
of
journal
storage
• Fencing
is
defined
to
prevent
split
brain
scenario
of
two
NN
wriJng
28
29. HDFS
NN
HA
ConfiguraJon
• Define
dfs.nameservices
– Nameservice
Id
• Define
dfs.namenodes.[nameservice
id]
– Comma
separated
list
of
name
nodes
• Define
dfs.namenode.rpc-‐address.[Nameservice
Id].[Name
node
Id]
– Fully
qualified
machine
name
and
port
• Define
dfs.namenode.hap-‐address.[nameservice
ID].[name
node
ID]
– Fully
qualified
machine
name
and
port
• Define
dfs.namenode.shared.edits.dir
– For
nfs:
file:///mnt/...
– For
Journal
nodes:
qjournal://node1:8485;node2.
com:8485;
• Define
dfs.client.failover.proxy.provider.[nameservice
ID]
– org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
• Define
dfs.ha.fencing.methods
– sshfence;
requires
password
less
ssh
into
name
nodes
from
one
another
– shell
• Define
fs.defaultFS
the
HA
enabled
logical
URI
• For
journal
nodes
– Define
dfs.journalnode.edits.dir
where
edits
and
other
local
states
used
by
JNs
will
be
stored
29
30. HDFS
NN
HA
ConfiguraJon
• Define
dfs.ha.automaJc-‐failover.enabled
– Set
to
true
• Define
ha.zookeeper.quorum
– Host
and
port
of
ZK
• To
enable
HA
in
an
exisJng
cluster
– Run
hdfs
dfsadmin
–safemode
enter
– Run
hdfs
dfsadmin
–saveNamespace
– Stop
HDFS
cluster
dfs-‐stop.sh
– Start
journal
node
daemons
hdfs-‐daemon.sh
journalnode
– Run
hdfs
zkfc
–formatZK
on
exisJng
NN
– Run
hdfs
–iniEalizeSharedEdits
on
exisJng
NN
– Run
hdfs
namenode
–bootstrapStandBy
on
new
NN
– Delete
secondary
name
node
– Start
HDFS
cluster
dfs-‐start.sh
30
35. Adding
New
Nodes
• Add
node
address
to
dfs.hosts
file
– Update
mapred.hosts
file
if
using
mapred
• Update
namenode
with
the
new
set
of
nodes
– hadoop
dfsadmin
–refreshNodes
– Update
jobtracker
with
the
new
set
of
nodes
• hadoop
mradmin
–refreshNodes
• Update
“slaves”
file
with
the
new
node
names
• Start
new
datanodes
(and
tasktrackers)
• Check
the
availability
of
the
new
nodes
in
UI
• Run
balancer
so
that
data
is
distributed
35
36. Decommissioning
Nodes
• Add
node
address
to
exclude
file
– dfs.hosts.exclude
– mapred.hosts.exclude
• Update
namenode
(and
jobtracker)
– hadoop
dfsadmin
–refreshNodes
– hadoop
mradmin
–refreshNodes
• Verify
all
the
nodes
are
decommissioned
(UI)
• Remove
nodes
from
dfs.hosts
(and
mapred.hosts)
file
• Update
namenode
(and
jobtracker)
• Remove
nodes
from
the
“slaves”
file
36
37. HDFS
Upgrade
• No
file
system
layout
change
– Install
new
version
of
HDFS
(and
MapReduce)
– Stop
the
old
daemons
– Update
the
configuraJon
files
– Start
the
new
daemons
– Update
clients
to
use
the
new
libraries
– Remove
the
old
install
and
the
configuraJon
files
– Update
applicaJon
code
for
deprecated
APIs
37
38. HDFS
Upgrade
• With
file
system
layout
changes
– When
there
is
a
layout
change
NN
will
not
start
– Run
FSCK
to
make
sure
that
the
FS
is
healthy
– Keep
a
copy
of
the
FSCK
output
for
verificaJon
– Clear
HDFS
and
map
reduce
temporary
files
– Make
sure
that
any
previous
upgrade
is
finalized
– Shutdown
map
reduce
and
kill
orphaned
task
– Shutdown
HDFS
and
make
a
copy
of
NN
directories
– Install
new
versions
of
HDFS
and
Map
Reduce
– Start
HDFS
with
–upgrade
opJon
• Start-‐dfs.sh
–upgrade
– Once
the
upgrade
is
complete
perform
manual
spot
checks
• hadoop
dfsadmin
–upgradeProcess
status
– Start
Map
Reduce
– Rollback
or
Finalize
the
upgrade
• stop-‐dfs.sh;
start-‐dfs.sh
–rollback
• hadoop
dfsadmin
-‐finalizeUpgrade
38
39. Key
Parameters
Parameter
Descrip@on
Default
Value
dfs.blocksize
File
block
size
128
MB
dfs.replicaJon
File
block
replicaJon
count
3
dfs.datanode.numblocks
No
of
blocks
aier
which
new
sub
directory
gets
created
in
DN
io.bytes.per.checksum
Number
of
data
bytes
for
which
check
sum
is
calculated
512
dfs.datanode.scan.period.hours
Timeframe
in
hours
to
complete
block
scanning
504
(3
weeks)
39