Xiaomi is a Chinese technology company, it sells more than 100 million smartphones worldwide in 2018, and also owns one of the world's largest IoT device platforms. Xiaomi builds dozens of mobile apps and Internet services based on intelligent devices, including Ads, news feeds, finance service, game, music, video, personal cloud service and so on. The rapid growth of business results in exponential growth of the data analytics infrastructure. The amount of data has roared more than 20 times in the past 3 years, which renders us big challenges on the HDFS scalability
In this talk, we introduce how we scale HDFS to support hundreds of PB data with thousands nodes:
1. How Xiaomi use Hadoop and the characteristic of our usage
2. We made HDFS federation cluster to be used like a single cluster, most applications don't need to change any code to migrate from a single cluster to a federation cluster. Our works include a wrapper FileSystem compatible with DistributedFileSystem, supporting rename among different name spaces and zookeeper-based mount table renewer.
3. Experience of tuning NameNode to improve scalability
4. How to maintain hundreds of HDFS clusters and the optimization we did on client-side to make user and programs access these clusters easily with high performance
2. Outline
• Introduction of Xiaomi
• Scenarios and challenges
• Improvements on HDFS federation
• Experience on scaling up single NameNode
• Efficient management of hundreds of clusters
11. Improvements on HDFS Federation
• Problem of HDFS Federation at late 2016
– NameNode are independent, metadata is not shared
– Client side MountTable config, hard to maintain
– MountTable don’t support nesting mount-point
– ViewFileSystem is not compatible with DistributedFileSystem
– RBF is not stable and not fully functioning at late 2016
12. Improvements on HDFS Federation
viewfs
Pool 1 Pool nPool k
Block Pools
Datanode 1
…
Datanode 2
…
Datanode m
…
NS 1 NS k
Foreign
NS n
Common Storage
NN-1 NN-k NN-n
… …
BlockStorageNamespace
Original HDFS Federation
user
/
yarn hive
service1 service2
small
dir1
small
dir2
small
service2
small
service1
…
…
13. Improvements on HDFS Federation
viewfs
Pool 1 Pool nPool k
Block Pools
Datanode 2
…
Datanode 3
…
Datanode m
…
NS 1 NS k
Foreign
NS n
Common Storage
NN-1 NN-k NN-n
… …
BlockStorageNamespace
Support Nested MountPoints
Pool 1
NS 0
NN-0
…
Datanode 1
…
user
/
yarn hive
service1 service2
hdfs:// -> FederatedDFSFileSystem
extends DistributedFileSystem
Add Default NameSpace
Support rename across NameSpaces
Compatible with hdfs://, don’t need
to change any code
Update MountTable Config from ZK
14. Nested Mount table and Default NameSpace
1. Xiaomi is not only a hardware company, also an Internet
company, which develops very fast
2. There are more than 100 internet services, the new business and
services emerges quickly, based on our smart devices and more
than 300 million users
3. It’s hard for us to use a fixed mount table which is pre-divided
15. NN-1 NN-k NN-nNN-0
user
/
yarn hive
service1 service2
Nested Mount table and Default NameSpace
/some_new_nosql_service
/user/live_show_services
/user/short_video_services
1. At First, we divide the initial mount
point by data amount and QPS. Only
need to config a dozen of mountpoints
for the largest services, others fall into
the default NameSpaces
2. When new infrastructure-services and
internet-services emerges, the whole
mount table don’t need any updates
3. HADOOP-13055 supports linkFallback,
but our solution is more flexible
NS 1 NS kNS 0 NS n
19. Rename Across NameSpaces in Detail
Source Phase 1
1. Sanity Check.
• Existence
• Permission
• Can’t be reserved directory
• Can’t be symlink
• Not in encryption zones
2. Serialize the inode-tree and blocks
information with ProtoBuf
• Name
• Permissions
• mtime/atime
• Replication factor
• Block locations
• Acl / Xattr / Quota …
20. Rename Across NameSpaces in Detail
Source Phase 1
3. Lock the directory
• Add a FederationRenameFeature. Record the information about renameId, source
and destination path
• With FederationRenameFeature, all sub-directories and files in this directory, and all
inodes in the parent path, is not writable
4. Add a federation-rename record
5. Return the serialized data to client
21. Rename Across NameSpaces in Detail
Dest Phase 1
1. Sanity Check
• permission, quota, not in encryption zones
2. Deserialize the inode-tree, graft it to the destination path
• Allocate inode id for each inode
• Allocate block id and new GS for each block
• Update acl and other features
22. Rename Across NameSpaces in Detail
Dest Phase 1
3. Lock the directory
• Also use FederationRenameFeature
4. Update quota count
5. Add a federation-rename record
6. Return a list of block information, inclouding:
• srcBlockId, destBlockId, blockSize, srcGenStamp, destGenStamp for each block
23. Rename Across NameSpaces in Detail
Link Block
1. For each DN, send request in batch
• Create new block file by hardlink, one by one
• With a total operation timeout
2. Using a ThreadPoolExecutor
3. For each block, count as complete if at least 2/3 replicas succeed
• Slow DN will not affect the total progress
24. Rename Across NameSpaces in Detail
Source Phase2
1. Delete the source directory/file
2. Delete all the inodes and blocks asyncronizely
3. Remove federation-rename record
Dest Phase2
1. Remove FedeartionRenameFeature, make the target directory
visible
2. Remove federation-rename record
25. Error Handling
Failed at How to Handle Result
Source Phase 1 Fail Fail
Dest Phase 1 Cancel source-phase1 Fail
Link Block
Request Fail
NameNode Fixer will redo the remaining steps
Will succeed
finally
Source Phase 2
Request Fail
NameNode Fixer will redo the remaining steps
Will succeed
finally
Dest Phase 2
Request Fail
NameNode Fixer will redo the remaining steps
Will succeed
finally
26. Error Handling
NameNode Failover and Restart
1. All operation have editlog
2. FederationRenameFeature will serialized to FsImage
3. Federation-rename records won’t serialized to FsImage, rebuild
from log replay or FsImage loading ( if some inode have
FederationRenameFeature, then add a Federation-rename record)
27. Scaling up NameNodes
Our Largest NameNode
1. 150GB heap
2. Use CMS GC
3. More than 500 million objects (240 million files and 260 million
blocks)
4. More than 20000 QPS
28. Scaling up NameNodes
Experience
• Throttle
– BlockReport / Incremental-BlockReport throttle
– Concurrent GetContentSummary throttle
• Lock optimization
• Config optimization
• Add more tracing information
29. Block Report Throttle
• Problem:Full GC when NameNode Startup
NameNode
60%
DN
DN
DN
DN
DN
Thousands of DN Block Report
at almost same time
DN
DN
DN
DN
DN
NameNode could only
process one block report
one time
Throttle the max concurrent
block reports, extra reports
will be rejected, and DN will
retry later
30. Other optimization
• Lock Optimization on exhausting operations
– When processing block report, release and re-gain the lock for every storage
– When processing getContentSummary, release the lock every N files
• Config optimization
– More handlers
– Longer heart-beat interval
– Longer full block report interval
– disable retry-cache and access-time
31. More tracing information
• Record Operations that hold the FSNamesystem lock too long
• Record QPS monitor on both server-side and client-side, push these
data to our internal monitor system
• Record failure reason and statistics of block allocation failure
• Add log for slow block report processing
32. How We Efficiently Manage 100+ Clusters
• We use HBase heavily in Xiaomi
• 20~30 HBase clusters for sensitive services and businesses in each
datacenter
• With the rapid growth of the global business, now there are more than 5
datacenters distributed in the whole world
• The number of total clusters also grows very quickly, make it hard to
maintain
33. How We Efficiently Manage 100+ Clusters
• Initially…
cluster-1
Canary
cluster-2
Canary
cluster-3
Canary
cluster-n
Canary
introduce my self
Today I’ll share some works we did on scaling HDFS
spoken English
investigation of xiaomi
phone sales
main market is india and china, also have good market share at southeast aisa and euroupe
not in America
IoT sales
a variety of smart-devices
it sales very well in china
based on these phones and devices, we build lots of internet services and business
these are most import part of them
for this page, most services are well-known, I would introduce some of services that developed by us
Talos is a data integration and distribution system
FDS is an object storage system, which is quite similar with AWS S3, EMQ is a cloud message queue, which is also similar with AWS EMQ
our clusters could be divided into 2 part, online vs offline
these 2 scenarios is quite different, which brings us different challenges
for online services, most HDFS clusters is deployed for hbase, we use hbase heavily, there are more than 100 online hdfs clusters and more than 3000 nodes
the biggest challenge for online cluster is latency, especially the impaction of slow nodes and slow disk
this part is not belong to this session, I’ll not introduce them in detail
on the other hand, for offline analysis, we build several huge clusters, for these clusters ,the biggest challenge is scalability, which is how to serve more data and files
let take a look at the data growth
this is the chart for our largest cluster
4 years ago
by the end of last year
everybody knows what this means to hdfs cluster
single namenode is hard to serve so many data
with the repaid growth, we meet the scalability in 2016
after a bounch of work, we successfully make namenode become stable, but it will not last for a long time, we have to enable federation
but the dependency is too complex and it’s almost impossible to divide these data into different namespaces
it’s also very hard for us to ask users change their code to use viewfs
So the only way makes sense for us is to build a huge single cluster
more accurately, we need to modify federation to make it works like a single hdfs cluster
how we did that
let’s first take a look at the defects of federation
in this solution, for every directory you need to assign a namespace, you have to add a mountpoint
if the path is not in mount-table, then it will be mapped to one of the default namespaces
in addition, to make the federation works like a single cluster, we support rename across namenode
to avoid the code change, we created a new filesystem that wrapped viewfs in it
in the last, we move the mountable to zookeeper and can update it automatically, so user don’t need to worry about the mount table
this is the whole solution of us to make Federation works as a single cluster, in the next, I’ll introduce each part in details
first, we create a wrapper FileSystem, it’s extended from DistributedFileSystem
our users don’t need to change any code, just update some configs
when the client initialing, it will fetch mounttable from zk
in addition, we add a watcher, so clients can get the latest config anytime when they update
at last, we made a admin tool to operate the mounttable config on zookeeper
to make the federation transparently to user, still a lot of works to do
here is some of them
another improvement that worth to mention is the trash optimization
by default , every user have only one trash folder, and since movetotrash is a rename operation and we support rename across namenode. a user delete operation on other namespaces may cause a rename across namenode. this operation’s cost is high. we don’t want it be triggered too frequently by removing trash data, so we did some optimization
I’ll first introduce the overview, and then introduce some details
it’s very complex, I’ll try to explain it as clear as I can
there are 5 steps to complete a federation-rename
Ok, the next is some experience of tuning a single namenode
let me show the reason first, let‘s assume in the normal case, heap usage is 60%. when NN restart, it start receiving a lot of blockreport
other blockreports that waiting proceed is stored in memory, the report speed is much higher than the processing speed, so the reports in memory keep accumulating until the heap is full.