NameNode Analytics - Querying HDFS Namespace in Real Time

Who Am I?
Bachelor of Science in Computer Science from UC San Diego (Eleanor Roosevelt College).
I have been fortunate to work alongside Konstantin Shvachko, one of the original architects of the HDFS NameNode from
Yahoo!, for several years.
I have spent 6 years working on HDFS internals and related projects at eBay, WANdisco, and now PayPal.
Hadoop open source contributor:
• HDFS-3107: Introduce truncate to HDFS.
• HDFS-4456: Add concat to HttpFS and WebHDFS.
• HADOOP-10641: Introduce coordination / consensus interface to HDFS.
• MAPREDUCE-2669: Add StandardDev, Mean, and Mode, examples to MapReduce.
• Various bug fixes.
Work on NameNode internals and distributed File System design.
 Giraffa File System: https://github.com/GiraffaFS/giraffa
 GeoDistributed File System (WANdisco Patent): https://patents.justia.com/patent/20150278244
©2015 PayPal Inc. Confidential and proprietary. 3
Plamen Jeliazkov.

Background
Created as a means of storing petabytes order of data securely (through replication).
By virtue of being a distributed file system, HDFS is seen as a safe haven for any type of data.
However, HDFS does have its own scaling limitations:
• “Limits are around 10,000 clients working on around 200 million files and directories, totaling around 500 million file
system objects (inodes and blocks). Typically capping out around 20 PBs, though larger clusters do exist.”
 https://www.usenix.org/publications/login/april-2010-volume-35-number-2/hdfs-scalability-limits-growth - Konstantin Shvachko,
Therefore, HDFS is best used as a system for storing large single files of data.
• Best case scenario is large files with large block sizes so that the NameNode has to store less metadata per raw
storage.
Because of the nature of having large sequential files it is also best used as a system for processing batch analytics or by
applications that benefit from sequential reads / writes.
The Hadoop Distributed File System.

HDFS @ PayPal
Customers tend to see HDFS as a giant black box. Dump and forget.
Customers just want to store their data in the easiest manner. No storage optimization or security.
• Do not like to build any sort of “clean-up” or TTL mechanisms into their applications.
When space issues arise Hadoop Management lacks context:
• What took up that space? (RCA required)
• Who took up that space? (RCA required)
• What targets can we look at for deleting quickly? (Small files, old files, empty files, specific user, etc.)
Even in the event we catch wind of a data issue:
• Difficult to determine which team or person is responsible.
• Difficult to determine which datasets were affected.
• Damage is already done. (Cluster performance degraded; quota hit; application deployed; etc. It’s already too late…)
• Difficult to be pro-active, so we end up being re-active instead. Often times very late to react.
My observations of HDFS data management pain points.

Previous Architecture(s)
The Old World
Active
NN
Standby
NN
FsImage Processed Image
Offline
Image
Viewer
Kibana /
Elastic Search
3 mins 90 mins 30 mins
Legacy
FsImage
* This assumes a large enterprise Hadoop environment where the FsImage is larger than 20 GB. For smaller image sizes, this is trivial.
* This architecture usually leads to generation of daily reports. This diagram is presentative of the fastest possible report generation.

HDFS Usage Analytics Today
Standby NameNode is forced to create a legacy FSImage.
• This requires additional work by Standby NameNode to achieve.
• This legacy image is created in addition to the regular, Protobuf’d, FSImage created for the active NN.
• Storage redundancy solely for the purpose of performing analytics later.
(We end up creating 2 FSImages per checkpoint – double storage cost, double IO cost, no instant benefit).
• Legacy image retains less metadata than the Protobuf image. (No XAttrs, tokens, storage policies).
Legacy-format FSImage is parsed and uploaded to Kibana or ElasticSearch.
• This process typically happens once a day.
• It takes approximately 15 to 20 minutes to fully parse a 25GB FSImage, about current size of large cluster FSImage.
We have seen FSImages of over 30+ GB when things are bad.
• Requires pulling the FSImage off the Standby NameNode. Network cost is not very high however.
Making this process more frequent will increase network cost on the Standby. RPC issues seen if bandwidth saturated.
• Image dump -> Parsing -> Processing can take anywhere between 2-3 hours. Only about 4-6 reports per day at best.
Other third party solutions tend to follow the architecture described on this slide.
My observations on the current “standard”.

Engineering A New Solution
In order to query near real time you require something like a constantly updating NameNode.
• Attempting to do so in any distributed manner involves solving the distributed atomic rename or coordination.
(Think HBase region transitions).
• We cannot rely on parsing the FSImage and EditLogs as that adds too much processing time.
 15-30 minutes to parse legacy FSImage and 1-2 minutes per large EditLog.
 Protobuf parsing means loading the entire INode set into memory.
To filter or query effectively requires parallel processing.
• Assuming we can’t utilize a distributed system effectively, can we work with a single node? Yes.
We can also utilize multiple CPU cores…
• Java 8 Stream API allows simple filters, maps, reduces, collections on large parallelized in-memory data structures.
• A single NameNode stores the entire metadata set in-memory already in such a structure.
Do we need to build a whole new system? No.
• We need to write some custom query engine logic but can re-use most HDFS data structures and logic.
• We can keep our “NameNode” up to date using live cluster Journal Nodes.
• We can simplify further by removing the RPC Server. No need for DataNodes or clients to connect to our
“NameNode”.
Combining old knowledge and new ideas.

Inspiration from Dr. Elephant
Dr. Elephant is a tool from LinkedIn for providing ”self-help” suggestions on how to tune various YARN applications in order
to free up more capacity queue space and perform better. NNA was also conceived as a “self-help” tool.
Ideas inspiring other ideas.

Inspiration from Dr. Elephant
Ideas inspiring other ideas.

NameNode Analytics
“A modified, isolated, read-only, Standby NameNode, with no RPC Server,
but with a Web Server and custom query engine embedded inside it.”
It can best be described as:

Architecture
Basic high-level view.
Client
NameNode Analytics
(Off the cluster;
isolated and read-only NN)
JournalNodes
(On the cluster)
NameNode
(On the cluster)
(1) Query
(0*) One-time Bootstrap Call
(Fetch remote FsImage)
(3) Response
(*) EditLog Tailing
(*) Writes editLog to JournalNodes
* = conditional or “in the background”
(2) Processing

Architecture
Deep dive view into NNA.
NameNode Analytics
Rest API
(Spark Java Web Server)
Java 8 Stream API
(Query Processing)
NameNode FSNamesystem
(Image loading; editLog tailing / updating; and in-memory set)
NameNode
In-Memory
Metadata
Set
(INode Tree)
(GSet)
Query
EditLog-Tailer updates
Response

NNA @ PayPal
NNA provides the information and an internal TICK stack keeps the historical data, visualizes, and takes action.
(TICK stack is: telegraf, influxDB, chronograf, kapacitor)
How do we utilize this?

NNA @ PayPal

NNA @ PayPal
Who is creating the most empty files?
Who is creating the most empty directories?
Who are the biggest users of the file system in terms of file count or space usage?
What are the largest directories by in terms of file count or space usage?
Who is creating small files? (Greater than 0 bytes but much less than 1 block size).
Who has the most “open permission” files? (chmod 777 abusers).
What is the average file size under a particular directory?
What files are open / being written to right now?

NNA @ PayPal
Tracking of quota usage.
Tracking of old files.
Tracking of small files / areas for archival or compression and compaction.
Tracking of user last delegation token issued date.
Tracking of File types (extensions).
Per user usage reports and suggestions.
Query against any dimension available in the HDFS INode(s).
(In progress) AUTOMATED HDFS DATA MANAGEMENT.

First detect, then fix.
NNA is your detection tool.

Understanding NNA API
NNA first asks you to define a set to work with; either the set of all files, or the set of all directories.
Depending on which set you pick, different options are available to you.
From there you build a set of filters to apply to that set and then finally some result you want to reduce to, the sum.
• Take this example: /filter?set=files&filters=fileSize:eq:0&sum=count
• "Starting with the set of all files, get all those that have a file size equal to zero, and count how many there are."
• Or this example: /filter?set=files&filters=modTime:olderThanYears:1&sum=diskspaceConsumed
• "Starting with the set of all files, get all those with a modification time older than 1 year, and sum up their diskspace
usage."
From there we allow even more complex groupings via a /histogram endpoint:
• For example: /histogram?set=files&filters=fileSize:eq:0&type=user&sum=count
• "Starting with the set of all files, get all those that have a file size equal to zero, group them by user, and count how
many there are.
What do queries look like?

Some Pictures
For example…
Graphing:
Users by # of empty files they own
/histogram?set=files&filters=fileSize:eq:0&type=user&sum=count

Some Pictures
For example…
Graphing:
Users by # of empty directories
they own
/histogram?set=dirs&filters=dirNumChildren:eq:0&type=user&sum=count

Some Pictures
For example…
Graphing:
Users by # of small files
/histogram?set=files&filters=fileSize:lte:1024&type=user&sum=count

Some Pictures
For example…
Dumping:
Files currently being
written to
/filter?set=files&filters=isUnderConstruction:eq:true&limit=1000

Some Pictures
For example…
Histogram Binning:
Size of Files vs
Disk space consumed by Files

Some Pictures
For example…
Histogram Binning:
Disk space consumed by
different replication factors

Some Pictures
For example…
Histogram Binning:
File Type Extensions

Story Time!
HDFS-11419
Slow addBlock operation on NameNode due to users writing into WARM StoragePolicy directories.
Difficult to find all the WARM directories; impossible from legacy FsImage alone; very simple on NNA.
Dump all WARM directory path from API: /filter?set=dirs&filters=storageType:eq:WARM
NameNode Pushing Scalability Limits
We were pushing the limits of the NameNode and close to going full GC. 400+ million files. 800+ million total file system objects.
Difficult to find datasets to delete and little time.
Find old datasets to delete: /histogram?set=files&filters=accessTime:olderThanYears:2&type=parentDir&sum=count
Small File Prevention
Midway through an imitative to find and clean-up small files from HDFS we found users were creating small files at the rate we were
compressing and cleaning them.
Difficult to find which users are creating small files.
Find users by small files: /histogram?set=files&filters=fileSize:lte:1048576,accessTime:hoursAgo:24&type=user&sum=count
When has NNA saved us?

Successes
Near real-time analysis.
‘nough said.
For anyone wondering - the magic is in skipping the FSNamesystem lock and introducing multi-core processing.
Easy to install and maintain.
NNA’s Gradle build can construct RPM packages.
Difficulty is about equal to that of bringing up a new, additional, Standby NameNode.
Scalable?
While NNA is not a distributed system, it is a replicated read-only copy.
If you require more analytical throughput you could spin up multiple NNA instances.
The Journal Nodes can handle many readers.
Where has NNA won?

Flaws
It is still a NameNode.
NNA is subject to all the faults and flaws of a regular HDFS NameNode.
If you have too many files and blocks, your NNA instance will operate slower as a result.
Interactive queries that don’t reduce the working set are not great for NNA.
It is not a distributed system.
While NNA can serve cached reports very frequently, it cannot handle many interactive queries at the same time.
Queries are best used by admins while reports are best used by end users.
It is “one of those” single-person projects.
While I had assistance in coding, NNA was mostly a one person show.
Fixing bugs and adding features over a period of nearly a year and a half now.
There is plenty of work still to do and things to improve.
NNA is not Perfect.

Future Work
Where can NNA go from here?
HDFS-6382 : TTL In HDFS
Discussion about TTL living outside the NameNode. Desire to not introduce TTL management due to additional thread resource requirements
on active NameNode. NNA could be extended to provide a routine TTL service on top of it.
HDFS-13150 : Faster Tailing of Edits from Journal Nodes
Part of the work to make Standby NameNode(s) service reads is to reduce the latency between when an EditLog transaction is applied on the
Active vs on the Standby. Reducing this latency means NNA queries become even closer to real time as well.
HDFS Cluster Management Integration
NNA is trivial enough to install that it should be able to easily create an Ambari package, Cloudera Parcel, or other integration package for
your flavor of management consoles.
Web & Security
NNA supports LDAP only at the moment. Uses JSON Web Tokens to maintain sessions. Would any Security experts like to lend a hand?
Support for Kerberos authentication would be great!

Demo
Example Local Cluster from Code

END
(Q & A?)

NameNode Analytics - Querying HDFS Namespace in Real Time

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie NameNode Analytics - Querying HDFS Namespace in Real Time

Ähnlich wie NameNode Analytics - Querying HDFS Namespace in Real Time (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

NameNode Analytics - Querying HDFS Namespace in Real Time