This document discusses cluster monitoring metrics and tools. It provides an introduction to metrics and monitoring, describes the Hadoop/HBase metrics framework including common metric types and examples. It then covers specific metrics for the master, region servers, JVM, RPCs and more. Finally it discusses tools like Ganglia, Nagios and JMX for collecting and viewing metrics and provides instructions for a hands-on exercise with Ganglia.
2. 2
Agenda
Course Credit
Introduction
Metrics Framework
Tools
Tools on wiki
http://wiki.spn.tw.trendnet.org/wiki/Hadoop_
Related_Web_Site_List
3. 3
Course Credit
Show up, 30 scores
Ask question, each question earns 5 scores
Hands-on, 40 scores
70 scores will pass this course
Each course credit will be calculated once
for each course finished
The course credit will be sent to you and your
supervisor by mail
4. 4
Introduction – (1/2)
Using a cluster without monitoring and
metrics is…
the same as driving a car while blindfolded
It
is great to run load tests against your
HBase cluster
need to correlate the cluster’s performance
with what the system is doing under the
hood
5. 5
Introduction – (2/2)
Graphing
Captures the exposed metrics of a system and
displays them in visual charts
A picture speaks a thousand words
Are good for historical, quantitative data
Monitoring
Still difficult to see what a system is doing right
now
Qualitative data is needed, which is handled by
the monitoring kind of support systems
Sends out emails to various recipients
SMS messages to telephones
Does something by customized scripts
9. 9
The Metrics Framework –
Metric Types – (1/3)
Metric Type Description
Integer value (IV) An integer counter. Only updated when the value
changes
Long value (LV) A long counter. Only updated when the value
changes
Rate (R) A float value representing a rate.
1. The rate is calculated as number of operations /
elapsed time in seconds.
2. The rate is stored in the previous value field.
3. The internal counter is reset to zero.
4. The last polled timestamp is set to the current time.
5. The computed rate is returned to the caller.
10. 10
The Metrics Framework –
Metric Types – (2/3)
Metric Type Description
String (S) Static, text-based information and never reset nor changed.
E.g., HBase version number, build date, and so on.
Time varying The context keeps aggregating the value. When the value is
integer (TVI) polled it returns the accrued integer value, and resets to zero,
until it is polled again
Time varying Same as TVI, but uses Long
long (TVL)
11. 11
The Metrics Framework –
Metric Types – (3/3)
Metric Type Description
Time varying The number of operations or events and the time they
rate (TVR) required to complete.
The values for operation count and time accrued are reset
once the metric is polled
Persistent time Same as TVR, but NOT reset for every poll
varying rate
(PTVR)
12. 12
The Metrics Framework –
Master Metrics
The master process exposes all metrics relating to its
role in a cluster
Metric Property Name Description
Cluster hbase.master.clust The total number of requests to the
requests (R) er_requests cluster, aggregated across all region
servers
Split time hbase.master.splitTi The time it took to split the write-ahead
(PTVR) me log files after a restart
Split size hbase.master.splitSi The total size of the write-ahead log files
(PTVR) ze that were split
13. 13
The Metrics Framework –
Region Server Metrics
A substantial number of metrics here
Includes details about different parts of the over-all
architecture inside the server
Into following groups
Block cache metrics
Compaction metrics
Memstore metrics
Store metrics
I/O metrics
Miscellaneous metrics
14. 14
Region Server Metrics –
Block cache metrics – (1/2)
Metric Property Name Description
count (LV) hbase.regionserver.bl The number of blocks currently in
ockCacheCount the cache
size (LV) hbase.regionserver.bl The number of the size of blocks
ockCacheSize currently in the occupied Java
heap space
free (LV) hbase.regionserver.bl Remaining heap for the cache
ockCacheFree
evicted (LV) hbase.regionserver.bl The number of blocks that had to
ockCacheEvictedCo be removed because of heap size
unt constraints
15. 15
Region Server Metrics –
Block cache metrics – (2/2)
Metric Property Name Description
cache hit (LV) hbase.regionse The number of cache block hits
rver.blockCach
eHitCount
miss (LV) hbase.regionse The number of cache block hit missed
rver.blockCach
eMissCount
hit ratio (IV) hbase.regionse The number of cache hits in relation to
rver.blockCach the total number of requests to the
eHitRation cache
16. 16
Region Server Metrics –
Compaction metrics
Metric Property Name Description
compaction hbase.regionserv The total size (in bytes) of the storage
size (PTVR) er.compactionSi files that have been compacted
ze
compaction hbase.regionserv How long that operation took.
time (PTVR) er.compactionTi Above metrics reported after a
me completed compaction run
compaction hbase.regionserv How many files a region server
queue size (IV) er.compactionQ has queued up for compaction
ueueSize currently (recommended for monitoring)
17. 17
Region Server Metrics –
Memstore metrics
Metric Property Name Description
memstore size MB hbase.regionserv The total heap space occupied by
metric (IV) er.memstoreSize all memstores (in online regions) for
MB the server in megabytes
flush queue size hbase.regionserv The number of enqueued regions
(IV) er.flushQueueSize that are being flushed next
(recommended for monitoring)
flush size (PTVR) hbase.regionserv The total size (in bytes) of the
er.flushSize memstore that has been flushed
flush time (PTVR) hbase.regionserv The total time took for the
er.flushTime memstore that has been flushed
18. 18
Region Server Metrics –
Store metrics
Metric Property Name Description
store files (IV) hbase.regionserver.st The total number of storage files,
orefiles spread across all stores (regions)
managed by current server
stores (IV) hbase.regionserver.st The total number of stores for the
ores server, across all regions
store file index hbase.regionserver.st The sum of the block index,
size MB metric orefileIndexSizeMB and optional meta index, for all
(IV) store files in megabytes
19. 19
Region Server Metrics –
I/O metrics
Metric Property Name Description
fs read latency hbase.regionser Filesystem read latency. e.g., the time it
(TVR) ver.fsReadLaten takes to load a block from the storage
cy files
fs write latency hbase.regionser The same as above, but for write
(TVR) ver.fsWriteLaten operations, including the storage files
cy and write-ahead log
fs sync latency hbase.regionser The latency to sync the write-ahead log
(TVR) ver.fsSyncLaten records to the filesystem.
cy
All numbers in milliseconds
20. 20
Region Server Metrics –
Miscellaneous metrics
Metric Property Name Description
read request hbase.regionserv The total number of read (such as
count (LV) er.readRequestC get()) operations
ount
write request hbase.regionserv The total number of write (such as
count (LV) er.writeRequestC put()) operations
ount
requests (R) hbase.regionserv The actual request rate per second
er.requests
regions (IV) hbase.regionserv The number of regions that are
er.regions currently online and hosted by this
region server
21. 21
The Metrics Framework –
RPC Metrics
Metric Property Name Description
RPC Process rpc.metrics.RpcP The average time took to
Time rocessingTime process the RPCs on the server
side
RPC Queue rpc.metrics.Rpc The time the call arrived and
Time QueueTime when it is actually processed,
which is the queue time
(recommended for monitoring)
22. 22
The Metrics Framework –
JVM Metrics
Tuning
the JVM settings for optimizing your
HBase setup
You need to know what is going on in the
cluster
Into following groups
Memory usage metrics
Garbage collection metrics
Thread metrics
System event metrics
23. 23
JVM Metrics –
Memory usage metrics
Metric Property Name Description
Non-heap used jvm.RegionServer.metrics. What used versus
memory memNonHeapUsedM committed memory
means
http://docs.oracle.com
Non-heap jvm.RegionServer.metrics. /javase/6/docs/api/jav
committed memory memNonHeapCommitted a/lang/management/
M MemoryUsage.html
Heap used memory jvm.RegionServer.metrics.
memHeapUsedM
Heap committed jvm.RegionServer.metrics.
memory memHeapCommittedM
24. 24
JVM Metrics –
Garbage collection metrics
• Garbage collection process causes so-called stop-the-world pauses
in certain step
• It is difficult to handle when a system is bound by tight SLAs
• These pauses approach the multiminute range, because this can
cause a region server to miss its ZooKeeper lease renewal —
forcing the master to take evasive actions
• So-called ―Juliet Pause‖
Metric Property Name Description
gc count jvm.RegionServer.metri The number of garbage
cs.gcCount collections
gc time millis jvm.RegionServer.metri The accumulated time spent in
cs.gcTimeMillis garbage collection
25. 25
JVM Metrics – Thread metrics
Metric Property Name Description
new state jvm.RegionServer.metrics.thre The count for each
adsNew possible thread state,
runnable state jvm.RegionServer.metrics.thre including new,
adsRunnable runnable, blocked, and
so on.
blocked state jvm.RegionServer.metrics.thre You could refer to
adsBlocked following docs
http://www.programcr
waiting state jvm.RegionServer.metrics.thre eek.com/2009/03/thre
adsWaiting ad-status/
timed waiting jvm.RegionServer.metrics.thre http://docs.oracle.com
state adsTimedWaiting /javase/1.5.0/docs/api
terminated state jvm.RegionServer.metrics.thre /java/lang/Thread.Stat
adsTerminated e.html
26. 26
JVM Metrics –
System event metrics
Metric Property Name Description
log fatal jvm.RegionServer. System event metrics provide counts for
metrics.logFatal various log-level events.
e.g., the log error metric provides the
log error jvm.RegionServer. number of log events that occurred on
metrics.logError the error level.
log warn jvm.RegionServer.
metrics.logWarn
log info jvm.RegionServer.
metrics.logInfo
28. 28
The Metrics Framework
If you find other Metrics not listed here
Please refer to API docs directly…
http://hbase.apache.org/apidocs/index.ht
ml?overview-summary.html
29. 29
Tools - Ganglia
A distributed, scalable monitoring system
suitable for large cluster systems
HBase inherits its native support for Ganglia
directly from Hadoop
30. 30
Ganglia – Three components
Ganglia monitoring daemon (gmond)
Runs on every machine that is monitored
Collects the local data and prepares the statistics to be
polled by other systems
Ganglia meta daemon (gmetad)
Is installed on a central node
Acts as the federation node to the entire cluster
Polls from one or more monitoring daemons to receive the
current cluster status
Ganglia PHP web frontend
Ganglia Web Frontend
Retrieves the combined statistics from the meta daemon
and presents it as HTML
32. 32
Tools - Nagios
polls current metrics on a regular basis
and compares them with given thresholds
Once the thresholds are exceededing it
will start evasive actions
Ranging from sending out emails, SMS
messages to telephones, to triggering
scripts, or even physically rebooting the
server when necessary
33. 33
Tools - JMX
Java Management Extensions technology
The standard for Java applications to
export their status
Also has the ability to provide operations
Common tools for JMX
JConsole
JMXToolkit
http://hbase.apache.org/metrics.html
34. 34
Hands-on
Use Ganglia “Aggregate Graphs” feature
Title with your name
Including 5 hosts
Use any two Metrics
Cut the image file, just like this sample
Put the image file into Git
YOUR_HOME=${GIT_ROOT}/hbase-training/005/hands-
on/<your_name>
mkdir ${YOUR_HOME}
Put your hands-on into ${YOUR_HOME}