005 cluster monitoring

Cluster
Monitoring
2012/07/26
Scott Miao

2

Agenda
 Course Credit

 Introduction

 Metrics Framework

 Tools
 Tools on wiki
http://wiki.spn.tw.trendnet.org/wiki/Hadoop_
Related_Web_Site_List

3

Course Credit
 Show up, 30 scores
 Ask question, each question earns 5 scores
 Hands-on, 40 scores
 70 scores will pass this course

 Each course credit will be calculated once
for each course finished
 The course credit will be sent to you and your
supervisor by mail

4

Introduction – (1/2)
 Using a cluster without monitoring and
metrics is…
 the same as driving a car while blindfolded
 It
is great to run load tests against your
HBase cluster
 need to correlate the cluster’s performance
with what the system is doing under the
hood

5

Introduction – (2/2)
 Graphing
 Captures the exposed metrics of a system and
displays them in visual charts
 A picture speaks a thousand words
 Are good for historical, quantitative data

 Monitoring
 Still difficult to see what a system is doing right
now
 Qualitative data is needed, which is handled by
the monitoring kind of support systems
 Sends out emails to various recipients
 SMS messages to telephones
 Does something by customized scripts

6

The Metrics Framework –
Basic Classes from Hadoop

7

Extended Classes in HBase

8

Classes Collaboration

9

Metric Types – (1/3)
Metric Type Description

Integer value (IV) An integer counter. Only updated when the value
changes

Long value (LV) A long counter. Only updated when the value
changes

Rate (R) A float value representing a rate.
1. The rate is calculated as number of operations /
elapsed time in seconds.
2. The rate is stored in the previous value field.
3. The internal counter is reset to zero.
4. The last polled timestamp is set to the current time.
5. The computed rate is returned to the caller.

10


String (S) Static, text-based information and never reset nor changed.
E.g., HBase version number, build date, and so on.

Time varying The context keeps aggregating the value. When the value is
integer (TVI) polled it returns the accrued integer value, and resets to zero,
until it is polled again
Time varying Same as TVI, but uses Long
long (TVL)

11


Time varying The number of operations or events and the time they
rate (TVR) required to complete.

The values for operation count and time accrued are reset
once the metric is polled

Persistent time Same as TVR, but NOT reset for every poll
varying rate
(PTVR)

12

Master Metrics
 The master process exposes all metrics relating to its
role in a cluster
Metric Property Name Description

Cluster hbase.master.clust The total number of requests to the
requests (R) er_requests cluster, aggregated across all region
servers
Split time hbase.master.splitTi The time it took to split the write-ahead
(PTVR) me log files after a restart

Split size hbase.master.splitSi The total size of the write-ahead log files
(PTVR) ze that were split

13

Region Server Metrics
A substantial number of metrics here
 Includes details about different parts of the over-all
architecture inside the server
 Into following groups
 Block cache metrics
 Compaction metrics
 Memstore metrics
 Store metrics
 I/O metrics
 Miscellaneous metrics

14

Region Server Metrics –
Block cache metrics – (1/2)

count (LV) hbase.regionserver.bl The number of blocks currently in
ockCacheCount the cache

size (LV) hbase.regionserver.bl The number of the size of blocks
ockCacheSize currently in the occupied Java
heap space
free (LV) hbase.regionserver.bl Remaining heap for the cache
ockCacheFree

evicted (LV) hbase.regionserver.bl The number of blocks that had to
ockCacheEvictedCo be removed because of heap size
unt constraints

15

Block cache metrics – (2/2)

cache hit (LV) hbase.regionse The number of cache block hits
rver.blockCach
eHitCount
miss (LV) hbase.regionse The number of cache block hit missed
rver.blockCach
eMissCount
hit ratio (IV) hbase.regionse The number of cache hits in relation to
rver.blockCach the total number of requests to the
eHitRation cache

16

Compaction metrics

compaction hbase.regionserv The total size (in bytes) of the storage
size (PTVR) er.compactionSi files that have been compacted
ze

compaction hbase.regionserv How long that operation took.
time (PTVR) er.compactionTi Above metrics reported after a
me completed compaction run

compaction hbase.regionserv How many files a region server
queue size (IV) er.compactionQ has queued up for compaction
ueueSize currently (recommended for monitoring)

17

Memstore metrics

memstore size MB hbase.regionserv The total heap space occupied by
metric (IV) er.memstoreSize all memstores (in online regions) for
MB the server in megabytes

flush queue size hbase.regionserv The number of enqueued regions
(IV) er.flushQueueSize that are being flushed next
(recommended for monitoring)

flush size (PTVR) hbase.regionserv The total size (in bytes) of the
er.flushSize memstore that has been flushed

flush time (PTVR) hbase.regionserv The total time took for the
er.flushTime memstore that has been flushed

18

Store metrics

store files (IV) hbase.regionserver.st The total number of storage files,
orefiles spread across all stores (regions)
managed by current server

stores (IV) hbase.regionserver.st The total number of stores for the
ores server, across all regions

store file index hbase.regionserver.st The sum of the block index,
size MB metric orefileIndexSizeMB and optional meta index, for all
(IV) store files in megabytes

19

I/O metrics

fs read latency hbase.regionser Filesystem read latency. e.g., the time it
(TVR) ver.fsReadLaten takes to load a block from the storage
cy files

fs write latency hbase.regionser The same as above, but for write
(TVR) ver.fsWriteLaten operations, including the storage files
cy and write-ahead log

fs sync latency hbase.regionser The latency to sync the write-ahead log
(TVR) ver.fsSyncLaten records to the filesystem.
cy

All numbers in milliseconds

20

Miscellaneous metrics

read request hbase.regionserv The total number of read (such as
count (LV) er.readRequestC get()) operations
ount

write request hbase.regionserv The total number of write (such as
count (LV) er.writeRequestC put()) operations
ount

requests (R) hbase.regionserv The actual request rate per second
er.requests

regions (IV) hbase.regionserv The number of regions that are
er.regions currently online and hosted by this
region server

21

RPC Metrics

RPC Process rpc.metrics.RpcP The average time took to
Time rocessingTime process the RPCs on the server
side

RPC Queue rpc.metrics.Rpc The time the call arrived and
Time QueueTime when it is actually processed,
which is the queue time
(recommended for monitoring)

22

JVM Metrics
 Tuning
the JVM settings for optimizing your
HBase setup
 You need to know what is going on in the
cluster
 Into following groups
 Memory usage metrics
 Garbage collection metrics
 Thread metrics
 System event metrics

23

JVM Metrics –
Memory usage metrics

Non-heap used jvm.RegionServer.metrics. What used versus
memory memNonHeapUsedM committed memory
means
http://docs.oracle.com
Non-heap jvm.RegionServer.metrics. /javase/6/docs/api/jav
committed memory memNonHeapCommitted a/lang/management/
M MemoryUsage.html
Heap used memory jvm.RegionServer.metrics.
memHeapUsedM

Heap committed jvm.RegionServer.metrics.
memory memHeapCommittedM

24

JVM Metrics –
Garbage collection metrics
• Garbage collection process causes so-called stop-the-world pauses
in certain step

• It is difficult to handle when a system is bound by tight SLAs

• These pauses approach the multiminute range, because this can
cause a region server to miss its ZooKeeper lease renewal —
forcing the master to take evasive actions
• So-called ―Juliet Pause‖
gc count jvm.RegionServer.metri The number of garbage
cs.gcCount collections

gc time millis jvm.RegionServer.metri The accumulated time spent in
cs.gcTimeMillis garbage collection

25

JVM Metrics – Thread metrics
new state jvm.RegionServer.metrics.thre The count for each
adsNew possible thread state,
runnable state jvm.RegionServer.metrics.thre including new,
adsRunnable runnable, blocked, and
so on.
blocked state jvm.RegionServer.metrics.thre You could refer to
adsBlocked following docs
http://www.programcr
waiting state jvm.RegionServer.metrics.thre eek.com/2009/03/thre
adsWaiting ad-status/
timed waiting jvm.RegionServer.metrics.thre http://docs.oracle.com
state adsTimedWaiting /javase/1.5.0/docs/api
terminated state jvm.RegionServer.metrics.thre /java/lang/Thread.Stat
adsTerminated e.html

26

JVM Metrics –
System event metrics
log fatal jvm.RegionServer. System event metrics provide counts for
metrics.logFatal various log-level events.
e.g., the log error metric provides the
log error jvm.RegionServer. number of log events that occurred on
metrics.logError the error level.
log warn jvm.RegionServer.
metrics.logWarn

log info jvm.RegionServer.
metrics.logInfo

27

Info Metrics
 Only accessible through JMX

28

The Metrics Framework
 If you find other Metrics not listed here
 Please refer to API docs directly…
 http://hbase.apache.org/apidocs/index.ht
ml?overview-summary.html

29

Tools - Ganglia

A distributed, scalable monitoring system
suitable for large cluster systems

 HBase inherits its native support for Ganglia
directly from Hadoop

30

Ganglia – Three components
 Ganglia monitoring daemon (gmond)
 Runs on every machine that is monitored
 Collects the local data and prepares the statistics to be
polled by other systems

 Ganglia meta daemon (gmetad)
 Is installed on a central node
 Acts as the federation node to the entire cluster
 Polls from one or more monitoring daemons to receive the
current cluster status

 Ganglia PHP web frontend
 Ganglia Web Frontend
 Retrieves the combined statistics from the meta daemon
and presents it as HTML

31

Ganglia - Installation

http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start

32

Tools - Nagios
 polls current metrics on a regular basis
and compares them with given thresholds
 Once the thresholds are exceededing it
will start evasive actions
 Ranging from sending out emails, SMS
messages to telephones, to triggering
scripts, or even physically rebooting the
server when necessary

33

Tools - JMX
 Java Management Extensions technology
 The standard for Java applications to
export their status
 Also has the ability to provide operations
 Common tools for JMX
 JConsole
 JMXToolkit

http://hbase.apache.org/metrics.html

34

Hands-on
 Use Ganglia “Aggregate Graphs” feature
 Title with your name
 Including 5 hosts
 Use any two Metrics
 Cut the image file, just like this sample

 Put the image file into Git
 YOUR_HOME=${GIT_ROOT}/hbase-training/005/hands-
on/<your_name>
 mkdir ${YOUR_HOME}
 Put your hands-on into ${YOUR_HOME}

005 cluster monitoring

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie 005 cluster monitoring

Ähnlich wie 005 cluster monitoring (20)

Mehr von Scott Miao

Mehr von Scott Miao (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

005 cluster monitoring