SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Cluster
Monitoring
2012/07/26
Scott Miao
2




Agenda
 Course    Credit

 Introduction


 Metrics   Framework

 Tools
   Tools on wiki
  http://wiki.spn.tw.trendnet.org/wiki/Hadoop_
  Related_Web_Site_List
3




Course Credit
   Show up, 30 scores
   Ask question, each question earns 5 scores
   Hands-on, 40 scores
   70 scores will pass this course

   Each course credit will be calculated once
    for each course finished
   The course credit will be sent to you and your
    supervisor by mail
4




Introduction – (1/2)
 Using a cluster without monitoring and
  metrics is…
      the same as driving a car while blindfolded
 It
   is great to run load tests against your
  HBase cluster
      need to correlate the cluster’s performance
       with what the system is doing under the
       hood
5




Introduction – (2/2)
   Graphing
       Captures the exposed metrics of a system and
        displays them in visual charts
       A picture speaks a thousand words
       Are good for historical, quantitative data


   Monitoring
       Still difficult to see what a system is doing right
        now
       Qualitative data is needed, which is handled by
        the monitoring kind of support systems
       Sends out emails to various recipients
       SMS messages to telephones
       Does something by customized scripts
6




The Metrics Framework –
Basic Classes from Hadoop
7




The Metrics Framework –
Extended Classes in HBase
8




The Metrics Framework –
Classes Collaboration
9



        The Metrics Framework –
        Metric Types – (1/3)
Metric Type          Description

Integer value (IV)   An integer counter. Only updated when the value
                     changes


Long value (LV)      A long counter. Only updated when the value
                     changes


Rate (R)             A float value representing a rate.
                     1. The rate is calculated as number of operations /
                         elapsed time in seconds.
                     2. The rate is stored in the previous value field.
                     3. The internal counter is reset to zero.
                     4. The last polled timestamp is set to the current time.
                     5. The computed rate is returned to the caller.
10




         The Metrics Framework –
         Metric Types – (2/3)
Metric Type     Description


String (S)      Static, text-based information and never reset nor changed.
                E.g., HBase version number, build date, and so on.




Time varying    The context keeps aggregating the value. When the value is
integer (TVI)   polled it returns the accrued integer value, and resets to zero,
                until it is polled again
Time varying    Same as TVI, but uses Long
long (TVL)
11


       The Metrics Framework –
       Metric Types – (3/3)
Metric Type     Description

Time varying    The number of operations or events and the time they
rate (TVR)      required to complete.




                The values for operation count and time accrued are reset
                once the metric is polled

Persistent time Same as TVR, but NOT reset for every poll
varying rate
(PTVR)
12



         The Metrics Framework –
         Master Metrics
    The   master process exposes all metrics relating to its
      role in a cluster
Metric         Property Name          Description

Cluster        hbase.master.clust     The total number of requests to the
requests (R)   er_requests            cluster, aggregated across all region
                                      servers
Split time     hbase.master.splitTi   The time it took to split the write-ahead
(PTVR)         me                     log files after a restart

Split size     hbase.master.splitSi   The total size of the write-ahead log files
(PTVR)         ze                     that were split
13




      The Metrics Framework –
      Region Server Metrics
A   substantial number of metrics here
 Includes details about different parts of the over-all
  architecture inside the server
 Into following groups
     Block cache metrics
     Compaction metrics
     Memstore metrics
     Store metrics
     I/O metrics
     Miscellaneous metrics
14




         Region Server Metrics –
         Block cache metrics – (1/2)
Metric         Property Name         Description


count (LV)     hbase.regionserver.bl The number of blocks currently in
               ockCacheCount         the cache

size (LV)      hbase.regionserver.bl The number of the size of blocks
               ockCacheSize          currently in the occupied Java
                                     heap space
free (LV)      hbase.regionserver.bl Remaining heap for the cache
               ockCacheFree

evicted (LV)   hbase.regionserver.bl The number of blocks that had to
               ockCacheEvictedCo be removed because of heap size
               unt                   constraints
15




       Region Server Metrics –
       Block cache metrics – (2/2)
Metric           Property Name   Description



cache hit (LV)   hbase.regionse The number of cache block hits
                 rver.blockCach
                 eHitCount
miss (LV)        hbase.regionse The number of cache block hit missed
                 rver.blockCach
                 eMissCount
hit ratio (IV)   hbase.regionse The number of cache hits in relation to
                 rver.blockCach the total number of requests to the
                 eHitRation     cache
16




         Region Server Metrics –
         Compaction metrics
Metric            Property Name      Description



compaction        hbase.regionserv The total size (in bytes) of the storage
size (PTVR)       er.compactionSi files that have been compacted
                  ze

compaction        hbase.regionserv How long that operation took.
time (PTVR)       er.compactionTi Above metrics reported after a
                  me               completed compaction run

compaction        hbase.regionserv How many files a region server
queue size (IV)   er.compactionQ has queued up for compaction
                  ueueSize         currently (recommended for monitoring)
17


         Region Server Metrics –
         Memstore metrics
Metric              Property Name      Description

memstore size MB hbase.regionserv      The total heap space occupied by
metric (IV)      er.memstoreSize       all memstores (in online regions) for
                 MB                    the server in megabytes

flush queue size    hbase.regionserv The number of enqueued regions
(IV)                er.flushQueueSize that are being flushed next
                                      (recommended for monitoring)

flush size (PTVR)   hbase.regionserv   The total size (in bytes) of the
                    er.flushSize       memstore that has been flushed


flush time (PTVR)   hbase.regionserv   The total time took for the
                    er.flushTime       memstore that has been flushed
18




       Region Server Metrics –
       Store metrics
Metric             Property Name            Description



store files (IV)   hbase.regionserver.st    The total number of storage files,
                   orefiles                 spread across all stores (regions)
                                            managed by current server

stores (IV)        hbase.regionserver.st    The total number of stores for the
                   ores                     server, across all regions


store file index hbase.regionserver.st      The sum of the block index,
size MB metric orefileIndexSizeMB           and optional meta index, for all
(IV)                                        store files in megabytes
19




         Region Server Metrics –
         I/O metrics
Metric            Property Name     Description

fs read latency hbase.regionser Filesystem read latency. e.g., the time it
(TVR)           ver.fsReadLaten takes to load a block from the storage
                cy              files

fs write latency hbase.regionser    The same as above, but for write
(TVR)            ver.fsWriteLaten   operations, including the storage files
                 cy                 and write-ahead log

fs sync latency   hbase.regionser   The latency to sync the write-ahead log
(TVR)             ver.fsSyncLaten   records to the filesystem.
                  cy

                                                  All numbers in milliseconds
20


      Region Server Metrics –
      Miscellaneous metrics
Metric          Property Name      Description

read request    hbase.regionserv   The total number of read (such as
count (LV)      er.readRequestC    get()) operations
                ount

write request   hbase.regionserv   The total number of write (such as
count (LV)      er.writeRequestC   put()) operations
                ount

requests (R)    hbase.regionserv   The actual request rate per second
                er.requests


regions (IV)    hbase.regionserv   The number of regions that are
                er.regions         currently online and hosted by this
                                   region server
21




   The Metrics Framework –
   RPC Metrics
Metric        Property Name     Description


RPC Process   rpc.metrics.RpcP The average time took to
Time          rocessingTime    process the RPCs on the server
                               side



RPC Queue     rpc.metrics.Rpc   The time the call arrived and
Time          QueueTime         when it is actually processed,
                                which is the queue time
                                (recommended for monitoring)
22




The Metrics Framework –
JVM Metrics
 Tuning
       the JVM settings for optimizing your
 HBase setup
     You need to know what is going on in the
      cluster
 Into   following groups
     Memory usage metrics
     Garbage collection metrics
     Thread metrics
     System event metrics
23


         JVM Metrics –
         Memory usage metrics
Metric             Property Name               Description

Non-heap used      jvm.RegionServer.metrics. What used versus
memory             memNonHeapUsedM           committed memory
                                             means
                                             http://docs.oracle.com
Non-heap           jvm.RegionServer.metrics. /javase/6/docs/api/jav
committed memory   memNonHeapCommitted a/lang/management/
                   M                         MemoryUsage.html
Heap used memory   jvm.RegionServer.metrics.
                   memHeapUsedM


Heap committed     jvm.RegionServer.metrics.
memory             memHeapCommittedM
24


     JVM Metrics –
     Garbage collection metrics
• Garbage collection process causes so-called stop-the-world pauses
  in certain step

    • It is difficult to handle when a system is bound by tight SLAs

    • These pauses approach the multiminute range, because this can
       cause a region server to miss its ZooKeeper lease renewal —
       forcing the master to take evasive actions
         • So-called ―Juliet Pause‖
Metric            Property Name             Description
gc count         jvm.RegionServer.metri The number of garbage
                 cs.gcCount             collections


gc time millis   jvm.RegionServer.metri The accumulated time spent in
                 cs.gcTimeMillis        garbage collection
25




     JVM Metrics – Thread metrics
Metric             Property Name                   Description
new state          jvm.RegionServer.metrics.thre   The count for each
                   adsNew                          possible thread state,
runnable state     jvm.RegionServer.metrics.thre   including new,
                   adsRunnable                     runnable, blocked, and
                                                   so on.
blocked state      jvm.RegionServer.metrics.thre   You could refer to
                   adsBlocked                      following docs
                                                   http://www.programcr
waiting state      jvm.RegionServer.metrics.thre   eek.com/2009/03/thre
                   adsWaiting                      ad-status/
timed waiting      jvm.RegionServer.metrics.thre   http://docs.oracle.com
state              adsTimedWaiting                 /javase/1.5.0/docs/api
terminated state   jvm.RegionServer.metrics.thre   /java/lang/Thread.Stat
                   adsTerminated                   e.html
26




     JVM Metrics –
     System event metrics
Metric      Property Name       Description
log fatal   jvm.RegionServer.   System event metrics provide counts for
            metrics.logFatal    various log-level events.
                                e.g., the log error metric provides the
log error   jvm.RegionServer.   number of log events that occurred on
            metrics.logError    the error level.
log warn    jvm.RegionServer.
            metrics.logWarn

log info    jvm.RegionServer.
            metrics.logInfo
27



The Metrics Framework –
Info Metrics
 Only   accessible through JMX
28




The Metrics Framework
 If   you find other Metrics not listed here
      Please refer to API docs directly…
      http://hbase.apache.org/apidocs/index.ht
       ml?overview-summary.html
29




 Tools - Ganglia

A  distributed, scalable monitoring system
 suitable for large cluster systems

 HBase inherits its native support for Ganglia
 directly from Hadoop
30




     Ganglia – Three components
   Ganglia monitoring daemon (gmond)
       Runs on every machine that is monitored
       Collects the local data and prepares the statistics to be
        polled by other systems

   Ganglia meta daemon (gmetad)
       Is installed on a central node
       Acts as the federation node to the entire cluster
       Polls from one or more monitoring daemons to receive the
        current cluster status

   Ganglia PHP web frontend
       Ganglia Web Frontend
       Retrieves the combined statistics from the meta daemon
        and presents it as HTML
31



  Ganglia - Installation




http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start
32




Tools - Nagios
 polls  current metrics on a regular basis
  and compares them with given thresholds
 Once the thresholds are exceededing it
  will start evasive actions
     Ranging from sending out emails, SMS
      messages to telephones, to triggering
      scripts, or even physically rebooting the
      server when necessary
33




Tools - JMX
 Java   Management Extensions technology
     The standard for Java applications to
      export their status
     Also has the ability to provide operations
 Common       tools for JMX
     JConsole
     JMXToolkit


http://hbase.apache.org/metrics.html
34


     Hands-on
 Use   Ganglia “Aggregate Graphs” feature
    Title with your name
    Including 5 hosts
    Use any two Metrics
    Cut the image file, just like this sample


 Put   the image file into Git
    YOUR_HOME=${GIT_ROOT}/hbase-training/005/hands-
     on/<your_name>
    mkdir ${YOUR_HOME}
    Put your hands-on into ${YOUR_HOME}

Weitere ähnliche Inhalte

Was ist angesagt?

Lucene revolution 2011
Lucene revolution 2011Lucene revolution 2011
Lucene revolution 2011
Takahiko Ito
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 

Was ist angesagt? (18)

Oracle: Binding versus caging
Oracle: Binding versus cagingOracle: Binding versus caging
Oracle: Binding versus caging
 
HBase at Xiaomi
HBase at XiaomiHBase at Xiaomi
HBase at Xiaomi
 
Lucene revolution 2011
Lucene revolution 2011Lucene revolution 2011
Lucene revolution 2011
 
The Essential postgresql.conf
The Essential postgresql.confThe Essential postgresql.conf
The Essential postgresql.conf
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
 
Gur1009
Gur1009Gur1009
Gur1009
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
 
Hbase 89 fb online configuration
Hbase 89 fb online configurationHbase 89 fb online configuration
Hbase 89 fb online configuration
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
 
Thanos - Prometheus on Scale
Thanos - Prometheus on ScaleThanos - Prometheus on Scale
Thanos - Prometheus on Scale
 
How To Sediment
How To SedimentHow To Sediment
How To Sediment
 
Pick diamonds from garbage
Pick diamonds from garbagePick diamonds from garbage
Pick diamonds from garbage
 
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache RatisNoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
 
Rman workshop short
Rman workshop shortRman workshop short
Rman workshop short
 

Ähnlich wie 005 cluster monitoring

A service platform for development deployment and runtime management of real-...
A service platform for development deployment and runtime management of real-...A service platform for development deployment and runtime management of real-...
A service platform for development deployment and runtime management of real-...
dmeil
 
Recordmanagment2
Recordmanagment2Recordmanagment2
Recordmanagment2
myrajendra
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
Membase
 

Ähnlich wie 005 cluster monitoring (20)

A service platform for development deployment and runtime management of real-...
A service platform for development deployment and runtime management of real-...A service platform for development deployment and runtime management of real-...
A service platform for development deployment and runtime management of real-...
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
The End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional ManagementThe End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional Management
 
Recordmanagment2
Recordmanagment2Recordmanagment2
Recordmanagment2
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management System
 
LeanXcale Presentation - Waterloo University
LeanXcale Presentation - Waterloo UniversityLeanXcale Presentation - Waterloo University
LeanXcale Presentation - Waterloo University
 
Centrifuge
CentrifugeCentrifuge
Centrifuge
 
Hbase.pptx
Hbase.pptxHbase.pptx
Hbase.pptx
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Redis v5 & Streams
Redis v5 & StreamsRedis v5 & Streams
Redis v5 & Streams
 
Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODB
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with Cassandra
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
 
Kosmos Filesystem
Kosmos FilesystemKosmos Filesystem
Kosmos Filesystem
 
realtime system.docx
realtime system.docxrealtime system.docx
realtime system.docx
 
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKChoose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
 
Amazon Kinesis Data Streams Vs Msk (1).pptx
Amazon Kinesis Data Streams Vs Msk (1).pptxAmazon Kinesis Data Streams Vs Msk (1).pptx
Amazon Kinesis Data Streams Vs Msk (1).pptx
 
Paris Video Tech - 1st Edition: Streamroot, Adaptive Bitrate Algorithms: comm...
Paris Video Tech - 1st Edition: Streamroot, Adaptive Bitrate Algorithms: comm...Paris Video Tech - 1st Edition: Streamroot, Adaptive Bitrate Algorithms: comm...
Paris Video Tech - 1st Edition: Streamroot, Adaptive Bitrate Algorithms: comm...
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Scaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFSScaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFS
 

Mehr von Scott Miao

004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
Scott Miao
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
Scott Miao
 
20121022 tm hbasecanarytool
20121022 tm hbasecanarytool20121022 tm hbasecanarytool
20121022 tm hbasecanarytool
Scott Miao
 

Mehr von Scott Miao (9)

My thoughts for - Building CI/CD Pipelines for Serverless Applications sharing
My thoughts for - Building CI/CD Pipelines for Serverless Applications sharingMy thoughts for - Building CI/CD Pipelines for Serverless Applications sharing
My thoughts for - Building CI/CD Pipelines for Serverless Applications sharing
 
20171122 aws usergrp_coretech-spn-cicd-aws-v01
20171122 aws usergrp_coretech-spn-cicd-aws-v0120171122 aws usergrp_coretech-spn-cicd-aws-v01
20171122 aws usergrp_coretech-spn-cicd-aws-v01
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
analytic engine - a common big data computation service on the aws
analytic engine - a common big data computation service on the awsanalytic engine - a common big data computation service on the aws
analytic engine - a common big data computation service on the aws
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Attack on graph
Attack on graphAttack on graph
Attack on graph
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
20121022 tm hbasecanarytool
20121022 tm hbasecanarytool20121022 tm hbasecanarytool
20121022 tm hbasecanarytool
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

005 cluster monitoring

  • 2. 2 Agenda  Course Credit  Introduction  Metrics Framework  Tools  Tools on wiki http://wiki.spn.tw.trendnet.org/wiki/Hadoop_ Related_Web_Site_List
  • 3. 3 Course Credit  Show up, 30 scores  Ask question, each question earns 5 scores  Hands-on, 40 scores  70 scores will pass this course  Each course credit will be calculated once for each course finished  The course credit will be sent to you and your supervisor by mail
  • 4. 4 Introduction – (1/2)  Using a cluster without monitoring and metrics is…  the same as driving a car while blindfolded  It is great to run load tests against your HBase cluster  need to correlate the cluster’s performance with what the system is doing under the hood
  • 5. 5 Introduction – (2/2)  Graphing  Captures the exposed metrics of a system and displays them in visual charts  A picture speaks a thousand words  Are good for historical, quantitative data  Monitoring  Still difficult to see what a system is doing right now  Qualitative data is needed, which is handled by the monitoring kind of support systems  Sends out emails to various recipients  SMS messages to telephones  Does something by customized scripts
  • 6. 6 The Metrics Framework – Basic Classes from Hadoop
  • 7. 7 The Metrics Framework – Extended Classes in HBase
  • 8. 8 The Metrics Framework – Classes Collaboration
  • 9. 9 The Metrics Framework – Metric Types – (1/3) Metric Type Description Integer value (IV) An integer counter. Only updated when the value changes Long value (LV) A long counter. Only updated when the value changes Rate (R) A float value representing a rate. 1. The rate is calculated as number of operations / elapsed time in seconds. 2. The rate is stored in the previous value field. 3. The internal counter is reset to zero. 4. The last polled timestamp is set to the current time. 5. The computed rate is returned to the caller.
  • 10. 10 The Metrics Framework – Metric Types – (2/3) Metric Type Description String (S) Static, text-based information and never reset nor changed. E.g., HBase version number, build date, and so on. Time varying The context keeps aggregating the value. When the value is integer (TVI) polled it returns the accrued integer value, and resets to zero, until it is polled again Time varying Same as TVI, but uses Long long (TVL)
  • 11. 11 The Metrics Framework – Metric Types – (3/3) Metric Type Description Time varying The number of operations or events and the time they rate (TVR) required to complete. The values for operation count and time accrued are reset once the metric is polled Persistent time Same as TVR, but NOT reset for every poll varying rate (PTVR)
  • 12. 12 The Metrics Framework – Master Metrics  The master process exposes all metrics relating to its role in a cluster Metric Property Name Description Cluster hbase.master.clust The total number of requests to the requests (R) er_requests cluster, aggregated across all region servers Split time hbase.master.splitTi The time it took to split the write-ahead (PTVR) me log files after a restart Split size hbase.master.splitSi The total size of the write-ahead log files (PTVR) ze that were split
  • 13. 13 The Metrics Framework – Region Server Metrics A substantial number of metrics here  Includes details about different parts of the over-all architecture inside the server  Into following groups  Block cache metrics  Compaction metrics  Memstore metrics  Store metrics  I/O metrics  Miscellaneous metrics
  • 14. 14 Region Server Metrics – Block cache metrics – (1/2) Metric Property Name Description count (LV) hbase.regionserver.bl The number of blocks currently in ockCacheCount the cache size (LV) hbase.regionserver.bl The number of the size of blocks ockCacheSize currently in the occupied Java heap space free (LV) hbase.regionserver.bl Remaining heap for the cache ockCacheFree evicted (LV) hbase.regionserver.bl The number of blocks that had to ockCacheEvictedCo be removed because of heap size unt constraints
  • 15. 15 Region Server Metrics – Block cache metrics – (2/2) Metric Property Name Description cache hit (LV) hbase.regionse The number of cache block hits rver.blockCach eHitCount miss (LV) hbase.regionse The number of cache block hit missed rver.blockCach eMissCount hit ratio (IV) hbase.regionse The number of cache hits in relation to rver.blockCach the total number of requests to the eHitRation cache
  • 16. 16 Region Server Metrics – Compaction metrics Metric Property Name Description compaction hbase.regionserv The total size (in bytes) of the storage size (PTVR) er.compactionSi files that have been compacted ze compaction hbase.regionserv How long that operation took. time (PTVR) er.compactionTi Above metrics reported after a me completed compaction run compaction hbase.regionserv How many files a region server queue size (IV) er.compactionQ has queued up for compaction ueueSize currently (recommended for monitoring)
  • 17. 17 Region Server Metrics – Memstore metrics Metric Property Name Description memstore size MB hbase.regionserv The total heap space occupied by metric (IV) er.memstoreSize all memstores (in online regions) for MB the server in megabytes flush queue size hbase.regionserv The number of enqueued regions (IV) er.flushQueueSize that are being flushed next (recommended for monitoring) flush size (PTVR) hbase.regionserv The total size (in bytes) of the er.flushSize memstore that has been flushed flush time (PTVR) hbase.regionserv The total time took for the er.flushTime memstore that has been flushed
  • 18. 18 Region Server Metrics – Store metrics Metric Property Name Description store files (IV) hbase.regionserver.st The total number of storage files, orefiles spread across all stores (regions) managed by current server stores (IV) hbase.regionserver.st The total number of stores for the ores server, across all regions store file index hbase.regionserver.st The sum of the block index, size MB metric orefileIndexSizeMB and optional meta index, for all (IV) store files in megabytes
  • 19. 19 Region Server Metrics – I/O metrics Metric Property Name Description fs read latency hbase.regionser Filesystem read latency. e.g., the time it (TVR) ver.fsReadLaten takes to load a block from the storage cy files fs write latency hbase.regionser The same as above, but for write (TVR) ver.fsWriteLaten operations, including the storage files cy and write-ahead log fs sync latency hbase.regionser The latency to sync the write-ahead log (TVR) ver.fsSyncLaten records to the filesystem. cy All numbers in milliseconds
  • 20. 20 Region Server Metrics – Miscellaneous metrics Metric Property Name Description read request hbase.regionserv The total number of read (such as count (LV) er.readRequestC get()) operations ount write request hbase.regionserv The total number of write (such as count (LV) er.writeRequestC put()) operations ount requests (R) hbase.regionserv The actual request rate per second er.requests regions (IV) hbase.regionserv The number of regions that are er.regions currently online and hosted by this region server
  • 21. 21 The Metrics Framework – RPC Metrics Metric Property Name Description RPC Process rpc.metrics.RpcP The average time took to Time rocessingTime process the RPCs on the server side RPC Queue rpc.metrics.Rpc The time the call arrived and Time QueueTime when it is actually processed, which is the queue time (recommended for monitoring)
  • 22. 22 The Metrics Framework – JVM Metrics  Tuning the JVM settings for optimizing your HBase setup  You need to know what is going on in the cluster  Into following groups  Memory usage metrics  Garbage collection metrics  Thread metrics  System event metrics
  • 23. 23 JVM Metrics – Memory usage metrics Metric Property Name Description Non-heap used jvm.RegionServer.metrics. What used versus memory memNonHeapUsedM committed memory means http://docs.oracle.com Non-heap jvm.RegionServer.metrics. /javase/6/docs/api/jav committed memory memNonHeapCommitted a/lang/management/ M MemoryUsage.html Heap used memory jvm.RegionServer.metrics. memHeapUsedM Heap committed jvm.RegionServer.metrics. memory memHeapCommittedM
  • 24. 24 JVM Metrics – Garbage collection metrics • Garbage collection process causes so-called stop-the-world pauses in certain step • It is difficult to handle when a system is bound by tight SLAs • These pauses approach the multiminute range, because this can cause a region server to miss its ZooKeeper lease renewal — forcing the master to take evasive actions • So-called ―Juliet Pause‖ Metric Property Name Description gc count jvm.RegionServer.metri The number of garbage cs.gcCount collections gc time millis jvm.RegionServer.metri The accumulated time spent in cs.gcTimeMillis garbage collection
  • 25. 25 JVM Metrics – Thread metrics Metric Property Name Description new state jvm.RegionServer.metrics.thre The count for each adsNew possible thread state, runnable state jvm.RegionServer.metrics.thre including new, adsRunnable runnable, blocked, and so on. blocked state jvm.RegionServer.metrics.thre You could refer to adsBlocked following docs http://www.programcr waiting state jvm.RegionServer.metrics.thre eek.com/2009/03/thre adsWaiting ad-status/ timed waiting jvm.RegionServer.metrics.thre http://docs.oracle.com state adsTimedWaiting /javase/1.5.0/docs/api terminated state jvm.RegionServer.metrics.thre /java/lang/Thread.Stat adsTerminated e.html
  • 26. 26 JVM Metrics – System event metrics Metric Property Name Description log fatal jvm.RegionServer. System event metrics provide counts for metrics.logFatal various log-level events. e.g., the log error metric provides the log error jvm.RegionServer. number of log events that occurred on metrics.logError the error level. log warn jvm.RegionServer. metrics.logWarn log info jvm.RegionServer. metrics.logInfo
  • 27. 27 The Metrics Framework – Info Metrics  Only accessible through JMX
  • 28. 28 The Metrics Framework  If you find other Metrics not listed here  Please refer to API docs directly…  http://hbase.apache.org/apidocs/index.ht ml?overview-summary.html
  • 29. 29 Tools - Ganglia A distributed, scalable monitoring system suitable for large cluster systems  HBase inherits its native support for Ganglia directly from Hadoop
  • 30. 30 Ganglia – Three components  Ganglia monitoring daemon (gmond)  Runs on every machine that is monitored  Collects the local data and prepares the statistics to be polled by other systems  Ganglia meta daemon (gmetad)  Is installed on a central node  Acts as the federation node to the entire cluster  Polls from one or more monitoring daemons to receive the current cluster status  Ganglia PHP web frontend  Ganglia Web Frontend  Retrieves the combined statistics from the meta daemon and presents it as HTML
  • 31. 31 Ganglia - Installation http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_quick_start
  • 32. 32 Tools - Nagios  polls current metrics on a regular basis and compares them with given thresholds  Once the thresholds are exceededing it will start evasive actions  Ranging from sending out emails, SMS messages to telephones, to triggering scripts, or even physically rebooting the server when necessary
  • 33. 33 Tools - JMX  Java Management Extensions technology  The standard for Java applications to export their status  Also has the ability to provide operations  Common tools for JMX  JConsole  JMXToolkit http://hbase.apache.org/metrics.html
  • 34. 34 Hands-on  Use Ganglia “Aggregate Graphs” feature  Title with your name  Including 5 hosts  Use any two Metrics  Cut the image file, just like this sample  Put the image file into Git  YOUR_HOME=${GIT_ROOT}/hbase-training/005/hands- on/<your_name>  mkdir ${YOUR_HOME}  Put your hands-on into ${YOUR_HOME}