SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Monitoring and Troubleshooting
  7/6/2012

© 2012 MapR Technologies   Troubleshooting 1
Monitoring & Troubleshooting
   Agenda
   • Cluster Monitoring Tools
   • Troubleshooting MapReduce Jobs
   • Troubleshooting Scenarios
   • Working with MapR Support
   • Things to Avoid




© 2012 MapR Technologies   Troubleshooting 2
Monitoring & Troubleshooting
   Objectives
   At the end of this module you will be able to:
   • Identify the tools you can use to monitor your cluster
   • Explain how MapR central logging can help you monitor MapReduce jobs
   • Describe several common troubleshooting scenarios and how to resolve
     issues based on these scenarios
   • List the tools you can use to work with MapR Support




© 2012 MapR Technologies        Troubleshooting 3
Cluster Monitoring Tools




© 2012 MapR Technologies   Troubleshooting 4
Monitoring Tools

         Built-In Tools
          – MapR Control System
          – MapR Metrics

         3rd Party Tools
          – Nagios
          – Ganglia




5   © 2012 MapR Technologies      Troubleshooting 5
MapR Control System

         MapR Control System
          –   Dashboard with cluster overview
              • Node health
              • MapR-FS and available disks
              • Resource utilization
                  –   bandwidth
                  –   disk space
                  –   CPU
              • MapReduce job status
              • Alarms




6   © 2012 MapR Technologies            Troubleshooting 6
MapR Control System




7   © 2012 MapR Technologies   Troubleshooting 7
MapR Metrics

         MapR Metrics
          –   View performance information about Hadoop jobs
              • Predict cluster usage
              • Measure which jobs consume resources
              • Troubleshoot failures & performance issues
          –   Metrics provided on
              •   Cumulative CPU/memory usage
              •   # of running/failed tasks/attempts
              •   Speed of input, output, and shuffle
              •   Duration of task attempts
              •   Data read, written, or shuffled
              •   Memory in use
              •   Number of records skipped/spilled

8   © 2012 MapR Technologies               Troubleshooting 8
MapR Metrics




9   © 2012 MapR Technologies   Troubleshooting 9
3rd Party Tools

          Nagios
           –   Configuration script generator
          Ganglia
           –   CLDB does metrics
           –   MapRGangliaContext
           –   Only need gmond on CLDB node




10   © 2012 MapR Technologies          Troubleshooting 10
MapR Service Logs

          /opt/mapr/logs
          For example:
           – CLDB
           – Warden
           – FileServer (mfs)
           – NFS




11   © 2012 MapR Technologies   Troubleshooting 11
Troubleshooting
                           MapReduce Jobs



© 2012 MapR Technologies      Troubleshooting 12
Central Logging

          MapR 2.0 introduces central logging
           –   Log files written to “local” volume on MapR-FS
               •   replication factor = 1
                   –   I/O confined to node
           – /var/mapr/local/<host>/logs/mapred/userlogs
           – Configurable via JobTracker variable
               •   mapr.localvolumes.path




13   © 2012 MapR Technologies                 Troubleshooting 13
Central Logging

          New CLI for MapReduce logs
               maprcli job linklogs -jobid <jobPatten> -todir
               <maprfsDir> [ -jobconf <pathToJobXml>]
           – Create a job-centric view of all logs on all involved TaskTracker nodes
           – Creates the following structure under <maprfsDir> for all <jobid>’s
             matching <jobPattern>
               •   <jobid>/hosts/<host>/
                   –   symbolic links to log directories of tasks executed for <jobid> on <host>
               •   <jobid>/mappers/
                   –   symbolic links to log directories of all map task attempts for <jobid> across the
                       cluster
               •   <jobid>/reducers/
                   –   symbolic links to log directories of all reduce task attempts for <jobid> across the
                       cluster


14   © 2012 MapR Technologies                   Troubleshooting 14
Troubleshooting
                              Scenarios



© 2012 MapR Technologies      Troubleshooting 15
Troubleshooting Scenarios

          Slow nodes
          Out of memory
          Out of disk space
          Time skew
          No ZooKeeper quorum
          Contention for resources
          Requirements not met




16   © 2012 MapR Technologies    Troubleshooting 16
Identifying Slow Nodes

          Before installation:
           –   Use dd to benchmark read/write speed
               •   dd bs=4M if=/dev/null of=/dev/sd<x>

           –   Compare performance across nodes to test network throughput:
               •   dd bs=4M if=/dev/null |       sudo ssh root@node 'dd bs=4M of=/dev/foo’

          After installation:
           – Look at task starting and completion times
           – Look in system logs for memory or CPU problems
           – Look at the performance of writes to the local volume
             (where intermediate data goes)
          Slow disks identified based on a threshold in mfs.conf
           –   May really be slow NIC


17   © 2012 MapR Technologies                     Troubleshooting 17
Out of Memory

          Make sure there is enough swap space
          See if a memory-intensive job is running
          Use ulimit to make sure there are no limits on the number of file
           descriptors, resource usage, and the number of processes
          Garbage collection can result in out-of-memory errors




18   © 2012 MapR Technologies     Troubleshooting 18
Out of Disk Space

          MapR logs go to /opt/mapr/logs
           – If this partition is too small, space can run out
           – Set up a cron job to clean out old logs
           – Move to a larger partition




19   © 2012 MapR Technologies          Troubleshooting 19
Time Skew

          NTP is your friend
          20 Seconds differential is the max allowed




20   © 2012 MapR Technologies    Troubleshooting 20
No ZooKeeper Quorum

          Not enough ZooKeepers running
          configure.sh run improperly
           –   Different ZooKeeper or CLDB nodes specified
          Network problem
           –   Hostname resolution
           –   Physical connection down




21   © 2012 MapR Technologies             Troubleshooting 21
Contention for Resources

          Make sure there’s no limit on file descriptors, processes
          Make sure the service layout follows good guidelines
           – Don’t run ZooKeeper with CLDB or JobTracker
           – Fewer task slots when running TaskTracker with CLDB or ZooKeeper
           – Avoid running the active JobTracker on the primary CLDB node

        Don’t run other random things on cluster nodes
        Don’t mix distributions




22   © 2012 MapR Technologies      Troubleshooting 22
Requirements Not Met

          Use Sun Java JDK
          Same users/groups with same UID/GID numbers on all nodes
          Proper licensing
          Host resolution between all nodes
           –   DNS or /etc/hosts
        Keyless ssh between all nodes for the root user
        All necessary ports open
           –   Watch out for iptables and selinux




23   © 2012 MapR Technologies          Troubleshooting 23
Working with MapR
                                Support



© 2012 MapR Technologies       Troubleshooting 24
Working with MapR Support

          mapr-support-collect and mapr-support dump
          fsck and gfsck




25   © 2012 MapR Technologies   Troubleshooting 25
Things to Avoid




© 2012 MapR Technologies      Troubleshooting 26
Things to Avoid

          Remove ZooKeeper data manually
          Format disks (unless you are sure)
          Run configure.sh incorrectly
          Use dd on an installed node
          Modify configuration files
           – Without a good reason
           – Inconsistently




27   © 2012 MapR Technologies        Troubleshooting 27
Questions




© 2012 MapR Technologies   Troubleshooting 28

Weitere ähnliche Inhalte

Was ist angesagt?

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoopabord
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache KylinShi Shao Feng
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnOmid Vahdaty
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Shivkumar Babshetty
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Tsuyoshi OZAWA
 

Was ist angesagt? (20)

12a architecture
12a architecture12a architecture
12a architecture
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
10c introduction
10c introduction10c introduction
10c introduction
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
Anatomy of Hadoop YARN
Anatomy of Hadoop YARNAnatomy of Hadoop YARN
Anatomy of Hadoop YARN
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014
 

Andere mochten auch

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008athusoo
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 

Andere mochten auch (8)

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 

Ähnlich wie 70a monitoring & troubleshooting

10c introduction
10c introduction10c introduction
10c introductionInyoung Cho
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsEMC
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2Stefanie Zhao
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Coredns nodecache - A highly-available Node-cache DNS server
Coredns nodecache - A highly-available Node-cache DNS serverCoredns nodecache - A highly-available Node-cache DNS server
Coredns nodecache - A highly-available Node-cache DNS serverYann Hamon
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 

Ähnlich wie 70a monitoring & troubleshooting (20)

48a tuning
48a tuning48a tuning
48a tuning
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
22 configuration
22 configuration22 configuration
22 configuration
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
10c introduction
10c introduction10c introduction
10c introduction
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data Analytics
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Coredns nodecache - A highly-available Node-cache DNS server
Coredns nodecache - A highly-available Node-cache DNS serverCoredns nodecache - A highly-available Node-cache DNS server
Coredns nodecache - A highly-available Node-cache DNS server
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 

Mehr von mapr-academy

42 lab-managing services
42 lab-managing services42 lab-managing services
42 lab-managing servicesmapr-academy
 
41a managing services
41a managing services41a managing services
41a managing servicesmapr-academy
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your clustermapr-academy
 
3 map r installation & setup administration course description
3 map r installation & setup administration course description3 map r installation & setup administration course description
3 map r installation & setup administration course descriptionmapr-academy
 

Mehr von mapr-academy (8)

53 lab-nfs
53 lab-nfs53 lab-nfs
53 lab-nfs
 
51 lab-volumes
51 lab-volumes51 lab-volumes
51 lab-volumes
 
50a volumes
50a volumes50a volumes
50a volumes
 
42 lab-managing services
42 lab-managing services42 lab-managing services
42 lab-managing services
 
41a managing services
41a managing services41a managing services
41a managing services
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your cluster
 
14 lab-planing
14 lab-planing14 lab-planing
14 lab-planing
 
3 map r installation & setup administration course description
3 map r installation & setup administration course description3 map r installation & setup administration course description
3 map r installation & setup administration course description
 

Kürzlich hochgeladen

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Kürzlich hochgeladen (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

70a monitoring & troubleshooting

  • 1. Monitoring and Troubleshooting 7/6/2012 © 2012 MapR Technologies Troubleshooting 1
  • 2. Monitoring & Troubleshooting Agenda • Cluster Monitoring Tools • Troubleshooting MapReduce Jobs • Troubleshooting Scenarios • Working with MapR Support • Things to Avoid © 2012 MapR Technologies Troubleshooting 2
  • 3. Monitoring & Troubleshooting Objectives At the end of this module you will be able to: • Identify the tools you can use to monitor your cluster • Explain how MapR central logging can help you monitor MapReduce jobs • Describe several common troubleshooting scenarios and how to resolve issues based on these scenarios • List the tools you can use to work with MapR Support © 2012 MapR Technologies Troubleshooting 3
  • 4. Cluster Monitoring Tools © 2012 MapR Technologies Troubleshooting 4
  • 5. Monitoring Tools  Built-In Tools – MapR Control System – MapR Metrics  3rd Party Tools – Nagios – Ganglia 5 © 2012 MapR Technologies Troubleshooting 5
  • 6. MapR Control System  MapR Control System – Dashboard with cluster overview • Node health • MapR-FS and available disks • Resource utilization – bandwidth – disk space – CPU • MapReduce job status • Alarms 6 © 2012 MapR Technologies Troubleshooting 6
  • 7. MapR Control System 7 © 2012 MapR Technologies Troubleshooting 7
  • 8. MapR Metrics  MapR Metrics – View performance information about Hadoop jobs • Predict cluster usage • Measure which jobs consume resources • Troubleshoot failures & performance issues – Metrics provided on • Cumulative CPU/memory usage • # of running/failed tasks/attempts • Speed of input, output, and shuffle • Duration of task attempts • Data read, written, or shuffled • Memory in use • Number of records skipped/spilled 8 © 2012 MapR Technologies Troubleshooting 8
  • 9. MapR Metrics 9 © 2012 MapR Technologies Troubleshooting 9
  • 10. 3rd Party Tools  Nagios – Configuration script generator  Ganglia – CLDB does metrics – MapRGangliaContext – Only need gmond on CLDB node 10 © 2012 MapR Technologies Troubleshooting 10
  • 11. MapR Service Logs  /opt/mapr/logs  For example: – CLDB – Warden – FileServer (mfs) – NFS 11 © 2012 MapR Technologies Troubleshooting 11
  • 12. Troubleshooting MapReduce Jobs © 2012 MapR Technologies Troubleshooting 12
  • 13. Central Logging  MapR 2.0 introduces central logging – Log files written to “local” volume on MapR-FS • replication factor = 1 – I/O confined to node – /var/mapr/local/<host>/logs/mapred/userlogs – Configurable via JobTracker variable • mapr.localvolumes.path 13 © 2012 MapR Technologies Troubleshooting 13
  • 14. Central Logging  New CLI for MapReduce logs maprcli job linklogs -jobid <jobPatten> -todir <maprfsDir> [ -jobconf <pathToJobXml>] – Create a job-centric view of all logs on all involved TaskTracker nodes – Creates the following structure under <maprfsDir> for all <jobid>’s matching <jobPattern> • <jobid>/hosts/<host>/ – symbolic links to log directories of tasks executed for <jobid> on <host> • <jobid>/mappers/ – symbolic links to log directories of all map task attempts for <jobid> across the cluster • <jobid>/reducers/ – symbolic links to log directories of all reduce task attempts for <jobid> across the cluster 14 © 2012 MapR Technologies Troubleshooting 14
  • 15. Troubleshooting Scenarios © 2012 MapR Technologies Troubleshooting 15
  • 16. Troubleshooting Scenarios  Slow nodes  Out of memory  Out of disk space  Time skew  No ZooKeeper quorum  Contention for resources  Requirements not met 16 © 2012 MapR Technologies Troubleshooting 16
  • 17. Identifying Slow Nodes  Before installation: – Use dd to benchmark read/write speed • dd bs=4M if=/dev/null of=/dev/sd<x> – Compare performance across nodes to test network throughput: • dd bs=4M if=/dev/null | sudo ssh root@node 'dd bs=4M of=/dev/foo’  After installation: – Look at task starting and completion times – Look in system logs for memory or CPU problems – Look at the performance of writes to the local volume (where intermediate data goes)  Slow disks identified based on a threshold in mfs.conf – May really be slow NIC 17 © 2012 MapR Technologies Troubleshooting 17
  • 18. Out of Memory  Make sure there is enough swap space  See if a memory-intensive job is running  Use ulimit to make sure there are no limits on the number of file descriptors, resource usage, and the number of processes  Garbage collection can result in out-of-memory errors 18 © 2012 MapR Technologies Troubleshooting 18
  • 19. Out of Disk Space  MapR logs go to /opt/mapr/logs – If this partition is too small, space can run out – Set up a cron job to clean out old logs – Move to a larger partition 19 © 2012 MapR Technologies Troubleshooting 19
  • 20. Time Skew  NTP is your friend  20 Seconds differential is the max allowed 20 © 2012 MapR Technologies Troubleshooting 20
  • 21. No ZooKeeper Quorum  Not enough ZooKeepers running  configure.sh run improperly – Different ZooKeeper or CLDB nodes specified  Network problem – Hostname resolution – Physical connection down 21 © 2012 MapR Technologies Troubleshooting 21
  • 22. Contention for Resources  Make sure there’s no limit on file descriptors, processes  Make sure the service layout follows good guidelines – Don’t run ZooKeeper with CLDB or JobTracker – Fewer task slots when running TaskTracker with CLDB or ZooKeeper – Avoid running the active JobTracker on the primary CLDB node  Don’t run other random things on cluster nodes  Don’t mix distributions 22 © 2012 MapR Technologies Troubleshooting 22
  • 23. Requirements Not Met  Use Sun Java JDK  Same users/groups with same UID/GID numbers on all nodes  Proper licensing  Host resolution between all nodes – DNS or /etc/hosts  Keyless ssh between all nodes for the root user  All necessary ports open – Watch out for iptables and selinux 23 © 2012 MapR Technologies Troubleshooting 23
  • 24. Working with MapR Support © 2012 MapR Technologies Troubleshooting 24
  • 25. Working with MapR Support  mapr-support-collect and mapr-support dump  fsck and gfsck 25 © 2012 MapR Technologies Troubleshooting 25
  • 26. Things to Avoid © 2012 MapR Technologies Troubleshooting 26
  • 27. Things to Avoid  Remove ZooKeeper data manually  Format disks (unless you are sure)  Run configure.sh incorrectly  Use dd on an installed node  Modify configuration files – Without a good reason – Inconsistently 27 © 2012 MapR Technologies Troubleshooting 27
  • 28. Questions © 2012 MapR Technologies Troubleshooting 28