SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
glideinWMS training



          Solving Grid problems
        through glidein monitoring
            i.e. The Grid debugging part of G.Factory operations

                           by Igor Sfiligoi (UCSD)




glideinWMS training              Grid debugging                    1
Glidein Factory Operations
●    Factory node operations
●    Serving VO Frontend Admin requests
●    Keeping up with changes in the Grid
●    Debugging Grid problems
      ●   The most time consuming part
      ●   Effectively we help solve Grid problems,
          through glidein monitoring

    glideinWMS training      Grid debugging      2
Reminder - Glideins
 ●   A glidein is a properly configured Condor startd
     daemon submitted as a Grid job


                                    Submit node                    Worker node

     Frontend                                                        glidein
                         Monitor
                         Condor    Central manager
                                                                      Startd
                      Match
                                                              CE           Job
              Request
              glideins
                                   Factory         Submit
                                                   glideins

glideinWMS training                          Grid debugging                      3
What can go wrong in the Grid?
 ●   Many places where thing can go wrong
      ●   Essentially at any of the arrows below


                       Submit node                    Worker node
                                                        glidein
                      Central manager
                                                         Startd
                                                 CE           Job


                      Factory



glideinWMS training             Grid debugging                      4
What can go wrong in the Grid?
 ●   In particular
      ●   CE may refuse to accept glideins



                        Submit node                   Worker node
                                                        glidein
                       Central manager
                                                         Startd
                                                 CE           Job


                      Factory



glideinWMS training             Grid debugging                      5
What can go wrong in the Grid?
 ●   In particular
      ●   CE may not start glideins
      ●   Or fail to tell us what
          the status of the job is
                          Submit node                   Worker node
                                                          glidein
                        Central manager
                                                           Startd
                                                   CE           Job


                        Factory



glideinWMS training               Grid debugging                      6
What can go wrong in the Grid?
 ●   In particular
      ●   The worker node may be broken/misconfigured
            –   Thus validation
                will fail
      ●   Many               Submit node                    Worker node
          reasons
                                                              glidein
                            Central manager
                                                               Startd
                                                       CE           Job


                            Factory



glideinWMS training                   Grid debugging                      7
What can go wrong in the Grid?
 ●   In particular
      ●   The WAN networking may not work properly
      ●   The CM never hears
          from the Startd
      ●   Or Schedd     Submit node                   Worker node
          cannot                                        glidein
          talk to     Central manager
                                                         Startd
          Startd
                                                 CE           Job
      ●   Can be selective
                      Factory



glideinWMS training             Grid debugging                      8
What can go wrong in the Grid?
 ●   In particular
      ●   Or the security infrastructure could be broken
            –   CAs missing
            –   Time discrepancies
            –   Etc.         Submit node                    Worker node
                                                              glidein
                            Central manager
                                                               Startd
                                                       CE           Job


                            Factory



glideinWMS training                   Grid debugging                      9
What can go wrong in the Grid?
 ●   In particular
      ●   The site may refuse to start the user job
            –   e.g. glexec


                               Submit node                    Worker node
                                                                glidein
                              Central manager
                                                                 Startd
                                                         CE           Job


                              Factory



glideinWMS training                     Grid debugging                      10
What can go wrong with glideins?
 ●   And there are also non-Grid problems
      ●   Jobs not matching
 ●   But that's
     beyond
     the scope         Submit node                    Worker node
     of this                                            glidein
     document         Central manager
                                                         Startd
                                                 CE           Job


                      Factory



glideinWMS training             Grid debugging                      11
Problem classification
 ●   Most often we see WN problems               Typically easy
                                                  to diagnose
      ●   Followed by CEs refusing glideins
 ●   Then there are misbehaving CEs
      ●   Very hard to diagnose!
 ●   Everything else quite rare
      ●   But usually hard to diagnose as well




glideinWMS training           Grid debugging                      12
Grid debugging




                      Validation problems
                       i.e. Problems on Worker Nodes




glideinWMS training              Grid debugging        13
WN problems
 ●   The glidein startup script runs
     a list of validation scripts
      ●   If any of them fails, the WN is considered broken
      ●   This way user jobs never get to broken WNs
 ●   Two sources of tests
      ●   Glidein Factory
      ●   VO Frontend
 ●   Of course, if the validation script cannot be fetched
     from either Web server, it is considered a failure
     as well
glideinWMS training          Grid debugging                   14
Types of tests
 ●   The glideinWMS SW comes with a set of
     standard tests (provided by the factory):
      ●   Grid environment present (e.g. CAs)
      ●   Some free disk on $PWD and on /tmp
      ●   Enough FE-provided proxy lifetime remaining
      ●   gLExec related tests
      ●   OS type
 ●   Each VO may have its own needs, e.g.:
      ●   Is VO SW pre-installed and accessible?

glideinWMS training          Grid debugging             15
Discovering the problems
 ●   Any error message printed out by the validation
     script will be delivered back to the factory
      ●   After the glidein terminates
 ●   Most validation scripts provide clear indication
     what went wrong
      ●   And we strive to get all to do it!
 ●   New machine readable format being introduced
      ●   With v2_6_2


glideinWMS training            Grid debugging           16
Typical ops
 ●   Noticing that a large fraction of glideins for a
     site are failing is easy
      ●   Just look at the monitoring
      ●   And we are getting a daily email as well
 ●   Discovering what exactly is broken not too
     difficult either
      ●   Just parse the logs
      ●   Will get even easier when all scripts
          return machine readable information
                                                 With appropriate tools

glideinWMS training             Grid debugging                            17
Action items
                                                        Unless this is
 ●   Not much we can do directly                        the result of a
                                                       misconfiguration
                                                          on our part
 ●   Typically, we open a ticket with the site
      ●   Provide the list of nodes where it happens
          (rare to have the whole site broken)
      ●   A concise but complete error report
          essential for a speedy resolution
 ●   In minority of cases we have to contact the
     VO FE admin, e.g.
      ●   Unclear error messages
      ●   Non-WN specific validation errors

glideinWMS training                Grid debugging                 18
Black hole nodes
 ●   There is one further WM problem
      ●   Black hole WNs
      ●   WNs that accept glidein jobs, but don't execute them
 ●   glidein_startup never has the chance
     to log anything
      ●   Not even the node it is running on
      ●   Thus, empty log files!
 ●   We can infer we have a black hole node at a site
     by looking at job timing (in Condor-G logs)
      ●   Good jobs run for at least 20 mins

glideinWMS training          Grid debugging                 19
Grid debugging




                CE refusing the glideins




glideinWMS training       Grid debugging   20
CE Refusing the glideins
 ●   CE admin has the right to refuse anyone
      ●   But usually does not change his mind overnight
      ●   First time accessing a site an issue on its own
            –   Not covered here
 ●   When things go wrong, the typical reason is
      ●   CE service down,
      ●   Problems in the Security/Auth infrastructure,
      ●   CE seriously misconfigured/broken


glideinWMS training                Grid debugging           21
Expected vs Unexpected
 ●   Some “problems” are expected
      ●   e.g. the CE is down for scheduled maintenance
      ●   Nothing to do in this case!
            –   Just a monitoring issue
      ●   So, checking the maintenance DB important!
 ●   If not, we have to notify the site
      ●   The VO FEs are not getting the CPU slots
          they are asking for



glideinWMS training                Grid debugging         22
Discovering the problem
 ●   Condor-G reacts in two different ways
      ●   Does nothing – We still have monitoring showing
          the job did not progress from Waiting→Pending
      ●   Puts the job on Hold
 ●   The G.Factory will react on Held jobs
      ●   Releasing them a few time → Condor-G retries
      ●   Removing them after a while
            –   Just to be replaced with identical glideins
                                            For most non-trivial problems
                                            the problem does not solve by itself

glideinWMS training                 Grid debugging                                 23
Action items
                            (for unexpected problems)



 ●   Most of the time, not much we can do directly
      ●   Will just open ticket with site
      ●   If any useful info in the HoldReason, we pass it on
      ●   DN of the proxy the most valuable info
 ●   But it could be our problem, too
      ●   Found many Condor-G problems in the past
      ●   Comparing the behavior of many G.Factory
          instances can confirm or exclude this
                         Ah-hoc solutions needed
                         if this is the case


glideinWMS training             Grid debugging                  24
Grid debugging



                        CE not properly
                      handling the glideins



glideinWMS training           Grid debugging   25
Problematic CE
 ●   Three basic types of problems:
      ●   Glideins not starting
      ●   Improper monitoring information
      ●   Output files not being delivered to client
 ●   And there is two more
      ●   Unexpected policies that kill glideins




glideinWMS training           Grid debugging           26
Glideins not starting
 ●   The CE scheduling policy is not available to us
      ●   So often not obvious if we are just low priority or
          something else is going on
      ●   GF/Condor-G does not see it as an error condition
 ●   We usually don't act on it, unless
      ●   The VO FE admin complains, or
      ●   We have been given explicit guidance of the
          expected startup rates
 ●   Not much for us to investigate
      ●   Just tell the site admin “Jobs are not starting”
glideinWMS training            Grid debugging                   27
Glideins being killed by the site
 ●   Ideally, our glideins should fit within
     the policies of the site            But getting this info
                                                is not trivial, remember?
      ●   But sometimes they don't
      ●   So they get killed hard
 ●   Discovering this from our side very hard
      ●   We often just notice empty log files
      ●   Not an error for Condor-G
      ●   Often learn of this because the VO complains
 ●   If and when we understand the problem,
     we can deal with it ourselves
      ●   i.e. we config the glideins to stay within the limits
glideinWMS training            Grid debugging                               28
Preemption
 ●   Some site will preempt our glideins
     if higher priority jobs get into the queue
      ●   Effectively killing our glideins
 ●   Not an actual error
      ●   Sites have the right to do it!
 ●   But it can mess up with our monitoring/ops
      ●   We may see killed glideins, or
      ●   We may see glideins that seem to run for
          a very long time (when automatically rescheduled on the CE)
 ●   We have to efficiently filter these events out
glideinWMS training                  Grid debugging                     29
Improper monitoring info from CE
 ●   A CE may not provide reliable information
 ●   Each VO FE provides us with monitoring
     information about its central manager
      ●   By comparing what it tells us, with what
          the CE tells us, we can infer if there are problems
 ●   A large, consistent discrepancy typically signals
     problems in the CE monitoring
 ●   Very difficult to figure out what is going on
      ●   We have no direct detailed data to act upon
      ●   Mostly ad-hoc detective work, prodding the black box
      ●   Often inconclusive
glideinWMS training           Grid debugging                    30
Lack of output files
●    The glidein output files contain
      ●   Accounting information
      ●   Detail logging
●    Without other problems, mostly an annoyance
●    But much more often paired with glideins failing
      ●   Making failure diagnostics close to impossible
●    Extremely hard to diagnose the root cause
      ●   Sometimes we may infer it (black holes, killed glideins, ...)
      ●   For actual CE problems it requires help from many
          parties, including us, the site admins and SW developers
    glideinWMS training          Grid debugging                    31
Grid debugging




                      Networking problems




glideinWMS training          Grid debugging   32
Glideins are network heavy
 ●   Each glidein opens several
     long‑lived TCP connections (in CCB mode)
      ●   Can overwhelm networking gear
            –   e.g. NATs can run out of spare ports
 ●   Problems can have non-linear behavior
      ●   Will work fine on small scale
      ●   Will degrade after a while
            –   Not necessarily a step function, though
                                                           Although straight out
                                                          denials due to firewalls
                                                            are also a problem

glideinWMS training                Grid debugging                            33
Diagnostics and action items
 ●   Not trivial to detect
      ●   Errors often in the glidein logs
                                                       And we are lacking
      ●   But difficult to interpret                 tools for automatically
                                                         detecting this.
 ●   Not much we can do directly
      ●   A problem between the VO services and the site
            –   So we notify both
 ●   However
      ●   we usually end up assisting as experts


glideinWMS training                 Grid debugging                     34
Grid debugging




               Authentication problems




glideinWMS training       Grid debugging   35
Security is delicate stuff
 ●   Grid security mechanisms paranoid by design
      ●   “Availability” is the last to be considered
      ●   The main focus is keeping the “bad guys” out
 ●   So they are extremely delicate
      ●   If any piece of the chain breaks, everything breaks
 ●   Things that can go wrong (non exhaustive list):
      ●   Missing CA(s)
      ●   Expired CRLs
      ●   Expired glidein proxy
      ●   Wrong system time (clock skew)
glideinWMS training             Grid debugging                  36
Diagnostics and action items
 ●   Finding the root cause usually hard
                                                                  And we are lacking
      ●   Errors are in the glidein logs                        tools for automatically
                                                                    detecting this.
      ●   But usually do not provide enough info
          (to avoid giving up too much info to a hypothetical attacker)
 ●   Have to distinguish between
     site problems and VO problems, too
      ●   Only obvious if only a fraction fails (→ WN problem)
      ●   Else, may need to get both sides involved to
          properly diagnose the root cause


glideinWMS training                Grid debugging                              37
Grid debugging




                      Job startup problems




glideinWMS training           Grid debugging   38
gLExec              (1)



 ●   The biggest source of problems, by far,
     is gLExec refusing to accept a user proxy
      ●   Resulting in jobs not starting
      ●   BTW, Condor is not good at handling gLExec denials
 ●   We can only partially test gLExec
     during validation
      ●   May behave differently based on the proxy used
      ●   Its behavior can change in time
 ●   And final users may be the source of the problem
      ●   e.g. by letting the proxy expire            Condor could catch
                                                     these, and hopefully
                                                          soon will
glideinWMS training           Grid debugging                         39
gLExec               (2)



 ●   Non trivial to detect
      ●   Errors are in the glidein logs
      ●   But we miss the tools to extract them
 ●   Finding the root cause impossible without
     site admin help
      ●   gLExec policies are a site secret
      ●   We thus just notify the site,
          providing the failing user DN



glideinWMS training           Grid debugging         40
Configuration problems
 ●   Condor can be configured to run a wrapper
     around the user job
      ●   To customize the user environment
      ●   Usually provided by the VO FE
 ●   If that fails, the user job fails with it
 ●   Luckily, failures are rare
      ●   If we notice them, we notify the VO FE admins
      ●   However, they often notice before we do


glideinWMS training           Grid debugging              41
Other job startup problems
 ●   By default, we validate the node
     only at glidein startup
      ●   WN conditions may change by the time a job
          is scheduled to run             We should do better.
            –   e.g. the disk fills up                      Condor supports
                                                           periodic validation
 ●   The errors are usually only                          tests, we just don't
                                                          use them right now.
     seen by the final users
      ●   So we hardly ever notice
          these kind of problems


glideinWMS training                      Grid debugging                          42
Summary
●   The Grid world is a good approximation of
    a chaotic system
     ●   There are thus many failure modes
●   The pilot paradigm hides most of the failures
    from the final users
     ●   But the failures are still there
     ●   Resulting in wasted/underused CPU cycles
●   The G.Factory operators are in the best position to
    diagnose the root cause of the failures
     ●   By having a global view
     ●   However, they cannot solve the problems by themselves
    glideinWMS training       Grid debugging                43
Acknowledgments
 ●   This document was sponsored by grants from
     the US NSF and US DOE,
     and by the UC system




glideinWMS training        Grid debugging         44

Weitere ähnliche Inhalte

Ähnlich wie Solving Grid problems through glidein monitoring

glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012Igor Sfiligoi
 
Glidein Factory Operations
Glidein Factory OperationsGlidein Factory Operations
Glidein Factory OperationsIgor Sfiligoi
 
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM... glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...Igor Sfiligoi
 
glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012Igor Sfiligoi
 
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolMonitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolIgor Sfiligoi
 
Introduction to glideinWMS
Introduction to glideinWMSIntroduction to glideinWMS
Introduction to glideinWMSIgor Sfiligoi
 
glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012Igor Sfiligoi
 
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Igor Sfiligoi
 
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...Igor Sfiligoi
 
O futuro do cloud deployment
O futuro do cloud deploymentO futuro do cloud deployment
O futuro do cloud deploymentSidnei Da Silva
 
An argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceAn argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceIgor Sfiligoi
 

Ähnlich wie Solving Grid problems through glidein monitoring (12)

Glidein internals
Glidein internalsGlidein internals
Glidein internals
 
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012glideinWMS Frontend Internals - glideinWMS Training Jan 2012
glideinWMS Frontend Internals - glideinWMS Training Jan 2012
 
Glidein Factory Operations
Glidein Factory OperationsGlidein Factory Operations
Glidein Factory Operations
 
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM... glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
glideinWMS Frontend Installation - Part 2 - Frontend Installation -glideinWM...
 
glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012glideinWMS Architecture - glideinWMS Training Jan 2012
glideinWMS Architecture - glideinWMS Training Jan 2012
 
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolMonitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
 
Introduction to glideinWMS
Introduction to glideinWMSIntroduction to glideinWMS
Introduction to glideinWMS
 
glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012glideinWMS validation scirpts - glideinWMS Training Jan 2012
glideinWMS validation scirpts - glideinWMS Training Jan 2012
 
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...
 
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
glideinWMS Frontend Installation - Part 1 - Condor Installation - glideinWMS ...
 
O futuro do cloud deployment
O futuro do cloud deploymentO futuro do cloud deployment
O futuro do cloud deployment
 
An argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS ExperienceAn argument for moving the requirements out of user hands - The CMS Experience
An argument for moving the requirements out of user hands - The CMS Experience
 

Mehr von Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 

Mehr von Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Solving Grid problems through glidein monitoring

  • 1. glideinWMS training Solving Grid problems through glidein monitoring i.e. The Grid debugging part of G.Factory operations by Igor Sfiligoi (UCSD) glideinWMS training Grid debugging 1
  • 2. Glidein Factory Operations ● Factory node operations ● Serving VO Frontend Admin requests ● Keeping up with changes in the Grid ● Debugging Grid problems ● The most time consuming part ● Effectively we help solve Grid problems, through glidein monitoring glideinWMS training Grid debugging 2
  • 3. Reminder - Glideins ● A glidein is a properly configured Condor startd daemon submitted as a Grid job Submit node Worker node Frontend glidein Monitor Condor Central manager Startd Match CE Job Request glideins Factory Submit glideins glideinWMS training Grid debugging 3
  • 4. What can go wrong in the Grid? ● Many places where thing can go wrong ● Essentially at any of the arrows below Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 4
  • 5. What can go wrong in the Grid? ● In particular ● CE may refuse to accept glideins Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 5
  • 6. What can go wrong in the Grid? ● In particular ● CE may not start glideins ● Or fail to tell us what the status of the job is Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 6
  • 7. What can go wrong in the Grid? ● In particular ● The worker node may be broken/misconfigured – Thus validation will fail ● Many Submit node Worker node reasons glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 7
  • 8. What can go wrong in the Grid? ● In particular ● The WAN networking may not work properly ● The CM never hears from the Startd ● Or Schedd Submit node Worker node cannot glidein talk to Central manager Startd Startd CE Job ● Can be selective Factory glideinWMS training Grid debugging 8
  • 9. What can go wrong in the Grid? ● In particular ● Or the security infrastructure could be broken – CAs missing – Time discrepancies – Etc. Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 9
  • 10. What can go wrong in the Grid? ● In particular ● The site may refuse to start the user job – e.g. glexec Submit node Worker node glidein Central manager Startd CE Job Factory glideinWMS training Grid debugging 10
  • 11. What can go wrong with glideins? ● And there are also non-Grid problems ● Jobs not matching ● But that's beyond the scope Submit node Worker node of this glidein document Central manager Startd CE Job Factory glideinWMS training Grid debugging 11
  • 12. Problem classification ● Most often we see WN problems Typically easy to diagnose ● Followed by CEs refusing glideins ● Then there are misbehaving CEs ● Very hard to diagnose! ● Everything else quite rare ● But usually hard to diagnose as well glideinWMS training Grid debugging 12
  • 13. Grid debugging Validation problems i.e. Problems on Worker Nodes glideinWMS training Grid debugging 13
  • 14. WN problems ● The glidein startup script runs a list of validation scripts ● If any of them fails, the WN is considered broken ● This way user jobs never get to broken WNs ● Two sources of tests ● Glidein Factory ● VO Frontend ● Of course, if the validation script cannot be fetched from either Web server, it is considered a failure as well glideinWMS training Grid debugging 14
  • 15. Types of tests ● The glideinWMS SW comes with a set of standard tests (provided by the factory): ● Grid environment present (e.g. CAs) ● Some free disk on $PWD and on /tmp ● Enough FE-provided proxy lifetime remaining ● gLExec related tests ● OS type ● Each VO may have its own needs, e.g.: ● Is VO SW pre-installed and accessible? glideinWMS training Grid debugging 15
  • 16. Discovering the problems ● Any error message printed out by the validation script will be delivered back to the factory ● After the glidein terminates ● Most validation scripts provide clear indication what went wrong ● And we strive to get all to do it! ● New machine readable format being introduced ● With v2_6_2 glideinWMS training Grid debugging 16
  • 17. Typical ops ● Noticing that a large fraction of glideins for a site are failing is easy ● Just look at the monitoring ● And we are getting a daily email as well ● Discovering what exactly is broken not too difficult either ● Just parse the logs ● Will get even easier when all scripts return machine readable information With appropriate tools glideinWMS training Grid debugging 17
  • 18. Action items Unless this is ● Not much we can do directly the result of a misconfiguration on our part ● Typically, we open a ticket with the site ● Provide the list of nodes where it happens (rare to have the whole site broken) ● A concise but complete error report essential for a speedy resolution ● In minority of cases we have to contact the VO FE admin, e.g. ● Unclear error messages ● Non-WN specific validation errors glideinWMS training Grid debugging 18
  • 19. Black hole nodes ● There is one further WM problem ● Black hole WNs ● WNs that accept glidein jobs, but don't execute them ● glidein_startup never has the chance to log anything ● Not even the node it is running on ● Thus, empty log files! ● We can infer we have a black hole node at a site by looking at job timing (in Condor-G logs) ● Good jobs run for at least 20 mins glideinWMS training Grid debugging 19
  • 20. Grid debugging CE refusing the glideins glideinWMS training Grid debugging 20
  • 21. CE Refusing the glideins ● CE admin has the right to refuse anyone ● But usually does not change his mind overnight ● First time accessing a site an issue on its own – Not covered here ● When things go wrong, the typical reason is ● CE service down, ● Problems in the Security/Auth infrastructure, ● CE seriously misconfigured/broken glideinWMS training Grid debugging 21
  • 22. Expected vs Unexpected ● Some “problems” are expected ● e.g. the CE is down for scheduled maintenance ● Nothing to do in this case! – Just a monitoring issue ● So, checking the maintenance DB important! ● If not, we have to notify the site ● The VO FEs are not getting the CPU slots they are asking for glideinWMS training Grid debugging 22
  • 23. Discovering the problem ● Condor-G reacts in two different ways ● Does nothing – We still have monitoring showing the job did not progress from Waiting→Pending ● Puts the job on Hold ● The G.Factory will react on Held jobs ● Releasing them a few time → Condor-G retries ● Removing them after a while – Just to be replaced with identical glideins For most non-trivial problems the problem does not solve by itself glideinWMS training Grid debugging 23
  • 24. Action items (for unexpected problems) ● Most of the time, not much we can do directly ● Will just open ticket with site ● If any useful info in the HoldReason, we pass it on ● DN of the proxy the most valuable info ● But it could be our problem, too ● Found many Condor-G problems in the past ● Comparing the behavior of many G.Factory instances can confirm or exclude this Ah-hoc solutions needed if this is the case glideinWMS training Grid debugging 24
  • 25. Grid debugging CE not properly handling the glideins glideinWMS training Grid debugging 25
  • 26. Problematic CE ● Three basic types of problems: ● Glideins not starting ● Improper monitoring information ● Output files not being delivered to client ● And there is two more ● Unexpected policies that kill glideins glideinWMS training Grid debugging 26
  • 27. Glideins not starting ● The CE scheduling policy is not available to us ● So often not obvious if we are just low priority or something else is going on ● GF/Condor-G does not see it as an error condition ● We usually don't act on it, unless ● The VO FE admin complains, or ● We have been given explicit guidance of the expected startup rates ● Not much for us to investigate ● Just tell the site admin “Jobs are not starting” glideinWMS training Grid debugging 27
  • 28. Glideins being killed by the site ● Ideally, our glideins should fit within the policies of the site But getting this info is not trivial, remember? ● But sometimes they don't ● So they get killed hard ● Discovering this from our side very hard ● We often just notice empty log files ● Not an error for Condor-G ● Often learn of this because the VO complains ● If and when we understand the problem, we can deal with it ourselves ● i.e. we config the glideins to stay within the limits glideinWMS training Grid debugging 28
  • 29. Preemption ● Some site will preempt our glideins if higher priority jobs get into the queue ● Effectively killing our glideins ● Not an actual error ● Sites have the right to do it! ● But it can mess up with our monitoring/ops ● We may see killed glideins, or ● We may see glideins that seem to run for a very long time (when automatically rescheduled on the CE) ● We have to efficiently filter these events out glideinWMS training Grid debugging 29
  • 30. Improper monitoring info from CE ● A CE may not provide reliable information ● Each VO FE provides us with monitoring information about its central manager ● By comparing what it tells us, with what the CE tells us, we can infer if there are problems ● A large, consistent discrepancy typically signals problems in the CE monitoring ● Very difficult to figure out what is going on ● We have no direct detailed data to act upon ● Mostly ad-hoc detective work, prodding the black box ● Often inconclusive glideinWMS training Grid debugging 30
  • 31. Lack of output files ● The glidein output files contain ● Accounting information ● Detail logging ● Without other problems, mostly an annoyance ● But much more often paired with glideins failing ● Making failure diagnostics close to impossible ● Extremely hard to diagnose the root cause ● Sometimes we may infer it (black holes, killed glideins, ...) ● For actual CE problems it requires help from many parties, including us, the site admins and SW developers glideinWMS training Grid debugging 31
  • 32. Grid debugging Networking problems glideinWMS training Grid debugging 32
  • 33. Glideins are network heavy ● Each glidein opens several long‑lived TCP connections (in CCB mode) ● Can overwhelm networking gear – e.g. NATs can run out of spare ports ● Problems can have non-linear behavior ● Will work fine on small scale ● Will degrade after a while – Not necessarily a step function, though Although straight out denials due to firewalls are also a problem glideinWMS training Grid debugging 33
  • 34. Diagnostics and action items ● Not trivial to detect ● Errors often in the glidein logs And we are lacking ● But difficult to interpret tools for automatically detecting this. ● Not much we can do directly ● A problem between the VO services and the site – So we notify both ● However ● we usually end up assisting as experts glideinWMS training Grid debugging 34
  • 35. Grid debugging Authentication problems glideinWMS training Grid debugging 35
  • 36. Security is delicate stuff ● Grid security mechanisms paranoid by design ● “Availability” is the last to be considered ● The main focus is keeping the “bad guys” out ● So they are extremely delicate ● If any piece of the chain breaks, everything breaks ● Things that can go wrong (non exhaustive list): ● Missing CA(s) ● Expired CRLs ● Expired glidein proxy ● Wrong system time (clock skew) glideinWMS training Grid debugging 36
  • 37. Diagnostics and action items ● Finding the root cause usually hard And we are lacking ● Errors are in the glidein logs tools for automatically detecting this. ● But usually do not provide enough info (to avoid giving up too much info to a hypothetical attacker) ● Have to distinguish between site problems and VO problems, too ● Only obvious if only a fraction fails (→ WN problem) ● Else, may need to get both sides involved to properly diagnose the root cause glideinWMS training Grid debugging 37
  • 38. Grid debugging Job startup problems glideinWMS training Grid debugging 38
  • 39. gLExec (1) ● The biggest source of problems, by far, is gLExec refusing to accept a user proxy ● Resulting in jobs not starting ● BTW, Condor is not good at handling gLExec denials ● We can only partially test gLExec during validation ● May behave differently based on the proxy used ● Its behavior can change in time ● And final users may be the source of the problem ● e.g. by letting the proxy expire Condor could catch these, and hopefully soon will glideinWMS training Grid debugging 39
  • 40. gLExec (2) ● Non trivial to detect ● Errors are in the glidein logs ● But we miss the tools to extract them ● Finding the root cause impossible without site admin help ● gLExec policies are a site secret ● We thus just notify the site, providing the failing user DN glideinWMS training Grid debugging 40
  • 41. Configuration problems ● Condor can be configured to run a wrapper around the user job ● To customize the user environment ● Usually provided by the VO FE ● If that fails, the user job fails with it ● Luckily, failures are rare ● If we notice them, we notify the VO FE admins ● However, they often notice before we do glideinWMS training Grid debugging 41
  • 42. Other job startup problems ● By default, we validate the node only at glidein startup ● WN conditions may change by the time a job is scheduled to run We should do better. – e.g. the disk fills up Condor supports periodic validation ● The errors are usually only tests, we just don't use them right now. seen by the final users ● So we hardly ever notice these kind of problems glideinWMS training Grid debugging 42
  • 43. Summary ● The Grid world is a good approximation of a chaotic system ● There are thus many failure modes ● The pilot paradigm hides most of the failures from the final users ● But the failures are still there ● Resulting in wasted/underused CPU cycles ● The G.Factory operators are in the best position to diagnose the root cause of the failures ● By having a global view ● However, they cannot solve the problems by themselves glideinWMS training Grid debugging 43
  • 44. Acknowledgments ● This document was sponsored by grants from the US NSF and US DOE, and by the UC system glideinWMS training Grid debugging 44