This document discusses tools for monitoring and troubleshooting jobs in a glideinWMS-based HTCondor pool. It describes how to determine where jobs are running, why jobs may not be starting, and why jobs are taking a long time to finish. The key tools mentioned are condor_q, condor_history, and the job event log. The document also provides guidance on checking job requirements, user priorities, supported sites, and restarting jobs.
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
1. glideinWMS for users
Monitoring and
troubleshooting
a glideinWMS-based
HTCondor pool
by Igor Sfiligoi (UCSD)
CERN, Dec 2012 glideinWMS monitoring 1
2. Scope of this talk
This talk describes what
information are available
when troubleshooting in a
glideinWMS-based HTCondor pool,
and what tools can you use
to mine them.
Reader is expected to already have a basic understanding of HTCondor and glideinWMS.
CERN, Dec 2012 glideinWMS monitoring 2
3. HTCondor Architecture
● As a reminder
G.F.
+3
VO FE Grid
G.F.
+1
Execute node
Central manager Execute node
Submit node
Execute node
Negotiator
Submit node
Execute node
Submit node
Execute node
Schedd Condor
CERN, Dec 2012 glideinWMS monitoring 3
4. Typical user questions
addressed in this talk
● Where is/was my job running?
● Why are my jobs
not starting?
● Why do my jobs
take forever to finish?
CERN, Dec 2012 glideinWMS monitoring 4
5. Where is/was
my job running?
CERN, Dec 2012 glideinWMS monitoring 5
6. Job progress monitoring
● HTCondor provides two basic means to monitor
job progress
● Querying the system for current status
– Using the cmdline condor_q/condor_history
● Parsing the job event log
– Either plain text or XML formatted
– Starting with 7.9.1, condor_history can be used
to extract the last known state
CERN, Dec 2012 glideinWMS monitoring 6
7. Job status
● Each Job has a status associated with it
● An integer attribute called
JobStatus
– But has well known semantics
associated with each value
● Jobs start in the Idle state
● Become Running if everything works fine
● Completed when they terminate
● If anything goes wrong, a Job will go into Hold
● If removed before completion, will be Removed
CERN, Dec 2012 glideinWMS monitoring 7
8. Monitoring the Job Status
● Idle/Running/Held jobs can be polled with
condor_q
● Will query the Schedd daemon
● Once they terminate, or are removed,
they leave the Schedd queue
● Are put into a file on disk One exception:
If a job was running when it
● Can use condor_history was removed, but the execute node
does not confirm the job was
to retrieve the last ClassAd killed remotely, the job will be
kept in the Schedd.
● The job event log
has all the state transitions
(of course)
CERN, Dec 2012 glideinWMS monitoring 8
9. So, where is the job running?
● Easy to get the machine name and/or IP
● Standard HTCondor attribute
RemoteHost & StartdIpAddr
● But may not necessary make sense
● Do you recognize all network domains?
● And they could be on a private network!
CERN, Dec 2012 glideinWMS monitoring 9
10. Getting glidein attributes
● Glideins have many more attributes that
describe them
● e.g. symbolic site name
GLIDEIN_CMSSite
● However, by default, you
do not get this info in the Job Classad
● But easy to add
● <my attr> = $$(<glidein attr>:Unknown)
– Will get the info in MATCH_EXP_<my attr>
CERN, Dec 2012 glideinWMS monitoring 10
11. Standard attributes
● Standard glideinWMS attributes
● JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)"
● JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"
● JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"
● JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"
Useful
● JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)" for in-depth
debugging
● JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"
● JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"
● Standard CMS glideinWMS attribute
● JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"
Configured by the HTCondor admin,
no need for the user to do anything
SUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ...
CERN, Dec 2012 glideinWMS monitoring 11
12. Getting them in the event log
● You (or the admins) can also propagate
the attributes into the event log
job_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, …
● As a result you get “Job Ad” events
...
...
001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>
... (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>
001
...
028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.
TriggerEventTypeNumber = 12/03 00:46:33 Job ad information event triggered.
028 (20327.002.000) 1
TriggerEventTypeNumber = 1
Cluster = 20327
Cluster = 20327
EventTypeNumber = 28
EventTypeNumber = 28
ExecuteHost = "<193.48.85.94:38749>"
ExecuteHost = "<193.48.85.94:38749>"
JOB_CMSSite = "T2_FR_IPHC"
JOB_CMSSite = "T2_FR_IPHC"
EventTime = "2012-12-03T00:46:33"
EventTime = "2012-12-03T00:46:33"
TriggerEventTypeName = "ULOG_EXECUTE"
TriggerEventTypeName = "ULOG_EXECUTE"
Proc = 2
Proc = 2
Subproc = 0
CurrentTime 0= time()
Subproc =
CurrentTime = time()
MyType = "ExecuteEvent"
MyType = "ExecuteEvent"
...
...
CERN, Dec 2012 glideinWMS monitoring 12
13. Why is my job
not starting?
CERN, Dec 2012 glideinWMS monitoring 13
14. Troubleshooting process
● First question
● Do my jobs match any (logical) resource?
● Once you are sure of that
● Are there jobs from higher priority users?
● Are desired sites just too busy?
● Are there problems at desired site(s)?
● If nothing gives a satisfying answer
● It may be a glideinWMS misconfiguration,
see help from VO FE admins
CERN, Dec 2012 glideinWMS monitoring 14
15. How do I know if my jobs match?
● Good question!
● Unfortunately, the answer is not trivial
● The FE matching policy not “public”
● And, of course, no tools to probe for it
● You will have to rely on the FE admins to
“explain” the policy
● Hopefully in a human readable format
● Hopefully without conversion errors!
CERN, Dec 2012 glideinWMS monitoring 15
16. An example FE policy
● See the CMS FE talk for an actual
high level view
● The actual FE policy is a python expression
A simple example – could be much more complex
(glidein["attrs"]["GLIDEIN_CMSSite"]
(glidein["attrs"]["GLIDEIN_CMSSite"]
in job["DESIRED_Sites"].split(",")) and
in job["DESIRED_Sites"].split(",")) and
((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True")
((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True")
== (job.get("DESIRES_HTPC")==1))
== (job.get("DESIRES_HTPC")==1))
● And then there is the matching HTCondor one
(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&
(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&
((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))
((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))
CERN, Dec 2012 glideinWMS monitoring 16
17. A word about HTCondor matching
● Once glideins start, you can probe their policy
condor_status -format '%s' START
$ condor_status -format '%sn' START
( $( condor_statustrue ) && '%sn' START ( ( stringListMember(GLIDEIN_CMSSite,
true ) && ( -format ( true ) &&
DESIRED_Sites,",") =?= )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite,
( ( true ) && ( true && && GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR
DESIRED_Sites,",")) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR
ES_HTPC =?= true ) ) && ( ( ( GLIDEIN_Is_HTPC undefined =?= Cur
ES_HTPC < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur
rentTime =?= true ) ) ) ) &&
( rentTime) <&& ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,
( true GLIDEIN_ToRetire ) ) )
DESIRED_Sites,",") =?= &&
( ( true ) && ( true )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite,
&& GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR
ES_HTPC =?= true ) ) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR
DESIRED_Sites,",") ) && ( ( ( GLIDEIN_Is_HTPC undefined =?= Cur
rentTime < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur
ES_HTPC =?= true ) ) ) ) &&
rentTime < GLIDEIN_ToRetire ) ) )
...
...
● But no tools to help you understand the M.M.
● The closest is
condor_q -analyze
– But only looks at Job requirements
– So, not really helping when all/most of the policy in glideins
CERN, Dec 2012 glideinWMS monitoring 17
18. User priorities
● So, jobs should be matching, but are not starting
● And there are plenty matching glideins in the system
● Likely there are other higher-priority jobs
in the system
● Possibly from a different user Warning: Slow!
condor_userio
● Possibly on a different schedd
condor_status -submitters
● No tools to give you the easy answer
● If you need the answer, you will have to investigate
CERN, Dec 2012 glideinWMS monitoring 18
19. Unclaimed glideins
● If you see plenty of Unclaimed glideins,
but no matching jobs from other users
● You have either reached the schedd limit
MAX_JOBS_RUNNING
● Or something bad is going on!
● You can only ask yout FE admin for help
● But first double check that your jobs should
indeed be matching, at least on paper
CERN, Dec 2012 glideinWMS monitoring 19
20. Supported Sites
● What should you do if there are
no (new) glideins coming from an expected site?
● First off, see if the site is even supported by the
glideinWMS instance!
● Each Entry has a ClassAd
condor_status -any -const 'MyType==”glideresource”'
● Look for the attributes your FE is matching on
e.g. GLIDEIN_CMSSite
Site
not there?
Notify your
FE admin!
CERN, Dec 2012 glideinWMS monitoring 20
21. Is the FE even asking for them?
● You are sure that your jobs should be
matching?
● But what if you are wrong?
● Check it out
… -format '%in' GlideFactoryMonitorRequestedIdle
But remember
it is
not just your
jobs.
CERN, Dec 2012 glideinWMS monitoring 21
22. Maybe the site is just busy?
● Glideins have to compete with other Grid jobs
at most sites
● Maybe the site is just busy?
● Check if glideinWMS has put any glideins
in the Grid queues
… -format '%in' GlideFactoryMonitorStatusPending
If you find
zeros,
notify your
FE admin!
CERN, Dec 2012 glideinWMS monitoring 22
23. Site problems?
● The glideins will validate the worker node
before talking to the C.M.
● If the test fails, the glidein will “waste” 20 mins on
the node to prevent other jobs to fail on it again
● You can check if there are “Running”
glideins in glideinWMS, even though
you see none (or few) in the C.M.
… -format '%in' GlideFactoryMonitorStatusRunning
If you find
a discrepancy,
notify your
FE admin!
CERN, Dec 2012 glideinWMS monitoring 23
24. Still no clue?
● If all your detective work fails
● Notify your VO FE admin
● They have access to information
you don't
CERN, Dec 2012 glideinWMS monitoring 24
25. Why do my jobs
take forever to finish?
CERN, Dec 2012 glideinWMS monitoring 25
26. My jobs are running, but...
● Great, your jobs are happily running
● But you are getting no results back!
● i.e. the jobs are not finishing in the expected time
● Two main likely reasons
● They are being restarted
● You miscalculated the needed time
CERN, Dec 2012 glideinWMS monitoring 26
27. Jobs re-starting
● HTCondor tries to be user friendly
● If a job gets preempted, for almost any reason,
it will try to re-start it with the hope it will finish
on the next try
● And will not ever give up! (by default)
● You can easily check how many times it started
condor_q -format '%in' NumJobStarts
● You may want to cap the number with
periodic_hold/remove
http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-remove
http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove
CERN, Dec 2012 glideinWMS monitoring 27
28. Why is it restarting?
● OK, I now know it is restarting... but why?
● Most likely, the glidein was killed
● Was it due to your job “misbehaving”?
● Most Grid sites have limits on resource use
● Including CPU, memory and disk
● If you exceed them, the glidein (and you) will be killed
● Glideins should be configured to detect and
hold/remove your job if you “misbehave”
● Thus you would not be re-started
● If you see many restart, notify your FE admin
Likely there is a policy rule missing
CERN, Dec 2012 glideinWMS monitoring 28
29. What is my job doing?
● What if it is not restarting... just running forever
(or until hitting the time limit)
● HTCondor allows for peeking at a running job
● A cmdline tool called
condor_ssh_to_job
● Unfortunately, needs implicit permission from site
– And about half of the sites don't allow it
CERN, Dec 2012 glideinWMS monitoring 29
31. Pointers
● glideinWMS Home Page
http://tinyurl.com/glideinWMS
● HTCondor Home Page
http://research.cs.wisc.edu/htcondor/
● HTCondor support
htcondor-users@cs.wisc.edu
htcondor-admin@cs.wisc.edu
● glideinWMS support
glideinwms-support@fnal.gov
CERN, Dec 2012 glideinWMS monitoring 31
32. Acknowledgments
● The creation of this document was sponsored
by grants from the US NSF and US DOE,
and by the University of California system
CERN, Dec 2012 glideinWMS monitoring 32