Monitoring and troubleshooting a glideinWMS-based HTCondor pool

glideinWMS for users

Monitoring and
troubleshooting
a glideinWMS-based
HTCondor pool
by Igor Sfiligoi (UCSD)

CERN, Dec 2012 glideinWMS monitoring 1

Scope of this talk

This talk describes what
information are available
when troubleshooting in a
glideinWMS-based HTCondor pool,
and what tools can you use
to mine them.

Reader is expected to already have a basic understanding of HTCondor and glideinWMS.


HTCondor Architecture
● As a reminder
G.F.
+3
VO FE Grid
G.F.
+1
Execute node

Central manager Execute node
Submit node
Execute node
Negotiator
Submit node
Execute node
Submit node
Execute node
Schedd Condor


Typical user questions
addressed in this talk

● Where is/was my job running?
● Why are my jobs
not starting?
● Why do my jobs
take forever to finish?


Where is/was
my job running?


Job progress monitoring
● HTCondor provides two basic means to monitor
job progress
● Querying the system for current status
– Using the cmdline condor_q/condor_history
● Parsing the job event log
– Either plain text or XML formatted
– Starting with 7.9.1, condor_history can be used
to extract the last known state


Job status
● Each Job has a status associated with it
● An integer attribute called
JobStatus
– But has well known semantics
associated with each value
● Jobs start in the Idle state
● Become Running if everything works fine
● Completed when they terminate
● If anything goes wrong, a Job will go into Hold
● If removed before completion, will be Removed

Monitoring the Job Status
● Idle/Running/Held jobs can be polled with
condor_q
● Will query the Schedd daemon
● Once they terminate, or are removed,
they leave the Schedd queue
● Are put into a file on disk One exception:
If a job was running when it
● Can use condor_history was removed, but the execute node
does not confirm the job was
to retrieve the last ClassAd killed remotely, the job will be
kept in the Schedd.
● The job event log
has all the state transitions
(of course)


So, where is the job running?
● Easy to get the machine name and/or IP
● Standard HTCondor attribute
RemoteHost & StartdIpAddr
● But may not necessary make sense
● Do you recognize all network domains?
● And they could be on a private network!


Getting glidein attributes
● Glideins have many more attributes that
describe them
● e.g. symbolic site name
GLIDEIN_CMSSite
● However, by default, you
do not get this info in the Job Classad
● But easy to add
● <my attr> = $$(<glidein attr>:Unknown)
– Will get the info in MATCH_EXP_<my attr>


Standard attributes
● Standard glideinWMS attributes
● JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)"
● JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"
● JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"
● JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"
Useful
● JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)" for in-depth
debugging

● JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"
● JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"
● Standard CMS glideinWMS attribute
● JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"

Configured by the HTCondor admin,
no need for the user to do anything
SUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ...


Getting them in the event log
● You (or the admins) can also propagate
the attributes into the event log
job_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, …
● As a result you get “Job Ad” events
...
...
001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>
... (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>
001
...
028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.
TriggerEventTypeNumber = 12/03 00:46:33 Job ad information event triggered.
028 (20327.002.000) 1
TriggerEventTypeNumber = 1
Cluster = 20327
Cluster = 20327
EventTypeNumber = 28
EventTypeNumber = 28
ExecuteHost = "<193.48.85.94:38749>"
ExecuteHost = "<193.48.85.94:38749>"
JOB_CMSSite = "T2_FR_IPHC"
JOB_CMSSite = "T2_FR_IPHC"
EventTime = "2012-12-03T00:46:33"
EventTime = "2012-12-03T00:46:33"
TriggerEventTypeName = "ULOG_EXECUTE"
TriggerEventTypeName = "ULOG_EXECUTE"
Proc = 2
Proc = 2
Subproc = 0
CurrentTime 0= time()
Subproc =
CurrentTime = time()
MyType = "ExecuteEvent"
MyType = "ExecuteEvent"
...
...


Why is my job
not starting?


Troubleshooting process
● First question
● Do my jobs match any (logical) resource?
● Once you are sure of that
● Are there jobs from higher priority users?
● Are desired sites just too busy?
● Are there problems at desired site(s)?
● If nothing gives a satisfying answer
● It may be a glideinWMS misconfiguration,
see help from VO FE admins


How do I know if my jobs match?
● Good question!
● Unfortunately, the answer is not trivial
● The FE matching policy not “public”
● And, of course, no tools to probe for it
● You will have to rely on the FE admins to
“explain” the policy
● Hopefully in a human readable format
● Hopefully without conversion errors!


An example FE policy
● See the CMS FE talk for an actual
high level view
● The actual FE policy is a python expression
A simple example – could be much more complex
(glidein["attrs"]["GLIDEIN_CMSSite"]
(glidein["attrs"]["GLIDEIN_CMSSite"]
in job["DESIRED_Sites"].split(",")) and
in job["DESIRED_Sites"].split(",")) and
((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True")
((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True")
== (job.get("DESIRES_HTPC")==1))
== (job.get("DESIRES_HTPC")==1))

● And then there is the matching HTCondor one
(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&
(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&
((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))
((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))


A word about HTCondor matching
● Once glideins start, you can probe their policy
condor_status -format '%s' START
$ condor_status -format '%sn' START
( $( condor_statustrue ) && '%sn' START ( ( stringListMember(GLIDEIN_CMSSite,
true ) && ( -format ( true ) &&
DESIRED_Sites,",") =?= )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite,
( ( true ) && ( true && && GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR
DESIRED_Sites,",")) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR
ES_HTPC =?= true ) ) && ( ( ( GLIDEIN_Is_HTPC undefined =?= Cur
ES_HTPC < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur
rentTime =?= true ) ) ) ) &&
( rentTime) <&& ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,
( true GLIDEIN_ToRetire ) ) )
DESIRED_Sites,",") =?= &&
( ( true ) && ( true )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite,
&& GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR
ES_HTPC =?= true ) ) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR
DESIRED_Sites,",") ) && ( ( ( GLIDEIN_Is_HTPC undefined =?= Cur
rentTime < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur
ES_HTPC =?= true ) ) ) ) &&
rentTime < GLIDEIN_ToRetire ) ) )
...
...

● But no tools to help you understand the M.M.
● The closest is
condor_q -analyze
– But only looks at Job requirements
– So, not really helping when all/most of the policy in glideins


User priorities
● So, jobs should be matching, but are not starting
● And there are plenty matching glideins in the system
● Likely there are other higher-priority jobs
in the system
● Possibly from a different user Warning: Slow!
condor_userio
● Possibly on a different schedd
condor_status -submitters
● No tools to give you the easy answer
● If you need the answer, you will have to investigate

Unclaimed glideins
● If you see plenty of Unclaimed glideins,
but no matching jobs from other users
● You have either reached the schedd limit
MAX_JOBS_RUNNING
● Or something bad is going on!
● You can only ask yout FE admin for help
● But first double check that your jobs should
indeed be matching, at least on paper


Supported Sites
● What should you do if there are
no (new) glideins coming from an expected site?
● First off, see if the site is even supported by the
glideinWMS instance!
● Each Entry has a ClassAd
condor_status -any -const 'MyType==”glideresource”'
● Look for the attributes your FE is matching on
e.g. GLIDEIN_CMSSite
Site
not there?
Notify your
FE admin!


Is the FE even asking for them?
● You are sure that your jobs should be
matching?
● But what if you are wrong?
● Check it out
… -format '%in' GlideFactoryMonitorRequestedIdle

But remember
it is
not just your
jobs.


Maybe the site is just busy?
● Glideins have to compete with other Grid jobs
at most sites
● Maybe the site is just busy?
● Check if glideinWMS has put any glideins
in the Grid queues
… -format '%in' GlideFactoryMonitorStatusPending

If you find
zeros,
notify your
FE admin!

Site problems?
● The glideins will validate the worker node
before talking to the C.M.
● If the test fails, the glidein will “waste” 20 mins on
the node to prevent other jobs to fail on it again
● You can check if there are “Running”
glideins in glideinWMS, even though
you see none (or few) in the C.M.
… -format '%in' GlideFactoryMonitorStatusRunning

If you find
a discrepancy,
notify your
FE admin!

Still no clue?

● If all your detective work fails
● Notify your VO FE admin
● They have access to information
you don't


Why do my jobs
take forever to finish?


My jobs are running, but...
● Great, your jobs are happily running
● But you are getting no results back!
● i.e. the jobs are not finishing in the expected time
● Two main likely reasons
● They are being restarted
● You miscalculated the needed time


Jobs re-starting
● HTCondor tries to be user friendly
● If a job gets preempted, for almost any reason,
it will try to re-start it with the hope it will finish
on the next try
● And will not ever give up! (by default)
● You can easily check how many times it started
condor_q -format '%in' NumJobStarts
● You may want to cap the number with
periodic_hold/remove
http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-remove
http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove


Why is it restarting?
● OK, I now know it is restarting... but why?
● Most likely, the glidein was killed
● Was it due to your job “misbehaving”?
● Most Grid sites have limits on resource use
● Including CPU, memory and disk
● If you exceed them, the glidein (and you) will be killed
● Glideins should be configured to detect and
hold/remove your job if you “misbehave”
● Thus you would not be re-started
● If you see many restart, notify your FE admin
Likely there is a policy rule missing

What is my job doing?
● What if it is not restarting... just running forever
(or until hitting the time limit)
● HTCondor allows for peeking at a running job
● A cmdline tool called
condor_ssh_to_job

● Unfortunately, needs implicit permission from site
– And about half of the sites don't allow it

The End


Pointers
● glideinWMS Home Page
http://tinyurl.com/glideinWMS
● HTCondor Home Page
http://research.cs.wisc.edu/htcondor/
● HTCondor support
htcondor-users@cs.wisc.edu
htcondor-admin@cs.wisc.edu
● glideinWMS support
glideinwms-support@fnal.gov


Acknowledgments
● The creation of this document was sponsored
by grants from the US NSF and US DOE,
and by the University of California system


Monitoring and troubleshooting a glideinWMS-based HTCondor pool

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Monitoring and troubleshooting a glideinWMS-based HTCondor pool

Ähnlich wie Monitoring and troubleshooting a glideinWMS-based HTCondor pool (20)

Mehr von Igor Sfiligoi

Mehr von Igor Sfiligoi (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Monitoring and troubleshooting a glideinWMS-based HTCondor pool