The document discusses a study conducted on the Beocat HPC cluster at Kansas State University to understand why users terminate jobs early. The study found that user terminated jobs accounted for around 10% of total CPU time and 12.75% of user wait time, representing significant wasted resources. The top reasons for job termination included exploring the system, system errors, jobs not finishing on time, and jobs converging or not converging earlier than expected. The study suggests ways to address the top reasons and reduce wastage through techniques like improving system reliability, helping users estimate job runtimes better, and automating convergence detection. Repeating such studies on other clusters could help understand wastage in different HPC environments.
1. Why do users kill
HPC jobs?
Venkatesh-Prasad Ranganath Daniel Andresen
December 17-20, 2018
2. Context
⢠HPC clusters are worth millions of dollars
⢠Critical computations depend on HPC
⢠Numerous eďŹorts have explored HPC ROI
⢠Most eďŹorts have focused on improving non-human ROI
⢠System monitoring and management
⢠failure, power quality, temperature
⢠Programming support
⢠novel abstractions, exascale debugging, couplings
between experiments
3. Context
⢠Very few eďŹorts have explored human ROI
⢠Understand how software engineering aspects
inďŹuence development and use of scientiďŹc software
⢠Propose methods to model and measure human ROI
⢠No observational studies of user triggered wastage
Human ROI/productivity
⢠EďŹort expended by users to use HPC clusters
⢠Gains/Losses incurred by HPC users
4. Study: Questions
1. For what reasons do users terminate HPC jobs?
2. How often do users terminate jobs?
3. How much compute resource is wasted due to user
terminated jobs?
4. How do user terminated jobs compare to system and
scheduler terminated jobs and all jobs executed on the
cluster in terms of consumed compute resources?
5. How does wasted computation translate into user wait
times?
6. How do user terminated jobs compare to system and
scheduler terminated jobs and all jobs executed on the
cluster in terms of user wait times?
5. Study: Environment
1. Beocat cluster at Kansas State University
1. XSEDE Federation (XF) Level 3 cluster
2. ~7900 processor cores / 300+ nodes
3. 16-80 cores per node
4. 32GB-1.5TB RAM per node
2. Sun Grid Engine (SGE) was used to job scheduling
3. Around 400 unique users (students + researchers)
4. Supported by 1 application scientist and 2 sys admins
8. Study: Execution
⢠Conducted between Aug 15 2016 thru Dec 31 2017
⢠Used intervention to encourage users to participate in the
study; participation was voluntary and IRB approved
⢠Manually aggregated collected free-form reasons
⢠Used SGE accounting ďŹles as the source of runtime
information
⢠Analyzed collected reasons and runtime info using Awk :)
⢠Artifacts and scripts available at https://bitbucket.org/
rvprasad/why-do-users-kill-hpc-jobs
9. Job Costs
Normal Exit CPU Time (s) WC Time (s)
Y 59,664,147,967 13,865,524,891
N 17,452,418,827 3,088,336,839
Total 77,116,566,794 16,953,861,730
Terminated 7,375,029,412 2,162,356,250
9.56% of total CPU time was wasted
12.75% of total WC (User) time was wasted
42.25% of total abnormal exit CPU time was wasted
70.02% of total abnormal exit WC (User) time was wasted
639,102 (649,542) jobs were executed (submitted)
26,967 jobs were terminated by users
13,598 jobs were executing during termination
10. Reasons & Their Costs
Reasons for User Triggered Terminations CPU Time % WC Time %
1 Exploring and testing Beocat 10.41 32.50
2 System errors 10.10 6.06
3 Incorrect application parameters
4 Decided to change application parameters
5 Computation has converged 4.99
6 Computation is not converging 3.98
7 Application code crashed or encountered errors
8 Job script encountered errors 5.46
9 Decided to change job parameters
10 Issues with requested amount of memory
11 Job will not ďŹnish on time 3.08 5.23
12 Testing or debugging code
13 External user error
14 ConďŹicts with other submitted jobs 4.98
15 Unable to understand the provided reason 9.79 3.83
16 IneďŹcient use of resources
17 No reasons were provided 45.57 37.13
Total (seconds) 7,375,029,412 2,162,356,250
11. Remediations for
Top Reasons
⢠System errors: Improve cluster reliability and reduce system
failures
⢠ConďŹicts with other submitted jobs: Help users identify and
use useful conďŹgurations
⢠Computation has converged: Use automation to detect
convergence
⢠Computation is not converging: Use automation to avoid/
detect divergent computations
⢠Job will not ďŹnish on time: Help users to better estimate time
required for jobs
⢠Exploring and testing Beocat: Limit compute time or use
dedicated testing sub-cluster or job queue with diďŹerent SLA
12. Possible Data Quality
Issues
⢠Missing Reasons
⢠Incomprehensible reasons / No reasons are provided
⢠Ungathered Reasons
⢠Crashed or unterminated jobs whose results were
discarded
⢠Inconsistent Reasons
⢠DiďŹering reasons for same kind of jobs or situations
⢠MisclassiďŹed Reasons
⢠Rater biases and human error
14. Current and Future Work
⢠Educate users about using Beocat
⢠Reduce wastage using existing techniques
⢠E.g., explore use of checkpointing solutions
⢠Revamp monitoring and data collection on Beocat
⢠Explore options to address data quality
⢠Repeat the study on other clusters similar to Beocat
⢠E.g., other XSEDE (XF) level 3 clusters
⢠Repeat the study on clusters not similar to Beocat
⢠E.g., XSEDE (XF) level 1 and 2 clusters
15. Takeaway
Call to Action
⢠User terminated HPC jobs contribute non-trivial amount of
wasted computation, e.g., 10% of execution time
⢠Top reasons for users to terminate HPC jobs can
⢠be tackled with existing techniques or
⢠serve as good research directions to improve HPC
⢠Repeat the study on your clusters to understand the kinds of
wastage in diďŹerent HPC environments
⢠Explore human (soft) aspects in HPC
https://bitbucket.org/rvprasad/why-do-users-
kill-hpc-jobs