SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Why do users kill
HPC jobs?
Venkatesh-Prasad Ranganath Daniel Andresen
December 17-20, 2018
Context
• HPC clusters are worth millions of dollars

• Critical computations depend on HPC

• Numerous efforts have explored HPC ROI

• Most efforts have focused on improving non-human ROI

• System monitoring and management

• failure, power quality, temperature

• Programming support

• novel abstractions, exascale debugging, couplings
between experiments
Context
• Very few efforts have explored human ROI

• Understand how software engineering aspects
influence development and use of scientific software

• Propose methods to model and measure human ROI

• No observational studies of user triggered wastage

Human ROI/productivity

• Effort expended by users to use HPC clusters

• Gains/Losses incurred by HPC users
Study: Questions
1. For what reasons do users terminate HPC jobs?

2. How often do users terminate jobs?

3. How much compute resource is wasted due to user
terminated jobs? 

4. How do user terminated jobs compare to system and
scheduler terminated jobs and all jobs executed on the
cluster in terms of consumed compute resources? 

5. How does wasted computation translate into user wait
times?

6. How do user terminated jobs compare to system and
scheduler terminated jobs and all jobs executed on the
cluster in terms of user wait times?
Study: Environment
1. Beocat cluster at Kansas State University

1. XSEDE Federation (XF) Level 3 cluster

2. ~7900 processor cores / 300+ nodes

3. 16-80 cores per node

4. 32GB-1.5TB RAM per node

2. Sun Grid Engine (SGE) was used to job scheduling

3. Around 400 unique users (students + researchers)

4. Supported by 1 application scientist and 2 sys admins
Study: Offline Design
$	qdel	1234
Study: Online Design
$	qdel	1234	-issue	"scripting	error"		
		-app	VASP
Study: Execution
• Conducted between Aug 15 2016 thru Dec 31 2017

• Used intervention to encourage users to participate in the
study; participation was voluntary and IRB approved

• Manually aggregated collected free-form reasons

• Used SGE accounting files as the source of runtime
information

• Analyzed collected reasons and runtime info using Awk :)

• Artifacts and scripts available at https://bitbucket.org/
rvprasad/why-do-users-kill-hpc-jobs
Job Costs
Normal Exit CPU Time (s) WC Time (s)
Y 59,664,147,967 13,865,524,891
N 17,452,418,827 3,088,336,839
Total 77,116,566,794 16,953,861,730
Terminated 7,375,029,412 2,162,356,250
9.56% of total CPU time was wasted

12.75% of total WC (User) time was wasted

42.25% of total abnormal exit CPU time was wasted

70.02% of total abnormal exit WC (User) time was wasted
639,102 (649,542) jobs were executed (submitted)

26,967 jobs were terminated by users

13,598 jobs were executing during termination
Reasons & Their Costs
Reasons for User Triggered Terminations CPU Time % WC Time %
1 Exploring and testing Beocat 10.41 32.50
2 System errors 10.10 6.06
3 Incorrect application parameters
4 Decided to change application parameters
5 Computation has converged 4.99
6 Computation is not converging 3.98
7 Application code crashed or encountered errors
8 Job script encountered errors 5.46
9 Decided to change job parameters
10 Issues with requested amount of memory
11 Job will not nish on time 3.08 5.23
12 Testing or debugging code
13 External user error
14 Conflicts with other submitted jobs 4.98
15 Unable to understand the provided reason 9.79 3.83
16 Inecient use of resources
17 No reasons were provided 45.57 37.13
Total (seconds) 7,375,029,412 2,162,356,250
Remediations for
Top Reasons
• System errors: Improve cluster reliability and reduce system
failures

• Conflicts with other submitted jobs: Help users identify and
use useful congurations

• Computation has converged: Use automation to detect
convergence

• Computation is not converging: Use automation to avoid/
detect divergent computations

• Job will not finish on time: Help users to better estimate time
required for jobs

• Exploring and testing Beocat: Limit compute time or use
dedicated testing sub-cluster or job queue with different SLA
Possible Data Quality
Issues
• Missing Reasons
• Incomprehensible reasons / No reasons are provided

• Ungathered Reasons
• Crashed or unterminated jobs whose results were
discarded

• Inconsistent Reasons
• Differing reasons for same kind of jobs or situations

• Misclassified Reasons
• Rater biases and human error
Offline Design vs Online Design
Current and Future Work
• Educate users about using Beocat

• Reduce wastage using existing techniques

• E.g., explore use of checkpointing solutions

• Revamp monitoring and data collection on Beocat

• Explore options to address data quality

• Repeat the study on other clusters similar to Beocat

• E.g., other XSEDE (XF) level 3 clusters

• Repeat the study on clusters not similar to Beocat

• E.g., XSEDE (XF) level 1 and 2 clusters
Takeaway
Call to Action
• User terminated HPC jobs contribute non-trivial amount of
wasted computation, e.g., 10% of execution time

• Top reasons for users to terminate HPC jobs can

• be tackled with existing techniques or

• serve as good research directions to improve HPC

• Repeat the study on your clusters to understand the kinds of
wastage in different HPC environments

• Explore human (soft) aspects in HPC
https://bitbucket.org/rvprasad/why-do-users-
kill-hpc-jobs

Weitere ähnliche Inhalte

Ähnlich wie Why do Users kill HPC Jobs?

Software engineering jwfiles 3
Software engineering jwfiles 3Software engineering jwfiles 3
Software engineering jwfiles 3Azhar Shaik
 
Mis system analysis and system design
Mis   system analysis and system designMis   system analysis and system design
Mis system analysis and system designRahul Hedau
 
Analyzing Data, Getting Results: Making it All Make Sense
Analyzing Data, Getting Results: Making it All Make SenseAnalyzing Data, Getting Results: Making it All Make Sense
Analyzing Data, Getting Results: Making it All Make SenseJenn Riley
 
System development life cycle (sdlc)
System development life cycle (sdlc)System development life cycle (sdlc)
System development life cycle (sdlc)Mukund Trivedi
 
A personal journey towards more reproducible networking research
A personal journey towards more reproducible networking researchA personal journey towards more reproducible networking research
A personal journey towards more reproducible networking researchOlivier Bonaventure
 
Agileand saas davepatterson_armandofox_050813webinar
Agileand saas davepatterson_armandofox_050813webinarAgileand saas davepatterson_armandofox_050813webinar
Agileand saas davepatterson_armandofox_050813webinarRoberto Jr. Figueroa
 
Mg6088 spm unit-2
Mg6088 spm unit-2Mg6088 spm unit-2
Mg6088 spm unit-2SIMONTHOMAS S
 
Software engineering lecture notes
Software engineering lecture notesSoftware engineering lecture notes
Software engineering lecture notesSiva Ayyakutti
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE MethodBrendan Gregg
 
Lecture 3 software_engineering
Lecture 3 software_engineeringLecture 3 software_engineering
Lecture 3 software_engineeringmoduledesign
 
2 approaches to system development
2 approaches to system development2 approaches to system development
2 approaches to system developmentcymark09
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Dr. Shikha Mehta
 
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...confluent
 
Selection of methodology - System Analysis and Design
Selection of methodology - System Analysis and Design  Selection of methodology - System Analysis and Design
Selection of methodology - System Analysis and Design Sutharshan Sharma
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive MaintenanceArnab Biswas
 
performancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfperformancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfMAshok10
 
Designing and Implementing Information Systems with Event Modeling, Bobby Cal...
Designing and Implementing Information Systems with Event Modeling, Bobby Cal...Designing and Implementing Information Systems with Event Modeling, Bobby Cal...
Designing and Implementing Information Systems with Event Modeling, Bobby Cal...confluent
 

Ähnlich wie Why do Users kill HPC Jobs? (20)

Software engineering jwfiles 3
Software engineering jwfiles 3Software engineering jwfiles 3
Software engineering jwfiles 3
 
Mis system analysis and system design
Mis   system analysis and system designMis   system analysis and system design
Mis system analysis and system design
 
SE Unit-1.pptx
SE Unit-1.pptxSE Unit-1.pptx
SE Unit-1.pptx
 
Analyzing Data, Getting Results: Making it All Make Sense
Analyzing Data, Getting Results: Making it All Make SenseAnalyzing Data, Getting Results: Making it All Make Sense
Analyzing Data, Getting Results: Making it All Make Sense
 
System development life cycle (sdlc)
System development life cycle (sdlc)System development life cycle (sdlc)
System development life cycle (sdlc)
 
A personal journey towards more reproducible networking research
A personal journey towards more reproducible networking researchA personal journey towards more reproducible networking research
A personal journey towards more reproducible networking research
 
Agileand saas davepatterson_armandofox_050813webinar
Agileand saas davepatterson_armandofox_050813webinarAgileand saas davepatterson_armandofox_050813webinar
Agileand saas davepatterson_armandofox_050813webinar
 
Mg6088 spm unit-2
Mg6088 spm unit-2Mg6088 spm unit-2
Mg6088 spm unit-2
 
Software engineering lecture notes
Software engineering lecture notesSoftware engineering lecture notes
Software engineering lecture notes
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Lecture 3 software_engineering
Lecture 3 software_engineeringLecture 3 software_engineering
Lecture 3 software_engineering
 
Seminar on Project Management by Rj
Seminar on Project Management by RjSeminar on Project Management by Rj
Seminar on Project Management by Rj
 
2 approaches to system development
2 approaches to system development2 approaches to system development
2 approaches to system development
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
 
Selection of methodology - System Analysis and Design
Selection of methodology - System Analysis and Design  Selection of methodology - System Analysis and Design
Selection of methodology - System Analysis and Design
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive Maintenance
 
performancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfperformancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdf
 
Designing and Implementing Information Systems with Event Modeling, Bobby Cal...
Designing and Implementing Information Systems with Event Modeling, Bobby Cal...Designing and Implementing Information Systems with Event Modeling, Bobby Cal...
Designing and Implementing Information Systems with Event Modeling, Bobby Cal...
 
software engineering
software engineering software engineering
software engineering
 

Mehr von Venkatesh Prasad Ranganath

SeMA: A Design Methodology for Building Secure Android Apps
SeMA: A Design Methodology for Building Secure Android AppsSeMA: A Design Methodology for Building Secure Android Apps
SeMA: A Design Methodology for Building Secure Android AppsVenkatesh Prasad Ranganath
 
Are free Android app security analysis tools effective in detecting known vul...
Are free Android app security analysis tools effective in detecting known vul...Are free Android app security analysis tools effective in detecting known vul...
Are free Android app security analysis tools effective in detecting known vul...Venkatesh Prasad Ranganath
 
Benchpress: Analyzing Android App Vulnerability Benchmark Suites
Benchpress:  Analyzing Android App Vulnerability Benchmark SuitesBenchpress:  Analyzing Android App Vulnerability Benchmark Suites
Benchpress: Analyzing Android App Vulnerability Benchmark SuitesVenkatesh Prasad Ranganath
 
Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Behavior Driven Development [10] - Software Testing Techniques (CIS640)Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Behavior Driven Development [10] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Code Coverage [9] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)Code Coverage [9] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Property Based Testing [5] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)Property Based Testing [5] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Intro to Python3 [2] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)Intro to Python3 [2] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Unit testing [4] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)Unit testing [4] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Testing concepts [3] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)Testing concepts [3] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Introduction [1] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)Introduction [1] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)Venkatesh Prasad Ranganath
 
Compatibility Testing using Patterns-based Trace Comparison
Compatibility Testing using Patterns-based Trace ComparisonCompatibility Testing using Patterns-based Trace Comparison
Compatibility Testing using Patterns-based Trace ComparisonVenkatesh Prasad Ranganath
 

Mehr von Venkatesh Prasad Ranganath (17)

SeMA: A Design Methodology for Building Secure Android Apps
SeMA: A Design Methodology for Building Secure Android AppsSeMA: A Design Methodology for Building Secure Android Apps
SeMA: A Design Methodology for Building Secure Android Apps
 
Are free Android app security analysis tools effective in detecting known vul...
Are free Android app security analysis tools effective in detecting known vul...Are free Android app security analysis tools effective in detecting known vul...
Are free Android app security analysis tools effective in detecting known vul...
 
Benchpress: Analyzing Android App Vulnerability Benchmark Suites
Benchpress:  Analyzing Android App Vulnerability Benchmark SuitesBenchpress:  Analyzing Android App Vulnerability Benchmark Suites
Benchpress: Analyzing Android App Vulnerability Benchmark Suites
 
Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Behavior Driven Development [10] - Software Testing Techniques (CIS640)Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Behavior Driven Development [10] - Software Testing Techniques (CIS640)
 
Code Coverage [9] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)Code Coverage [9] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)
 
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
 
Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)
 
Property Based Testing [5] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)Property Based Testing [5] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)
 
Intro to Python3 [2] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)Intro to Python3 [2] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)
 
Unit testing [4] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)Unit testing [4] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)
 
Testing concepts [3] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)Testing concepts [3] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)
 
Introduction [1] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)Introduction [1] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)
 
Compatibility Testing using Patterns-based Trace Comparison
Compatibility Testing using Patterns-based Trace ComparisonCompatibility Testing using Patterns-based Trace Comparison
Compatibility Testing using Patterns-based Trace Comparison
 
My flings with data analysis
My flings with data analysisMy flings with data analysis
My flings with data analysis
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
R language, an introduction
R language, an introductionR language, an introduction
R language, an introduction
 
Pattern-based Features
Pattern-based FeaturesPattern-based Features
Pattern-based Features
 

KĂźrzlich hochgeladen

Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermicultureTakeleZike1
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 

KĂźrzlich hochgeladen (20)

Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermiculture
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 

Why do Users kill HPC Jobs?

  • 1. Why do users kill HPC jobs? Venkatesh-Prasad Ranganath Daniel Andresen December 17-20, 2018
  • 2. Context • HPC clusters are worth millions of dollars • Critical computations depend on HPC • Numerous efforts have explored HPC ROI • Most efforts have focused on improving non-human ROI • System monitoring and management • failure, power quality, temperature • Programming support • novel abstractions, exascale debugging, couplings between experiments
  • 3. Context • Very few efforts have explored human ROI • Understand how software engineering aspects influence development and use of scientic software • Propose methods to model and measure human ROI • No observational studies of user triggered wastage Human ROI/productivity • Effort expended by users to use HPC clusters • Gains/Losses incurred by HPC users
  • 4. Study: Questions 1. For what reasons do users terminate HPC jobs? 2. How often do users terminate jobs? 3. How much compute resource is wasted due to user terminated jobs? 4. How do user terminated jobs compare to system and scheduler terminated jobs and all jobs executed on the cluster in terms of consumed compute resources? 5. How does wasted computation translate into user wait times? 6. How do user terminated jobs compare to system and scheduler terminated jobs and all jobs executed on the cluster in terms of user wait times?
  • 5. Study: Environment 1. Beocat cluster at Kansas State University 1. XSEDE Federation (XF) Level 3 cluster 2. ~7900 processor cores / 300+ nodes 3. 16-80 cores per node 4. 32GB-1.5TB RAM per node 2. Sun Grid Engine (SGE) was used to job scheduling 3. Around 400 unique users (students + researchers) 4. Supported by 1 application scientist and 2 sys admins
  • 8. Study: Execution • Conducted between Aug 15 2016 thru Dec 31 2017 • Used intervention to encourage users to participate in the study; participation was voluntary and IRB approved • Manually aggregated collected free-form reasons • Used SGE accounting les as the source of runtime information • Analyzed collected reasons and runtime info using Awk :) • Artifacts and scripts available at https://bitbucket.org/ rvprasad/why-do-users-kill-hpc-jobs
  • 9. Job Costs Normal Exit CPU Time (s) WC Time (s) Y 59,664,147,967 13,865,524,891 N 17,452,418,827 3,088,336,839 Total 77,116,566,794 16,953,861,730 Terminated 7,375,029,412 2,162,356,250 9.56% of total CPU time was wasted 12.75% of total WC (User) time was wasted 42.25% of total abnormal exit CPU time was wasted 70.02% of total abnormal exit WC (User) time was wasted 639,102 (649,542) jobs were executed (submitted) 26,967 jobs were terminated by users 13,598 jobs were executing during termination
  • 10. Reasons & Their Costs Reasons for User Triggered Terminations CPU Time % WC Time % 1 Exploring and testing Beocat 10.41 32.50 2 System errors 10.10 6.06 3 Incorrect application parameters 4 Decided to change application parameters 5 Computation has converged 4.99 6 Computation is not converging 3.98 7 Application code crashed or encountered errors 8 Job script encountered errors 5.46 9 Decided to change job parameters 10 Issues with requested amount of memory 11 Job will not nish on time 3.08 5.23 12 Testing or debugging code 13 External user error 14 Conflicts with other submitted jobs 4.98 15 Unable to understand the provided reason 9.79 3.83 16 Inecient use of resources 17 No reasons were provided 45.57 37.13 Total (seconds) 7,375,029,412 2,162,356,250
  • 11. Remediations for Top Reasons • System errors: Improve cluster reliability and reduce system failures • Conflicts with other submitted jobs: Help users identify and use useful congurations • Computation has converged: Use automation to detect convergence • Computation is not converging: Use automation to avoid/ detect divergent computations • Job will not nish on time: Help users to better estimate time required for jobs • Exploring and testing Beocat: Limit compute time or use dedicated testing sub-cluster or job queue with different SLA
  • 12. Possible Data Quality Issues • Missing Reasons • Incomprehensible reasons / No reasons are provided • Ungathered Reasons • Crashed or unterminated jobs whose results were discarded • Inconsistent Reasons • Differing reasons for same kind of jobs or situations • Misclassied Reasons • Rater biases and human error
  • 13. Offline Design vs Online Design
  • 14. Current and Future Work • Educate users about using Beocat • Reduce wastage using existing techniques • E.g., explore use of checkpointing solutions • Revamp monitoring and data collection on Beocat • Explore options to address data quality • Repeat the study on other clusters similar to Beocat • E.g., other XSEDE (XF) level 3 clusters • Repeat the study on clusters not similar to Beocat • E.g., XSEDE (XF) level 1 and 2 clusters
  • 15. Takeaway Call to Action • User terminated HPC jobs contribute non-trivial amount of wasted computation, e.g., 10% of execution time • Top reasons for users to terminate HPC jobs can • be tackled with existing techniques or • serve as good research directions to improve HPC • Repeat the study on your clusters to understand the kinds of wastage in different HPC environments • Explore human (soft) aspects in HPC https://bitbucket.org/rvprasad/why-do-users- kill-hpc-jobs