In this deck from the 2015 PBS Works User Group, Andrew Wellington, Senior HPC Systems Administrator, from NCI presents: PBS and Scheduling at NCI: The past, present and future.
Watch the video presentation: http://insidehpc.com/2015/09/video-pbs-and-scheduling-at-nci-the-past-present-and-future/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
3. nci.org.au
NCI - An Overview
• NCI is Australia’s national high-performance computing service
• comprehensive, vertically-integrated research service
• providing national access on priority and merit
• driven by research objectives
• Operates as a formal collaboration of ANU, CSIRO, the Australian Bureau of Meteorology and Geoscience
Australia
• As a partnership with a number of research-intensive Universities, supported by the Australian Research
Council.
4. nci.org.au
Where are we located?
• Nation’s capital — Canberra, ACT
• At its National University — The Australian National University (ANU)
5. nci.org.au
Research Communities
Research focus areas
• Climate Science and Earth System Science
• Astronomy (optical and theoretical)
• Geosciences: Geophysics, Earth Observation
• Biosciences & Bioinformatics
• Computational Sciences
• Engineering
• Chemistry
• Physics
• Social Sciences
• Growing emphasis on data-intensive computation
• Cloud Services
• Earth System Grid
8. nci.org.au
Past Machines - SC
• 126 nodes with 4x 1GHz Alpha CPUs (504 CPUs)
• Between 4 - 16GB RAM per node
• Total 700GB RAM, 2.88TB global disk, 13.1TB total disk
• Theoretical peak over 1 Tflop
• Linpack at 820 Gflops
• Quadrics “Elan3” interconnect
• 250 Mbytes/sec bidirectional
9. nci.org.au
Past Machines - LC
• 152 nodes with 2.66GHz Pentium 4
• 1GB RAM per node
• Theoretical peak over 800 Gflops
• 1.4TB global storage, 16TB total disk
• Gigabit ethernet interconnect
10. nci.org.au
Past Machines - AC
• 1928 1.6GHz Itanium2 processors
• Grouped into 30 partitions with 64 processors each
• Total 5.6TB RAM, 30TB global disk, 47TB total disk
• Theoretical peak over 11 Tflops
• SGI NUMAlink4 interconnect
11. nci.org.au
Past Machines - XE
• 156 nodes with 2x quad-core 3.0GHz Xeon Harpertown (1248 cores)
• Between 16 - 32GB RAM per node
• Total 2.7TB RAM, 54TB global disk, 130TB total disk
• 32 NVIDIA Tesla Fermi GPUs (16 Tflops)
• Theoretical peak almost 15 Tflops
• DDR Infiniband interconnect
12. nci.org.au
Past Machines - Vayu
• 1492 nodes with 2x quad-core 2.93GHz Xeon Nehalem (11,936 cores)
• Between 24 - 96GB RAM per node
• Total 37TB RAM, 800TB global disk
• Theoretical peak approx 140 Tflops
• QDR Infiniband interconnect
13. nci.org.au
Current Machine - Raijin
• 3592 nodes with 2x 8-core 2.6GHz Xeon Sandy Bridge (57,472 cores)
• Between 32- 128 GB RAM per node
• Total 160TB RAM, 10 PB global disk
• Theoretical peak approx 1.2 PFlops
• FDR Infiniband interconnect
• Around 52km (32 miles) of IB cabling
• 1.5 MW power; 100 tonnes of water in cooling
• Access to global Lustre filesystems (21 PB+)
14. nci.org.au
Current Machine - Tenjin Cloud
• Dell C8000 based high performance cloud
• 100 nodes with 2x 8-core 2.6GHz Xeon Sandy Bridge (1600 cores)
• 128GB RAM per node
• Over 12TB main memory, 650TB Ceph global storage
• Access to global Lustre filesystems (21 PB+)
• OpenStack management
15. nci.org.au
The Past - ANUPBS
• A heavily customised fork of OpenPBS v2.3
• Adds a lot of new commands
• jobnodes, jobs_on_node, nqstat, pbs_rusage, pbsrsh, pestat, qcat, qcp, qls, qps, qwait
• Modified a number of commands to add new options
• pbsdsh, pbsnodes, qalter, qdel, qorder, qrerun, qrun, qsub
• Unique scheduling algorithm
• Tight integration with local accounting and allocation system
• License shadow daemon integration for tracking usage of licensed software
• Support for local “jobfs” filesystem
• Support for cpusets (part of cgroups now)
16. nci.org.au
The Past - More Features
• The concept of draining the system or individual nodes
• Basically dedicated time options but for individual nodes
• Configuring nodes for maximum walltimes for jobs
• Eg, this node only runs jobs with a walltime less than 2hours
17. nci.org.au
The Past - Accounting
• ANUPBS tightly integrated with “RASH” (Resource Accounting Shell)
• Allowed ANUPBS to make scheduling decisions based on accounting data
• RASH integration with systems allowed users access to be linked to accounting
• Tight integration meant it was difficult to port RASH forward to PBS Pro
18. nci.org.au
The Past - Suspend/Resume
• Scheduler only thinks in terms of suspend/resume
• Every job has the possibility to suspend other jobs, not just “express” jobs
• Advantages of suspend/resume:
• Large parallel jobs not requiring reserved nodes
• Debug or express jobs not requiring reserved nodes
• Long running jobs not preventing other jobs from running
• Disadvantages
• Possible excessive paging if not managed correctly
• Too many suspended jobs with too few queued jobs may leave free nodes
19. nci.org.au
The Past - Suspend/Resume
• Operation of the suspend/resume scheduler
• In general jobs in the same queue have the same priority
• A pairwise comparison (preemptor/preemptee) of all job pairs
• Consider many factors:
• Relative walltime, ncpus, already suspended time, existing resource usages by
user and project, how close jobs are to completion, etc
20. nci.org.au
The Present - PBS Pro
• Using PBS Pro 12.0-based custom branch
• Using backfill based scheduling
• Customisation of our PBS Pro installation
• Allocation / accounting system “alloc_db”
• Priority calculation scripts
• Running CPU, memory, walltime limits
• Support for MUNGE authentication
21. nci.org.au
The Present - Why PBS Pro?
• During acceptance testing of Raijin…
• Lead developer of ANUPBS left NCI
• PBS Pro offered a supported scheduler and resource manager
• Same heritage — more familiar to users
• Altair work with us to customise PBS Pro to our needs
• Good time to change — moving to a new machine
22. nci.org.au
The Present - Suspend / Resume
• Looked at suspend/resume but found issues
• More CPUs being suspended than required to run a job
• Less flexibility in selection of jobs to suspend
• At one point the scheduler was trying to suspend jobs even when there were
CPUs free to run the new job!
• Jobs can end up suspended “forever” as their node keeps getting selected
• Only works well with suspending for high-priority “express” jobs
• Can only specify target jobs for preemption as a binary option
23. nci.org.au
The Present - Suspend / Resume
• Small number of jobs as preemption targets means that the jobs are “picked on”
• Sample jobs run through our test cluster (based on a day of jobs in the real clsuter)
Job Walltime Used Time Suspend Time % vs Used % vs Request
A 2:24:00 20:47 3:21:20 968.72% 139.81%
B 04:48:00 1:57:20 7:05:56 363.01% 147.89%
C 2:00:00 1:15:56 1:06:47 87.95% 55.65%
D 1:00:00 13:17 49:59 376.29% 83.31%
24. nci.org.au
The Present - Accounting
• NCI allocates compute hours to projects across quarters
• System parses accounting logs provided by PBS
• Accounting logs don’t include resource usage for jobs that get deleted due to
MOMs going down
• Further information extracted from job history in cronjob (qstat -fx)
• Once a project is out of quota for the quarter, jobs may be allowed in “bonus”
• Somewhat manual in some areas of the system, especially reporting
25. nci.org.au
The Present - License Daemon
• Our own custom License Shadow Daemon (“lsd”)
• One lsd for multiple clusters
• Tracks usage of licenses across many commercial packages
• Has knowledge of pattern of license usage for different resource requests
• Users should request software in their jobs script (-l software=fluent)
• Jobs that request licenses have them “reserved” by lsd
• Can detect “rogue” jobs using licenses with no PBS request
• Hooks integrate with lsd to reject running jobs when licenses aren’t free
26. nci.org.au
The Present - Local “jobfs”
• We allow users to request temporary storage that is node local
• Handled with 3 hooks:
• Run job hook creating temporary folder on local disk
• Periodic hook checking usage of temporary folder
• End job hook deleting temporary folder when complete
• Some issues currently
• Nothing automating cleanup if it fails (script manually taking care of it)
• Periodic hook is not completely accurate
27. nci.org.au
The Present - Other Features
• Hooks implementing:
• Node health check before running job
• Job resource summary at end of job output file
• Enable/disable hyperthreading if job requests it
• Old features not implemented in our current system:
• Process containment with cpusets / cgroups
• Full suspend/resume scheduling
28. nci.org.au
The Future - Lessons from Present
• Slowness in scheduling cycles contributed to by:
• Time taken for some of the hooks we use (optimisation required)
• Ordering of hooks
• Calculation of start time for top jobs (sometimes)
• At times the server blocks waiting for MOM responses
29. nci.org.au
The Future - PBS Pro 13
• Currently early stages of testing PBS Pro 13.0
• New PBS features we’re most interested in:
• Support for cgroups
• Node health hooks
• Hook configuration files
• Asynchronous scheduling
30. nci.org.au
The Future - Our Customisations
• Enhancing suspend/resume support in PBS Pro
• Investigating use of anti-express queue to allow all bonus jobs to be preempted
by non-bonus jobs
• Better ways of targeting suspension
• Allocation management upgrades
• Speed enhancements
• Further automation
31. nci.org.au
The Future - Cloud
• Integration with our cloud environments for job submission
• Requires MUNGE to support multiple realms to not leak our key
• Opportunistic scheduling of jobs from Raijin HPC to Cloud
• What jobs are suitable to be moved?
• What cloud environments can we use to do this?
• Local OpenStack (Tenjin)
• Australian Federated NeCTAR Cloud (OpenStack)
• Public cloud (Amazon, Azure, etc)
33. W
Providing Australian researchers
with world-class computing services
NCI Contacts
General enquiries: +61 2 6125 9800
Media enquiries: +61 2 6125 4389
Help desk: help@nci.org.au
Address:
NCI, Building 143, Ward Road
The Australian National University
Canberra ACT 0200
Hinweis der Redaktion
Vayu is the hindu god of wind
Raijin is the Japanese god of thunder and lightning in the Shinto religion
Draining 2h used when bringing nodes back from failures to ensure that if the fault returns less compute is impacted
Do not think in terms of backfill at all. Backfill has nothing to do with this algorithm.
Express only is quite different to how we worked with ANUPBS
Suspend forever may be fixable with starving job parameters, but it’s very hard to define a real time that is a starving job for us.Targeting jobs for preemption is done with the preempt_targets resource, and a binary “suspendable” flag. Doesn’t allow us to do something like “preempt jobs with priority < 100”. Also doesn’t allow us to have “normal” jobs suspend other “normal” jobs
The jobs used here essentially are all 10% of the walltimes of real jobs
All these jobs were submitted in one go at the start, their priorities didn’t change significantly during the time. Earlier jobs with more suspended time had more competition, as the higher priority stuff got run, in production jobs come in more over time.
Accounting logs not containing everything is annoying for us as we charge for time used even if the system causes their job to fail.
Not as tightly integrated as the ANUPBS / RASH schedulers
We have to run a client daemon on the PBS servers to maintain a persistent connection to LSD, this is to continue to do what the old ANUPBS did. Hooks talk to local daemon.
Periodic hook doesn’t pickup the unlink-fill disk pattern (only checks files still present in directory)
Tentative deployment date will be early 2016
Anti-express queues would require potentially setting softlimits for projects (maybe automatic once over quota in allocation system?)
We have a MUNGE Patch in test that provides multiple realms
NeCTAR is an Australian government funded cloud environment operated by a number of institutions (NCI hosts a node)