PBS and Scheduling at NCI: The past, present and future

W
PBS and Scheduling at NCI
The Past, Present and Future
Andrew Wellington
Senior HPC Systems Administrator
September 2015

nci.org.au
Overview
• What is NCI
• The Past
• Our Machines
• PBS
• The Present
• The Future
• Questions

nci.org.au
NCI - An Overview
• NCI is Australia’s national high-performance computing service
• comprehensive, vertically-integrated research service
• providing national access on priority and merit
• driven by research objectives
• Operates as a formal collaboration of ANU, CSIRO, the Australian Bureau of Meteorology and Geoscience
Australia
• As a partnership with a number of research-intensive Universities, supported by the Australian Research
Council.

nci.org.au
Where are we located?
• Nation’s capital — Canberra, ACT
• At its National University — The Australian National University (ANU)

nci.org.au
Research Communities
Research focus areas
• Climate Science and Earth System Science
• Astronomy (optical and theoretical)
• Geosciences: Geophysics, Earth Observation
• Biosciences & Bioinformatics
• Computational Sciences
• Engineering
• Chemistry
• Physics
• Social Sciences
• Growing emphasis on data-intensive computation
• Cloud Services
• Earth System Grid

nci.org.au
Who Uses NCI?
• 3,000+ users
• 10 new users every week
• 600 + projects
Astrophysics, Biology, Climate & Weather, Oceanography,
particle Physics, fluid dynamics, materials science, Chemistry,
Photonics, Mathematics, image processing, Geophysics,
Engineering, remote sensing, Bioinformatics, Environmental
Science, Geospatial, Hydrology, data mining

nci.org.au
What do they use it for?

nci.org.au
Past Machines - SC
• 126 nodes with 4x 1GHz Alpha CPUs (504 CPUs)
• Between 4 - 16GB RAM per node
• Total 700GB RAM, 2.88TB global disk, 13.1TB total disk
• Theoretical peak over 1 Tflop
• Linpack at 820 Gflops
• Quadrics “Elan3” interconnect
• 250 Mbytes/sec bidirectional

nci.org.au
Past Machines - LC
• 152 nodes with 2.66GHz Pentium 4
• 1GB RAM per node
• Theoretical peak over 800 Gflops
• 1.4TB global storage, 16TB total disk
• Gigabit ethernet interconnect

nci.org.au
Past Machines - AC
• 1928 1.6GHz Itanium2 processors
• Grouped into 30 partitions with 64 processors each
• Total 5.6TB RAM, 30TB global disk, 47TB total disk
• Theoretical peak over 11 Tflops
• SGI NUMAlink4 interconnect

nci.org.au
Past Machines - XE
• 156 nodes with 2x quad-core 3.0GHz Xeon Harpertown (1248 cores)
• Total 2.7TB RAM, 54TB global disk, 130TB total disk
• 32 NVIDIA Tesla Fermi GPUs (16 Tflops)
• Theoretical peak almost 15 Tflops
• DDR Infiniband interconnect

nci.org.au
Past Machines - Vayu
• 1492 nodes with 2x quad-core 2.93GHz Xeon Nehalem (11,936 cores)
• Total 37TB RAM, 800TB global disk
• Theoretical peak approx 140 Tflops
• QDR Infiniband interconnect

nci.org.au
Current Machine - Raijin
• 3592 nodes with 2x 8-core 2.6GHz Xeon Sandy Bridge (57,472 cores)
• Between 32- 128 GB RAM per node
• Total 160TB RAM, 10 PB global disk
• Theoretical peak approx 1.2 PFlops
• FDR Infiniband interconnect
• Around 52km (32 miles) of IB cabling
• 1.5 MW power; 100 tonnes of water in cooling
• Access to global Lustre filesystems (21 PB+)

nci.org.au
Current Machine - Tenjin Cloud
• Dell C8000 based high performance cloud
• 100 nodes with 2x 8-core 2.6GHz Xeon Sandy Bridge (1600 cores)
• 128GB RAM per node
• Over 12TB main memory, 650TB Ceph global storage
• Access to global Lustre filesystems (21 PB+)
• OpenStack management

nci.org.au
The Past - ANUPBS
• A heavily customised fork of OpenPBS v2.3
• Adds a lot of new commands
• jobnodes, jobs_on_node, nqstat, pbs_rusage, pbsrsh, pestat, qcat, qcp, qls, qps, qwait
• Modified a number of commands to add new options
• pbsdsh, pbsnodes, qalter, qdel, qorder, qrerun, qrun, qsub
• Unique scheduling algorithm
• Tight integration with local accounting and allocation system
• License shadow daemon integration for tracking usage of licensed software
• Support for local “jobfs” filesystem
• Support for cpusets (part of cgroups now)

nci.org.au
The Past - More Features
• The concept of draining the system or individual nodes
• Basically dedicated time options but for individual nodes
• Configuring nodes for maximum walltimes for jobs
• Eg, this node only runs jobs with a walltime less than 2hours

nci.org.au
The Past - Accounting
• ANUPBS tightly integrated with “RASH” (Resource Accounting Shell)
• Allowed ANUPBS to make scheduling decisions based on accounting data
• RASH integration with systems allowed users access to be linked to accounting
• Tight integration meant it was difficult to port RASH forward to PBS Pro

nci.org.au
The Past - Suspend/Resume
• Scheduler only thinks in terms of suspend/resume
• Every job has the possibility to suspend other jobs, not just “express” jobs
• Advantages of suspend/resume:
• Large parallel jobs not requiring reserved nodes
• Debug or express jobs not requiring reserved nodes
• Long running jobs not preventing other jobs from running
• Disadvantages
• Possible excessive paging if not managed correctly
• Too many suspended jobs with too few queued jobs may leave free nodes

nci.org.au
The Past - Suspend/Resume
• Operation of the suspend/resume scheduler
• In general jobs in the same queue have the same priority
• A pairwise comparison (preemptor/preemptee) of all job pairs
• Consider many factors:
• Relative walltime, ncpus, already suspended time, existing resource usages by
user and project, how close jobs are to completion, etc

nci.org.au
The Present - PBS Pro
• Using PBS Pro 12.0-based custom branch
• Using backfill based scheduling
• Customisation of our PBS Pro installation
• Allocation / accounting system “alloc_db”
• Priority calculation scripts
• Running CPU, memory, walltime limits
• Support for MUNGE authentication

nci.org.au
The Present - Why PBS Pro?
• During acceptance testing of Raijin…
• Lead developer of ANUPBS left NCI
• PBS Pro offered a supported scheduler and resource manager
• Same heritage — more familiar to users
• Altair work with us to customise PBS Pro to our needs
• Good time to change — moving to a new machine

nci.org.au
The Present - Suspend / Resume
• Looked at suspend/resume but found issues
• More CPUs being suspended than required to run a job
• Less flexibility in selection of jobs to suspend
• At one point the scheduler was trying to suspend jobs even when there were
CPUs free to run the new job!
• Jobs can end up suspended “forever” as their node keeps getting selected
• Only works well with suspending for high-priority “express” jobs
• Can only specify target jobs for preemption as a binary option

nci.org.au
The Present - Suspend / Resume
• Small number of jobs as preemption targets means that the jobs are “picked on”
• Sample jobs run through our test cluster (based on a day of jobs in the real clsuter)
Job Walltime Used Time Suspend Time % vs Used % vs Request
A 2:24:00 20:47 3:21:20 968.72% 139.81%
B 04:48:00 1:57:20 7:05:56 363.01% 147.89%
C 2:00:00 1:15:56 1:06:47 87.95% 55.65%
D 1:00:00 13:17 49:59 376.29% 83.31%

nci.org.au
The Present - Accounting
• NCI allocates compute hours to projects across quarters
• System parses accounting logs provided by PBS
• Accounting logs don’t include resource usage for jobs that get deleted due to
MOMs going down
• Further information extracted from job history in cronjob (qstat -fx)
• Once a project is out of quota for the quarter, jobs may be allowed in “bonus”
• Somewhat manual in some areas of the system, especially reporting

nci.org.au
The Present - License Daemon
• Our own custom License Shadow Daemon (“lsd”)
• One lsd for multiple clusters
• Tracks usage of licenses across many commercial packages
• Has knowledge of pattern of license usage for different resource requests
• Users should request software in their jobs script (-l software=fluent)
• Jobs that request licenses have them “reserved” by lsd
• Can detect “rogue” jobs using licenses with no PBS request
• Hooks integrate with lsd to reject running jobs when licenses aren’t free

nci.org.au
The Present - Local “jobfs”
• We allow users to request temporary storage that is node local
• Handled with 3 hooks:
• Run job hook creating temporary folder on local disk
• Periodic hook checking usage of temporary folder
• End job hook deleting temporary folder when complete
• Some issues currently
• Nothing automating cleanup if it fails (script manually taking care of it)
• Periodic hook is not completely accurate

nci.org.au
The Present - Other Features
• Hooks implementing:
• Node health check before running job
• Job resource summary at end of job output file
• Enable/disable hyperthreading if job requests it
• Old features not implemented in our current system:
• Process containment with cpusets / cgroups
• Full suspend/resume scheduling

nci.org.au
The Future - Lessons from Present
• Slowness in scheduling cycles contributed to by:
• Time taken for some of the hooks we use (optimisation required)
• Ordering of hooks
• Calculation of start time for top jobs (sometimes)
• At times the server blocks waiting for MOM responses

nci.org.au
The Future - PBS Pro 13
• Currently early stages of testing PBS Pro 13.0
• New PBS features we’re most interested in:
• Support for cgroups
• Node health hooks
• Hook configuration files
• Asynchronous scheduling

nci.org.au
The Future - Our Customisations
• Enhancing suspend/resume support in PBS Pro
• Investigating use of anti-express queue to allow all bonus jobs to be preempted
by non-bonus jobs
• Better ways of targeting suspension
• Allocation management upgrades
• Speed enhancements
• Further automation

nci.org.au
The Future - Cloud
• Integration with our cloud environments for job submission
• Requires MUNGE to support multiple realms to not leak our key
• Opportunistic scheduling of jobs from Raijin HPC to Cloud
• What jobs are suitable to be moved?
• What cloud environments can we use to do this?
• Local OpenStack (Tenjin)
• Australian Federated NeCTAR Cloud (OpenStack)
• Public cloud (Amazon, Azure, etc)

W
Providing Australian researchers
with world-class computing services
NCI Contacts
General enquiries: +61 2 6125 9800
Media enquiries: +61 2 6125 4389
Help desk: help@nci.org.au
Address:
NCI, Building 143, Ward Road
The Australian National University
Canberra ACT 0200

PBS and Scheduling at NCI: The past, present and future

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie PBS and Scheduling at NCI: The past, present and future

Ähnlich wie PBS and Scheduling at NCI: The past, present and future (20)

Mehr von inside-BigData.com

Mehr von inside-BigData.com (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

PBS and Scheduling at NCI: The past, present and future

Hinweis der Redaktion