Queue, Scheduling and Charging Policies

The Blue Waters queue and scheduling policies for production are now implemented. Accounting and charging are implemented.

Resource Features

To target specific node types, we have implemented features: xe, xk and x. The default feature will be xe for the XE nodes so that if you do not specify either xk for the XK nodes or both xe and xk for multi-req (use of both XE and XK) or use both node types without specifying how much of either with the "x" features that crosses both XE and XK.

A small number of XE and XK nodes (96 of each) offer double the usual amount of memory: 128 GB for XE and 64 GB for XK. To target these nodes for a job, append himem to the xe or xk feature in the #PBS -l nodes=... line. See below for examples.

Examples of using features:

For XE node specification: #PBS -l nodes=1024:ppn=32:xe

For XK node specification: #PBS -l nodes=1024:ppn=16:xk

For both XE and XK: #PBS -l nodes=1024:ppn=32:xe+1024:ppn=16:xk

For XE/XK-non-specific (X-feature) node specification: #PBS -l nodes=1024:ppn=16:x

For XE large memory nodes: #PBS -l nodes=64:ppn=32:xehimem

For XK large memory nodes: #PBS -l nodes=64:ppn=16:xkhimem

Queues

A queue based system is used to establish initial job priorities and charging.

To specify a queue: #PBS -q queue_name

Queue Name	Property	Maximum wall time	Default wall time	Maximum number of nodes	Charge Factor‡
normal	default	48:00:00	00:30	26864	1
high		48:00:00	00:30	26864	2
low†		48:00:00	00:30	26864	0.5
debug**		00:30:00	00:30	7000	1
noalloc§		48:00:00	00:30	26864	0

** - The debug queue is limited to one job per user either running or eligible to run. If a user has a job running in the debug queue then all of that user's other jobs in the debug queue will be in the blocked state as shown by the showq command. If a user does not have a job running in the debug queue then only the user's largest job in the debug queue will be eligible for scheduling and the user's smaller debug jobs will be blocked. Blocked jobs will not run even if appropriate nodes are available. For this reason, when running a set of tests at different node counts it is best to submit only the largest job to the debug queue and the smaller jobs to the high queue with a short walltime limit to allow backfilling as nodes are cleared for the debug job.

† - The low queue is configured to run jobs only when there are nodes not being reserved by higher priority jobs. The priority of a low queue job is such that it may not run for weeks or months. Put jobs in the low queue when their execution isn't needed within a specified time frame. Low queue jobs will backfill when there are no higher priority jobs eligible to backfill.

§ - The noalloc queue is configured to run similar to the low queue with a much lower priority. Jobs submitted to the noalloc queue are also subject to pre-emption after 1 hour of wall clock time. Only users who are in projects with No Remaining Allocation are eligible to submit jobs to the noalloc queue.

‡ Current charge factor discount: We currently have several discounts available that can reduce a job's charge factor. Please see the blog entry Charge Factor Discounts for jobs on Blue Waters for more information on how to take advantage of the discounts.

Moving Jobs Among Queues:

After being queued, a job may be moved from one queue to a different one by its submitter. You might do this if you realize you put the job in the wrong queue, or you need the job to run sooner. The command to do this is "qmove". Find out about its features using "man qmove".

Schedule Configuration (How do I make my jobs more likely to run?)

The Blue Waters project doesn't publish the exact configuration of our scheduling system. We change it from time, so we don't want to guarantee any specific feature. However, here is a list of general considerations for choosing your job parameters.

Larger jobs generally get priority over smaller jobs.

Wall time of a job no longer factors into priority calculations on Blue Waters. For both size and wall-time considerations, see the "Why isn't my job running?" page in this section and its discussion on backfill; smaller and shorter jobs fit into backfill better.

Jobs accumulate priority when they're in the eligible state in the queue. So if you have a job that isn't running, it's better to leave it there than to re-submit.

Jobs submitted to the "high" or "debug" queues have higher starting priority than jobs in the normal queue with the same parameters; see above for tradeoffs for using those queues.

Fair Share

As of October 21, 2014, we have implemented fair share in the Blue Waters scheduler. Collaborations that are using more than a certain fraction of Blue Waters will have their submitted job priorities lowered. Such jobs will not lose eligibility, but other jobs will tend to run first.

This policy accounts for usage of an entire project, and effects users of the entire project equally. All allocation groups on Blue Waters are treated the same under this policy; the scheduler applies these changes automatically.

Job Scheduling Limits

There is a limit to the total node count that one user can have in the queue (larger than the total node count of Blue Waters). There is also a limit of total queued nodes per project, more than the per-user limit but less than double it, so one user cannot prevent other project users from having jobs be eligible but two users in the same project can. There is also a very large upper limit of running jobs per allocation, but most groups will not hit this limit unless their jobs are very small.

Charging

Charging is based on the aggregate node-hours for a job scaled by the charging factor for the queue used by the job. The normal queue will have a charging factor of 1. Other queues will have a higher or lower factor depending on variables like priority or preemptibilty. We currently have several discounts available that can reduce a job's charge factor. Please see the blog entry Charge Factor Discounts for jobs on Blue Waters for more information on how to take advantage of the discounts.

Compute nodes are allocated in an exclusive manner; jobs do not share nodes. The use of one node for one hour has a usage of one node-hour scaled by the queue charging factor for the job to which the node is allocated. The number of PEs (processing elements) or number of threads on the node is not a factor in usage.

The usage command will report aggregate node-hours taking into account the queue charging factor for each contributing job. The portal provides charge information on individual past jobs as well.

As is discussed in the User Guide overview and the System Summary, there are 16 cores (AMD Bulldozer compute cores) per XE node and 8 cores per XK nodes.

Refunds

The current policy for job refunds is that it is impractical in regular operations to address requests for refunds on a system of this size due to the time it takes to determine the cause of the job termination. We strongly recommend that users implement an efficient checkpoint strategy in their application and use the recommended checkpoint interval calculator to determine the time between checkpoints based on node count and the time to write a checkpoint. In extraordinary cases refunds might be considered. Send email to help+bw@ncsa.illinois.edu for more information.