| Monitoring Jobs
System Commands
qstat : show status of pbs batch jobs.
qstat -a lists jobs in submission order.
qstat - f <jobid> produces a full/detailed report for the job, including its working directory (init_work_dir).
qpeek : This command is deprecated.
- Job stdout and stderr files are accessible in the submission directory of the job as ${PBS_JOBID}.OU and ${PBS_JOBID}.ER while the job is running.
- qdel: deletes a job from the queue, or ends a job that's already running.
apstat : Shows the number of up nodes and idle nodes and a list of current pending and running jobs.
apstat -r displays all the node reservations.
apstat -c adds info on each partition to the "Compute node summary" section at the beginning of the output (XT = whole system, 32 = XE nodes, 16 = XK nodes)
showq : List jobs in priority order in three categories for active jobs, eligible jobs and blocked jobs.
showq -r lists details of all running/active jobs.
showq -i lists details of all eligible jobs, including their priorities.
showq -b lists details of all blocked jobs.
qs : Another utility that shows a lot of info on queued and running jobs. This one has a column for the type of nodes used by a job (xe or xk).
showstart <jobid> : takes a jobid as its argument and displays an estimate start time of a job based on current reservations.
checkjob <jobid> : takes a jobid as its argument and displays the current job state and if nodes are available to run the job.
xtnodestat : shows the current allocation and status of the system's nodes and gives information about each running job. The output displays the position of each node in the network.
xtnodestat -m prints only the mesh display.
xtnodestat -j prints only the job display.
For more information of the above commands, see the corresponding man pages.
Scripts
The following scripts have been written to combine/simplfiy/beautify some of the functionality of the system commands above. Note that these scripts are automatically available upon logging in. No modules need to be loaded.
apstat_system.pl : A perl script that displays the system status by partition (it basically wraps "apstat -c" and adds some other info)
qstat.pl : A perl script that displays queue info similar to the default qstat output with the addition of the node type and count
showqgpu.pl : A perl script that displays only XK jobs in a format similar to the default showq output
showqxe.pl : A perl script that displays only XE jobs in a format similar to the default showq output
xkqueue.pl : A perl script showing queued XK jobs in order of their priority
xequeue.pl : A perl script showing queued XE jobs in order of their priority
Monitoring memory usage
Compile and link into your application a routine with getrusage() similar to the following and call it from Fortran or C at points in your code where you would like to monitor memory usage. You may want to disable the MPI_Barrier() for performance and/or only call the routine from selected ranks in order to constrain the output. See "man getrusage" for more information about the struct returned--you may also monitor user and system time and various other metrics provided by the OS kernel.
#include <stdio.h>
#include <mpi.h>
#include <sys/time.h>
#include <sys/resource.h>
void memtrack(char *message, int *myrank)
{
struct rusage myrusage;
// MPI_Barrier(MPI_COMM_WORLD); // optional, depending on your use case
getrusage(RUSAGE_SELF, &myrusage);
printf("%d: %s: maxrss=%.1fMB\n",
*myrank, message, myrusage.ru_maxrss/1024.0);
}
void memtrack_(int *myrank)
{
memtrack("", myrank);
}
|
Sample calls from Fortran or C:
print *, 'in main after MPI setup'
call memtrack(rank)
memtrack("in main after MPI setup",&rank);
|
Sample output:
0: in main after MPI setup: maxrss=18.0MB
2: in main after MPI setup: maxrss=18.0MB
3: in main after MPI setup: maxrss=18.0MB
1: in main after MPI setup: maxrss=18.0MB |
|