Monitoring Jobs
Here you can see how to manage and monitor your jobs on our HPC cluster. Whether you're running batch jobs, interactive sessions, or array jobs, these tools and commands will help you keep track of your work and manage your resources.
Checking Job Status
Using squeue
The squeue
command is possibly your most common tool for viewing the status of jobs in the queue. Here's a basic usage:
squeue
Sample output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234 batch my_job jsmith R 5:23 1 cn01
1235 batch arr_job jdoe R 2:45 1 cn02
1236 gpu gpu_task asmith PD 0:00 1 (Resources)
To see only your job:
squeue -u your_username
To see jobs on a specific partition:
squeue -p partition_name
Jobs States
These are common job states that you might see under the ST
column of squeue
's output:
- R: Running
- PD: Pending
- CG: Completing
- CD: Completed
- F: Failed
- TO: Timeout
- CA: Cancelled
Detailed Job Information
Using scontrol
To get detailed informatnio about a specific job:
scontrol show job job_id
Sample output:
JobId=1234 JobName=my_job
UserId=jsmith(1001) GroupId=users(1001) MCS_label=N/A
Priority=4294901758 Nice=0 Account=default QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:10:12 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2023-06-01T10:00:00 EligibleTime=2023-06-01T10:00:00
AccrueTime=2023-06-01T10:00:00
StartTime=2023-06-01T10:05:00 EndTime=2023-06-01T11:05:00 Deadline=N/A
PreemptEligibleTime=2023-06-01T10:05:00 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-01T10:05:00
Partition=batch AllocNode:Sid=login01:12345
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cn01
BatchHost=cn01
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=4G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/jsmith/my_script.sh
WorkDir=/home/jsmith
StdErr=/home/jsmith/my_job.err
StdIn=/dev/null
StdOut=/home/jsmith/my_job.out
Power=
Cancelling Jobs
To cancel a job:
scancel job_id
To cancel all your jobs:
scancel -u your_username
To cancel all your pending jobs:
scancel -t PENDING -u your_username
Modifying Jobs
If you initially submit a job and then remember some attribute needs to be changed, you don't need to cacnel and resubmit the whole job. You can modify certain attributes of a job that's already in the queue using the scontrol update
command.
For example, to change the time limit of a job:
scontrol update JobId=job_id TimeLimit=2:00:00
To change the number of CPUs:
scontrol update JobId=job_id NumCPUs=4
Monitoring Resource Usage
Using sstat
For running jobs, you can use sstat
to get resource usage statistics:
sstat -j job_id --format=JobID,AveCPU,AveRSS,AveVMSize
Sample output:
JobID AveCPU AveRSS AveVMSize
-------- ------------ ------------ -----------
1234.0 00:05:23 1234K 4567K
Using sacct
For completed jobs, use sacct
o view accounting data:
sacct -j job_id --format=JobID,JobName,MaxRSS,Elapsed
Sample output:
JobID JobName MaxRSS Elapsed
------------ ---------- ---------- ----------
1234 my_job 4096K 00:15:23
Monitoring Cluster Status
Using sinfo
To see the overall status of the cluster:
sinfo
Sample output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 1 mix gpu1
defq* up infinite 2 idle cn01,gpu2
To see more detailed node information on for example gpu1
:
sinfo -n gpu1 -o "%n %c %m %t %f %G %D %P %C %O"
Sample output:
HOSTNAMES CPUS MEMORY STATE AVAIL_FEATURES GRES NODES PARTITION CPUS(A/I/O/T) CPU_LOAD
gpu1 128 1 mix location=local (null) 1 defq* 5/123/0/128 1.67
Job Arrays
For job arrays, you can use most of the above commands with some modifications.
To see the status of all tasks in a job array:
squeue -j array_job_id
Sample output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234_1 batch array_job jdoe R 5:23 1 cn01
1234_2 batch array_job jdoe R 5:23 1 cn02
1234_3 batch array_job jdoe PD 0:00 1 (Resources)
Troubleshooting
If a job fails, try checking the following:
- Look at the job's output and error files.
- Check the job state and exit code:
sacct --brief
Sample output:
JobID State ExitCode ------------ ---------- -------- 1040 TIMEOUT 0:0 1041 FAILED 6:0 1042 TIMEOUT 0:0 1043 FAILED 1:0 1046 COMPLETED 0:0 1047 RUNNING 0:0
FAILED
indicates the process terminated with with a non-zero exit code. The first number in the ExitCode column is the exit code and the number after the colon is the signal that caused the process to terminate if it was terminated by a signal. - Check the job's resource usage with
sacct
- Verify that you requested sufficient resources, and your job did not get terminated due to needing more resources than requested.
If you face persistent issues, please do not hesitate to reach out to us for help.