Tags:
create new tag
view all tags

Useful Slurm commands

Overview

command description
sinfo monitor nodes and partitions queue information; check more info options by sinfo --help
sinfo -o "%C %P" report of CPU usage as Idle, Active,... for a partition
squeue view information about jobs in the scheduling queue
scontrol show jobid JobID job status
scontrol show jobid -dd JobID helpful for job troubleshooting
sstat -j JobID information about running jobs (or specific job JobID)
scancel -j JobID abort job JobID
scancel -n JobID delete all jobs with job name JobID
sprio -l priority of your jobs
sshare -a share information about all users
sacct -j JobID -o 'JobID,state,MaxVMSize,MaxRSS,Elapsed' information on completed jobs (or specific job JobID)
sacct --helpformat format options for sacct
sacctmgr show user -s user account information
sreport -tminper cluster utilization --tres="cpu,gres/gpu" start=2019-12-01 check utilisation of resources

How to check your past and current jobs' memory requirements

For composing job memory requirements it is important to understand the memory behavior of jobs. The critical metric is the job's maximal resident set size (MaxRss), i.e. the maximal amount of memory that a job occupies in the physical RAM of the node. This is what you need to specify in SLURM request flags like --mem-per-cpu.

You can use sacct in a line like the following to find out about your past and current jobs.

sacct --format="JobID%16,User%12,State%16,partition,time,elapsed,ReqMem,MaxRss,MaxVMSize,ncpus,nnodes,reqcpus,reqnode,Start,End,NodeList"

If you want to see older jobs than from today, you will have to add a starting time like -S 2021-05-25. Also, you can list specific jobs by adding the Job ID following the -j flag:

sacct --format="JobID%16,User%12,State%16,partition,time,elapsed,ReqMem,MaxRss,MaxVMSize,ncpus,nnodes,reqcpus,reqnode,Start,End,NodeList" -j $YOUR_JOB_ID

The total maximal memory consumed by your job may be larger, but this does not matter if most of it can be kept in virtual memory which is staged out to disk, and which need not be accessed frequently. The situation changes if that staged out memory also needs to be continually read back, which leads to the condition of swapping. The node is so busy staging in and out from your virtual memory that it can almost do no work at all for you in "user space", but is spending most of it's time in "kernel space". If you look at jobs with tools like top, these jobs usually appear in a D state.

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2021-06-01 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback