Slurm Batch system usage
This is introduction to T3 Slurm configuration - a modern job scheduler for Linux clusters.
Please use User Interface Nodes t3ui01-03 mostly for development and small quick tests.
For intensive computational work one should use Compute Nodes.
There are two types of Compute Nodes in T3 Slurm - Worker Nodes for CPU usage and GPU machines. All new hardware is equipped with 256GB of RAM and 10GbE network:
Compute Node |
Processor Type |
Computing Resources: Cores/GPUs |
t3ui01-03 - login node |
Intel Xeon E5-2697 (2.30GH) |
72 |
t3gpu0[1-2] |
Intel Xeon E5-2630 v4 (2.20GHz) |
8 * GeForce GTX 1080 Ti |
t3wn60-63 |
Intel Xeon Gold 6148 (2.40GHz) |
80 |
t3wn51-59 |
Intel Xeon E5-2698 (2.30GHz) |
64 |
t3wn41-43,45-48 |
AMD Opteron 6272 (2.1GHz) |
32 |
t3wn30-36,38-39 |
Intel Xeon E5-2670 (2.6 GHz) |
16 |
Access to the Compute Nodes is controlled by Slurm.
Currently Maximum number of CPU jobs each user is allowed to run is
500 (it's about 40% of CPU resources).
There are four partitions (similar to SGE queues) implemented. Two for CPU and 2 for GPU usage:
- quick for CPU short jobs; default time is 30 min, max - 1 hour
- wn for CPU longer jobs
- qgpu for short GPU jobs; default time is 30 min, max - 1 hour and 1 GPU/user
- gpu for GPU resources; max 15 GPUs/user
Here is few useful commands start to work with Slurm:
sinfo # monitor nodes and partitions queue information
sbatch # submit a batch script
squeue # view information about jobs in the scheduling queue
scontrol show jobid -dd JobID # helpful for job troubleshooting
sstat -j JobID # information about running jobs (or specific job JobID)
scancel -j JobID # abort job JobID
scancel -n JobID # deletes all jobs with job name JobID
sprio -l # priority of your jobs
sshare -a # share information about all users
sacct -j --format=JobID,JobName,MaxRSS,Elapsed # information on completed jobs (or specific job JobID)
sacct --helpformat # see format options for sacct
To submit job to the wn partition issue:
sbatch -p wn --account=t3 job.sh
One might create a shell script with a set of all directives starting with
#SBATCH
string like in the following examples.
GPU Example
CPU Example
CPU Example for using multiple processors (threads) on a single physical computer
One can check Slurm configuration (information about Nodes and Partitions, etc.)
from /etc/slurm/slurm.conf
Slurm itself calculates
priorities of jobs taking into account
-
Age of Job: the job has been waiting in queue
-
FairShare: past of the cluster usage by the user
-
Job Size: resources request CPU, Memeory
So that it's useful to declare time resource in submission script (the less required the higher priority) with time option like
--time=...