Slurm Batch system usage

This is introduction to T3 Slurm configuration - a modern job scheduler for Linux clusters.

Please use User Interface Nodes t3ui01-03 mostly for development and small quick tests.

For intensive computational work one should use Compute Nodes. There are two types of Compute Nodes in T3 Slurm - Worker Nodes for CPU usage and GPU machines. All new hardware is equipped with 256GB of RAM and 10GbE network:

Compute Node Processor Type Computing Resources: Cores/GPUs
t3ui01-03 - login node Intel Xeon E5-2697 (2.30GH) 72
t3gpu0[1-2] Intel Xeon E5-2630 v4 (2.20GHz) 8 * GeForce GTX 1080 Ti
t3wn60-63 Intel Xeon Gold 6148 (2.40GHz) 80
t3wn51-59 Intel Xeon E5-2698 (2.30GHz) 64
t3wn41-43,45-48 AMD Opteron 6272 (2.1GHz) 32
t3wn30-36,38-39 Intel Xeon E5-2670 (2.6 GHz) 16

Access to the Compute Nodes is controlled by Slurm.
Currently Maximum number of CPU jobs each user is allowed to run is 500 (it's about 40% of CPU resources).
There are four partitions (similar to SGE queues) implemented. Two for CPU and 2 for GPU usage:

  • quick for CPU short jobs; default time is 30 min, max - 1 hour
  • wn for CPU longer jobs
  • qgpu for short GPU jobs; default time is 30 min, max - 1 hour and 1 GPU/user
  • gpu for GPU resources; max 15 GPUs/user

Here is few useful commands start to work with Slurm:

sinfo           # monitor nodes and partitions queue information
sbatch          # submit a batch script 
squeue          # view information about jobs in the scheduling queue
scontrol show jobid -dd JobID  #  helpful for job troubleshooting
sstat -j JobID   # information about running  jobs (or specific job JobID) 
scancel -j JobID   # abort job JobID
scancel -n JobID    # deletes all jobs with job name JobID
sprio -l        #  priority of your jobs
sshare -a       # share information about all users
sacct -j  --format=JobID,JobName,MaxRSS,Elapsed #  information on completed jobs (or specific job JobID)
sacct --helpformat # see format options for sacct

To submit job to the wn partition issue: sbatch -p wn --account=t3 job.sh

One might create a shell script with a set of all directives starting with #SBATCH string like in the following examples.

GPU Example

CPU Example

CPU Example for using multiple processors (threads) on a single physical computer

One can check Slurm configuration (information about Nodes and Partitions, etc.) from /etc/slurm/slurm.conf

Slurm itself calculates priorities of jobs taking into account
- Age of Job: the job has been waiting in queue
- FairShare: past of the cluster usage by the user
- Job Size: resources request CPU, Memeory

So that it's useful to declare time resource in submission script (the less required the higher priority) with time option like --time=...

Edit | Attach | Watch | Print version | History: r35 | r27 < r26 < r25 < r24 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r25 - 2020-03-19 - NinaLoktionova
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback