Slurm Batch system usage

This is introduction to test configuration of Slurm - a modern job scheduler for Linux clusters - at T3.

Currently t3ui07 is a single login node for Slurm. As any User Interface Node it should be used mostly for development and small quick tests.

For intensive computational work one should use Compute Nodes. There are two types of Compute Nodes - Worker Nodes for CPU usage and GPU machines. All new hardware is equipped with 256GB of RAM and 10GbE network:

Compute Node Processor Type Computing Resources: Cores/GPUs
t3ui04 - login node AMD Opteron 6272 (2.1GHz) 32
t3gpu0[1-2] Intel Xeon E5-2630 v4 (2.20GHz) 8 * GeForce GTX 1080 Ti
t3wn60,63 Intel Xeon Gold 6148 (2.40GHz) 80
t3wn51-58 Intel Xeon E5-2698 (2.30GHz) 64
t3wn48 AMD Opteron 6272 (2.1GHz) 32
t3wn38 Intel Xeon E5-2670 (2.6 GHz) 16

Access to the Compute Nodes is controlled by Slurm.
Corresponding to computing resources there are two partitions (similar to SGE queues) implemented: wn and gpu.

Here is few useful commands start to work with Slurm:

sinfo           # view information about Slurm nodes and partitions
sbatch          # submit a batch script 
squeue          # view information about jobs in the scheduling queue
sacct (-j X)    # view detailed information about jobs (or specific job X)
scancel -j X    # abort job X
scancel -n X    # deletes all jobs with job name X
sprio -l        #  priority of your jobs
sshare -a       # share information about all users

To submit job to the wn partition issue: sbatch -p wn job.sh

One might create a shell script with a set of all directives starting with #SBATCH string like in the following examples.

GPU Example

CPU Example

CPU Example for using multiple processors (threads) on a single physical computer

One can check Slurm configuration (information about Nodes and Partitions, etc.) from /etc/slurm/slurm.conf

Currently Maximum number of jobs each user is allowed to run is 400 (it's about 60% of CPU resources).

Slurm itself calculate priorities of jobs taking into account

- Age of Job: the job has been waiting in queue
- FairShare: past of the cluster usage by the user
- Job Size: resources request CPU, Memeory, Time

So that it's useful to to declare time resource in submission script (the less required the higher priority) like --time=...

-- NinaLoktionova - 2019-05-08

Edit | Attach | Watch | Print version | History: r35 | r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2019-10-23 - NinaLoktionova
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback