(r28) SlurmUsage < CmsTier3

Tags: view all tags
<!-- keep this as a security measure:
   #uncomment if the subject should only be modifiable by the listed groups 
   # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup
   # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup
   #uncomment this if you want the page only be viewable by the listed groups
   # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup
-->

---+ Slurm Batch system usage

This is introduction to T3  Slurm configuration  - a modern job scheduler for Linux clusters.

Please use  User Interface Nodes t3ui01-03 mostly for development and  small quick tests.

For intensive computational work one should use Compute Nodes. 
There are two types of Compute Nodes in T3 Slurm - Worker Nodes for CPU usage and GPU machines. All  new hardware is equipped with 256GB of RAM  and 10GbE network:

| *Compute Node* | *Processor Type* | *Computing Resources: Cores/GPUs* |*Memory, GB*|
|t3ui01-03  - login node | Intel Xeon E5-2697  (2.30GH)  |  72  | 128 |
| t3gpu0[1-2] |  Intel Xeon E5-2630 v4 (2.20GHz)  |  8 * !GeForce GTX 1080 Ti  | 256 |
| t3wn60-63 | Intel Xeon Gold 6148 (2.40GHz) |  80    | 256 |
| t3wn51-59 | Intel Xeon E5-2698 (2.30GHz)  |  64    | 128 |
| t3wn41-43,45-48  | AMD Opteron 6272 (2.1GHz)   |  32    | 96 |
| t3wn30-36,38-39 | Intel Xeon E5-2670 (2.6 GHz)  |  16    | 48 | 


Access to the Compute Nodes is controlled by Slurm. </br>
Currently Maximum number of CPU jobs each user is allowed to run  is *500* (it's about 40% of CPU resources). </br>
There are four partitions implemented. Two for CPU and 2 for GPU usage:
   * *quick*  for CPU short jobs; default time is 30 min, max - 1 hour
   * *wn*      for CPU longer jobs
   * *qgpu*   for short GPU jobs; default time is 30 min, max - 1 hour and 1 GPU/user
   * *gpu*     for GPU resources; max 15 GPUs/user

To submit job to the wn partition issue: =sbatch -p wn --account=t3 --mem=3000 job.py= 
 
One might create  a shell script with a set of all directives starting with =#SBATCH=  string like in the following examples. </br>

[[GPU Example][GPU Example]]

[[CPU Example][CPU Example]]

[[CPU Example  for using multiple processors (threads) on a single physical computer][CPU Example  for using multiple processors (threads) on a single physical computer]]

Here is few [[SlurmMonitoringCommands][useful commands to check Slurm jobs and nodes status]] and [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/SlurmUtilisation][T3 Slurm Monitoring]] page.

One can check  Slurm configuration (information about Nodes and Partitions, etc.)
 from /etc/slurm/slurm.conf

Slurm itself calculates *priorities of jobs* taking into account  </br>
-  *Age of Job*: the job has been waiting in queue </br>
-  *FairShare*: past of the cluster usage  by the user </br>
-  *Job Size*: resources request  CPU, Memeory </br>

So that it's useful to declare time resource in  submission script  (the less required the higher priority) with time option like  ==--time=...==
Topic revision: r28 - 2020-04-27 - NinaLoktionova
CmsTier3
User Pages
Main Page
Policies
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs