(r52) Tier3Policies < CmsTier3

Tags: view all tags
<!-- keep this as a security measure:
   #uncomment if the subject should only be modifiable by the listed groups 
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup
   #uncomment this if you want the page only be viewable by the listed groups
   # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup
-->

%TOC%

---+ Policies for the resource allocation on the PSI Tier-3

These policies were agreed upon in the [[SteerBoardMeeting01][first]] and [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/SteerBoardMeeting02][second]] Steering Board meetings.

   1 We organize the users along [[PhysicsGroupsOverview][Physics Groups]]
   1 For the purpose of the cluster organization, *every user must be mapped exactly to one physics group* (even though  the user may work for several).
   1 The *resources available to a physics group* consist of the added up resources of its members. How these resources are used is up to the physics group's internal organization.
   1 Each physics group has one *Responsible User* who 
      * takes care of managing the group's resources (e.g. deciding what data sets to delete).
      * is the single point of contact for the cluster administrators for organizational issues.
      * can propose a guest user (see below).
   1 The resources are equipartitioned between users 
      * The *NFS home directory space* allocated to a user is 400 GB (note that quota calculation in ZFS has a certain latency, so slight time delays can result). A user may request a bigger quota in extraordinary circumstances. Requests must be sent to the admin mailing list, and the steering board members will be notified about the request.
      * To mitigate the limited scratch space on the *User Interface machines*, we distribute the users across the different UIs according to their institute allegiance (since people in the same institute can easily negotiate among each other in cases of contention). This is by convention and not enforced in any hard way to preserve flexibility
   1 Each group may define two *guest users* who may be an external user (i.e. not belonging to ETHZ, PSI, or !UniZ). 
      * The guest user will not get own resources, but he can use the group's resources
      * The group's Responsible User is responsible for the guest user.
      * The Guest user's account will be of limited duration

---++ User Interface ( UIs ) policies

%STARTSECTION{name="UisPerGroup" type="section"}%
|  *OS*  |  *UI Hostname*  |  *users group*  |  *Notes*  |
|  SL6  | t3ui01 |  PSI  |  132GB RAM, 72cores, 4TB =/scratch= ( type [[http://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10_.28RAID_1.2B0.29][RAID1+0]] )  |
|  SL6  | t3ui02 |  ETHZ  |  132GB RAM, 72cores, 4TB =/scratch= ( type [[http://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10_.28RAID_1.2B0.29][RAID1+0]] )  | 
|  SL6  | t3ui03 |  UNIZ  |  132GB RAM, 72cores, 4TB =/scratch= ( type [[http://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10_.28RAID_1.2B0.29][RAID1+0]] )  | 

%ENDSECTION{name="UisPerGroup" type="section"}%
---++ =/shome= policies
Each T3 user owns a dedicated =/shome/$USER= filesystem ( it's not a simple dir ) featuring :
   * The [[https://en.wikipedia.org/wiki/ZFS][ZFS]] filesystem
   * 400 GB of user quota meant as the sum of the *current* user data + *past* ( aka backups or snapshots ) user data ; 
a T3 user should use as current data max 120GB in order to make room for his/her own snapshots.-
   * The last 36 hours of snapshots in =/shome/$USER/.zfs/snapshot= ; to recover a file, or a whole dir, simply use the =cp= command ; no interaction with the T3 admins will be needed !
 <!--  * The last 5 days of snapshots in =/shome/$USER/.zfs/snapshot= ; again to recover a file use =cp= -->
   * The generous 200GB user quota is used to both store the current user data *AND* the recently deleted data still referenced by some =/shome/$USER/.zfs/snapshot=, 
so, as an extreme example, downloading a 400GB file into =/shome/$USER=, deleting it and then trying to download the same file again it will immediately fail reporting =out of space= ;
if a T3 user runs out of space then only the T3 admins will be able to recover space by serially deleting his/her oldest snapshots. 
   * Each T3 user can verify his current/past =/shome/$USER= usage by this %RED%[[http://t3mon.psi.ch/ganglia/PSIT3-custom/space.report][URL]]%ENDCOLOR% :<pre>
$ lynx  --dump --width=800 %RED%http://t3mon.psi.ch/ganglia/PSIT3-custom/space.report%ENDCOLOR%  | egrep "NAME|$USER"
NAME                       QUOTA  AVAIL  RESERV   USED  USEDDS  USEDSNAP  SSCOUNT  RATIO  CREATION
data01/shome/martinelli_f   800G   796G     10G  4.22G   4.22G     3.53M       46  1.25x  Mon Dec  7 18:49 2015
</pre>
   
---++ Batch system policies

These policies were discussed and endorsed in the [[SteerBoardMeeting03][steering board meeting of 2011-11-03]]

---+++ Aims
   * The T3 CPUs and Storage resources must be shared in a fair way.
   * All users are treated equally, so resources are accorded per user, not per group.
   * We want to ensure that we keep an adequate part of the resources for short turnaround jobs. Longer jobs (like for bigger private MC production) should be possible, but during the main office hours the shorter jobs will have priority and longer jobs will be throttled by a quota.
   * Scheduling policies have the greatest impact when job turnaround times are small, so we would like to favor short queue jobs over long queue jobs. 
      * short queue jobs should be able to run on all slots of the cluster
      * short queue job runtime should cover the majority of use cases
      * long queue jobs should only be able to saturate part of the cluster
   * since we have tens of users, we would like to reduce the amount of slots a single user can fill, especially on the long queue

---+++ Resource quota limits for enforcing policies
%STARTSECTION{name="SchedPolicies" type="section"}% Explicit scheduling policies:
   1 Queue job runtime limits.
      * =short.q=: 90min
      * =all.q=: 10h ( this is the default queue used by a =qsub= command )
      * =long.q=: 96h
   1 Queue jobs amount limits. How many jobs can be running in each of the queues: 
      * =short.q=: can run on all the 1040 available job slots.
      * =all.q= and =long.q= together: max 740 job slots.
      * =long.q=: max 360 job slots.
   1 User jobs amount limits. Defines the maximum number of jobs a user can have running in each queue: 
      * =short.q=: max 460 jobs.
      * =all.q=: max 400 jobs.
      * =long.q=: max 340 jobs.
      * A user can only ever have 500 running jobs in total, independent of the queues.
   1 Users with justified requests for large numbers of very long jobs can be accorded resources on special request (mail to steering board)
   1 *All These policies are relaxed at both night and on weekends*, so that the cluster can be taken by a bigger number of jobs: 
      * The User jobs amount limits are turned *off*.
      * Night time defined as weekdays from 19:00 - 4:00, weekend time defined as Sat 4:00 - Mon 4:00
%ENDSECTION{name="SchedPolicies" type="section"}%

Other resource limits affecting job submission
   1 *Job RAM limit, default %RED%3GB%ENDCOLOR%:*
      * By default 3GB of RAM will be reserved on the assigned =t3wn= server and if the job will use more than 3GB then it will be killed; [[http://linux.die.net/man/5/sge_queue_conf][Read about h_vmem]]
      * If you need >3GB use the =qsub= option =-l h_vmem=nG=, with =n= <= 6G ( 6 GByte ); the more RAM you'll request the less jobs will get running in a =t3wn= server, so check if you really need so much RAM ( in all the CMS worldwide grid centres is tolerated a max of 2GB of RAM ! )
      * By running =qstat -j JOBID= you will see the =h_vmem= RAM value that was requested at submission time, either by default or by you.

---+++ How to check the current batch system policies
The command =qquota= report the batch system quota usage, per a single user or per all the users ; the batch system policies are published on each =t3ui1*= server in =/gridware/sge_ce/tier3-policies/= ; for instance during the day they are :</br>
%TWISTY%
<pre>
$ grep -A 100000 -B 10000 --color TRUE /gridware/sge_ce/tier3-policies/day
{
   name         max_jobs_per_sun_host
   description  Allow maximally 8 jobs per bl6270 host
   enabled      TRUE
   limit        queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts {@bl6270} to slots=8
}
{
   name         max_jobs_per_intel_host
   description  Allow maximally 16 jobs per intel host
   enabled      TRUE
   limit        queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts {@wnintel} to slots=16
}
{
   name         max_jobs_per_intel2_host
   description  Allow maximally 64 jobs per intel2 host
   enabled      TRUE
   limit        queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts {@wnintel2} to slots=64
}
{
   name         max_jobs_per_supermicro_host
   description  Allow maximally 32 jobs per supermicro host
   enabled      TRUE
   limit        queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts {@wnsupermicro} to slots=32
}
{
   name         max_jobs_per_t3vm03
   description  NONE
   enabled      FALSE
   limit        queues all.q,short.q hosts t3vm03.psi.ch to slots=2
}
{
   name         test-rqs-admin2
   description  limit maximal number of jobs of a user in the admin queue
   enabled      FALSE
   limit        users {*} queues all.q.admin to slots=40
}
{
   name         test-rqs-admin
   description  limit admin queue to 30 slots total
   enabled      FALSE
   limit        queues all.q.admin to slots=30
}
{
   name         max_allq_jobs
   description  limit all.q and long.q to a maximal number of common slots
   enabled      TRUE
   limit        queues all.q,long.q to slots=740
}
{
   name         max_longq_jobs
   description  limit long.q to a maximal number of slots
   enabled      TRUE
   limit        queues long.q to slots=360
}
{
   name         max_sherpagen_jobs
   description  limit sherpa.gen.q to a maximal number of slots
   enabled      TRUE
   limit        queues sherpa.gen.q to slots=50
}
{
   name         max_sherpaintlong_jobs
   description  limit sherpa.int.long.q to a maximal number of slots
   enabled      TRUE
   limit        queues sherpa.int.long.q to slots=32
}
{
   name         max_sherpaintvlong_jobs
   description  limit sherpa.int.vlong.q to a maximal number of slots
   enabled      TRUE
   limit        queues sherpa.int.vlong.q to slots=32
}
{
   name         max_user_jobs_per_queue
   description  Limit a user to a maximal number of concurrent jobs in each \
   queue
   enabled      TRUE
   limit        users {*} queues all.q to slots=400
   limit        users {*} queues short.q to slots=460
   limit        users {*} queues long.q to slots=340
   limit	users {*} queues sherpa.gen.q to slots=32
   limit	users {*} queues sherpa.int.long.q to slots=32
   limit	users {*} queues sherpa.int.vlong.q to slots=32
}
{
   name         max_jobs_per_user
   description  Limit the total number of concurrent jobs a user can run on \
   the cluster
   enabled      TRUE
   limit        users {*} queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q to slots=500
}
</pre>%ENDTWISTY%</br>
All the current quotas =qquota -u \*= 
</br>
%TWISTY%<pre>
resource quota rule limit                filter
--------------------------------------------------------------------------------
max_jobs_per_sun_host/1 slots=2/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn17
max_jobs_per_sun_host/1 slots=1/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn25
max_jobs_per_sun_host/1 slots=1/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn15
max_jobs_per_intel_host/1 slots=11/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn39
max_jobs_per_intel_host/1 slots=13/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn37
max_jobs_per_intel_host/1 slots=12/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn30
max_jobs_per_intel_host/1 slots=13/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn38
max_jobs_per_intel_host/1 slots=12/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn40
max_jobs_per_intel_host/1 slots=13/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn34
max_jobs_per_intel_host/1 slots=12/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn36
max_jobs_per_intel_host/1 slots=12/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn33
max_jobs_per_intel_host/1 slots=12/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn35
max_jobs_per_intel_host/1 slots=12/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn32
max_jobs_per_intel_host/1 slots=12/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn31
max_jobs_per_intel2_host/1 slots=25/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn52
max_jobs_per_intel2_host/1 slots=25/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn51
max_jobs_per_intel2_host/1 slots=24/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn55
max_jobs_per_intel2_host/1 slots=25/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn54
max_jobs_per_intel2_host/1 slots=24/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn53
max_jobs_per_intel2_host/1 slots=24/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn56
max_jobs_per_intel2_host/1 slots=24/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn57
max_jobs_per_intel2_host/1 slots=24/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn58
max_jobs_per_intel2_host/1 slots=10/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn59
max_jobs_per_supermicro_host/1 slots=1/32           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn50
max_allq_jobs/1    slots=344/740        queues all.q,long.q
max_longq_jobs/1   slots=340/360        queues long.q
max_user_jobs_per_queue/1 slots=4/400          users %BLUE%ggiannin%ENDCOLOR% queues all.q
max_user_jobs_per_queue/3 slots=340/340        users %RED%wiederkehr_s%ENDCOLOR% queues long.q
max_jobs_per_user/1 slots=340/500        users %RED%wiederkehr_s%ENDCOLOR% queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=4/500          users %BLUE%ggiannin%ENDCOLOR% queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
</pre>%ENDTWISTY%
---+++ Availability of an interactive queue to debug the user jobs
A special *debug.q* queue allowing interactive sessions is available. Please consult the wiki page HowToDebugJobs . [[https://indico.cern.ch/event/375163/][There was also a presentation in 2015]]
---++ UIs and WNs =/tmp /scratch= user quota

Sometimes we've found the =/tmp= or the =/scratch= partitions of UIs and WNs full because some user had filled them with big and later forgotten files/dirs or simply because a job went crazy; so we've decided to introduce the disk quotas to detect and stop at least these macro errors; the disk quotas, so far, are not designed to manage the case where many users are using their individually allowed amount of space but all together are filling all the space. It's up to you and your group to make room regularly on the shared filesystems.

For people not familiar with the Linux quota terms we've reported here the official definitions:
   * *Hard Limits* A hard limit is the absolute maximum number of disk blocks (or inodes) that can be temporarily used by a user (or group). Any attempt to use a single block or inode above the hard limit fails.
   * *Soft Limits* The soft limit is set below the hard limit. This allows users to temporarily exceed their soft limit, permitting them to finish whatever they were doing, and giving them some time in which to go through their files and trim back their usage to below their soft limit.
   * *Grace Periods* As stated earlier, any disk usage above the soft limit is temporary. It is the grace period that determines the length of time that a user (or group) can extend their usage beyond their soft limit and toward their hard limit. If a user continues to use more than the soft limit and the grace period expires, no additional disk usage will be permitted until the user (or group) has reduced their usage to a point below the soft limit.
The UIs and WNs disk quotas enforced at T3 are the following:
| ** | *soft limit* | *hard limit* | *grace* |
| */tmp* | 40% | 50% | 7 days |
| */scratch* | 80% | 90% | 7 days |

The UIs and WNs disk policies enforced at T3 are the following:
   * If you will overuse a partition ( like described before ) then you'll receive a personal e-mail at 1pm reporting your abuse and the grace period will be 7 days (since the moment you've overused) .
   * To make room on UIs, you simply connect and you remove your files.
   * To make room on WNs, you must to use the dedicated [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToDebugJobs][debug.q]] queue and login into the WN and delete your files present in =/tmp= and/or =/scratch=
   * Once you made room so that your space usage is again below the soft limit then the quota system will forget the overuse ( no past memory ).
Obviuosly if you fill all the space at, for instance, 2pm you'll receive the warning e-mail the day after, so that e-mail must to be considered just like a reminder, instead you're supposed to delete your files as soon as you get in your code or shell session the following error:
<pre>$ dd if=/dev/zero of=/tmp/zero.$USER
sda5: warning, user block quota exceeded.
sda5: write failed, user block limit reached.
dd: writing to `/tmp/zero': Disk quota exceeded
</pre>

---+++ Warning email example
%TWISTY%
<pre>
Dear T3 User

your disk usage has exceeded the agreed limits on this server,
have a look to this page to check the actual T3 usage policies:
https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies

Please delete any unnecessary files on following filesystems:

The /tmp filesystem (/dev/sda5)

                        Block limits               File limits
Filesystem           used    soft    hard  grace    used  soft  hard  grace
/dev/sda5      +- 1486100  891660 1486100  6days       1     0     0       

The T3 Administrators.
</pre>%ENDTWISTY%
---+++ How to check your =/tmp= or =/scratch= quota
%TWISTY%
<pre>
[auser@t3ui07 ~]$ quota -s -f /tmp
Disk quotas for user auser (uid 515): none

[auser@t3ui07 ~]$ quota -s -f /scratch
Disk quotas for user auser (uid 515): 
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
      /dev/sda8  39772M  93662M    115G           23463       0       0
</pre>
%ENDTWISTY%

---+++ How to check the others =/tmp= or =/scratch= quota
%TWISTY%
<pre>
[auser@t3ui06 ~]$ %BLUE%sudo%ENDCOLOR% /usr/sbin/repquota -s /scratch
*** Report for user quotas on device /dev/sdb1
Block grace time: 7days; Inode grace time: 7days
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --    188M       0       0              4     0     0       
User1   --       4    106G    120G              1     0     0       
User2  --   6385M    106G    120G            122     0     0       
User3   --  11149M    106G    120G             20     0     0             

[auser@t3ui06 ~]$ %BLUE%sudo%ENDCOLOR% /usr/sbin/repquota -s /tmp
*** Report for user quotas on device /dev/sda6
Block grace time: 7days; Inode grace time: 7days
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --    151M       0       0              5     0     0       
xfs       --       0       0       0              1     0     0       
nagios    --       4       0       0              2     0     0       
User1   --       4   3876M   4844M              1     0     0       
User2      --    1112   3876M   4844M             28     0     0       
User3 --      72   3876M   4844M             17     0     0       
</pre>%ENDTWISTY%
<br>
if you discover an abuse write/call your colleague and invite him/her to cleanup
Topic revision: r52 - 2017-10-18 - NinaLoktionova
CmsTier3
User Pages
Main Page
Policies
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs