Tags:
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup --> %TOC% ---+ Policies for the resource allocation on the PSI Tier-3 These policies were agreed upon in the [[SteerBoardMeeting01][first]] and [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/SteerBoardMeeting02][second]] Steering Board meetings. 1 We organize the users along [[PhysicsGroupsOverview][Physics Groups]] 1 For the purpose of the cluster organization, *every user must be mapped exactly to one physics group* (even though the user may work for several). 1 The *resources available to a physics group* consist of the added up resources of its members. How these resources are used is up to the physics group's internal organization. 1 Each physics group has one *Responsible User* who * takes care of managing the group's resources (e.g. deciding what data sets to delete). * is the single point of contact for the cluster administrators for organizational issues. * can propose a guest user (see below). 1 The resources are equipartitioned between users. <!-- 1 Each group may define two *guest users* who may be an external user (i.e. not belonging to ETHZ, PSI, or !UniZ). * The guest user will not get own resources, but he can use the group's resources * The group's Responsible User is responsible for the guest user. * The Guest user's account will be of limited duration --> ---++ User Interface ( UIs ) policies %STARTSECTION{name="UisPerGroup" type="section"}% | *OS* | *UI Hostname* | *users group* | *Notes* | | SL6 | t3ui01 | PSI | 132GB RAM, 72cores, 4TB =/scratch= | | SL6 | t3ui02 | ETHZ | 132GB RAM, 72cores, 4TB =/scratch= | | SL6 | t3ui03 | UNIZ | 132GB RAM, 72cores, 4TB =/scratch= | %ENDSECTION{name="UisPerGroup" type="section"}% ---++ Home policies * Every user has %GREEN% home directory /t3home/${USER} with default 10GB quota %ENDCOLOR% for development of software, documentation, etc.</br> It's configured with 1 daily, 1 weekly and 1 monthly snapshots placed beyond quota limit at /t3home/username/.snapshot </br> To check your quota allotment use the following command: <pre> ~% quota -s -f /t3home Disk quotas for user username (uid user_id): Filesystem blocks %GREEN%quota %ENDCOLOR% limit grace files quota limit grace t3nfs:/t3/home1 704M %GREEN% 10240M %ENDCOLOR% 11264M 927 4295m 4295m </pre> * For big size files there is %BLUE% /work/${USER} volume with effective space ~100GB%ENDCOLOR% and 2 daily snpashots. </br> =/mnt/t3nfs01/data01/shome= is a link to =/work= </br> Snapshots are a part of work quota. In a case quota is completely filled only administrator is in charge to help. Requests must be sent to the admin mailing list.</br> ---++ Batch system policies These policies were discussed and endorsed in the [[SteerBoardMeeting03][steering board meeting of 2011-11-03]] ---+++ Aims * The T3 CPUs and Storage resources must be shared in a fair way. * All users are treated equally, so resources are accorded per user, not per group. * We want to ensure that we keep an adequate part of the resources for short turnaround jobs. Longer jobs (like for bigger private MC production) should be possible, but during the main office hours the shorter jobs will have priority and longer jobs will be throttled by a quota. * Scheduling policies have the greatest impact when job turnaround times are small, so we would like to favor short queue jobs over long queue jobs. * short queue jobs should be able to run on all slots of the cluster * short queue job runtime should cover the majority of use cases * long queue jobs should only be able to saturate part of the cluster * since we have tens of users, we would like to reduce the amount of slots a single user can fill, especially on the long queue ---+++ Resource quota limits for enforcing policies %STARTSECTION{name="SchedPolicies" type="section"}% Explicit scheduling policies: 1 Queue job runtime limits. * =short.q=: 90min * =all.q=: 10h ( this is the default queue used by a =qsub= command ) * =long.q=: 96h 1 Queue jobs amount limits. How many jobs can be running in each of the queues: * =short.q=: can run on all the 1040 available job slots. * =all.q= and =long.q= together: max 740 job slots. * =long.q=: max 360 job slots. 1 User jobs amount limits. Defines the maximum number of jobs a user can have running in each queue: * =short.q=: max 460 jobs. * =all.q=: max 400 jobs. * =long.q=: max 340 jobs. * A user can only ever have 500 running jobs in total, independent of the queues. 1 Users with justified requests for large numbers of very long jobs can be accorded resources on special request (mail to steering board) 1 *All These policies are relaxed at both night and on weekends*, so that the cluster can be taken by a bigger number of jobs: * The User jobs amount limits are turned *off*. * Night time defined as weekdays from 19:00 - 4:00, weekend time defined as Sat 4:00 - Mon 4:00 %ENDSECTION{name="SchedPolicies" type="section"}% Other resource limits affecting job submission 1 *Job RAM limit, default %RED%3GB%ENDCOLOR%:* * By default 3GB of RAM will be reserved on the assigned =t3wn= server and if the job will use more than 3GB then it will be killed; [[http://linux.die.net/man/5/sge_queue_conf][Read about h_vmem]] * If you need >3GB use the =qsub= option =-l h_vmem=nG=, with =n= <= 6G ( 6 GByte ); the more RAM you'll request the less jobs will get running in a =t3wn= server, so check if you really need so much RAM ( in all the CMS worldwide grid centres is tolerated a max of 2GB of RAM ! ) * By running =qstat -j JOBID= you will see the =h_vmem= RAM value that was requested at submission time, either by default or by you. ---+++ How to check the current batch system policies The command =qquota= report the batch system quota usage, per a single user or per all the users ; the batch system policies are published on each =t3ui1*= server in =/gridware/sge_ce/tier3-policies/= ; for instance during the day they are :</br> %TWISTY% <pre> $ grep -A 100000 -B 10000 --color TRUE /gridware/sge_ce/tier3-policies/day { name max_jobs_per_sun_host description Allow maximally 8 jobs per bl6270 host enabled TRUE limit queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts {@bl6270} to slots=8 } { name max_jobs_per_intel_host description Allow maximally 16 jobs per intel host enabled TRUE limit queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts {@wnintel} to slots=16 } { name max_jobs_per_intel2_host description Allow maximally 64 jobs per intel2 host enabled TRUE limit queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts {@wnintel2} to slots=64 } { name max_jobs_per_supermicro_host description Allow maximally 32 jobs per supermicro host enabled TRUE limit queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts {@wnsupermicro} to slots=32 } { name max_jobs_per_t3vm03 description NONE enabled FALSE limit queues all.q,short.q hosts t3vm03.psi.ch to slots=2 } { name test-rqs-admin2 description limit maximal number of jobs of a user in the admin queue enabled FALSE limit users {*} queues all.q.admin to slots=40 } { name test-rqs-admin description limit admin queue to 30 slots total enabled FALSE limit queues all.q.admin to slots=30 } { name max_allq_jobs description limit all.q and long.q to a maximal number of common slots enabled TRUE limit queues all.q,long.q to slots=740 } { name max_longq_jobs description limit long.q to a maximal number of slots enabled TRUE limit queues long.q to slots=360 } { name max_sherpagen_jobs description limit sherpa.gen.q to a maximal number of slots enabled TRUE limit queues sherpa.gen.q to slots=50 } { name max_sherpaintlong_jobs description limit sherpa.int.long.q to a maximal number of slots enabled TRUE limit queues sherpa.int.long.q to slots=32 } { name max_sherpaintvlong_jobs description limit sherpa.int.vlong.q to a maximal number of slots enabled TRUE limit queues sherpa.int.vlong.q to slots=32 } { name max_user_jobs_per_queue description Limit a user to a maximal number of concurrent jobs in each \ queue enabled TRUE limit users {*} queues all.q to slots=400 limit users {*} queues short.q to slots=460 limit users {*} queues long.q to slots=340 limit users {*} queues sherpa.gen.q to slots=32 limit users {*} queues sherpa.int.long.q to slots=32 limit users {*} queues sherpa.int.vlong.q to slots=32 } { name max_jobs_per_user description Limit the total number of concurrent jobs a user can run on \ the cluster enabled TRUE limit users {*} queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q to slots=500 } </pre>%ENDTWISTY%</br> All the current quotas =qquota -u \*= </br> %TWISTY%<pre> resource quota rule limit filter -------------------------------------------------------------------------------- max_jobs_per_sun_host/1 slots=2/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn17 max_jobs_per_sun_host/1 slots=1/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn25 max_jobs_per_sun_host/1 slots=1/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn15 max_jobs_per_intel_host/1 slots=11/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn39 max_jobs_per_intel_host/1 slots=13/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn37 max_jobs_per_intel_host/1 slots=12/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn30 max_jobs_per_intel_host/1 slots=13/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn38 max_jobs_per_intel_host/1 slots=12/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn40 max_jobs_per_intel_host/1 slots=13/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn34 max_jobs_per_intel_host/1 slots=12/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn36 max_jobs_per_intel_host/1 slots=12/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn33 max_jobs_per_intel_host/1 slots=12/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn35 max_jobs_per_intel_host/1 slots=12/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn32 max_jobs_per_intel_host/1 slots=12/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn31 max_jobs_per_intel2_host/1 slots=25/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn52 max_jobs_per_intel2_host/1 slots=25/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn51 max_jobs_per_intel2_host/1 slots=24/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn55 max_jobs_per_intel2_host/1 slots=25/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn54 max_jobs_per_intel2_host/1 slots=24/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn53 max_jobs_per_intel2_host/1 slots=24/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn56 max_jobs_per_intel2_host/1 slots=24/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn57 max_jobs_per_intel2_host/1 slots=24/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn58 max_jobs_per_intel2_host/1 slots=10/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn59 max_jobs_per_supermicro_host/1 slots=1/32 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn50 max_allq_jobs/1 slots=344/740 queues all.q,long.q max_longq_jobs/1 slots=340/360 queues long.q max_user_jobs_per_queue/1 slots=4/400 users %BLUE%ggiannin%ENDCOLOR% queues all.q max_user_jobs_per_queue/3 slots=340/340 users %RED%wiederkehr_s%ENDCOLOR% queues long.q max_jobs_per_user/1 slots=340/500 users %RED%wiederkehr_s%ENDCOLOR% queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q max_jobs_per_user/1 slots=4/500 users %BLUE%ggiannin%ENDCOLOR% queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q </pre>%ENDTWISTY% ---+++ Availability of an interactive queue to debug the user jobs A special *debug.q* queue allowing interactive sessions is available. Please consult the wiki page HowToDebugJobs . [[https://indico.cern.ch/event/375163/][There was also a presentation in 2015]] ---++ UIs and WNs =/tmp /scratch= usage Sometimes we've found that shared working =/tmp= or the =/scratch= partitions of UIs and WNs are full because some user had filled them with big and later forgotten files/dirs or simply because a job went crazy. There was a clear users requirement to manage this space by themselves, so that there is no automatic cleaning. Please clean it timely and remember that */scratch is the least protected area and doesn't dedicated to keep for a long an important data*. If you discover an abuse, write or call your colleague and invite him/her to cleanup. Otherwise your peers work could be blocked. Administrators can help when group members are still interested in outdated user files presence in scratch. In this case the owner has to be explicitly changed to a new responsible person.
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r72
|
r61
<
r60
<
r59
<
r58
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r59 - 2019-02-27
-
NinaLoktionova
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback