Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M% ---+ 26. 07. 2012 Enforcing flexible memory limits on SGE ---++ Change proposal Up to now we didn't apply memory limits on SGE but because we got 5 servers crashes in the last days now it's time to apply them; said that I propose to: * Assign =h_vmem= values to each host according to its memory capacity, so for instance =25G= to =t3wn[10-29]= and =50G= to =t3wn[30-40]=. We'll run for each WN: * qconf -se WN * qconf -aattr exechost complex_values h_vmem=25G WN * qconf -se WN * Configure the =h_vmem= complex like a consumable with a default value; so switching: * its =consumable= property from =NO= to =YES= * its default value from =0= to =3G=. * Run =qconf -sc= to see the complexes before and after the change. * For each queue =Q= configure the hard limit =h_vmem= like =6G=, so users can request more memory than the default =3G= but <= =6G=. * The limit =h_vmem= is per server slot and it's an effective method to enforce an upper limit on =h_vmem= but even better we might program a [[http://docs.oracle.com/cd/E24901_01/doc.62/e21978/configuration.htm#autoId53][JVS]] script to reject the unsatisfiable job requests instead to leave them *queued forever* in the SGE queues. * Our =JSV= can be installed like a forced check inside =/gridware/sge/util/sge_request=. * We modify the file =/etc/security/limits.conf= to enforce the memory the limit '@cms as 6500000' like a final security mechanism would be SGE stop to respect its =h_vmem= deal or because of an SGE Admin misconfiguration. ---++ Some logs collected during the change ---+++ Configuring the =h_vmem= limit per each host <pre> [root@t3ce02 ~]# for i in `seq 10 29` ; do qconf -aattr exechost complex_values h_vmem=25G t3wn$i ; done root@t3ce02.psi.ch modified "t3wn10.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn11.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn12.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn13.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn14.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn15.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn16.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn17.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn18.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn19.psi.ch" in exechost list No modification because "h_vmem" already exists in "complex_values" of "exechost" root@t3ce02.psi.ch modified "t3wn21.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn22.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn23.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn24.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn25.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn26.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn27.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn28.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn29.psi.ch" in exechost list [root@t3ce02 ~]# for i in `seq 30 40` ; do qconf -aattr exechost complex_values h_vmem=50G t3wn$i ; done root@t3ce02.psi.ch modified "t3wn30.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn31.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn32.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn33.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn34.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn35.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn36.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn37.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn38.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn39.psi.ch" in exechost list root@t3ce02.psi.ch modified "t3wn40.psi.ch" in exechost list </pre> ---+++ After the h_vmem complex change <pre> [root@t3ce02 ~]# qconf -sc | egrep 'h_vmem|shortcut' #name shortcut type relop requestable consumable default urgency h_vmem h_vmem MEMORY <= YES YES 3G 0 </pre> Interesting to note, since I've enforced a default value even the running jobs were affected, this is reported by =qhost=: <pre> [root@t3ce02 ~]# qhost -F h_vmem | grep h_vmem Host Resource(s): hc:h_vmem=16.000G <-- WN10 Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=25.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=16.000G Host Resource(s): hc:h_vmem=19.000G Host Resource(s): hc:h_vmem=2.000G <-- WN30, 16 jobs running => 16*3G=48G, and 50G-48G = 2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G Host Resource(s): hc:h_vmem=2.000G <-- WN40 </pre> Strangely but a =qstat -j JOBID= doesn't report the implict =h_vmem=3G= request; conversely this limit is considered to start, or not, a new job and when the job will start like a =ulimit -v= memory limit. ---+++ Soft and Hard limit for the corner case 6G <pre> [root@t3ce02 ~]# qconf -sq all.q | grep vmem s_vmem 5.9G h_vmem 6G [root@t3ce02 ~]# qconf -sq short.q | grep vmem s_vmem 5.9G h_vmem 6G [root@t3ce02 ~]# qconf -sq long.q | grep vmem s_vmem 5.9G h_vmem 6G [root@t3ce02 ~]# qconf -sq all.q.admin | grep vmem s_vmem 5.9G h_vmem 6G </pre> ---+++ 3GB ulimit applied <pre> [root@t3wn40 ~]# ps fax | grep --color 2644280 -A 2 14192 ? S 0:00 \_ sge_shepherd-2644280 -bg 14193 ? Ss 0:00 | \_ /bin/bash /gridware/sge/default/spool/t3wn40/job_scripts/2644280 /shome/fronga/work/CBAF8prod/Plotting/ 14245 ? D 1:48 | \_ ./Selective_Plot_Generator_14193.exec --results [root@t3wn40 ~]# cd /proc/14245 [root@t3wn40 14245]# cat limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 401408 401408 processes Max open files 1024 1024 files Max locked memory 32768 32768 bytes Max address space 3221225472 3221225472 bytes <------- Max file locks unlimited unlimited locks Max pending signals 401408 401408 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 </pre> -- Main.FabioMartinelli - 2012-07-26 ---------------- %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r4 - 2012-09-03
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback