Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M% ---+ 03. 05. 2015 Son of Grid Engine 8.1.8 cpuset error and fix Ref: http://linux.oracle.com/documentation/EL6/Red_Hat_Enterprise_Linux-6-Resource_Management_Guide-en-US.pdf Catching the error by =strace= : <pre> [root@t3vmui01 ~]# %BLUE%strace -ff -p `pidof sge_execd` -o ./log%ENDCOLOR% [root@t3vmui01 ~]# grep cpuse log.* log.3070:read(4, "v/cpuset cgroup rw,relatime,cpus"..., 1024) = 89 log.3070:read(4, "1:cpuset:/\n", 1024) = 11 log.3070:openat(3, "dev/cpuset//cpus", O_RDONLY) = 4 log.3070:openat(3, "dev/cpuset//mems", O_RDONLY) = 4 log.3070:read(5, "v/cpuset cgroup rw,relatime,cpus"..., 1024) = 89 log.3070:stat("/dev/cpuset/sge", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 log.3070:stat("/dev/cpuset/sge/cpuset.mems", 0x7fff55e0b650) = -1 ENOENT (No such file or directory) log.3070:open("/dev/cpuset/sge/mems", O_RDONLY) = 5 log.3070:open("/dev/cpuset/sge/cpus", O_RDONLY) = 5 log.3070:stat("/dev/cpuset/sge/%RED%18.1%ENDCOLOR%", 0x7fff55e0c7c0) = -1 %RED%ENOENT%ENDCOLOR% (No such file or directory) log.3159:read(3, "1:cpuset:/\n", 1048576) = 11 log.3159:write(1, "1:cpuset:/\n", 11) = 11 log.3161:read(3, "1:cpuset:/\n", 1048576) = 11 log.3161:write(1, "1:cpuset:/\n", 11) = 11 [root@t3vmui01 ~]# </pre> Fixed by:<pre> [root@t3vmui01 ~]# cat /etc/sysconfig/sgeexecd export SGE_CGROUP_DIR=/dev/cpuset/sge </pre> and <pre> [root@t3vmui01 ~]# grep -Hn setup-cgroups-etc /etc/init.d/sgeexecd.p6444 /etc/init.d/sgeexecd.p6444:427: /opt/sge/util/resources/scripts/setup-cgroups-etc start </pre> plus a couple of =sgeexecd= service stop/start ; </br></br> Some logs showing a proper behaviour : <pre> 05/03/2015 23:12:27 [0:7176]: shepherd called with uid = 0, euid = 0 05/03/2015 23:12:27 [0:7176]: starting up 8.1.8 05/03/2015 23:12:27 [0:7176]: setpgid(7176, 7176) returned 0 05/03/2015 23:12:27 [0:7176]: %BLUE%do_core_binding: explicit%ENDCOLOR% 05/03/2015 23:12:27 [0:7176]: %BLUE%bind_process_to_mask: SGE_BINDING env var created%ENDCOLOR% 05/03/2015 23:12:27 [0:7176]: %BLUE%do_core_binding: explicit: binding done%ENDCOLOR% 05/03/2015 23:12:27 [0:7176]: %BLUE%do_core_binding: finishing%ENDCOLOR% 05/03/2015 23:12:27 [0:7176]: %BLUE%set cpuset cpus per core binding%ENDCOLOR% 05/03/2015 23:12:27 [0:7176]: no prolog script to start 05/03/2015 23:12:27 [0:7176]: parent: forked "job" with pid 7177 05/03/2015 23:12:27 [0:7176]: parent: job-pid: 7177 05/03/2015 23:12:27 [0:7177]: child: starting son(job, /opt/sge/default/spool/t3vmui01/job_scripts/%GREEN%26%ENDCOLOR%, 0, 4096); ... 05/03/2015 23:21:24 [0:7176]: writing usage file to "usage" 05/03/2015 23:21:24 [0:7176]: no epilog script to start </pre> <pre> [root@t3vmui01 ~]# grep cpuset /opt/sge/default/spool/t3vmui01/messages ... 05/03/2015 23:21:25| main|t3vmui01|I|%BLUE%removing task cpuset /dev/cpuset/sge/%ENDCOLOR%%GREEN%26.1%ENDCOLOR% </pre> </br></br> And again some files showing a proper behaviour : <pre> [root@t3vmui01 ~]# find /dev/cpuset/sge /dev/cpuset/sge /dev/cpuset/sge/32.1 /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR% /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/memory_spread_slab /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/memory_spread_page /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/memory_pressure /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/memory_migrate /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/sched_relax_domain_level /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/sched_load_balance /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/mem_hardwall /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/mem_exclusive /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/cpu_exclusive /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/mems /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/cpus <--------------- %RED%0%ENDCOLOR% inside, nice. /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/cgroup.event_control /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/notify_on_release /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/cgroup.procs /dev/cpuset/sge/32.1/%ORANGE%5094%ENDCOLOR%/tasks <---- %ORANGE%5094%ENDCOLOR% %BLUE%5095 5180 5182 5184 5185%ENDCOLOR% ( namely the procs created by my SGE job ) /dev/cpuset/sge/32.1/0 /dev/cpuset/sge/32.1/0/memory_spread_slab /dev/cpuset/sge/32.1/0/memory_spread_page /dev/cpuset/sge/32.1/0/memory_pressure /dev/cpuset/sge/32.1/0/memory_migrate /dev/cpuset/sge/32.1/0/sched_relax_domain_level /dev/cpuset/sge/32.1/0/sched_load_balance /dev/cpuset/sge/32.1/0/mem_hardwall /dev/cpuset/sge/32.1/0/mem_exclusive /dev/cpuset/sge/32.1/0/cpu_exclusive /dev/cpuset/sge/32.1/0/mems /dev/cpuset/sge/32.1/0/cpus /dev/cpuset/sge/32.1/0/cgroup.event_control /dev/cpuset/sge/32.1/0/notify_on_release /dev/cpuset/sge/32.1/0/cgroup.procs /dev/cpuset/sge/32.1/0/tasks /dev/cpuset/sge/32.1/memory_spread_slab /dev/cpuset/sge/32.1/memory_spread_page /dev/cpuset/sge/32.1/memory_pressure /dev/cpuset/sge/32.1/memory_migrate /dev/cpuset/sge/32.1/sched_relax_domain_level /dev/cpuset/sge/32.1/sched_load_balance /dev/cpuset/sge/32.1/mem_hardwall /dev/cpuset/sge/32.1/mem_exclusive /dev/cpuset/sge/32.1/cpu_exclusive /dev/cpuset/sge/32.1/mems /dev/cpuset/sge/32.1/cpus <------------------ 0-7 inside /dev/cpuset/sge/32.1/cgroup.event_control /dev/cpuset/sge/32.1/notify_on_release /dev/cpuset/sge/32.1/cgroup.procs /dev/cpuset/sge/32.1/tasks </pre> Checking if all the procs created by my SGE job are running on the same CPU core: <pre> [root@t3wn42 5094]# ps -F %ORANGE%5094%ENDCOLOR% %BLUE%5095 5180 5182 5184 5185%ENDCOLOR% UID PID PPID C SZ RSS %RED%PSR%ENDCOLOR% STIME TTY STAT TIME CMD root %ORANGE%5094%ENDCOLOR% 5058 0 15418 6412 %RED%0%ENDCOLOR% 15:29 ? S 0:00 sge_shepherd-32 -bg 2980 %BLUE%5095%ENDCOLOR% 5094 0 26833 1460 %RED%0%ENDCOLOR% 15:29 ? Ss 0:00 -sh /opt/sge/default/spool/t3wn42/job_scripts/32 2980 %BLUE%5180%ENDCOLOR% 1 4 28074 1212 %RED%0%ENDCOLOR% 15:29 ? D 0:55 find /bla 2980 %BLUE%5182%ENDCOLOR% 1 1 28070 1200 %RED%0%ENDCOLOR% 15:29 ? D 0:21 find /blabla 2980 %BLUE%5185%ENDCOLOR% 5095 0 25226 564 %RED%0%ENDCOLOR% 15:29 ? S 0:00 sleep 20000 </pre> </br> For MPI [[http://wiki.hp-see.eu/index.php/System_software,_middleware_and_programming_environments#cpuset_integration_on_SGI_UV_1000_machine][this]] seems interesting, but I didn't check it. ---------------- %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r3 - 2015-05-08
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback