Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup --> ---+ !!How to interactively debug jobs on a worker node %TOC% ---++ Important note The queue *debug.q* allows to debug the running jobs on the worker nodes. *Please do not abuse this facility* since it has the potential to interfere with other users' jobs. No CPU intensive tasks should ever be run on this queue. The use cases for this queue are : 1 Peeking at output and log files of a running job 1 Manual emergency cleanup of your scratch area when you discover that your jobs filled it up due to some misconfiguration in your code, causing problems for others. 1 Debugging a troublesome job in place by looking at its processes Be aware that by default each T3 job asks for 3GB of RAM so you've to request less RAM by setting %RED%-l h_vmem=400M%ENDCOLOR% ; viceversa the job could get easily blocked ---++ Interactive accesses by =qlogin= To debug a job running on =t3wn22= <pre>$ qlogin -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% Your job 1442007 ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job 1442007 has been successfully scheduled. Establishing builtin session to host t3wn12.psi.ch ... $ cd /scratch/$USER ... </pre> ---++ Running an arbitrary command by =qrsh= Some examples : ---+++ How to list your =/scratch= directory content <pre>$ qrsh -q debug.q -l hostname=t3wn14 %RED%-l h_vmem=400M%ENDCOLOR% ls /scratch/$USER</pre> ---+++ How to run a =tail= Do not abuse of this by attaching "tail -f" for prolonged times : <pre>$ qrsh -q debug.q -l hostname=t3wn14 %RED%-l h_vmem=400M%ENDCOLOR% tail /scratch/$USER/MyDIR/MyLogFile.log </pre> ---+++ How to check a =WN:/tmp= disk quota : <pre> $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /tmp </pre> Output could be : <pre> [t3ui12:auser] $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /tmp cat: sudo: No such file or directory *** Report for user quotas on device /dev/md3 Block grace time: 7days; Inode grace time: 7days Block limits File limits User used soft hard grace used soft hard grace ---------------------------------------------------------------------- root -- 52 0 0 4 0 0 nagios -- 4 774M 968M 2 0 0 auser -- 4 774M 968M 2 0 0 </pre> ---+++ How to erase ALL your files and dirs from a =WN:/tmp/= You want to do it because probably you're overusing your [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies][/tmp disk quota]] on that particular WN. <pre> qrsh -q debug.q -l hostname=t3wn10 %RED%-l h_vmem=400M%ENDCOLOR% find /tmp -user $USER -exec rm -rf {} \; </pre> ---+++ How to check a =WN:/scratch= disk quota : <pre> $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /scratch </pre> ---+++ How to check ALL your =WN:/scratch= disk quota : <pre> for H in `qhost | grep t3wn | awk '{print $1}' ` ; do echo bash -x -c \"qrsh -q debug.q -l hostname=$H -l h_vmem=400M sudo /usr/sbin/repquota -s /scratch\" ; done > /tmp/qrsh source /tmp/qrsh 2>&1 | egrep "$USER|t3wn" rm -f /tmp/qrsh </pre> ---+++ How to erase ALL your files and dirs from a =WN:/scratch/= You want to do it because probably you're overusing your [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies][/scratch disk quota]] on that particular WN. <pre> qrsh -q debug.q -l hostname=t3wn10 %RED%-l h_vmem=400M%ENDCOLOR% find /scratch/$USER/ -user $USER -exec rm -rf {} \; </pre> ---+++ How to check the batch system Job logs on a WN ---++++ ACTIVE JOB Given the *running* Job ID =%ORANGE%669786%ENDCOLOR%= we can check its batch system logs by : <pre> $ %ORANGE%JOBID%ENDCOLOR%=669786 $ %GREEN%qrsh -q debug.q %RED%-l h_vmem=400M%ENDCOLOR% -l hostname=%ENDCOLOR%%BLUE%`qstat -u \* | grep %ORANGE%$JOBID%ENDCOLOR% | grep -e "t3wn[0-9]\{2\}" -o `%ENDCOLOR% %GREEN%cat /gridware/sge/default/spool/%ENDCOLOR%%BLUE%`qstat -u \* | grep %ORANGE%$JOBID%ENDCOLOR% | grep -e "t3wn[0-9]\{2\}" -o `%ENDCOLOR%%GREEN%/active_jobs/%ORANGE%${JOBID}%ENDCOLOR%.1/trace%ENDCOLOR% cat: cat: No such file or directory 05/20/2016 14:01:48 [0:18856]: shepherd called with uid = 0, euid = 0 05/20/2016 14:01:48 [0:18856]: starting up 6.2u5 05/20/2016 14:01:48 [0:18856]: setpgid(18856, 18856) returned 0 05/20/2016 14:01:48 [0:18856]: do_core_binding: "binding" parameter not found in config file 05/20/2016 14:01:48 [0:18856]: no prolog script to start 05/20/2016 14:01:48 [0:18856]: parent: forked "job" with pid 18857 05/20/2016 14:01:48 [0:18856]: parent: job-pid: 18857 05/20/2016 14:01:48 [0:18857]: child: starting son(job, /gridware/sge/default/spool/t3wn32/job_scripts/669786, 0); 05/20/2016 14:01:48 [0:18857]: pid=18857 pgrp=18857 sid=18857 old pgrp=18856 getlogin()=<no login set> 05/20/2016 14:01:48 [0:18857]: reading passwd information for user 'ursl' 05/20/2016 14:01:48 [0:18857]: setosjobid: uid = 0, euid = 0 05/20/2016 14:01:48 [0:18857]: setting limits 05/20/2016 14:01:48 [0:18857]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 3221225472 hard 3221225472) resulting: (soft 3221225472 hard 3221225472) 05/20/2016 14:01:48 [0:18857]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: setting environment 05/20/2016 14:01:48 [0:18857]: Initializing error file 05/20/2016 14:01:48 [0:18857]: switching to intermediate/target user 05/20/2016 14:01:48 [521:18857]: closing all filedescriptors 05/20/2016 14:01:48 [521:18857]: further messages are in "error" and "trace" 05/20/2016 14:01:48 [521:18857]: now running with uid=521, euid=521 05/20/2016 14:01:48 [521:18857]: execvp(/mnt/t3nfs01/data01/shome/sgeadmin/t3scripts/starter_method.emi-wn.sl6.sh, "starter_method.emi-wn.sl6.sh" "/gridware/sge/default/spool/t3wn32/job_scripts/669786") </pre> ---++++ FINISHED JOB Given the *finished* job ID =%ORANGE%4675956%ENDCOLOR%= we can check its batch system logs by : <pre> $ %ORANGE%JOBID%ENDCOLOR%=4675956 $ %GREEN%qrsh -q debug.q %RED%-l h_vmem=400M%ENDCOLOR% -l hostname=%ENDCOLOR%%BLUE%`qacct -j %ORANGE%$JOBID%ENDCOLOR% | grep hostname| awk '{print $2'}| cut -d\. -f1`%ENDCOLOR%%GREEN% cat /gridware/sge/default/spool/%ENDCOLOR%%BLUE%`qacct -j %ORANGE%$JOBID%ENDCOLOR% | grep hostname| awk '{print $2'}| cut -d\. -f1`%ENDCOLOR%%GREEN%/active_jobs/%ENDCOLOR%%ORANGE%${JOBID}%ENDCOLOR%%GREEN%.1/trace%ENDCOLOR% 02/03/2014 11:05:19 [0:10068]: shepherd called with uid = 0, euid = 0 02/03/2014 11:05:19 [0:10068]: starting up 6.2u5 02/03/2014 11:05:19 [0:10068]: setpgid(10068, 10068) returned 0 02/03/2014 11:05:19 [0:10068]: do_core_binding: "binding" parameter not found in config file 02/03/2014 11:05:19 [0:10068]: no prolog script to start 02/03/2014 11:05:19 [0:10070]: child: starting son(job, /gridware/sge/default/spool/t3wn34/job_scripts/4675956, 0); 02/03/2014 11:05:19 [0:10070]: pid=10070 pgrp=10070 sid=10070 old pgrp=10068 getlogin()=<no login set> 02/03/2014 11:05:19 [0:10070]: reading passwd information for user 'bianchi' 02/03/2014 11:05:19 [0:10070]: setosjobid: uid = 0, euid = 0 02/03/2014 11:05:19 [0:10070]: setting limits 02/03/2014 11:05:19 [0:10070]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 1073741824 hard 1073741824) resulting: (soft 1073741824 hard 1073741824) 02/03/2014 11:05:19 [0:10070]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: setting environment 02/03/2014 11:05:19 [0:10070]: Initializing error file 02/03/2014 11:05:19 [0:10070]: switching to intermediate/target user 02/03/2014 11:05:19 [0:10068]: parent: forked "job" with pid 10070 02/03/2014 11:05:19 [0:10068]: parent: job-pid: 10070 02/03/2014 11:05:19 [579:10070]: closing all filedescriptors 02/03/2014 11:05:19 [579:10070]: further messages are in "error" and "trace" 02/03/2014 11:05:19 [579:10070]: now running with uid=579, euid=579 02/03/2014 11:05:19 [579:10070]: execvp(/shome/sgeadmin/t3scripts/starter_method.emi-wn.sh, "starter_method.emi-wn.sh" "/gridware/sge/default/spool/t3wn34/job_scripts/4675956") 02/03/2014 11:08:04 [0:10068]: wait3 returned -1 02/03/2014 11:08:04 [0:10068]: %RED%forward_signal_to_job(): mapping signal 20 TSTP%ENDCOLOR% 02/03/2014 11:08:04 [0:10068]: mapped signal TSTP to signal KILL 02/03/2014 11:08:04 [0:10068]: queued signal KILL 02/03/2014 11:08:04 [0:10068]: kill(-10070, KILL) 02/03/2014 11:08:04 [0:10068]: now sending signal KILL to pid -10070 02/03/2014 11:08:04 [0:10068]: pdc_kill_addgrpid: 20073 9 02/03/2014 11:08:04 [0:10068]: killing pid 10070/3 02/03/2014 11:08:04 [0:10068]: killing pid 11530/3 02/03/2014 11:08:04 [0:10068]: wait3 returned 10070 (status: 9; WIFSIGNALED: 1, WIFEXITED: 0, WEXITSTATUS: 0) 02/03/2014 11:08:04 [0:10068]: job exited with exit status 0 02/03/2014 11:08:04 [0:10068]: reaped "job" with pid 10070 02/03/2014 11:08:04 [0:10068]: job exited due to signal 02/03/2014 11:08:04 [0:10068]: job signaled: 9 02/03/2014 11:08:04 [0:10068]: ignored signal KILL to pid -10070 02/03/2014 11:08:04 [0:10068]: writing usage file to "usage" 02/03/2014 11:08:04 [0:10068]: no tasker to notify 02/03/2014 11:08:04 [0:10068]: parent: forked "epilog" with pid 11531 02/03/2014 11:08:04 [0:10068]: using signal delivery delay of 120 seconds 02/03/2014 11:08:04 [0:10068]: parent: epilog-pid: 11531 02/03/2014 11:08:04 [0:11531]: child: starting son(epilog, /shome/sgeadmin/t3scripts/epilog.sh, 0); 02/03/2014 11:08:04 [0:11531]: pid=11531 pgrp=11531 sid=11531 old pgrp=10068 getlogin()=<no login set> 02/03/2014 11:08:04 [0:11531]: reading passwd information for user 'bianchi' 02/03/2014 11:08:04 [0:11531]: setting limits 02/03/2014 11:08:04 [0:11531]: setting environment 02/03/2014 11:08:04 [0:11531]: Initializing error file 02/03/2014 11:08:04 [0:11531]: switching to intermediate/target user 02/03/2014 11:08:04 [579:11531]: closing all filedescriptors 02/03/2014 11:08:04 [579:11531]: further messages are in "error" and "trace" 02/03/2014 11:08:04 [579:11531]: using "/bin/bash" as shell of user "bianchi" 02/03/2014 11:08:04 [579:11531]: now running with uid=579, euid=579 02/03/2014 11:08:04 [579:11531]: execvp(/shome/sgeadmin/t3scripts/epilog.sh, "/shome/sgeadmin/t3scripts/epilog.sh") 02/03/2014 11:08:04 [0:10068]: wait3 returned 11531 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) 02/03/2014 11:08:04 [0:10068]: epilog exited with exit status 0 02/03/2014 11:08:04 [0:10068]: reaped "epilog" with pid 11531 02/03/2014 11:08:04 [0:10068]: epilog exited not due to signal 02/03/2014 11:08:04 [0:10068]: epilog exited with status 0 </pre> Be aware that the cleaning procedures regularly running on the WNs are going to delete these logs after some day ; if a log file is important to you then save it in your =/shome=
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r21
<
r20
<
r19
<
r18
<
r17
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r21 - 2016-10-11
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback