<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup --> ---+ !!How to interactively debug jobs on a worker node %TOC% ---++ Important note The queue *debug.q* allows to debug the running jobs on the worker nodes. *Please do not abuse this facility* since it has the potential to interfere with other users' jobs. No CPU intensive tasks should ever be run on this queue. The use cases for this queue are : 1 Peeking at output and log files of a running job 1 Manual emergency cleanup of your scratch area when you discover that your jobs filled it up due to some misconfiguration in your code, causing problems for others. 1 Debugging a troublesome job in place by looking at its processes Be aware that by default each T3 job asks for 3GB of RAM so you've to request less RAM by setting %RED%-l h_vmem=400M%ENDCOLOR% ; viceversa the job could get easily blocked ---++ Interactive accesses by =qlogin= To debug a job running on =t3wn22= <pre>$ qlogin -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% Your job 1442007 ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job 1442007 has been successfully scheduled. Establishing builtin session to host t3wn12.psi.ch ... $ cd /scratch/$USER ... </pre> ---++ Running an arbitrary command by =qrsh= Some examples : ---+++ How to list your =/scratch= directory content <pre>$ qrsh -q debug.q -l hostname=t3wn14 %RED%-l h_vmem=400M%ENDCOLOR% ls /scratch/$USER</pre> ---+++ How to run a =tail= Do not abuse of this by attaching "tail -f" for prolonged times : <pre>$ qrsh -q debug.q -l hostname=t3wn14 %RED%-l h_vmem=400M%ENDCOLOR% tail /scratch/$USER/MyDIR/MyLogFile.log </pre> ---+++ How to check a =WN:/tmp= disk quota : <pre> $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /tmp </pre> Output could be : <pre> [t3ui12:auser] $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /tmp cat: sudo: No such file or directory *** Report for user quotas on device /dev/md3 Block grace time: 7days; Inode grace time: 7days Block limits File limits User used soft hard grace used soft hard grace ---------------------------------------------------------------------- root -- 52 0 0 4 0 0 nagios -- 4 774M 968M 2 0 0 auser -- 4 774M 968M 2 0 0 </pre> ---+++ How to erase ALL your files and dirs from a =WN:/tmp/= You want to do it because probably you're overusing your [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies][/tmp disk quota]] on that particular WN. <pre> qrsh -q debug.q -l hostname=t3wn10 %RED%-l h_vmem=400M%ENDCOLOR% find /tmp -user $USER -exec rm -rf {} \; </pre> ---+++ How to check a =WN:/scratch= disk quota : <pre> $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /scratch </pre> ---+++ How to check ALL your =WN:/scratch= disk quota : <pre> for H in `qhost | grep t3wn | awk '{print $1}' ` ; do echo bash -x -c \"qrsh -q debug.q -l hostname=$H -l h_vmem=400M sudo /usr/sbin/repquota -s /scratch\" ; done > /tmp/qrsh source /tmp/qrsh 2>&1 | egrep "$USER|t3wn" rm -f /tmp/qrsh </pre> ---+++ How to erase ALL your files and dirs from a =WN:/scratch/= You want to do it because probably you're overusing your [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies][/scratch disk quota]] on that particular WN. <pre> qrsh -q debug.q -l hostname=t3wn10 %RED%-l h_vmem=400M%ENDCOLOR% find /scratch/$USER/ -user $USER -exec rm -rf {} \; </pre> ---+++ How to check the batch system Job logs on a WN ---++++ RUNNING JOB Given the *running* Job ID =%ORANGE%669786%ENDCOLOR%= we can check its batch system logs by : <pre> $ %ORANGE%JOBID%ENDCOLOR%=669786 $ %GREEN%qrsh -q debug.q %RED%-l h_vmem=400M%ENDCOLOR% -l hostname=%ENDCOLOR%%BLUE%`qstat -u \* | grep %ORANGE%$JOBID%ENDCOLOR% | grep -e "t3wn[0-9]\{2\}" -o `%ENDCOLOR% %GREEN%cat /gridware/sge/default/spool/%ENDCOLOR%%BLUE%`qstat -u \* | grep %ORANGE%$JOBID%ENDCOLOR% | grep -e "t3wn[0-9]\{2\}" -o `%ENDCOLOR%%GREEN%/active_jobs/%ORANGE%${JOBID}%ENDCOLOR%.1/trace%ENDCOLOR% cat: cat: No such file or directory 05/20/2016 14:01:48 [0:18856]: shepherd called with uid = 0, euid = 0 05/20/2016 14:01:48 [0:18856]: starting up 6.2u5 05/20/2016 14:01:48 [0:18856]: setpgid(18856, 18856) returned 0 05/20/2016 14:01:48 [0:18856]: do_core_binding: "binding" parameter not found in config file 05/20/2016 14:01:48 [0:18856]: no prolog script to start 05/20/2016 14:01:48 [0:18856]: parent: forked "job" with pid 18857 05/20/2016 14:01:48 [0:18856]: parent: job-pid: 18857 05/20/2016 14:01:48 [0:18857]: child: starting son(job, /gridware/sge/default/spool/t3wn32/job_scripts/669786, 0); 05/20/2016 14:01:48 [0:18857]: pid=18857 pgrp=18857 sid=18857 old pgrp=18856 getlogin()=<no login set> 05/20/2016 14:01:48 [0:18857]: reading passwd information for user 'ursl' 05/20/2016 14:01:48 [0:18857]: setosjobid: uid = 0, euid = 0 05/20/2016 14:01:48 [0:18857]: setting limits 05/20/2016 14:01:48 [0:18857]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 3221225472 hard 3221225472) resulting: (soft 3221225472 hard 3221225472) 05/20/2016 14:01:48 [0:18857]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 05/20/2016 14:01:48 [0:18857]: setting environment 05/20/2016 14:01:48 [0:18857]: Initializing error file 05/20/2016 14:01:48 [0:18857]: switching to intermediate/target user 05/20/2016 14:01:48 [521:18857]: closing all filedescriptors 05/20/2016 14:01:48 [521:18857]: further messages are in "error" and "trace" 05/20/2016 14:01:48 [521:18857]: now running with uid=521, euid=521 05/20/2016 14:01:48 [521:18857]: execvp(/mnt/t3nfs01/data01/shome/sgeadmin/t3scripts/starter_method.emi-wn.sl6.sh, "starter_method.emi-wn.sl6.sh" "/gridware/sge/default/spool/t3wn32/job_scripts/669786") </pre> ---++++ FINISHED JOB Given the *finished* job ID =%ORANGE%4675956%ENDCOLOR%= we can check its batch system logs by : <pre> $ %ORANGE%JOBID%ENDCOLOR%=4675956 $ %GREEN%qrsh -q debug.q %RED%-l h_vmem=400M%ENDCOLOR% -l hostname=%ENDCOLOR%%BLUE%`qacct -j %ORANGE%$JOBID%ENDCOLOR% | grep hostname| awk '{print $2'}| cut -d\. -f1`%ENDCOLOR%%GREEN% cat /gridware/sge/default/spool/%ENDCOLOR%%BLUE%`qacct -j %ORANGE%$JOBID%ENDCOLOR% | grep hostname| awk '{print $2'}| cut -d\. -f1`%ENDCOLOR%%GREEN%/active_jobs/%ENDCOLOR%%ORANGE%${JOBID}%ENDCOLOR%%GREEN%.1/trace%ENDCOLOR% 02/03/2014 11:05:19 [0:10068]: shepherd called with uid = 0, euid = 0 02/03/2014 11:05:19 [0:10068]: starting up 6.2u5 02/03/2014 11:05:19 [0:10068]: setpgid(10068, 10068) returned 0 02/03/2014 11:05:19 [0:10068]: do_core_binding: "binding" parameter not found in config file 02/03/2014 11:05:19 [0:10068]: no prolog script to start 02/03/2014 11:05:19 [0:10070]: child: starting son(job, /gridware/sge/default/spool/t3wn34/job_scripts/4675956, 0); 02/03/2014 11:05:19 [0:10070]: pid=10070 pgrp=10070 sid=10070 old pgrp=10068 getlogin()=<no login set> 02/03/2014 11:05:19 [0:10070]: reading passwd information for user 'bianchi' 02/03/2014 11:05:19 [0:10070]: setosjobid: uid = 0, euid = 0 02/03/2014 11:05:19 [0:10070]: setting limits 02/03/2014 11:05:19 [0:10070]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 1073741824 hard 1073741824) resulting: (soft 1073741824 hard 1073741824) 02/03/2014 11:05:19 [0:10070]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: setting environment 02/03/2014 11:05:19 [0:10070]: Initializing error file 02/03/2014 11:05:19 [0:10070]: switching to intermediate/target user 02/03/2014 11:05:19 [0:10068]: parent: forked "job" with pid 10070 02/03/2014 11:05:19 [0:10068]: parent: job-pid: 10070 02/03/2014 11:05:19 [579:10070]: closing all filedescriptors 02/03/2014 11:05:19 [579:10070]: further messages are in "error" and "trace" 02/03/2014 11:05:19 [579:10070]: now running with uid=579, euid=579 02/03/2014 11:05:19 [579:10070]: execvp(/shome/sgeadmin/t3scripts/starter_method.emi-wn.sh, "starter_method.emi-wn.sh" "/gridware/sge/default/spool/t3wn34/job_scripts/4675956") 02/03/2014 11:08:04 [0:10068]: wait3 returned -1 02/03/2014 11:08:04 [0:10068]: %RED%forward_signal_to_job(): mapping signal 20 TSTP%ENDCOLOR% 02/03/2014 11:08:04 [0:10068]: mapped signal TSTP to signal KILL 02/03/2014 11:08:04 [0:10068]: queued signal KILL 02/03/2014 11:08:04 [0:10068]: kill(-10070, KILL) 02/03/2014 11:08:04 [0:10068]: now sending signal KILL to pid -10070 02/03/2014 11:08:04 [0:10068]: pdc_kill_addgrpid: 20073 9 02/03/2014 11:08:04 [0:10068]: killing pid 10070/3 02/03/2014 11:08:04 [0:10068]: killing pid 11530/3 02/03/2014 11:08:04 [0:10068]: wait3 returned 10070 (status: 9; WIFSIGNALED: 1, WIFEXITED: 0, WEXITSTATUS: 0) 02/03/2014 11:08:04 [0:10068]: job exited with exit status 0 02/03/2014 11:08:04 [0:10068]: reaped "job" with pid 10070 02/03/2014 11:08:04 [0:10068]: job exited due to signal 02/03/2014 11:08:04 [0:10068]: job signaled: 9 02/03/2014 11:08:04 [0:10068]: ignored signal KILL to pid -10070 02/03/2014 11:08:04 [0:10068]: writing usage file to "usage" 02/03/2014 11:08:04 [0:10068]: no tasker to notify 02/03/2014 11:08:04 [0:10068]: parent: forked "epilog" with pid 11531 02/03/2014 11:08:04 [0:10068]: using signal delivery delay of 120 seconds 02/03/2014 11:08:04 [0:10068]: parent: epilog-pid: 11531 02/03/2014 11:08:04 [0:11531]: child: starting son(epilog, /shome/sgeadmin/t3scripts/epilog.sh, 0); 02/03/2014 11:08:04 [0:11531]: pid=11531 pgrp=11531 sid=11531 old pgrp=10068 getlogin()=<no login set> 02/03/2014 11:08:04 [0:11531]: reading passwd information for user 'bianchi' 02/03/2014 11:08:04 [0:11531]: setting limits 02/03/2014 11:08:04 [0:11531]: setting environment 02/03/2014 11:08:04 [0:11531]: Initializing error file 02/03/2014 11:08:04 [0:11531]: switching to intermediate/target user 02/03/2014 11:08:04 [579:11531]: closing all filedescriptors 02/03/2014 11:08:04 [579:11531]: further messages are in "error" and "trace" 02/03/2014 11:08:04 [579:11531]: using "/bin/bash" as shell of user "bianchi" 02/03/2014 11:08:04 [579:11531]: now running with uid=579, euid=579 02/03/2014 11:08:04 [579:11531]: execvp(/shome/sgeadmin/t3scripts/epilog.sh, "/shome/sgeadmin/t3scripts/epilog.sh") 02/03/2014 11:08:04 [0:10068]: wait3 returned 11531 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) 02/03/2014 11:08:04 [0:10068]: epilog exited with exit status 0 02/03/2014 11:08:04 [0:10068]: reaped "epilog" with pid 11531 02/03/2014 11:08:04 [0:10068]: epilog exited not due to signal 02/03/2014 11:08:04 [0:10068]: epilog exited with status 0 </pre> Be aware that the cleaning procedures regularly running on the WNs are going to delete these logs after some day ; if a log file is important to you then save it in your =/shome=
This topic: CmsTier3
>
WebHome
>
WebPreferences
>
HowToDebugJobs
Topic revision: r20 - 2016-07-22 - FabioMartinelli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback