HowToDebugJobs < CmsTier3

<!-- keep this as a security measure:
   #uncomment if the subject should only be modifiable by the listed groups 
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup
   #uncomment this if you want the page only be viewable by the listed groups
   # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup
-->

---+ !!How to interactively debug jobs on a worker node

%TOC%

---++ Important note
The queue *debug.q* allows to debug the running jobs on the worker nodes. *Please do not abuse this facility* since it has the potential to interfere with other users' jobs. No CPU intensive tasks should ever be run on this queue.

The use cases for this queue are :
   1 Peeking at output and log files of a running job
   1 Manual emergency cleanup of your scratch area when you discover that your jobs filled it up due to some misconfiguration in your code, causing problems for others.
   1 Debugging a troublesome job in place by looking at its processes
Be aware that by default each T3 job asks for 3GB of RAM so you've to request less RAM by setting %RED%-l h_vmem=400M%ENDCOLOR% ; viceversa the job could get easily blocked
---++ Interactive accesses by =qlogin=
To debug a job running on =t3wn22=
<pre>$ qlogin -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR%

Your job 1442007 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 1442007 has been successfully scheduled.
Establishing builtin session to host t3wn12.psi.ch ...
$ cd /scratch/$USER
...
</pre>

---++ Running an arbitrary command by =qrsh=
Some examples :
---+++ How to list your =/scratch= directory content
<pre>$ qrsh -q debug.q -l hostname=t3wn14 %RED%-l h_vmem=400M%ENDCOLOR% ls /scratch/$USER</pre>

---+++ How to run a =tail=
Do not abuse of this by attaching "tail -f" for prolonged times :
<pre>$ qrsh -q debug.q -l hostname=t3wn14 %RED%-l h_vmem=400M%ENDCOLOR% tail /scratch/$USER/MyDIR/MyLogFile.log
</pre>
---+++ How to check a =WN:/tmp= disk quota :
<pre>
$ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /tmp
</pre>
Output could be :
<pre>
[t3ui12:auser] $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /tmp
cat: sudo: No such file or directory
*** Report for user quotas on device /dev/md3
Block grace time: 7days; Inode grace time: 7days
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --      52       0       0              4     0     0       
nagios    --       4    774M    968M              2     0     0       
auser --       4    774M    968M              2     0     0 
</pre>
---+++ How to erase ALL your files and dirs from a =WN:/tmp/=
You want to do it because probably you're overusing your [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies][/tmp disk quota]] on that particular WN.
<pre>
qrsh -q debug.q -l hostname=t3wn10 %RED%-l h_vmem=400M%ENDCOLOR% find /tmp -user $USER -exec rm -rf {} \;
</pre>
---+++ How to check a =WN:/scratch= disk quota :
<pre>
$ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /scratch
</pre>
---+++ How to check ALL your =WN:/scratch= disk quota :
<pre>
for H in `qhost | grep t3wn | awk '{print $1}' ` ; do echo bash -x -c \"qrsh -q debug.q -l hostname=$H -l h_vmem=400M sudo /usr/sbin/repquota -s /scratch\"  ; done > /tmp/qrsh
source /tmp/qrsh 2>&1 | egrep "$USER|t3wn"
rm -f /tmp/qrsh  

</pre>
---+++ How to erase ALL your files and dirs from a =WN:/scratch/=
You want to do it because probably you're overusing your [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies][/scratch disk quota]] on that particular WN.
<pre>
qrsh -q debug.q -l hostname=t3wn10 %RED%-l h_vmem=400M%ENDCOLOR% find /scratch/$USER/ -user $USER -exec rm -rf {} \;
</pre>
---+++ How to check the batch system Job logs on a WN
---++++ RUNNING JOB
Given the *running* Job ID =%ORANGE%669786%ENDCOLOR%= we can check its batch system logs by :
<pre>
$ %ORANGE%JOBID%ENDCOLOR%=669786
$ %GREEN%qrsh -q debug.q %RED%-l h_vmem=400M%ENDCOLOR% -l hostname=%ENDCOLOR%%BLUE%`qstat -u \* | grep  %ORANGE%$JOBID%ENDCOLOR%  | grep -e "t3wn[0-9]\{2\}" -o `%ENDCOLOR% %GREEN%cat /gridware/sge/default/spool/%ENDCOLOR%%BLUE%`qstat -u \* | grep  %ORANGE%$JOBID%ENDCOLOR%  | grep -e "t3wn[0-9]\{2\}" -o `%ENDCOLOR%%GREEN%/active_jobs/%ORANGE%${JOBID}%ENDCOLOR%.1/trace%ENDCOLOR% 
cat: cat: No such file or directory
05/20/2016 14:01:48 [0:18856]: shepherd called with uid = 0, euid = 0
05/20/2016 14:01:48 [0:18856]: starting up 6.2u5
05/20/2016 14:01:48 [0:18856]: setpgid(18856, 18856) returned 0
05/20/2016 14:01:48 [0:18856]: do_core_binding: "binding" parameter not found in config file
05/20/2016 14:01:48 [0:18856]: no prolog script to start
05/20/2016 14:01:48 [0:18856]: parent: forked "job" with pid 18857
05/20/2016 14:01:48 [0:18856]: parent: job-pid: 18857
05/20/2016 14:01:48 [0:18857]: child: starting son(job, /gridware/sge/default/spool/t3wn32/job_scripts/669786, 0);
05/20/2016 14:01:48 [0:18857]: pid=18857 pgrp=18857 sid=18857 old pgrp=18856 getlogin()=<no login set>
05/20/2016 14:01:48 [0:18857]: reading passwd information for user 'ursl'
05/20/2016 14:01:48 [0:18857]: setosjobid: uid = 0, euid = 0
05/20/2016 14:01:48 [0:18857]: setting limits
05/20/2016 14:01:48 [0:18857]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 3221225472 hard 3221225472) resulting: (soft 3221225472 hard 3221225472)
05/20/2016 14:01:48 [0:18857]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: setting environment
05/20/2016 14:01:48 [0:18857]: Initializing error file
05/20/2016 14:01:48 [0:18857]: switching to intermediate/target user
05/20/2016 14:01:48 [521:18857]: closing all filedescriptors
05/20/2016 14:01:48 [521:18857]: further messages are in "error" and "trace"
05/20/2016 14:01:48 [521:18857]: now running with uid=521, euid=521
05/20/2016 14:01:48 [521:18857]: execvp(/mnt/t3nfs01/data01/shome/sgeadmin/t3scripts/starter_method.emi-wn.sl6.sh, "starter_method.emi-wn.sl6.sh" "/gridware/sge/default/spool/t3wn32/job_scripts/669786")
</pre>
---++++ FINISHED JOB
Given the *finished* job ID =%ORANGE%4675956%ENDCOLOR%= we can check its batch system logs by :
<pre>
$ %ORANGE%JOBID%ENDCOLOR%=4675956
$ %GREEN%qrsh -q debug.q %RED%-l h_vmem=400M%ENDCOLOR% -l hostname=%ENDCOLOR%%BLUE%`qacct -j %ORANGE%$JOBID%ENDCOLOR% | grep hostname| awk '{print $2'}| cut -d\. -f1`%ENDCOLOR%%GREEN% cat /gridware/sge/default/spool/%ENDCOLOR%%BLUE%`qacct -j %ORANGE%$JOBID%ENDCOLOR% | grep hostname| awk '{print $2'}| cut -d\. -f1`%ENDCOLOR%%GREEN%/active_jobs/%ENDCOLOR%%ORANGE%${JOBID}%ENDCOLOR%%GREEN%.1/trace%ENDCOLOR% 
02/03/2014 11:05:19 [0:10068]: shepherd called with uid = 0, euid = 0
02/03/2014 11:05:19 [0:10068]: starting up 6.2u5
02/03/2014 11:05:19 [0:10068]: setpgid(10068, 10068) returned 0
02/03/2014 11:05:19 [0:10068]: do_core_binding: "binding" parameter not found in config file
02/03/2014 11:05:19 [0:10068]: no prolog script to start
02/03/2014 11:05:19 [0:10070]: child: starting son(job, /gridware/sge/default/spool/t3wn34/job_scripts/4675956, 0);
02/03/2014 11:05:19 [0:10070]: pid=10070 pgrp=10070 sid=10070 old pgrp=10068 getlogin()=<no login set>
02/03/2014 11:05:19 [0:10070]: reading passwd information for user 'bianchi'
02/03/2014 11:05:19 [0:10070]: setosjobid: uid = 0, euid = 0
02/03/2014 11:05:19 [0:10070]: setting limits
02/03/2014 11:05:19 [0:10070]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 1073741824 hard 1073741824) resulting: (soft 1073741824 hard 1073741824)
02/03/2014 11:05:19 [0:10070]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: setting environment
02/03/2014 11:05:19 [0:10070]: Initializing error file
02/03/2014 11:05:19 [0:10070]: switching to intermediate/target user
02/03/2014 11:05:19 [0:10068]: parent: forked "job" with pid 10070
02/03/2014 11:05:19 [0:10068]: parent: job-pid: 10070
02/03/2014 11:05:19 [579:10070]: closing all filedescriptors
02/03/2014 11:05:19 [579:10070]: further messages are in "error" and "trace"
02/03/2014 11:05:19 [579:10070]: now running with uid=579, euid=579
02/03/2014 11:05:19 [579:10070]: execvp(/shome/sgeadmin/t3scripts/starter_method.emi-wn.sh, "starter_method.emi-wn.sh" "/gridware/sge/default/spool/t3wn34/job_scripts/4675956")
02/03/2014 11:08:04 [0:10068]: wait3 returned -1
02/03/2014 11:08:04 [0:10068]: %RED%forward_signal_to_job(): mapping signal 20 TSTP%ENDCOLOR%
02/03/2014 11:08:04 [0:10068]: mapped signal TSTP to signal KILL
02/03/2014 11:08:04 [0:10068]: queued signal KILL
02/03/2014 11:08:04 [0:10068]: kill(-10070, KILL)
02/03/2014 11:08:04 [0:10068]: now sending signal KILL to pid -10070
02/03/2014 11:08:04 [0:10068]: pdc_kill_addgrpid: 20073 9
02/03/2014 11:08:04 [0:10068]: killing pid 10070/3
02/03/2014 11:08:04 [0:10068]: killing pid 11530/3
02/03/2014 11:08:04 [0:10068]: wait3 returned 10070 (status: 9; WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
02/03/2014 11:08:04 [0:10068]: job exited with exit status 0
02/03/2014 11:08:04 [0:10068]: reaped "job" with pid 10070
02/03/2014 11:08:04 [0:10068]: job exited due to signal
02/03/2014 11:08:04 [0:10068]: job signaled: 9
02/03/2014 11:08:04 [0:10068]: ignored signal KILL to pid -10070
02/03/2014 11:08:04 [0:10068]: writing usage file to "usage"
02/03/2014 11:08:04 [0:10068]: no tasker to notify
02/03/2014 11:08:04 [0:10068]: parent: forked "epilog" with pid 11531
02/03/2014 11:08:04 [0:10068]: using signal delivery delay of 120 seconds
02/03/2014 11:08:04 [0:10068]: parent: epilog-pid: 11531
02/03/2014 11:08:04 [0:11531]: child: starting son(epilog, /shome/sgeadmin/t3scripts/epilog.sh, 0);
02/03/2014 11:08:04 [0:11531]: pid=11531 pgrp=11531 sid=11531 old pgrp=10068 getlogin()=<no login set>
02/03/2014 11:08:04 [0:11531]: reading passwd information for user 'bianchi'
02/03/2014 11:08:04 [0:11531]: setting limits
02/03/2014 11:08:04 [0:11531]: setting environment
02/03/2014 11:08:04 [0:11531]: Initializing error file
02/03/2014 11:08:04 [0:11531]: switching to intermediate/target user
02/03/2014 11:08:04 [579:11531]: closing all filedescriptors
02/03/2014 11:08:04 [579:11531]: further messages are in "error" and "trace"
02/03/2014 11:08:04 [579:11531]: using "/bin/bash" as shell of user "bianchi"
02/03/2014 11:08:04 [579:11531]: now running with uid=579, euid=579
02/03/2014 11:08:04 [579:11531]: execvp(/shome/sgeadmin/t3scripts/epilog.sh, "/shome/sgeadmin/t3scripts/epilog.sh")
02/03/2014 11:08:04 [0:10068]: wait3 returned 11531 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
02/03/2014 11:08:04 [0:10068]: epilog exited with exit status 0
02/03/2014 11:08:04 [0:10068]: reaped "epilog" with pid 11531
02/03/2014 11:08:04 [0:10068]: epilog exited not due to signal
02/03/2014 11:08:04 [0:10068]: epilog exited with status 0
</pre>
Be aware that the cleaning procedures regularly running on the WNs are going to delete these logs after some day ; if a log file is important to you then save it in your =/shome=
This topic: CmsTier3 > WebHome > WebPreferences > HowToDebugJobs
Topic revision: r20 - 2016-07-22 - FabioMartinelli