How to interactively debug jobs on a worker node

Important note
Interactive accesses by qlogin
Running an arbitrary command by qrsh

Important note

The queue debug.q allows to debug the running jobs on the worker nodes. Please do not abuse this facility since it has the potential to interfere with other users' jobs. No CPU intensive tasks should ever be run on this queue.

The use cases for this queue are :

Peeking at output and log files of a running job
Manual emergency cleanup of your scratch area when you discover that your jobs filled it up due to some misconfiguration in your code, causing problems for others.
Debugging a troublesome job in place by looking at its processes

Be aware that by default each T3 job asks for 3GB of RAM so you've to request less RAM by setting -l h_vmem=400M ; viceversa the job could get easily blocked

Interactive accesses by `qlogin`

To debug a job running on t3wn22

$ qlogin -q debug.q -l hostname=t3wn22 -l h_vmem=400M

Your job 1442007 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 1442007 has been successfully scheduled.
Establishing builtin session to host t3wn12.psi.ch ...
$ cd /scratch/$USER
...

Running an arbitrary command by `qrsh`

Some examples :

How to list your `/scratch` directory content

$ qrsh -q debug.q -l hostname=t3wn14 -l h_vmem=400M ls /scratch/$USER

How to run a `tail`

Do not abuse of this by attaching "tail -f" for prolonged times :

$ qrsh -q debug.q -l hostname=t3wn14 -l h_vmem=400M tail /scratch/$USER/MyDIR/MyLogFile.log

How to check a `WN:/tmp` disk quota :

$ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /tmp

Output could be :

[t3ui12:auser] $ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /tmp
cat: sudo: No such file or directory
*** Report for user quotas on device /dev/md3
Block grace time: 7days; Inode grace time: 7days
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --      52       0       0              4     0     0       
nagios    --       4    774M    968M              2     0     0       
auser --       4    774M    968M              2     0     0

How to erase ALL your files and dirs from a `WN:/tmp/`

You want to do it because probably you're overusing your /tmp disk quota on that particular WN.

qrsh -q debug.q -l hostname=t3wn10 -l h_vmem=400M find /tmp -user $USER -exec rm -rf {} \;

How to check a `WN:/scratch` disk quota :

$ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /scratch

How to check ALL your `WN:/scratch` disk quota :

for H in `qhost | grep t3wn | awk '{print $1}' ` ; do echo bash -x -c \"qrsh -q debug.q -l hostname=$H -l h_vmem=400M sudo /usr/sbin/repquota -s /scratch\"  ; done > /tmp/qrsh
source /tmp/qrsh 2>&1 | egrep "$USER|t3wn"
rm -f /tmp/qrsh

How to erase ALL your files and dirs from a `WN:/scratch/`

You want to do it because probably you're overusing your /scratch disk quota on that particular WN.

qrsh -q debug.q -l hostname=t3wn10 -l h_vmem=400M find /scratch/$USER/ -user $USER -exec rm -rf {} \;

How to check the batch system Job logs on a WN

ACTIVE JOB

Given the running Job ID 669786 we can check its batch system logs by :

$ JOBID=669786
$ qrsh -q debug.q -l h_vmem=400M -l hostname=`qstat -u \* | grep  $JOBID  | grep -e "t3wn[0-9]\{2\}" -o ` cat /gridware/sge/default/spool/`qstat -u \* | grep  $JOBID  | grep -e "t3wn[0-9]\{2\}" -o `/active_jobs/${JOBID}.1/trace 
cat: cat: No such file or directory
05/20/2016 14:01:48 [0:18856]: shepherd called with uid = 0, euid = 0
05/20/2016 14:01:48 [0:18856]: starting up 6.2u5
05/20/2016 14:01:48 [0:18856]: setpgid(18856, 18856) returned 0
05/20/2016 14:01:48 [0:18856]: do_core_binding: "binding" parameter not found in config file
05/20/2016 14:01:48 [0:18856]: no prolog script to start
05/20/2016 14:01:48 [0:18856]: parent: forked "job" with pid 18857
05/20/2016 14:01:48 [0:18856]: parent: job-pid: 18857
05/20/2016 14:01:48 [0:18857]: child: starting son(job, /gridware/sge/default/spool/t3wn32/job_scripts/669786, 0);
05/20/2016 14:01:48 [0:18857]: pid=18857 pgrp=18857 sid=18857 old pgrp=18856 getlogin()=
05/20/2016 14:01:48 [0:18857]: reading passwd information for user 'ursl'
05/20/2016 14:01:48 [0:18857]: setosjobid: uid = 0, euid = 0
05/20/2016 14:01:48 [0:18857]: setting limits
05/20/2016 14:01:48 [0:18857]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 3221225472 hard 3221225472) resulting: (soft 3221225472 hard 3221225472)
05/20/2016 14:01:48 [0:18857]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
05/20/2016 14:01:48 [0:18857]: setting environment
05/20/2016 14:01:48 [0:18857]: Initializing error file
05/20/2016 14:01:48 [0:18857]: switching to intermediate/target user
05/20/2016 14:01:48 [521:18857]: closing all filedescriptors
05/20/2016 14:01:48 [521:18857]: further messages are in "error" and "trace"
05/20/2016 14:01:48 [521:18857]: now running with uid=521, euid=521
05/20/2016 14:01:48 [521:18857]: execvp(/mnt/t3nfs01/data01/shome/sgeadmin/t3scripts/starter_method.emi-wn.sl6.sh, "starter_method.emi-wn.sl6.sh" "/gridware/sge/default/spool/t3wn32/job_scripts/669786")

FINISHED JOB

Given the finished job ID 4675956 we can check its batch system logs by :

$ JOBID=4675956
$ qrsh -q debug.q -l h_vmem=400M -l hostname=`qacct -j $JOBID | grep hostname| awk '{print $2'}| cut -d\. -f1` cat /gridware/sge/default/spool/`qacct -j $JOBID | grep hostname| awk '{print $2'}| cut -d\. -f1`/active_jobs/${JOBID}.1/trace 
02/03/2014 11:05:19 [0:10068]: shepherd called with uid = 0, euid = 0
02/03/2014 11:05:19 [0:10068]: starting up 6.2u5
02/03/2014 11:05:19 [0:10068]: setpgid(10068, 10068) returned 0
02/03/2014 11:05:19 [0:10068]: do_core_binding: "binding" parameter not found in config file
02/03/2014 11:05:19 [0:10068]: no prolog script to start
02/03/2014 11:05:19 [0:10070]: child: starting son(job, /gridware/sge/default/spool/t3wn34/job_scripts/4675956, 0);
02/03/2014 11:05:19 [0:10070]: pid=10070 pgrp=10070 sid=10070 old pgrp=10068 getlogin()=
02/03/2014 11:05:19 [0:10070]: reading passwd information for user 'bianchi'
02/03/2014 11:05:19 [0:10070]: setosjobid: uid = 0, euid = 0
02/03/2014 11:05:19 [0:10070]: setting limits
02/03/2014 11:05:19 [0:10070]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 1073741824 hard 1073741824) resulting: (soft 1073741824 hard 1073741824)
02/03/2014 11:05:19 [0:10070]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: setting environment
02/03/2014 11:05:19 [0:10070]: Initializing error file
02/03/2014 11:05:19 [0:10070]: switching to intermediate/target user
02/03/2014 11:05:19 [0:10068]: parent: forked "job" with pid 10070
02/03/2014 11:05:19 [0:10068]: parent: job-pid: 10070
02/03/2014 11:05:19 [579:10070]: closing all filedescriptors
02/03/2014 11:05:19 [579:10070]: further messages are in "error" and "trace"
02/03/2014 11:05:19 [579:10070]: now running with uid=579, euid=579
02/03/2014 11:05:19 [579:10070]: execvp(/shome/sgeadmin/t3scripts/starter_method.emi-wn.sh, "starter_method.emi-wn.sh" "/gridware/sge/default/spool/t3wn34/job_scripts/4675956")
02/03/2014 11:08:04 [0:10068]: wait3 returned -1
02/03/2014 11:08:04 [0:10068]: forward_signal_to_job(): mapping signal 20 TSTP
02/03/2014 11:08:04 [0:10068]: mapped signal TSTP to signal KILL
02/03/2014 11:08:04 [0:10068]: queued signal KILL
02/03/2014 11:08:04 [0:10068]: kill(-10070, KILL)
02/03/2014 11:08:04 [0:10068]: now sending signal KILL to pid -10070
02/03/2014 11:08:04 [0:10068]: pdc_kill_addgrpid: 20073 9
02/03/2014 11:08:04 [0:10068]: killing pid 10070/3
02/03/2014 11:08:04 [0:10068]: killing pid 11530/3
02/03/2014 11:08:04 [0:10068]: wait3 returned 10070 (status: 9; WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
02/03/2014 11:08:04 [0:10068]: job exited with exit status 0
02/03/2014 11:08:04 [0:10068]: reaped "job" with pid 10070
02/03/2014 11:08:04 [0:10068]: job exited due to signal
02/03/2014 11:08:04 [0:10068]: job signaled: 9
02/03/2014 11:08:04 [0:10068]: ignored signal KILL to pid -10070
02/03/2014 11:08:04 [0:10068]: writing usage file to "usage"
02/03/2014 11:08:04 [0:10068]: no tasker to notify
02/03/2014 11:08:04 [0:10068]: parent: forked "epilog" with pid 11531
02/03/2014 11:08:04 [0:10068]: using signal delivery delay of 120 seconds
02/03/2014 11:08:04 [0:10068]: parent: epilog-pid: 11531
02/03/2014 11:08:04 [0:11531]: child: starting son(epilog, /shome/sgeadmin/t3scripts/epilog.sh, 0);
02/03/2014 11:08:04 [0:11531]: pid=11531 pgrp=11531 sid=11531 old pgrp=10068 getlogin()=
02/03/2014 11:08:04 [0:11531]: reading passwd information for user 'bianchi'
02/03/2014 11:08:04 [0:11531]: setting limits
02/03/2014 11:08:04 [0:11531]: setting environment
02/03/2014 11:08:04 [0:11531]: Initializing error file
02/03/2014 11:08:04 [0:11531]: switching to intermediate/target user
02/03/2014 11:08:04 [579:11531]: closing all filedescriptors
02/03/2014 11:08:04 [579:11531]: further messages are in "error" and "trace"
02/03/2014 11:08:04 [579:11531]: using "/bin/bash" as shell of user "bianchi"
02/03/2014 11:08:04 [579:11531]: now running with uid=579, euid=579
02/03/2014 11:08:04 [579:11531]: execvp(/shome/sgeadmin/t3scripts/epilog.sh, "/shome/sgeadmin/t3scripts/epilog.sh")
02/03/2014 11:08:04 [0:10068]: wait3 returned 11531 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
02/03/2014 11:08:04 [0:10068]: epilog exited with exit status 0
02/03/2014 11:08:04 [0:10068]: reaped "epilog" with pid 11531
02/03/2014 11:08:04 [0:10068]: epilog exited not due to signal
02/03/2014 11:08:04 [0:10068]: epilog exited with status 0

Be aware that the cleaning procedures regularly running on the WNs are going to delete these logs after some day ; if a log file is important to you then save it in your /shome