How to interactively debug jobs on a worker node

Important note

We provide a special interactive queue debug.q for debugging directly on the worker nodes. Please do not abuse this facility since it has the potential to interfere with other users' jobs. No CPU intensive tasks should ever be run on this queue.

The use cases for this queue are

Peeking at output and log files of a running job
Manual emergency cleanup of your scratch area when you discover that your jobs filled it up due to some misconfiguration in your code, causing problems for others.
Debugging a troublesome job in place by looking at its processes
Be aware that by default each T3 job asks for 3GB of RAM but since we just want to debug we've to request less RAM by -l h_vmem=400M ; viceversa the job could be blocked ( no free resources )

Interactive shell accesses with the qlogin command

$ qlogin -q debug.q -l hostname=t3wn22 -l h_vmem=400M

Your job 1442007 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 1442007 has been successfully scheduled.
Establishing builtin session to host t3wn12.psi.ch ...
$ cd /scratch/$USER
...

Running a command on a WN by qrsh

How to list your `/scratch` directory on a particular WN:

$ qrsh -q debug.q -l hostname=t3wn14 -l h_vmem=400M ls /scratch/$USER

How to 'tail' a WN file

Please do not abuse this by attaching "tail -f" for prolonged times, you will block the single slot on the debug.q queue :

$ qrsh -q debug.q -l hostname=t3wn14 -l h_vmem=400M tail /scratch/$USER/MyDIR/MyLogFile.log

How to check the `WN:/tmp` disk quotas:

$ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /tmp

For instance :

[t3ui12:auser] $ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /tmp
cat: sudo: No such file or directory
*** Report for user quotas on device /dev/md3
Block grace time: 7days; Inode grace time: 7days
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --      52       0       0              4     0     0       
nagios    --       4    774M    968M              2     0     0       
auser --       4    774M    968M              2     0     0

How to erase ALL your files and dirs from a `WN:/tmp/`

Probably because you are overusing your /tmp disk quota on that WN.

qrsh -q debug.q -l hostname=t3wn10 -l h_vmem=400M find /tmp -user $USER -exec rm -rf {} \;

How to check the `WN:/scratch` disk quotas:

$ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /scratch

How to erase ALL your files and dirs from a `WN:/scratch/`

Probably because you are overusing your /scratch disk quota on that WN.

qrsh -q debug.q -l hostname=t3wn10 -l h_vmem=400M find /scratch/$USER/ -user $USER -exec rm -rf {} \;

How to check the Job logs

Given a finished job ID 4675956 we can check its logs WN side by :

$ JOBID=4675956
$ qrsh -q debug.q -l h_vmem=400M -l hostname=`qacct -j $JOBID | grep hostname| awk '{print $2'}| cut -d\. -f1` cat /gridware/sge/default/spool/`qacct -j $JOBID | grep hostname| awk '{print $2'}| cut -d\. -f1`/active_jobs/${JOBID}.1/trace 
02/03/2014 11:05:19 [0:10068]: shepherd called with uid = 0, euid = 0
02/03/2014 11:05:19 [0:10068]: starting up 6.2u5
02/03/2014 11:05:19 [0:10068]: setpgid(10068, 10068) returned 0
02/03/2014 11:05:19 [0:10068]: do_core_binding: "binding" parameter not found in config file
02/03/2014 11:05:19 [0:10068]: no prolog script to start
02/03/2014 11:05:19 [0:10070]: child: starting son(job, /gridware/sge/default/spool/t3wn34/job_scripts/4675956, 0);
02/03/2014 11:05:19 [0:10070]: pid=10070 pgrp=10070 sid=10070 old pgrp=10068 getlogin()=
02/03/2014 11:05:19 [0:10070]: reading passwd information for user 'bianchi'
02/03/2014 11:05:19 [0:10070]: setosjobid: uid = 0, euid = 0
02/03/2014 11:05:19 [0:10070]: setting limits
02/03/2014 11:05:19 [0:10070]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 1073741824 hard 1073741824) resulting: (soft 1073741824 hard 1073741824)
02/03/2014 11:05:19 [0:10070]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: setting environment
02/03/2014 11:05:19 [0:10070]: Initializing error file
02/03/2014 11:05:19 [0:10070]: switching to intermediate/target user
02/03/2014 11:05:19 [0:10068]: parent: forked "job" with pid 10070
02/03/2014 11:05:19 [0:10068]: parent: job-pid: 10070
02/03/2014 11:05:19 [579:10070]: closing all filedescriptors
02/03/2014 11:05:19 [579:10070]: further messages are in "error" and "trace"
02/03/2014 11:05:19 [579:10070]: now running with uid=579, euid=579
02/03/2014 11:05:19 [579:10070]: execvp(/shome/sgeadmin/t3scripts/starter_method.emi-wn.sh, "starter_method.emi-wn.sh" "/gridware/sge/default/spool/t3wn34/job_scripts/4675956")
02/03/2014 11:08:04 [0:10068]: wait3 returned -1
02/03/2014 11:08:04 [0:10068]: forward_signal_to_job(): mapping signal 20 TSTP
02/03/2014 11:08:04 [0:10068]: mapped signal TSTP to signal KILL
02/03/2014 11:08:04 [0:10068]: queued signal KILL
02/03/2014 11:08:04 [0:10068]: kill(-10070, KILL)
02/03/2014 11:08:04 [0:10068]: now sending signal KILL to pid -10070
02/03/2014 11:08:04 [0:10068]: pdc_kill_addgrpid: 20073 9
02/03/2014 11:08:04 [0:10068]: killing pid 10070/3
02/03/2014 11:08:04 [0:10068]: killing pid 11530/3
02/03/2014 11:08:04 [0:10068]: wait3 returned 10070 (status: 9; WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
02/03/2014 11:08:04 [0:10068]: job exited with exit status 0
02/03/2014 11:08:04 [0:10068]: reaped "job" with pid 10070
02/03/2014 11:08:04 [0:10068]: job exited due to signal
02/03/2014 11:08:04 [0:10068]: job signaled: 9
02/03/2014 11:08:04 [0:10068]: ignored signal KILL to pid -10070
02/03/2014 11:08:04 [0:10068]: writing usage file to "usage"
02/03/2014 11:08:04 [0:10068]: no tasker to notify
02/03/2014 11:08:04 [0:10068]: parent: forked "epilog" with pid 11531
02/03/2014 11:08:04 [0:10068]: using signal delivery delay of 120 seconds
02/03/2014 11:08:04 [0:10068]: parent: epilog-pid: 11531
02/03/2014 11:08:04 [0:11531]: child: starting son(epilog, /shome/sgeadmin/t3scripts/epilog.sh, 0);
02/03/2014 11:08:04 [0:11531]: pid=11531 pgrp=11531 sid=11531 old pgrp=10068 getlogin()=
02/03/2014 11:08:04 [0:11531]: reading passwd information for user 'bianchi'
02/03/2014 11:08:04 [0:11531]: setting limits
02/03/2014 11:08:04 [0:11531]: setting environment
02/03/2014 11:08:04 [0:11531]: Initializing error file
02/03/2014 11:08:04 [0:11531]: switching to intermediate/target user
02/03/2014 11:08:04 [579:11531]: closing all filedescriptors
02/03/2014 11:08:04 [579:11531]: further messages are in "error" and "trace"
02/03/2014 11:08:04 [579:11531]: using "/bin/bash" as shell of user "bianchi"
02/03/2014 11:08:04 [579:11531]: now running with uid=579, euid=579
02/03/2014 11:08:04 [579:11531]: execvp(/shome/sgeadmin/t3scripts/epilog.sh, "/shome/sgeadmin/t3scripts/epilog.sh")
02/03/2014 11:08:04 [0:10068]: wait3 returned 11531 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
02/03/2014 11:08:04 [0:10068]: epilog exited with exit status 0
02/03/2014 11:08:04 [0:10068]: reaped "epilog" with pid 11531
02/03/2014 11:08:04 [0:10068]: epilog exited not due to signal
02/03/2014 11:08:04 [0:10068]: epilog exited with status 0