How to interactively debug jobs on a worker node
Important note
We provide a special interactive queue
debug.q for debugging directly on the worker nodes.
Please do not abuse this facility since it has the potential to interfere with other users' jobs. No CPU intensive tasks should ever be run on this queue.
The use cases for this queue are
- Peeking at output and log files of a running job
- Manual emergency cleanup of your scratch area when you discover that your jobs filled it up due to some misconfiguration in your code, causing problems for others.
- Debugging a troublesome job in place by looking at its processes
- Be aware that by default each T3 job asks for 3GB of RAM but since we just want to debug we've to request less RAM by -l h_vmem=400M ; viceversa the job could be blocked ( no free resources )
Interactive shell accesses with the qlogin command
$ qlogin -q debug.q -l hostname=t3wn22 -l h_vmem=400M
Your job 1442007 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 1442007 has been successfully scheduled.
Establishing builtin session to host t3wn12.psi.ch ...
$ cd /scratch/$USER
...
Running a command on a WN by qrsh
How to list your /scratch
directory on a particular WN:
$ qrsh -q debug.q -l hostname=t3wn14 -l h_vmem=400M ls /scratch/$USER
How to 'tail' a WN file
Please do not abuse this by attaching "tail -f" for prolonged times, you will block the single slot on the debug.q queue :
$ qrsh -q debug.q -l hostname=t3wn14 -l h_vmem=400M tail /scratch/$USER/MyDIR/MyLogFile.log
How to check the WN:/tmp
disk quotas:
$ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /tmp
For instance :
[t3ui12:auser] $ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /tmp
cat: sudo: No such file or directory
*** Report for user quotas on device /dev/md3
Block grace time: 7days; Inode grace time: 7days
Block limits File limits
User used soft hard grace used soft hard grace
----------------------------------------------------------------------
root -- 52 0 0 4 0 0
nagios -- 4 774M 968M 2 0 0
auser -- 4 774M 968M 2 0 0
How to erase ALL your files and dirs from a WN:/tmp/
Probably because you are overusing your
/tmp disk quota on that WN.
qrsh -q debug.q -l hostname=t3wn10 -l h_vmem=400M find /tmp -user $USER -exec rm -rf {} \;
How to check the WN:/scratch
disk quotas:
$ qrsh -q debug.q -l hostname=t3wn22 -l h_vmem=400M sudo /usr/sbin/repquota -s /scratch
How to erase ALL your files and dirs from a WN:/scratch/
Probably because you are overusing your
/scratch disk quota on that WN.
qrsh -q debug.q -l hostname=t3wn10 -l h_vmem=400M find /scratch/$USER/ -user $USER -exec rm -rf {} \;
How to check the Job logs
Given a
finished job ID
4675956
we can check its logs WN side by :
$ JOBID=4675956
$ qrsh -q debug.q -l h_vmem=400M -l hostname=`qacct -j $JOBID | grep hostname| awk '{print $2'}| cut -d\. -f1` cat /gridware/sge/default/spool/`qacct -j $JOBID | grep hostname| awk '{print $2'}| cut -d\. -f1`/active_jobs/${JOBID}.1/trace
02/03/2014 11:05:19 [0:10068]: shepherd called with uid = 0, euid = 0
02/03/2014 11:05:19 [0:10068]: starting up 6.2u5
02/03/2014 11:05:19 [0:10068]: setpgid(10068, 10068) returned 0
02/03/2014 11:05:19 [0:10068]: do_core_binding: "binding" parameter not found in config file
02/03/2014 11:05:19 [0:10068]: no prolog script to start
02/03/2014 11:05:19 [0:10070]: child: starting son(job, /gridware/sge/default/spool/t3wn34/job_scripts/4675956, 0);
02/03/2014 11:05:19 [0:10070]: pid=10070 pgrp=10070 sid=10070 old pgrp=10068 getlogin()=
02/03/2014 11:05:19 [0:10070]: reading passwd information for user 'bianchi'
02/03/2014 11:05:19 [0:10070]: setosjobid: uid = 0, euid = 0
02/03/2014 11:05:19 [0:10070]: setting limits
02/03/2014 11:05:19 [0:10070]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 1073741824 hard 1073741824) resulting: (soft 1073741824 hard 1073741824)
02/03/2014 11:05:19 [0:10070]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
02/03/2014 11:05:19 [0:10070]: setting environment
02/03/2014 11:05:19 [0:10070]: Initializing error file
02/03/2014 11:05:19 [0:10070]: switching to intermediate/target user
02/03/2014 11:05:19 [0:10068]: parent: forked "job" with pid 10070
02/03/2014 11:05:19 [0:10068]: parent: job-pid: 10070
02/03/2014 11:05:19 [579:10070]: closing all filedescriptors
02/03/2014 11:05:19 [579:10070]: further messages are in "error" and "trace"
02/03/2014 11:05:19 [579:10070]: now running with uid=579, euid=579
02/03/2014 11:05:19 [579:10070]: execvp(/shome/sgeadmin/t3scripts/starter_method.emi-wn.sh, "starter_method.emi-wn.sh" "/gridware/sge/default/spool/t3wn34/job_scripts/4675956")
02/03/2014 11:08:04 [0:10068]: wait3 returned -1
02/03/2014 11:08:04 [0:10068]: forward_signal_to_job(): mapping signal 20 TSTP
02/03/2014 11:08:04 [0:10068]: mapped signal TSTP to signal KILL
02/03/2014 11:08:04 [0:10068]: queued signal KILL
02/03/2014 11:08:04 [0:10068]: kill(-10070, KILL)
02/03/2014 11:08:04 [0:10068]: now sending signal KILL to pid -10070
02/03/2014 11:08:04 [0:10068]: pdc_kill_addgrpid: 20073 9
02/03/2014 11:08:04 [0:10068]: killing pid 10070/3
02/03/2014 11:08:04 [0:10068]: killing pid 11530/3
02/03/2014 11:08:04 [0:10068]: wait3 returned 10070 (status: 9; WIFSIGNALED: 1, WIFEXITED: 0, WEXITSTATUS: 0)
02/03/2014 11:08:04 [0:10068]: job exited with exit status 0
02/03/2014 11:08:04 [0:10068]: reaped "job" with pid 10070
02/03/2014 11:08:04 [0:10068]: job exited due to signal
02/03/2014 11:08:04 [0:10068]: job signaled: 9
02/03/2014 11:08:04 [0:10068]: ignored signal KILL to pid -10070
02/03/2014 11:08:04 [0:10068]: writing usage file to "usage"
02/03/2014 11:08:04 [0:10068]: no tasker to notify
02/03/2014 11:08:04 [0:10068]: parent: forked "epilog" with pid 11531
02/03/2014 11:08:04 [0:10068]: using signal delivery delay of 120 seconds
02/03/2014 11:08:04 [0:10068]: parent: epilog-pid: 11531
02/03/2014 11:08:04 [0:11531]: child: starting son(epilog, /shome/sgeadmin/t3scripts/epilog.sh, 0);
02/03/2014 11:08:04 [0:11531]: pid=11531 pgrp=11531 sid=11531 old pgrp=10068 getlogin()=
02/03/2014 11:08:04 [0:11531]: reading passwd information for user 'bianchi'
02/03/2014 11:08:04 [0:11531]: setting limits
02/03/2014 11:08:04 [0:11531]: setting environment
02/03/2014 11:08:04 [0:11531]: Initializing error file
02/03/2014 11:08:04 [0:11531]: switching to intermediate/target user
02/03/2014 11:08:04 [579:11531]: closing all filedescriptors
02/03/2014 11:08:04 [579:11531]: further messages are in "error" and "trace"
02/03/2014 11:08:04 [579:11531]: using "/bin/bash" as shell of user "bianchi"
02/03/2014 11:08:04 [579:11531]: now running with uid=579, euid=579
02/03/2014 11:08:04 [579:11531]: execvp(/shome/sgeadmin/t3scripts/epilog.sh, "/shome/sgeadmin/t3scripts/epilog.sh")
02/03/2014 11:08:04 [0:10068]: wait3 returned 11531 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)
02/03/2014 11:08:04 [0:10068]: epilog exited with exit status 0
02/03/2014 11:08:04 [0:10068]: reaped "epilog" with pid 11531
02/03/2014 11:08:04 [0:10068]: epilog exited not due to signal
02/03/2014 11:08:04 [0:10068]: epilog exited with status 0
Be aware that the automatic cleaning procedures on the WNs might delete these logs at certain point ; if a log is important save it in your
/shome