Tags:
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup --> ---+ !!How to interactively debug jobs on a worker node %TOC% ---++ Important note We provide a special interactive queue *debug.q* for debugging directly on the worker nodes. *Please do not abuse this facility* since it has the potential to interfere with other users' jobs. No CPU intensive tasks should ever be run on this queue. The use cases for this queue are 1 Peeking at output and log files of a running job 1 Manual emergency cleanup of your scratch area when you discover that your jobs filled it up due to some misconfiguration in your code, causing problems for others. 1 Debugging a troublesome job in place by looking at its processes 1 Be aware that by default each T3 job asks for 3GB of RAM but since we just want to debug we've to request less RAM by %RED%-l h_vmem=400M%ENDCOLOR% ; viceversa the job could be blocked ( no free resources ) ---++ Interactive shell accesses with the *qlogin* command <pre>$ qlogin -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% Your job 1442007 ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job 1442007 has been successfully scheduled. Establishing builtin session to host t3wn12.psi.ch ... $ cd /scratch/$USER ... </pre> ---++ Running a command on a WN by *qrsh* ---+++ How to list your =/scratch= directory on a particular WN: <pre>$ qrsh -q debug.q -l hostname=t3wn14 %RED%-l h_vmem=400M%ENDCOLOR% ls /scratch/$USER</pre> ---+++ How to 'tail' a WN file Please do not abuse this by attaching "tail -f" for prolonged times, you will block the single slot on the debug.q queue : <pre>$ qrsh -q debug.q -l hostname=t3wn14 %RED%-l h_vmem=400M%ENDCOLOR% tail /scratch/$USER/MyDIR/MyLogFile.log </pre> ---+++ How to check the =WN:/tmp= disk quotas: <pre> $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /tmp </pre> For instance : <pre> [t3ui12:auser] $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /tmp cat: sudo: No such file or directory *** Report for user quotas on device /dev/md3 Block grace time: 7days; Inode grace time: 7days Block limits File limits User used soft hard grace used soft hard grace ---------------------------------------------------------------------- root -- 52 0 0 4 0 0 nagios -- 4 774M 968M 2 0 0 auser -- 4 774M 968M 2 0 0 </pre> ---+++ How to erase ALL your files and dirs from a =WN:/tmp/= Probably because you are overusing your [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies][/tmp disk quota]] on that WN. <pre> qrsh -q debug.q -l hostname=t3wn10 %RED%-l h_vmem=400M%ENDCOLOR% find /tmp -user $USER -exec rm -rf {} \; </pre> ---+++ How to check the =WN:/scratch= disk quotas: <pre> $ qrsh -q debug.q -l hostname=t3wn22 %RED%-l h_vmem=400M%ENDCOLOR% sudo /usr/sbin/repquota -s /scratch </pre> ---+++ How to erase ALL your files and dirs from a =WN:/scratch/= Probably because you are overusing your [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/Tier3Policies][/scratch disk quota]] on that WN. <pre> qrsh -q debug.q -l hostname=t3wn10 %RED%-l h_vmem=400M%ENDCOLOR% find /scratch/$USER/ -user $USER -exec rm -rf {} \; </pre> ---+++ How to check the Job logs Given a *finished* job ID =%ORANGE%4675956%ENDCOLOR%= we can check its logs WN side by : <pre> $ %ORANGE%JOBID%ENDCOLOR%=4675956 $ %GREEN%qrsh -q debug.q %RED%-l h_vmem=400M%ENDCOLOR% -l hostname=%ENDCOLOR%%BLUE%`qacct -j %ORANGE%$JOBID%ENDCOLOR% | grep hostname| awk '{print $2'}| cut -d\. -f1`%ENDCOLOR%%GREEN% cat /gridware/sge/default/spool/%ENDCOLOR%%BLUE%`qacct -j %ORANGE%$JOBID%ENDCOLOR% | grep hostname| awk '{print $2'}| cut -d\. -f1`%ENDCOLOR%%GREEN%/active_jobs/%ENDCOLOR%%ORANGE%${JOBID}%ENDCOLOR%%GREEN%.1/trace%ENDCOLOR% 02/03/2014 11:05:19 [0:10068]: shepherd called with uid = 0, euid = 0 02/03/2014 11:05:19 [0:10068]: starting up 6.2u5 02/03/2014 11:05:19 [0:10068]: setpgid(10068, 10068) returned 0 02/03/2014 11:05:19 [0:10068]: do_core_binding: "binding" parameter not found in config file 02/03/2014 11:05:19 [0:10068]: no prolog script to start 02/03/2014 11:05:19 [0:10070]: child: starting son(job, /gridware/sge/default/spool/t3wn34/job_scripts/4675956, 0); 02/03/2014 11:05:19 [0:10070]: pid=10070 pgrp=10070 sid=10070 old pgrp=10068 getlogin()=<no login set> 02/03/2014 11:05:19 [0:10070]: reading passwd information for user 'bianchi' 02/03/2014 11:05:19 [0:10070]: setosjobid: uid = 0, euid = 0 02/03/2014 11:05:19 [0:10070]: setting limits 02/03/2014 11:05:19 [0:10070]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 1073741824 hard 1073741824) resulting: (soft 1073741824 hard 1073741824) 02/03/2014 11:05:19 [0:10070]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY) 02/03/2014 11:05:19 [0:10070]: setting environment 02/03/2014 11:05:19 [0:10070]: Initializing error file 02/03/2014 11:05:19 [0:10070]: switching to intermediate/target user 02/03/2014 11:05:19 [0:10068]: parent: forked "job" with pid 10070 02/03/2014 11:05:19 [0:10068]: parent: job-pid: 10070 02/03/2014 11:05:19 [579:10070]: closing all filedescriptors 02/03/2014 11:05:19 [579:10070]: further messages are in "error" and "trace" 02/03/2014 11:05:19 [579:10070]: now running with uid=579, euid=579 02/03/2014 11:05:19 [579:10070]: execvp(/shome/sgeadmin/t3scripts/starter_method.emi-wn.sh, "starter_method.emi-wn.sh" "/gridware/sge/default/spool/t3wn34/job_scripts/4675956") 02/03/2014 11:08:04 [0:10068]: wait3 returned -1 02/03/2014 11:08:04 [0:10068]: %RED%forward_signal_to_job(): mapping signal 20 TSTP%ENDCOLOR% 02/03/2014 11:08:04 [0:10068]: mapped signal TSTP to signal KILL 02/03/2014 11:08:04 [0:10068]: queued signal KILL 02/03/2014 11:08:04 [0:10068]: kill(-10070, KILL) 02/03/2014 11:08:04 [0:10068]: now sending signal KILL to pid -10070 02/03/2014 11:08:04 [0:10068]: pdc_kill_addgrpid: 20073 9 02/03/2014 11:08:04 [0:10068]: killing pid 10070/3 02/03/2014 11:08:04 [0:10068]: killing pid 11530/3 02/03/2014 11:08:04 [0:10068]: wait3 returned 10070 (status: 9; WIFSIGNALED: 1, WIFEXITED: 0, WEXITSTATUS: 0) 02/03/2014 11:08:04 [0:10068]: job exited with exit status 0 02/03/2014 11:08:04 [0:10068]: reaped "job" with pid 10070 02/03/2014 11:08:04 [0:10068]: job exited due to signal 02/03/2014 11:08:04 [0:10068]: job signaled: 9 02/03/2014 11:08:04 [0:10068]: ignored signal KILL to pid -10070 02/03/2014 11:08:04 [0:10068]: writing usage file to "usage" 02/03/2014 11:08:04 [0:10068]: no tasker to notify 02/03/2014 11:08:04 [0:10068]: parent: forked "epilog" with pid 11531 02/03/2014 11:08:04 [0:10068]: using signal delivery delay of 120 seconds 02/03/2014 11:08:04 [0:10068]: parent: epilog-pid: 11531 02/03/2014 11:08:04 [0:11531]: child: starting son(epilog, /shome/sgeadmin/t3scripts/epilog.sh, 0); 02/03/2014 11:08:04 [0:11531]: pid=11531 pgrp=11531 sid=11531 old pgrp=10068 getlogin()=<no login set> 02/03/2014 11:08:04 [0:11531]: reading passwd information for user 'bianchi' 02/03/2014 11:08:04 [0:11531]: setting limits 02/03/2014 11:08:04 [0:11531]: setting environment 02/03/2014 11:08:04 [0:11531]: Initializing error file 02/03/2014 11:08:04 [0:11531]: switching to intermediate/target user 02/03/2014 11:08:04 [579:11531]: closing all filedescriptors 02/03/2014 11:08:04 [579:11531]: further messages are in "error" and "trace" 02/03/2014 11:08:04 [579:11531]: using "/bin/bash" as shell of user "bianchi" 02/03/2014 11:08:04 [579:11531]: now running with uid=579, euid=579 02/03/2014 11:08:04 [579:11531]: execvp(/shome/sgeadmin/t3scripts/epilog.sh, "/shome/sgeadmin/t3scripts/epilog.sh") 02/03/2014 11:08:04 [0:10068]: wait3 returned 11531 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) 02/03/2014 11:08:04 [0:10068]: epilog exited with exit status 0 02/03/2014 11:08:04 [0:10068]: reaped "epilog" with pid 11531 02/03/2014 11:08:04 [0:10068]: epilog exited not due to signal 02/03/2014 11:08:04 [0:10068]: epilog exited with status 0 </pre> Be aware that the automatic cleaning procedures on the WNs might delete these logs at certain point ; if a log is important save it in your =/shome=
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r21
|
r18
<
r17
<
r16
<
r15
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r16 - 2016-05-20
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback