qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 1261 0.55500 chen_z_cra chen_z qw 10/27/2008 10:17:41 1 1262 0.55500 chen_z_cra chen_z qw 10/27/2008 10:17:41 1
qstat -u '*'
[chen_z@t3ui01 ~]$ qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@t3wn01 BIP 8/8 8.00 lx24-amd64 ---------------------------------------------------------------------------- all.q@t3wn02 BIP 8/8 8.05 lx24-amd64 ---------------------------------------------------------------------------- all.q@t3wn03 BIP 8/8 8.00 lx24-amd64 ---------------------------------------------------------------------------- all.q@t3wn04 BIP 8/8 8.00 lx24-amd64 ---------------------------------------------------------------------------- all.q@t3wn05 BIP 8/8 8.00 lx24-amd64 ---------------------------------------------------------------------------- all.q@t3wn06 BIP 8/8 7.87 lx24-amd64 ---------------------------------------------------------------------------- all.q@t3wn07 BIP 8/8 8.00 lx24-amd64 d ---------------------------------------------------------------------------- all.q@t3wn08.psi.ch BIP 0/8 -NA- lx24-amd64 au ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 1261 0.55500 chen_z_cra chen_z qw 10/27/2008 10:17:41 1 1262 0.55500 chen_z_cra chen_z qw 10/27/2008 10:17:41 1
[chen_z@t3ui01 ~]$ qstat -j scheduling info: queue instance "all.q@t3wn08.psi.ch" dropped because it is temporarily not available queue instance "all.q@t3wn07" dropped because it is disabled queue instance "all.q@t3wn05" dropped because it is full queue instance "all.q@t3wn01" dropped because it is full queue instance "all.q@t3wn06" dropped because it is full queue instance "all.q@t3wn03" dropped because it is full queue instance "all.q@t3wn02" dropped because it is full queue instance "all.q@t3wn04" dropped because it is full All queues dropped because of overload or full
qdel $JOB-IDUse the
qstat
command to find a job's ID.
[chen_z@t3ui01 ~]$ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - t3wn01 lx24-amd64 8 8.00 15.7G 10.2G 1.9G 208.0K t3wn02 lx24-amd64 8 7.97 15.7G 6.8G 1.9G 208.0K t3wn03 lx24-amd64 8 8.00 15.7G 8.3G 1.9G 208.0K t3wn04 lx24-amd64 8 8.05 15.7G 7.0G 1.9G 208.0K t3wn05 lx24-amd64 8 8.01 15.7G 10.7G 1.9G 208.0K t3wn06 lx24-amd64 8 8.25 15.7G 7.9G 1.9G 4.3M t3wn07 lx24-amd64 8 8.00 15.7G 9.0G 1.9G 208.0K t3wn08 lx24-amd64 8 - 15.7G - 1.9G -
[chen_z@t3ui01 ~]$ qhost -j HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - t3wn01 lx24-amd64 8 8.00 15.7G 10.2G 1.9G 208.0K job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------- 1218 0.55500 sd_backgro dambach r 10/27/2008 19:23:01 all.q@t3wn MASTER 1220 0.55500 sd_backgro dambach r 10/27/2008 19:37:19 all.q@t3wn MASTER 1222 0.55500 sd_backgro dambach r 10/27/2008 19:44:44 all.q@t3wn MASTER 1224 0.55500 sd_backgro dambach r 10/27/2008 20:11:16 all.q@t3wn MASTER 1225 0.55500 sd_backgro dambach r 10/27/2008 20:16:09 all.q@t3wn MASTER 1227 0.55500 sd_backgro dambach r 10/27/2008 20:24:35 all.q@t3wn MASTER 1229 0.55500 sd_backgro dambach r 10/27/2008 20:30:56 all.q@t3wn MASTER 1230 0.55500 sd_backgro dambach r 10/27/2008 20:32:12 all.q@t3wn MASTER t3wn02 lx24-amd64 8 7.91 15.7G 7.0G 1.9G 208.0K 1177 0.55500 sd_backgro dambach r 10/27/2008 08:55:35 all.q@t3wn MASTER 1201 0.55500 sd_backgro dambach r 10/27/2008 08:55:35 all.q@t3wn MASTER 1237 0.55500 sd_backgro dambach r 10/27/2008 21:13:36 all.q@t3wn MASTER ... ... ... ... t3wn07 lx24-amd64 8 8.02 15.7G 9.1G 1.9G 208.0K 1221 0.55500 sd_backgro dambach r 10/27/2008 19:40:54 all.q@t3wn MASTER 1226 0.55500 sd_backgro dambach r 10/27/2008 20:20:15 all.q@t3wn MASTER 1228 0.55500 sd_backgro dambach r 10/27/2008 20:29:10 all.q@t3wn MASTER 1231 0.55500 sd_backgro dambach r 10/27/2008 20:40:06 all.q@t3wn MASTER 1233 0.55500 sd_backgro dambach r 10/27/2008 20:58:59 all.q@t3wn MASTER 1236 0.55500 sd_backgro dambach r 10/27/2008 21:09:43 all.q@t3wn MASTER 1239 0.55500 sd_backgro dambach r 10/27/2008 21:27:22 all.q@t3wn MASTER 1243 0.55500 sd_backgro dambach r 10/27/2008 21:32:47 all.q@t3wn MASTER t3wn08 lx24-amd64 8 - 15.7G - 1.9G -
qstat -j JOB_IDThis command prints the reason (scheduler information) why your job just sits in the queue, For example:
[chen_z@t3ui01 sge]$ qstat -j 1264 ============================================================== job_number: 1264 exec_file: job_scripts/1264 ... ... ... ... script_file: test.job scheduling info: queue instance "all.q@t3wn08.psi.ch" dropped because it is temporarily not available queue instance "all.q@t3wn07" dropped because it is disabled queue instance "all.q@t3wn05" dropped because it is full queue instance "all.q@t3wn01" dropped because it is full queue instance "all.q@t3wn03" dropped because it is full queue instance "all.q@t3wn04" dropped because it is full (-l h_rt=460000) cannot run in queue "all.q@t3wn06" because it offers only qf:h_rt=4:00:30:00 (-l h_rt=460000) cannot run in queue "all.q@t3wn02" because it offers only qf:h_rt=4:00:30:00So in this example, the reason is the maximum run time is larger than the run time limitation of the queue.
qacct
is meant to check the post execution stats of a job ; have a look to its parameters qacct --help
for instance you might want to check your RAM usage during the last 30d ( only good jobs ) :
$ qacct -f /gridware/sge/default/common/accounting.complete -o $USER -d 30 -j | egrep 'maxvmem|exit_status|jobnumber|jobname' | paste - - - - | grep "exit_status 0"it's important to request the correct amount of MAX RAM (
h_vmem
) for your jobs because if you constantly and erroneously ask too much RAM your jobs might wait longer to start
qquota
shows the current resource limits and basically who is using what inside those limits ; it's for instance useful to understand why your jobs are pending despite of tens of free CPUs core :
More... Close
$ qquota -u \* resource quota rule limit filter -------------------------------------------------------------------------------- max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn26 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn29 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn12 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn16 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn22 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn23 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn18 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn28 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn14 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn24 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn15 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn20 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn25 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn10 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn27 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn13 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn17 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn11 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn19 max_jobs_per_sun_host/1 slots=8/8 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn21 max_jobs_per_intel_host/1 slots=9/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn33 max_jobs_per_intel_host/1 slots=9/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn34 max_jobs_per_intel_host/1 slots=9/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn40 max_jobs_per_intel_host/1 slots=9/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn32 max_jobs_per_intel_host/1 slots=10/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn39 max_jobs_per_intel_host/1 slots=10/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn36 max_jobs_per_intel_host/1 slots=9/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn38 max_jobs_per_intel_host/1 slots=11/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn35 max_jobs_per_intel_host/1 slots=10/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn30 max_jobs_per_intel_host/1 slots=10/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn31 max_jobs_per_intel_host/1 slots=10/16 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn37 max_jobs_per_intel2_host/1 slots=53/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn59 max_jobs_per_intel2_host/1 slots=54/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn51 max_jobs_per_intel2_host/1 slots=54/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn52 max_jobs_per_intel2_host/1 slots=53/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn58 max_jobs_per_intel2_host/1 slots=51/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn56 max_jobs_per_intel2_host/1 slots=54/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn54 max_jobs_per_intel2_host/1 slots=55/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn53 max_jobs_per_intel2_host/1 slots=55/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn55 max_jobs_per_intel2_host/1 slots=53/64 queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn57 max_allq_jobs/1 slots=740/740 queues all.q,long.q max_longq_jobs/1 slots=99/360 queues long.q max_user_jobs_per_queue/1 slots=396/400 users ursl queues all.q max_user_jobs_per_queue/1 slots=237/400 users cgalloni queues all.q max_user_jobs_per_queue/1 slots=6/400 users grauco queues all.q max_user_jobs_per_queue/1 slots=2/400 users gaperrin queues all.q max_user_jobs_per_queue/2 slots=8/460 users ursl queues short.q max_user_jobs_per_queue/3 slots=96/340 users ursl queues long.q max_user_jobs_per_queue/3 slots=3/340 users pandolf queues long.q max_jobs_per_user/1 slots=500/500 users ursl queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q max_jobs_per_user/1 slots=237/500 users cgalloni queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q max_jobs_per_user/1 slots=6/500 users grauco queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q max_jobs_per_user/1 slots=2/500 users gaperrin queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q max_jobs_per_user/1 slots=3/500 users pandolf queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.qthe agreed T3 Policies, so specifically also the Batch System Polices, are on Tier3Policies
I | Attachment | History | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|---|
png | Picture_1.png | r1 | manage | 66.4 K | 2008-10-27 - 22:04 | ZhilingChen | QMON Job contol Window |
png | main-control.png | r1 | manage | 50.3 K | 2008-10-27 - 22:01 | ZhilingChen | QMON Main Control Window |