How to manage jobs with SGE Utilities

SGE provides many command line utilities and a GUI program to Interact With the Sun Grid Engine Software.

For the information on how to submit jobs, please consult this page

Command Line Client Commands

Command Line Client Commands

qstat - show job/queue status

with no arguments, the command shows your currently running/pending jobs

qstat

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   1261 0.55500 chen_z_cra chen_z       qw    10/27/2008 10:17:41                                    1        
   1262 0.55500 chen_z_cra chen_z       qw    10/27/2008 10:17:41                                    1

the -u flag can be used to look at other users jobs. You can use the wildcard '*' to specify all users

qstat -u '*'

the -f flag shows a full listing of all queues and related information.

[chen_z@t3ui01 ~]$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@t3wn01                   BIP   8/8       8.00     lx24-amd64    
----------------------------------------------------------------------------
all.q@t3wn02                   BIP   8/8       8.05     lx24-amd64    
----------------------------------------------------------------------------
all.q@t3wn03                   BIP   8/8       8.00     lx24-amd64    
----------------------------------------------------------------------------
all.q@t3wn04                   BIP   8/8       8.00     lx24-amd64    
----------------------------------------------------------------------------
all.q@t3wn05                   BIP   8/8       8.00     lx24-amd64    
----------------------------------------------------------------------------
all.q@t3wn06                   BIP   8/8       7.87     lx24-amd64    
----------------------------------------------------------------------------
all.q@t3wn07                   BIP   8/8       8.00     lx24-amd64    d
----------------------------------------------------------------------------
all.q@t3wn08.psi.ch            BIP   0/8       -NA-     lx24-amd64    au

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
   1261 0.55500 chen_z_cra chen_z       qw    10/27/2008 10:17:41     1        
   1262 0.55500 chen_z_cra chen_z       qw    10/27/2008 10:17:41     1

-j shows detailed information on pending/running job

     [chen_z@t3ui01 ~]$ qstat -j
     scheduling info:            queue instance "all.q@t3wn08.psi.ch" dropped because it is temporarily not available
                            queue instance "all.q@t3wn07" dropped because it is disabled
                            queue instance "all.q@t3wn05" dropped because it is full
                            queue instance "all.q@t3wn01" dropped because it is full
                            queue instance "all.q@t3wn06" dropped because it is full
                            queue instance "all.q@t3wn03" dropped because it is full
                            queue instance "all.q@t3wn02" dropped because it is full
                            queue instance "all.q@t3wn04" dropped because it is full
                            All queues dropped because of overload or full

qdel - delete a job from the queue

Syntax:

qdel $JOB-ID

Use the qstat command to find a job's ID.

qhost - Show job/host status

no arguments Show a table of all execution hosts and information about their configuration

[chen_z@t3ui01 ~]$ qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
t3wn01                  lx24-amd64      8  8.00   15.7G   10.2G    1.9G  208.0K
t3wn02                  lx24-amd64      8  7.97   15.7G    6.8G    1.9G  208.0K
t3wn03                  lx24-amd64      8  8.00   15.7G    8.3G    1.9G  208.0K
t3wn04                  lx24-amd64      8  8.05   15.7G    7.0G    1.9G  208.0K
t3wn05                  lx24-amd64      8  8.01   15.7G   10.7G    1.9G  208.0K
t3wn06                  lx24-amd64      8  8.25   15.7G    7.9G    1.9G    4.3M
t3wn07                  lx24-amd64      8  8.00   15.7G    9.0G    1.9G  208.0K
t3wn08                  lx24-amd64      8     -   15.7G       -    1.9G       -

-j Shows detailed information on pending/running job by worker nodes

[chen_z@t3ui01 ~]$ qhost -j
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
t3wn01                  lx24-amd64      8  8.00   15.7G   10.2G    1.9G  208.0K
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID 
   ----------------------------------------------------------------------------------------------
      1218 0.55500 sd_backgro dambach      r     10/27/2008 19:23:01 all.q@t3wn MASTER        
      1220 0.55500 sd_backgro dambach      r     10/27/2008 19:37:19 all.q@t3wn MASTER        
      1222 0.55500 sd_backgro dambach      r     10/27/2008 19:44:44 all.q@t3wn MASTER        
      1224 0.55500 sd_backgro dambach      r     10/27/2008 20:11:16 all.q@t3wn MASTER        
      1225 0.55500 sd_backgro dambach      r     10/27/2008 20:16:09 all.q@t3wn MASTER        
      1227 0.55500 sd_backgro dambach      r     10/27/2008 20:24:35 all.q@t3wn MASTER        
      1229 0.55500 sd_backgro dambach      r     10/27/2008 20:30:56 all.q@t3wn MASTER        
      1230 0.55500 sd_backgro dambach      r     10/27/2008 20:32:12 all.q@t3wn MASTER        
t3wn02                  lx24-amd64      8  7.91   15.7G    7.0G    1.9G  208.0K
      1177 0.55500 sd_backgro dambach      r     10/27/2008 08:55:35 all.q@t3wn MASTER        
      1201 0.55500 sd_backgro dambach      r     10/27/2008 08:55:35 all.q@t3wn MASTER        
      1237 0.55500 sd_backgro dambach      r     10/27/2008 21:13:36 all.q@t3wn MASTER        
... ... 
... ...       
t3wn07                  lx24-amd64      8  8.02   15.7G    9.1G    1.9G  208.0K
      1221 0.55500 sd_backgro dambach      r     10/27/2008 19:40:54 all.q@t3wn MASTER        
      1226 0.55500 sd_backgro dambach      r     10/27/2008 20:20:15 all.q@t3wn MASTER        
      1228 0.55500 sd_backgro dambach      r     10/27/2008 20:29:10 all.q@t3wn MASTER        
      1231 0.55500 sd_backgro dambach      r     10/27/2008 20:40:06 all.q@t3wn MASTER        
      1233 0.55500 sd_backgro dambach      r     10/27/2008 20:58:59 all.q@t3wn MASTER        
      1236 0.55500 sd_backgro dambach      r     10/27/2008 21:09:43 all.q@t3wn MASTER        
      1239 0.55500 sd_backgro dambach      r     10/27/2008 21:27:22 all.q@t3wn MASTER        
      1243 0.55500 sd_backgro dambach      r     10/27/2008 21:32:47 all.q@t3wn MASTER        
t3wn08                  lx24-amd64      8     -   15.7G       -    1.9G       -

-q Shows detailed information on queues at each host

Why Won't My Job Run Correctly?

Does your job show "Eqw" or "qw" state when you run qstat, and just sits there refusing to run? Get more info on what's wrong with it using:

qstat -j JOB_ID

This command prints the reason (scheduler information) why your job just sits in the queue, For example:

[chen_z@t3ui01 sge]$ qstat -j 1264
==============================================================
job_number:                 1264
exec_file:                  job_scripts/1264
... ...
... ...
script_file:                test.job
scheduling info:            queue instance "all.q@t3wn08.psi.ch" dropped because it is temporarily not available
                            queue instance "all.q@t3wn07" dropped because it is disabled
                            queue instance "all.q@t3wn05" dropped because it is full
                            queue instance "all.q@t3wn01" dropped because it is full
                            queue instance "all.q@t3wn03" dropped because it is full
                            queue instance "all.q@t3wn04" dropped because it is full
                            (-l h_rt=460000) cannot run in queue "all.q@t3wn06" because it offers only qf:h_rt=4:00:30:00
                            (-l h_rt=460000) cannot run in queue "all.q@t3wn02" because it offers only qf:h_rt=4:00:30:00

So in this example, the reason is the maximum run time is larger than the run time limitation of the queue.

qacct - post execution stats

qacct is meant to check the post execution stats of a job ; have a look to its parameters qacct --help

for instance you might want to check your RAM usage during the last 30d ( only good jobs ) :

$ qacct -f /gridware/sge/default/common/accounting.complete -o $USER -d 30 -j  | egrep 'maxvmem|exit_status|jobnumber|jobname' | paste - - - - | grep "exit_status  0"

it's important to request the correct amount of MAX RAM ( h_vmem ) for your jobs because if you constantly and erroneously ask too much RAM your jobs might wait longer to start

qquota - current resource limits

qquota shows the current resource limits and basically who is using what inside those limits ; it's for instance useful to understand why your jobs are pending despite of tens of free CPUs core :
More... Close

$ qquota -u \*
resource quota rule limit                filter
--------------------------------------------------------------------------------
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn26
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn29
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn12
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn16
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn22
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn23
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn18
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn28
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn14
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn24
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn15
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn20
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn25
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn10
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn27
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn13
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn17
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn11
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn19
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn21
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn33
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn34
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn40
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn32
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn39
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn36
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn38
max_jobs_per_intel_host/1 slots=11/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn35
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn30
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn31
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn37
max_jobs_per_intel2_host/1 slots=53/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn59
max_jobs_per_intel2_host/1 slots=54/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn51
max_jobs_per_intel2_host/1 slots=54/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn52
max_jobs_per_intel2_host/1 slots=53/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn58
max_jobs_per_intel2_host/1 slots=51/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn56
max_jobs_per_intel2_host/1 slots=54/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn54
max_jobs_per_intel2_host/1 slots=55/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn53
max_jobs_per_intel2_host/1 slots=55/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn55
max_jobs_per_intel2_host/1 slots=53/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn57
max_allq_jobs/1    slots=740/740        queues all.q,long.q
max_longq_jobs/1   slots=99/360         queues long.q
max_user_jobs_per_queue/1 slots=396/400        users ursl queues all.q
max_user_jobs_per_queue/1 slots=237/400        users cgalloni queues all.q
max_user_jobs_per_queue/1 slots=6/400          users grauco queues all.q
max_user_jobs_per_queue/1 slots=2/400          users gaperrin queues all.q
max_user_jobs_per_queue/2 slots=8/460          users ursl queues short.q
max_user_jobs_per_queue/3 slots=96/340         users ursl queues long.q
max_user_jobs_per_queue/3 slots=3/340          users pandolf queues long.q
max_jobs_per_user/1 slots=500/500        users ursl queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=237/500        users cgalloni queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=6/500          users grauco queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=2/500          users gaperrin queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=3/500          users pandolf queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q

the agreed T3 Policies, so specifically also the Batch System Polices, are on Tier3Policies

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
png	Picture_1.png	r1	manage	66.4 K	2008-10-27 - 22:04	ZhilingChen	QMON Job contol Window
png	main-control.png	r1	manage	50.3 K	2008-10-27 - 22:01	ZhilingChen	QMON Main Control Window

Topic revision: r6 - 2016-06-14 - FabioMartinelli

CmsTier3

User Pages
Main Page
Policies

Physics Groups
Steering Board Meetings

Admin Pages
AdminArea
Cluster Specs