Tags:
create new tag
view all tags

How to manage jobs with SGE Utilities

SGE provides many command line utilities and a GUI program to Interact With the Sun Grid Engine Software.

For the information on how to submit jobs, please consult this page

Command Line Client Commands

qstat - show job/queue status

  • with no arguments, the command shows your currently running/pending jobs
    qstat
    
    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
    -----------------------------------------------------------------------------------------------------------------
       1261 0.55500 chen_z_cra chen_z       qw    10/27/2008 10:17:41                                    1        
       1262 0.55500 chen_z_cra chen_z       qw    10/27/2008 10:17:41                                    1    
        
  • the -u flag can be used to look at other users jobs. You can use the wildcard '*' to specify all users
qstat -u '*'

  • the -f flag shows a full listing of all queues and related information.
    [chen_z@t3ui01 ~]$ qstat -f
    queuename                      qtype used/tot. load_avg arch          states
    ----------------------------------------------------------------------------
    all.q@t3wn01                   BIP   8/8       8.00     lx24-amd64    
    ----------------------------------------------------------------------------
    all.q@t3wn02                   BIP   8/8       8.05     lx24-amd64    
    ----------------------------------------------------------------------------
    all.q@t3wn03                   BIP   8/8       8.00     lx24-amd64    
    ----------------------------------------------------------------------------
    all.q@t3wn04                   BIP   8/8       8.00     lx24-amd64    
    ----------------------------------------------------------------------------
    all.q@t3wn05                   BIP   8/8       8.00     lx24-amd64    
    ----------------------------------------------------------------------------
    all.q@t3wn06                   BIP   8/8       7.87     lx24-amd64    
    ----------------------------------------------------------------------------
    all.q@t3wn07                   BIP   8/8       8.00     lx24-amd64    d
    ----------------------------------------------------------------------------
    all.q@t3wn08.psi.ch            BIP   0/8       -NA-     lx24-amd64    au
    
    ############################################################################
     - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
    ############################################################################
       1261 0.55500 chen_z_cra chen_z       qw    10/27/2008 10:17:41     1        
       1262 0.55500 chen_z_cra chen_z       qw    10/27/2008 10:17:41     1        
       

  • -j shows detailed information on pending/running job
         [chen_z@t3ui01 ~]$ qstat -j
         scheduling info:            queue instance "all.q@t3wn08.psi.ch" dropped because it is temporarily not available
                                queue instance "all.q@t3wn07" dropped because it is disabled
                                queue instance "all.q@t3wn05" dropped because it is full
                                queue instance "all.q@t3wn01" dropped because it is full
                                queue instance "all.q@t3wn06" dropped because it is full
                                queue instance "all.q@t3wn03" dropped because it is full
                                queue instance "all.q@t3wn02" dropped because it is full
                                queue instance "all.q@t3wn04" dropped because it is full
                                All queues dropped because of overload or full
        

qdel - delete a job from the queue

Syntax:
qdel $JOB-ID
Use the qstat command to find a job's ID.

qhost - Show job/host status

  • no arguments Show a table of all execution hosts and information about their configuration
    [chen_z@t3ui01 ~]$ qhost
    HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
    -------------------------------------------------------------------------------
    global                  -               -     -       -       -       -       -
    t3wn01                  lx24-amd64      8  8.00   15.7G   10.2G    1.9G  208.0K
    t3wn02                  lx24-amd64      8  7.97   15.7G    6.8G    1.9G  208.0K
    t3wn03                  lx24-amd64      8  8.00   15.7G    8.3G    1.9G  208.0K
    t3wn04                  lx24-amd64      8  8.05   15.7G    7.0G    1.9G  208.0K
    t3wn05                  lx24-amd64      8  8.01   15.7G   10.7G    1.9G  208.0K
    t3wn06                  lx24-amd64      8  8.25   15.7G    7.9G    1.9G    4.3M
    t3wn07                  lx24-amd64      8  8.00   15.7G    9.0G    1.9G  208.0K
    t3wn08                  lx24-amd64      8     -   15.7G       -    1.9G       -
       
  • -j Shows detailed information on pending/running job by worker nodes
    [chen_z@t3ui01 ~]$ qhost -j
    HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
    -------------------------------------------------------------------------------
    global                  -               -     -       -       -       -       -
    t3wn01                  lx24-amd64      8  8.00   15.7G   10.2G    1.9G  208.0K
       job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID 
       ----------------------------------------------------------------------------------------------
          1218 0.55500 sd_backgro dambach      r     10/27/2008 19:23:01 all.q@t3wn MASTER        
          1220 0.55500 sd_backgro dambach      r     10/27/2008 19:37:19 all.q@t3wn MASTER        
          1222 0.55500 sd_backgro dambach      r     10/27/2008 19:44:44 all.q@t3wn MASTER        
          1224 0.55500 sd_backgro dambach      r     10/27/2008 20:11:16 all.q@t3wn MASTER        
          1225 0.55500 sd_backgro dambach      r     10/27/2008 20:16:09 all.q@t3wn MASTER        
          1227 0.55500 sd_backgro dambach      r     10/27/2008 20:24:35 all.q@t3wn MASTER        
          1229 0.55500 sd_backgro dambach      r     10/27/2008 20:30:56 all.q@t3wn MASTER        
          1230 0.55500 sd_backgro dambach      r     10/27/2008 20:32:12 all.q@t3wn MASTER        
    t3wn02                  lx24-amd64      8  7.91   15.7G    7.0G    1.9G  208.0K
          1177 0.55500 sd_backgro dambach      r     10/27/2008 08:55:35 all.q@t3wn MASTER        
          1201 0.55500 sd_backgro dambach      r     10/27/2008 08:55:35 all.q@t3wn MASTER        
          1237 0.55500 sd_backgro dambach      r     10/27/2008 21:13:36 all.q@t3wn MASTER        
    ... ... 
    ... ...       
    t3wn07                  lx24-amd64      8  8.02   15.7G    9.1G    1.9G  208.0K
          1221 0.55500 sd_backgro dambach      r     10/27/2008 19:40:54 all.q@t3wn MASTER        
          1226 0.55500 sd_backgro dambach      r     10/27/2008 20:20:15 all.q@t3wn MASTER        
          1228 0.55500 sd_backgro dambach      r     10/27/2008 20:29:10 all.q@t3wn MASTER        
          1231 0.55500 sd_backgro dambach      r     10/27/2008 20:40:06 all.q@t3wn MASTER        
          1233 0.55500 sd_backgro dambach      r     10/27/2008 20:58:59 all.q@t3wn MASTER        
          1236 0.55500 sd_backgro dambach      r     10/27/2008 21:09:43 all.q@t3wn MASTER        
          1239 0.55500 sd_backgro dambach      r     10/27/2008 21:27:22 all.q@t3wn MASTER        
          1243 0.55500 sd_backgro dambach      r     10/27/2008 21:32:47 all.q@t3wn MASTER        
    t3wn08                  lx24-amd64      8     -   15.7G       -    1.9G       -
       
  • -q Shows detailed information on queues at each host

Why Won't My Job Run Correctly?

Does your job show "Eqw" or "qw" state when you run qstat, and just sits there refusing to run? Get more info on what's wrong with it using:
qstat -j JOB_ID 
This command prints the reason (scheduler information) why your job just sits in the queue, For example:
[chen_z@t3ui01 sge]$ qstat -j 1264
==============================================================
job_number:                 1264
exec_file:                  job_scripts/1264
... ...
... ...
script_file:                test.job
scheduling info:            queue instance "all.q@t3wn08.psi.ch" dropped because it is temporarily not available
                            queue instance "all.q@t3wn07" dropped because it is disabled
                            queue instance "all.q@t3wn05" dropped because it is full
                            queue instance "all.q@t3wn01" dropped because it is full
                            queue instance "all.q@t3wn03" dropped because it is full
                            queue instance "all.q@t3wn04" dropped because it is full
                            (-l h_rt=460000) cannot run in queue "all.q@t3wn06" because it offers only qf:h_rt=4:00:30:00
                            (-l h_rt=460000) cannot run in queue "all.q@t3wn02" because it offers only qf:h_rt=4:00:30:00

   
So in this example, the reason is the maximum run time is larger than the run time limitation of the queue.

qacct - post execution stats

qacct is meant to check the post execution stats of a job ; have a look to its parameters qacct --help

for instance you might want to check your RAM usage during the last 30d ( only good jobs ) :

$ qacct -f /gridware/sge/default/common/accounting.complete -o $USER -d 30 -j  | egrep 'maxvmem|exit_status|jobnumber|jobname' | paste - - - - | grep "exit_status  0"

it's important to request the correct amount of MAX RAM ( h_vmem ) for your jobs because if you constantly and erroneously ask too much RAM your jobs might wait longer to start

qquota - current resource limits

qquota shows the current resource limits and basically who is using what inside those limits ; it's for instance useful to understand why your jobs are pending despite of tens of free CPUs core :
More... Close
$ qquota -u \*
resource quota rule limit                filter
--------------------------------------------------------------------------------
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn26
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn29
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn12
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn16
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn22
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn23
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn18
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn28
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn14
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn24
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn15
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn20
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn25
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn10
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn27
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn13
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn17
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn11
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn19
max_jobs_per_sun_host/1 slots=8/8            queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn21
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn33
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn34
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn40
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn32
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn39
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn36
max_jobs_per_intel_host/1 slots=9/16           queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn38
max_jobs_per_intel_host/1 slots=11/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn35
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn30
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn31
max_jobs_per_intel_host/1 slots=10/16          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn37
max_jobs_per_intel2_host/1 slots=53/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn59
max_jobs_per_intel2_host/1 slots=54/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn51
max_jobs_per_intel2_host/1 slots=54/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn52
max_jobs_per_intel2_host/1 slots=53/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn58
max_jobs_per_intel2_host/1 slots=51/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn56
max_jobs_per_intel2_host/1 slots=54/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn54
max_jobs_per_intel2_host/1 slots=55/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn53
max_jobs_per_intel2_host/1 slots=55/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn55
max_jobs_per_intel2_host/1 slots=53/64          queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q hosts t3wn57
max_allq_jobs/1    slots=740/740        queues all.q,long.q
max_longq_jobs/1   slots=99/360         queues long.q
max_user_jobs_per_queue/1 slots=396/400        users ursl queues all.q
max_user_jobs_per_queue/1 slots=237/400        users cgalloni queues all.q
max_user_jobs_per_queue/1 slots=6/400          users grauco queues all.q
max_user_jobs_per_queue/1 slots=2/400          users gaperrin queues all.q
max_user_jobs_per_queue/2 slots=8/460          users ursl queues short.q
max_user_jobs_per_queue/3 slots=96/340         users ursl queues long.q
max_user_jobs_per_queue/3 slots=3/340          users pandolf queues long.q
max_jobs_per_user/1 slots=500/500        users ursl queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=237/500        users cgalloni queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=6/500          users grauco queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=2/500          users gaperrin queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q
max_jobs_per_user/1 slots=3/500          users pandolf queues all.q,short.q,long.q,sherpa.gen.q,sherpa.int.long.q.sherpa.int.vlong.q 

the agreed T3 Policies, so specifically also the Batch System Polices, are on Tier3Policies

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Picture_1.png r1 manage 66.4 K 2008-10-27 - 22:04 ZhilingChen QMON Job contol Window
PNGpng main-control.png r1 manage 50.3 K 2008-10-27 - 22:01 ZhilingChen QMON Main Control Window
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r6 - 2016-06-14 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback