Sun Grid Engine on CMS Tier3
Revision 22, 2019-02-14 10:46:33
This document describes the planning, installation, and basic configuration of SGE
on the PSI Tier3 cluster.
The advanced configuration and policies will be described in a separate document.
Useful links
Centrally starting/stopping the service
On the admin master node:
Starting:
ssh t3ce01 /etc/init.d/sgemaster start
cexec wn: /etc/init.d/sgeexecd start
Stopping:
cexec wn: /etc/init.d/sgeexecd stop
ssh t3ce01 /etc/init.d/sgemaster stop
SGE Installation
Cluster Planning
Hosts
========================================
host execute submit admin master
----------------------------------------
t3ce01 N Y Y Y
t3ui01 N Y Y
t3wn01 Y Y N
t3wn02 Y Y N
t3wn03 Y Y N
t3wn04 Y Y N
t3wn05 Y Y N
t3wn06 Y Y N
t3wn07 Y Y N
t3wn08 Y Y N
========================================
Environment
SGE_ROOT=/swshare/sge/n1ge6
SGE_CELL=tier3
SGE Admon
The SGE administrator account is sgeadmin.
SGE Services
# to be added to /etc/services
sge_qmaster 536/tcp # SGE batch system master
sge_execd 537/tcp # SGE batch system execd
Prerequisits
-
SGE_ROOT
is NFS mounted root rw on all nodes.
- The root can do passwordless SSH (to be relaxed).
- The sgeadmin HOME is NFS mounted rw on all nodes.
- The sgeadmin user can do passwordless SSH between any two nodes.
- The SGE services are defined in /etc/services on all nodes.
Here all nodes means all the nodes listed in the cluster table.
Download and Install the SGE Software
mkdir /swshare/sge/download
cd /swshare/sge/download
rsync -av -e ssh markushin@pc4731:/scratch/sge /swshare/sge/download/
export SGE_ROOT=/swshare/sge/n1ge6
export SGE_CELL=tier3
mkdir $SGE_ROOT
cd $SGE_ROOT
pwd
/swshare/sge/n1ge6
tar xvzf /swshare/sge/download/sge/ge-6.1u4-common.tar.gz
tar xvzf /swshare/sge/download/sge/ge-6.1u4-bin-lx24-amd64.tar.gz
tree -d /swshare/sge/n1ge6
Show Details Hide Details
/swshare/sge/n1ge6
|-- 3rd_party
| `-- qmon
|-- bin
| `-- lx24-amd64
|-- catman
| |-- a_man
| | |-- cat5
| | `-- cat8
| |-- cat
| | |-- cat1
| | |-- cat3
| | |-- cat5
| | `-- cat8
| |-- p_man
| | `-- cat3
| `-- u_man
| `-- cat1
|-- ckpt
|-- doc
| |-- bdbdocs
| | `-- utility
| `-- javadocs
| |-- com
| | `-- sun
| | `-- grid
| | `-- drmaa
| |-- org
| | `-- ggf
| | `-- drmaa
| `-- resources
|-- dtrace
|-- examples
| |-- drmaa
| |-- jobs
| `-- jobsbin
| `-- lx24-amd64
|-- include
|-- lib
| `-- lx24-amd64
|-- man
| |-- man1
| |-- man3
| |-- man5
| `-- man8
|-- mpi
| |-- SunHPCT5
| `-- myrinet
|-- pvm
| `-- src
|-- qmon
| `-- PIXMAPS
| `-- big
|-- util
| |-- install_modules
| |-- rctemplates
| |-- resources
| | |-- calendars
| | |-- centry
| | |-- loadsensors
| | |-- pe
| | |-- schemas
| | | |-- qhost
| | | |-- qquota
| | | `-- qstat
| | |-- starter_methods
| | `-- usersets
| `-- sgeCA
`-- utilbin
`-- lx24-amd64
69 directories
Edit the SGE Configuration File
See the comments in $SGE_ROOT/util/install_modules/tier3.conf for details.
SGE_ROOT/util/install_modules/tier3.conf
Show Details Hide Details
grep -v ^# $SGE_ROOT/util/install_modules/tier3.conf | sed '/^$/d'
SGE_ROOT="/swshare/sge/n1ge6"
SGE_QMASTER_PORT="536"
SGE_EXECD_PORT="537"
CELL_NAME="tier3"
ADMIN_USER="sgeadmin"
QMASTER_SPOOL_DIR="/var/spool/sge/qmaster"
EXECD_SPOOL_DIR="/var/spool/sge"
GID_RANGE="50700-50800"
SPOOLING_METHOD="classic"
DB_SPOOLING_SERVER="none"
DB_SPOOLING_DIR="/var/spool/sge/spooldb"
PAR_EXECD_INST_COUNT="8"
ADMIN_HOST_LIST="t3admin01 t3ce01 t3ui01"
SUBMIT_HOST_LIST="t3ce01 t3ui01 t3wn01 t3wn02 t3wn03 t3wn04 t3wn05 t3wn06 t3wn07 t3wn08"
EXEC_HOST_LIST="t3wn01 t3wn02 t3wn03 t3wn04 t3wn05 t3wn06 t3wn07 t3wn08"
EXECD_SPOOL_DIR_LOCAL=""
HOSTNAME_RESOLVING="true"
SHELL_NAME="ssh"
COPY_COMMAND="scp"
DEFAULT_DOMAIN="none"
ADMIN_MAIL="none"
ADD_TO_RC="false"
SET_FILE_PERMS="true"
RESCHEDULE_JOBS="wait"
SCHEDD_CONF="1"
SHADOW_HOST=""
EXEC_HOST_LIST_RM=""
REMOVE_RC="true"
WINDOWS_SUPPORT="false"
WIN_ADMIN_NAME="Administrator"
WIN_DOMAIN_ACCESS="false"
CSP_RECREATE="true"
CSP_COPY_CERTS="false"
CSP_COUNTRY_CODE="CH"
CSP_STATE="Switzerland"
CSP_LOCATION="Building"
CSP_ORGA="PSI"
CSP_ORGA_UNIT="AIT"
CSP_MAIL_ADDRESS="derek.feichtinger@psi.ch"
Install the SGE Master
Login as root to the master host.
The SGE_ROOT and QMASTER_SPOOL_DIR must be writable by root
(see SGE_ROOT/util/install_modules/tier3.conf).
Run the following commands:
hostname
t3ce01
whoami
root
export SGE_ROOT=/swshare/sge/n1ge6
export SGE_CELL=tier3
cd $SGE_ROOT
./inst_sge -m -auto $SGE_ROOT/util/install_modules/tier3.conf
...
Install log can be found in: \
/swshare/sge/n1ge6/tier3/common/install_logs/qmaster_install_t3ce01_2008-08-11_17:47:44.log
/swshare/sge/n1ge6/tier3/common/install_logs/qmaster_install_t3ce01_2008-08-11_17:47:44.log
Show File Hide File
Starting qmaster installation!
Installing Grid Engine as user >sgeadmin<
Your $SGE_ROOT directory: /swshare/sge/n1ge6
Using SGE_QMASTER_PORT >536<.
Using SGE_EXECD_PORT >537<.
Using >tier3< as CELL_NAME.
Using >/var/spool/sge/qmaster< as QMASTER_SPOOL_DIR.
Verifying and setting file permissions and owner in >3rd_party<
Verifying and setting file permissions and owner in >bin<
Verifying and setting file permissions and owner in >ckpt<
Verifying and setting file permissions and owner in >examples<
Verifying and setting file permissions and owner in >inst_sge<
Verifying and setting file permissions and owner in >install_execd<
Verifying and setting file permissions and owner in >install_qmaster<
Verifying and setting file permissions and owner in >lib<
Verifying and setting file permissions and owner in >mpi<
Verifying and setting file permissions and owner in >pvm<
Verifying and setting file permissions and owner in >qmon<
Verifying and setting file permissions and owner in >util<
Verifying and setting file permissions and owner in >utilbin<
Verifying and setting file permissions and owner in >catman<
Verifying and setting file permissions and owner in >doc<
Verifying and setting file permissions and owner in >include<
Verifying and setting file permissions and owner in >man<
Your file permissions were set
Using >true< as IGNORE_FQDN_DEFAULT.
If it's >true<, the domain name will be ignored.
Making directories
Setting spooling method to dynamic
Dumping bootstrapping information
Initializing spooling database
Using >50700-50800< as gid range.
Using >/var/spool/sge< as EXECD_SPOOL_DIR.
Using >none< as ADMIN_MAIL.
Reading in complex attributes.
Adding default parallel environments (PE)
Reading in parallel environments:
PE "make.sge_pqs_api".
Reading in usersets:
Userset "deadlineusers".
Userset "defaultdepartment".
starting sge_qmaster
starting sge_schedd
starting up GE 6.1u4 (lx24-amd64)
Adding ADMIN_HOST t3admin01
t3admin01 added to administrative host list
Adding ADMIN_HOST t3ce01
adminhost "t3ce01" already exists
Adding ADMIN_HOST t3ui01
t3ui01 added to administrative host list
Adding SUBMIT_HOST t3ce01
t3ce01 added to submit host list
Adding SUBMIT_HOST t3ui01
t3ui01 added to submit host list
Adding SUBMIT_HOST t3wn01
t3wn01 added to submit host list
Adding SUBMIT_HOST t3wn02
t3wn02 added to submit host list
Adding SUBMIT_HOST t3wn03
t3wn03 added to submit host list
Adding SUBMIT_HOST t3wn04
t3wn04 added to submit host list
Adding SUBMIT_HOST t3wn05
t3wn05 added to submit host list
Adding SUBMIT_HOST t3wn06
t3wn06 added to submit host list
Adding SUBMIT_HOST t3wn07
t3wn07 added to submit host list
Adding SUBMIT_HOST t3wn08
t3wn08 added to submit host list
Creating the default queue and hostgroup
root@t3ce01 added "@allhosts" to host group list
root@t3ce01 added "all.q" to cluster queue list
No CSP system installed!
No CSP system installed!
Setting scheduler configuration to >Normal< setting!
changed scheduler configuration
sge_qmaster successfully installed!
Test the master:
ps auxwf | grep [s]ge
sgeadmin 1322 0.0 0.0 77744 3248 ? Sl 17:47 0:00 /swshare/sge/n1ge6/bin/lx24-amd64/sge_qmaster
sgeadmin 1341 0.0 0.0 66672 2228 ? Sl 17:47 0:00 /swshare/sge/n1ge6/bin/lx24-amd64/sge_schedd
export PATH=$SGE_ROOT/bin/lx24-amd64:$PATH
qconf -sconf
Show Details Hide Details
[root@t3ce01 n1ge6]#
qconf -sconf
global:
execd_spool_dir /var/spool/sge
mailer /bin/mail
xterm /usr/bin/X11/xterm
load_sensor none
prolog none
epilog none
shell_start_mode posix_compliant
login_shells sh,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail none
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
execd_params none
reporting_params accounting=true reporting=false \
flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs 100
gid_range 50700-50800
qlogin_command telnet
qlogin_daemon /usr/sbin/in.telnetd
rlogin_daemon /usr/sbin/in.rlogind
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs 0
max_jobs 0
auto_user_oticket 0
auto_user_fshare 0
auto_user_default_project none
auto_user_delete_time 86400
delegated_file_staging false
reprioritize 0
Install the SGE Execute Hosts
NOTE: to run the install script below locally on the execute node, you will need to define it as a submit and admin host. This requires adding the host using the
qconf -ah hostname
(for admin host) and
qconf -as hostname
(for submit host) commands from an admin node. Else, the new node will not be allowed to contact the master's services.
On each execute host, where the SGE_ROOT and EXECD_SPOOL_DIR must be writable by root,
(see
SGE_ROOT/util/install_modules/tier3.conf)
do the following:
hostname
t3wn01
whoami
root
export SGE_ROOT=/swshare/sge/n1ge6
export SGE_CELL=tier3
mkdir /var/spool/sge/t3wn01
chown sgeadmin.root /var/spool/sge/t3wn01
cd $SGE_ROOT
./inst_sge -x -noremote -auto $SGE_ROOT/util/install_modules/tier3.conf
...
Install log can be found in: \
/swshare/sge/n1ge6/tier3/common/install_logs/execd_install_t3wn01_2008-08-11_21:04:43.lo
Note: If the script fails uncleanly, you can find the logs in /tmp/install.[PID].
Add the
sgeexecd service manually:
cp -a $SGE_ROOT/$SGE_CELL/common/sgeexecd /etc/init.d/
chkconfig --add sgeexecd
chkconfig --list sgeexecd
sgeexecd 0:off 1:off 2:off 3:on 4:off 5:on 6:off
Check the SGE services on this host - there should be
sge_execd running as
sgeadmin
on every execute host, e.g.:
ps auxwf | grep [s]ge
sgeadmin 25309 0.0 0.0 56108 1624 ? S 21:30 0:00 /swshare/sge/n1ge6/bin/lx24-amd64/sge_execd
Test some SGE commands:
export PATH=$SGE_ROOT/bin/lx24-amd64:$PATH
qconf -sel
t3wn01
qconf -sql
all.q
ls -lA /var/spool/sge/t3wn01
Show Details Hide Details
drwxr-xr-x 2 sgeadmin sgeadmin 4096 Aug 11 21:30 active_jobs
-rw-r--r-- 1 sgeadmin sgeadmin 6 Aug 11 21:30 execd.pid
drwxr-xr-x 2 sgeadmin sgeadmin 4096 Aug 11 21:30 jobs
drwxr-xr-x 2 sgeadmin sgeadmin 4096 Aug 11 21:30 job_scripts
-rw-r--r-- 1 sgeadmin sgeadmin 69 Aug 11 21:30 messages
How to remove SGE Execute Hosts from the configuration
remove the host from the relevant queues
qconf -mq all.q
Delete host from its host group (e.g. the "allhosts" group):
qconf -mhgrp @allhosts
Remove host from exec host list (and possibly also from admin and submission lists):
qconf -de t3vm02
qconf -dh t3vm02
qconf -ds t3vm02
Remove from configuration list:
$ qconf -dconf t3vm02
Files
The following files must be installed on all hosts:
/etc/profile.d/sge.sh
Show Details Hide Details
# SGE configuration for CMS Tier3
# 2008-08-11
export SGE_ROOT=/swshare/sge/n1ge6
export SGE_CELL=tier3
export PATH=$SGE_ROOT/bin/lx24-amd64:$PATH
export MANPATH=$MANPATH:$SGE_ROOT/man:
Install the SGE Submit Hosts
A submission host just needs access to the shared SGE installation for the binaries. Then, it needs to be configured as one of the allowed submission hosts by running the following command on the master
qconf -as [hostname]
After that, you should be able to run commands like
qhost
from the new host.
SGE Postinstallation Configuration
Configure SGE Queues
all.q configuration
Edit the queue configuration using the
qconf -mq all.q command:
ssh sgeadmin@t3ce01
qconf -sq all.q > ~/config/all.q.orig.conf
export EDITOR=vim
qconf -mq all.q
sgeadmin@t3ce01 modified "all.q" in cluster queue list
qconf -sq all.q > ~/config/all.q.conf
diff /shome/sgeadmin/config/all.q.{orig.,}conf
17c17
< shell /bin/csh
---
> shell /bin/bash
20c20
< shell_start_mode posix_compliant
---
> shell_start_mode unix_behavior
35,38c35,38
< s_rt INFINITY
< h_rt INFINITY
< s_cpu INFINITY
< h_cpu INFINITY
---
> s_rt 48:00:00
> h_rt 48:30:00
> s_cpu 24:00:00
> h_cpu 24:30:00
qconf -sq all.q
Show Details Hide Details
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 1,[t3wn01=8],[t3wn02=8],[t3wn03=8]
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode unix_behavior
starter_method /shome/sgeadmin/t3scripts/starter_method.sh
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt 48:00:00
h_rt 48:30:00
s_cpu 24:00:00
h_cpu 24:30:00
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Setting the queue's Grid/CMS Environment
In order to set the correct Grid environment on the worker nodes, the default starter method of the queue is overridden by a simple script:
#!/bin/bash
######### STARTER METHOD FOR SETTING USER'S ENVIRONMENT #####################
# settings for Grid credentials
if test x"$DBG" != x; then
echo "STARTER METHOD SCRIPT: Setting grid environment"
fi
source /swshare/glite/external/etc/profile.d/grid-env.sh
if test $? -ne 0; then
echo "WARNING: Failed to source grid environment" >&2
fi
#source $VO_CMS_SW_DIR/cmsset_default.sh
#if test $? -ne 0; then
# echo "WARNING: Failed to source the CMS environment ($VO_CMS_SW_DIR/cmsset_default.sh)" >&2
#fi
# now we execute the real job script
exec "$@"
Printing out accounting information at the end of a job
One can use the a queue's epilog setting to execute a script at the end of every job (use
qconf -mq
).
E.g. this script will attach some accounting information to the job's stdout (File
/shome/sgeadmin/t3scripts/epilog.sh
)
echo "# JOB Resource USAGE for job $JOB_ID:"
echo -n "# "
/swshare/sge/n1ge6/bin/lx24-amd64/qstat -j "$JOB_ID"| grep -e '^usage.*cpu='
Configuring the scheduling policy
The original configuration did no real fair share scheduling. After reading up a bit on how to implement share tree policies and seeing that this needs a lot of configuration and additional maintenance of user lists (I think), I decided to go for an easier
functional policy as mentioned on a mailing list.
Modify the master configuration
# qconf -mconf
...
enforce_user auto
...
auto_user_fshare 100
...
And then the scheduler configuration
# qconf -msconf
...
weight_tickets_functional 10000
...
The rationale for doing this is described as follows by an expert (Chris @ gridengine.info):
If you are only using the functional policy in a way described by that
article than ...
- The number "10000" shown in that configuration suggestion is arbitrary
- Any number higher than zero simply "turns on" the policy within the scheduler
- The number of functional tickets you have does not matter all that much
- The *ratio* of tickets you hold vs. tickets others hold matters very very much
- No relation to halftime
In the simple functional setup described in that article the key is
that we (a) enable the functional policy by telling SGE there are
10000 tickets in the system and (b) we automatically assign every user
100 functional share tickets.
What makes the scheduling policy then become "fair" is the fact that
all users have the same number/ratio of functional share tickets
(100). This makes them all get treated equally by the scheduler.
Configuring time dependent resource quota sets
Resource quota sets:
- Syntax
- Multiple rule sets contain one or more rules
- keywords: users, projects, queues (cluster queues), hosts, or pes (parallel environments)
- First matching rule from each set wins
- Strictest rule set wins
- Rules can contain
- Wildcard (*)
- Logical not operator (!)
- Brackets ({}): Means “apply this rule to each list member, singly” instead of to the group as a whole.
Example:
{
name max_user_jobs_per_queue
description Limit a user to a maximal number of concurrent jobs in each \
queue
enabled TRUE
limit users {*} queues all.q to slots=50
limit users {*} queues short.q to slots=80
}
{
name max_allq_jobs
description limit all.q to a maximal number of slots
enabled TRUE
limit queues all.q to slots=80
}
The rules can be set/viewed with the usual SGE variations of the
qconf
command:
qconf -srqs # show all the sets
qconf -srqs max_allq_jobs # show a specific set
qconf -Mrqs file.rqs # replace all sets by the ones found in file.rqs
qconf -mrqs # modify the existing sets in the editor
Time dependent resource quota sets:
We implemented a change of the rules based on day, night, and weekend. This is controlled by a cron job in
/etc/cron.d/change_sge_policies
Configure a parallel environment for SMP parallel jobs
show all existing parallel environments
qconf -spl
Define a new parallel environment "smp"
qconf -ap smp
pe_name smp
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots # this forces all slots to be on the same host!
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
Now, this environment must be added to all queues that should give access to it
qconf -mq all.q
...
pe_list smp make
...
Users can submit jobs by using the
-pe
flag together with the environment name and the number of requested slots
qsub -pe smp 4 myjob.sge
Limiting memory consumption on a per host basis
The
h_vmem
complex property is the hard limit on job memory consumption. This is actually enforced by SGE, and a job will be killed when it tries ro allocate beyond this limit.
In order to do a correct bookkeeping for jobs already present on the node, it is necessary to declare this property to be a "consumable" property. Also, one should immediately assign a default value for jobs which do not explicitely declare the requirement.
This can be done by editing the complex configuration:
qconf -mc
#name shortcut type relop requestable consumable default urgency
#----------------------------------------------------------------------------------------
...
h_vmem h_vmem MEMORY <= YES YES 2.5g 0
...
Now, one can assign explicit h_vmem limits to hosts using
qconf -me t3wn04
hostname t3wn04.psi.ch
load_scaling NONE
complex_values h_vmem=15g
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
A user can now declare the requirement for the vmem in his submit statement
qsub -l h_vmem=10g simple_env.sge
qsub -pe smp 4 -l h_vmem=2g simple_env.sge
Note that the requirement is per requested slot, so in the latter case the required vmem is 8GB!
SGE Basic Tests
Available Queues and Slots
which qstat
/swshare/sge/n1ge6/bin/lx24-amd64/qstat
# Queues and slots (see "man qstat" for details):
qstat -g c
CLUSTER QUEUE CQLOAD USED AVAIL TOTAL aoACDS cdsuE
-------------------------------------------------------------------------------
all.q 0.00 0 16 16 0 0
# Show the xecute hosts (only the installed hosts are shown):
qconf -sel
t3wn01
t3wn02
t3wn03
# Show the list of queues:
qconf -sql
all.q
Show the admin hosts (see "man qconf" for details)
qconf -sh
Show Details Hide Details
t3admin01
t3ce01
t3ui01
t3wn01
t3wn02
t3wn03
t3wn04
t3wn05
t3wn06
t3wn07
t3wn08
Show the given execution host:
qconf -se t3wn01
Show Output Hide Output
hostname t3wn01
load_scaling NONE
complex_values NONE
load_values load_avg=0.000000,load_short=0.000000, \
load_medium=0.000000,load_long=0.000000,arch=lx24-amd64, \
num_proc=8,mem_free=15851.125000M, \
swap_free=1992.425781M,virtual_free=17843.550781M, \
mem_total=16033.703125M,swap_total=1992.425781M, \
virtual_total=18026.128906M,mem_used=182.578125M, \
swap_used=0.000000M,virtual_used=182.578125M, \
cpu=0.000000,np_load_avg=0.000000, \
np_load_short=0.000000,np_load_medium=0.000000, \
np_load_long=0.000000
processors 8
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
Show the given queue:
qconf -sq all.q
Show Output Hide Output
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 1,[t3wn01=8],[t3wn02=8],[t3wn03=8]
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode unix_behavior
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt 48:00:00
h_rt 48:30:00
s_cpu 24:00:00
h_cpu 24:30:00
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Test Jobs
Use the
simple_env.sge script to submit a simple single-CPU job:
Show Details Hide Details
qsub ./simple_env.sge
Your job 2 ("simple_env") has been submitted
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2 0.00000 simple_env sgeadmin qw 08/13/2008 14:12:44 1
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2 0.55500 simple_env sgeadmin r 08/13/2008 14:12:49 all.q@t3wn01 1
ls -lA
...
-rw-r--r-- 1 sgeadmin sgeadmin 2245 Aug 13 14:12 simple_env.o2
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:12 simple_env.e2
simple_env.sge
Show Details Hide Details
#!/bin/bash
# SGE single-CPU job example
### Job name
#$ -N simple_env
### Run time soft and hard limits hh:mm:ss
#$ -l s_rt=00:01:00,h_rt=00:01:30
### Change to the current working directory
#$ -cwd
MY_HOST=`hostname`
MY_DATE=`date`
echo "Running on $MY_HOST at $MY_DATE"
echo "Running environment:"
env
echo "================================================================"
# Put your single-CPU script here
################################################################################
Use the
simple_job_array.sge script to test a job array:
Show Details Hide Details
qsub -q all.q -t 1-16 ./simple_job_array.sge
Your job-array 3.1-16:1 ("simple_job_array") has been submitted
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
3 0.55500 simple_job sgeadmin r 08/13/2008 14:25:49 all.q@t3wn01 1 16
ls -lA simple_job_array*
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.1
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.10
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.11
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.12
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.13
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.14
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.15
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.16
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.2
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.3
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.4
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.5
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.6
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.7
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.8
-rw-r--r-- 1 sgeadmin sgeadmin 0 Aug 13 14:25 simple_job_array.e3.9
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.1
-rw-r--r-- 1 sgeadmin sgeadmin 736 Aug 13 14:25 simple_job_array.o3.10
-rw-r--r-- 1 sgeadmin sgeadmin 736 Aug 13 14:25 simple_job_array.o3.11
-rw-r--r-- 1 sgeadmin sgeadmin 736 Aug 13 14:25 simple_job_array.o3.12
-rw-r--r-- 1 sgeadmin sgeadmin 736 Aug 13 14:25 simple_job_array.o3.13
-rw-r--r-- 1 sgeadmin sgeadmin 736 Aug 13 14:25 simple_job_array.o3.14
-rw-r--r-- 1 sgeadmin sgeadmin 736 Aug 13 14:25 simple_job_array.o3.15
-rw-r--r-- 1 sgeadmin sgeadmin 736 Aug 13 14:25 simple_job_array.o3.16
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.2
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.3
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.4
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.5
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.6
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.7
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.8
-rw-r--r-- 1 sgeadmin sgeadmin 731 Aug 13 14:25 simple_job_array.o3.9
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-1
-rw-r--r-- 1 sgeadmin sgeadmin 101 Aug 13 14:25 simple_job_array.out.3-10
-rw-r--r-- 1 sgeadmin sgeadmin 101 Aug 13 14:25 simple_job_array.out.3-11
-rw-r--r-- 1 sgeadmin sgeadmin 101 Aug 13 14:25 simple_job_array.out.3-12
-rw-r--r-- 1 sgeadmin sgeadmin 101 Aug 13 14:25 simple_job_array.out.3-13
-rw-r--r-- 1 sgeadmin sgeadmin 101 Aug 13 14:25 simple_job_array.out.3-14
-rw-r--r-- 1 sgeadmin sgeadmin 101 Aug 13 14:25 simple_job_array.out.3-15
-rw-r--r-- 1 sgeadmin sgeadmin 101 Aug 13 14:25 simple_job_array.out.3-16
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-2
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-3
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-4
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-5
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-6
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-7
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-8
-rw-r--r-- 1 sgeadmin sgeadmin 100 Aug 13 14:25 simple_job_array.out.3-9
grep t3wn simple_job_array.out.*
simple_job_array.out.3-1:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=1 on t3wn02 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-10:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=10 on t3wn01 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-11:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=11 on t3wn03 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-12:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=12 on t3wn02 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-13:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=13 on t3wn02 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-14:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=14 on t3wn03 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-15:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=15 on t3wn01 at Wed Aug 13 14:25:50 CEST 2008
simple_job_array.out.3-16:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=16 on t3wn01 at Wed Aug 13 14:25:50 CEST 2008
simple_job_array.out.3-2:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=2 on t3wn03 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-3:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=3 on t3wn01 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-4:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=4 on t3wn01 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-5:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=5 on t3wn03 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-6:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=6 on t3wn02 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-7:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=7 on t3wn02 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-8:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=8 on t3wn03 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.out.3-9:Running job JOB_NAME=simple_job_array task SGE_TASK_ID=9 on t3wn01 at Wed Aug 13 14:25:49 CEST 2008
simple_job_array.sge
Show Details Hide Details
#!/bin/bash
# SGE single-CPU job array example
### Job name
#$ -N simple_job_array
### Run time soft and hard limits hh:mm:ss
#$ -l s_rt=00:01:00,h_rt=00:01:30
### Change to the current working directory
#$ -cwd
### Export some environment varaibles:
#$ -v MY_PREFIX=simple_job_array.out
MY_HOST=`hostname`
MY_DATE=`date`
cat <
Bug Fixes
The directory /etc/init.d
must be a link
Note:
Due to a glitch in the configuration management area (the management uses rsync from a config area) the /etc/init.d
had been replaced by a real directory.
The directory
/etc/init.d
must be a link:
/etc/init.d -> rc.d/init.d
(strange things may start to happen if this is not the case).
Fix it on t3ce01.
[root@t3ce01 rc3.d]#
date
Mon Aug 11 22:22:10 CEST 2008
ls -lA /etc/init.d
-rwxr-xr-x 1 root root 1243 Aug 11 10:06 gmond
-rwxr-xr-x 1 root root 4210 Aug 11 09:58 ramdisk
-rwxr-xr-x 1 sgeadmin sgeadmin 15679 Aug 11 17:47 sgemaster
cp -a /etc/init.d/sgemaster /etc/rc.d/init.d/
ls -lAtr /etc/rc.d/init.d/
...
-rwxr-xr-x 1 root root 4210 Aug 11 09:58 ramdisk
-rwxr-xr-x 1 root root 1243 Aug 11 10:06 gmond
-rwxr-xr-x 1 sgeadmin sgeadmin 15679 Aug 11 17:47 sgemaster
rm -rf /etc/init.d
ln -s /etc/rc.d/init.d /etc/init.d
# Now chkconfig works as it should (it did not before):
chkconfig --add sgemaster
chkconfig --list sgemaster
sgemaster 0:off 1:off 2:off 3:on 4:off 5:on 6:off
Troubleshooting
Installation Troubleshooting
Missing output
The
inst_sge script tries to hide its output (omitting my comments on its design),
so that nothing may be printed on the console if the thing go wrong
even if you uncomment the
"# set -x"
line. If this happens check
the file(s)
/tmp/install.NNNNN
for possible reasons, like
Command failed: mkdir -p /var/spool/sge/qmaster
.
This is not a qmaster host!
On a start of the SGE master on t3ce01 I got this error message:
/etc/init.d/sgemaster start
sge_qmaster didn't start!
This is not a qmaster host!
Please, check your act_qmaster file!
Check what the
$SGE_ROOT/utilbin/lx24-amd64/gethostname
returns as hostname.
The entry in
$SGE_ROOT/$SGE_CELL/common/act_qmaster
must exactly match this name. In my (Derek's) case the hostname returned by the tool was
t3ce01.psi.ch
, while the file only contained t3ce01.
Afterwards I got the following message during startup:
/etc/init.d/sgemaster start
starting sge_qmaster
starting sge_schedd
local configuration t3ce01.psi.ch not defined - using global configuration
starting up GE 6.1u4 (lx24-amd64)
Taking a closer look at the startup with
strace
reveals that SGE is looking for an entry for t3ce01.psi.ch in the
$SGE_ROOT/$SGE_CELL/common/local_conf
directory. Since there had not been one for t3ce01 before, I ignored this.
"This hostname is not known at qmaster as an administrative host"
This message is written to the log file when you try to execute a command that
can be run only on an admin host, e.g.
Show Details Hide Details
./inst_sge -x -noremote -auto $SGE_ROOT/util/install_modules/tier3.conf
Your $SGE_ROOT directory: /swshare/sge/n1ge6
Using cell: >tier3<
Installation failed!
This hostname is not known at qmaster as an administrative host.
Solution: login to any admin host add the new host to the admin hosts using
the
qconf -ah command.
Show Details Hide Details
export SGE_ROOT=/swshare/sge/n1ge6
export SGE_CELL=tier3
export PATH=$SGE_ROOT/bin/lx24-amd64:$PATH
qconf -ah t3wn01,t3wn02,t3wn03,t3wn04,t3wn05,t3wn06,t3wn07,t3wn08
t3wn01 added to administrative host list
t3wn02 added to administrative host list
t3wn03 added to administrative host list
t3wn04 added to administrative host list
t3wn05 added to administrative host list
t3wn06 added to administrative host list
t3wn07 added to administrative host list
t3wn08 added to administrative host list
qconf -sh
t3admin01
t3ce01
t3ui01
t3wn01
t3wn02
t3wn03
t3wn04
t3wn05
t3wn06
t3wn07
t3wn08
You can add hosts even if they not available yet.
"Local execd spool directory [undef] is not a valid path"
The reason must be investigated, but a work around is simple:
create the corresponding directory manually. e.g.:
mkdir /var/spool/sge/t3wn01
chown sgeadmin.root /var/spool/sge/t3wn01
Installation Log Files
ls -lA /swshare/sge/n1ge6/tier3/common/install_logs
Show Details Hide Details
-rw-r--r-- 1 sgeadmin sgeadmin 159 Aug 11 21:04 execd_install_t3wn01_2008-08-11_21:04:43.log
-rw-r--r-- 1 sgeadmin sgeadmin 567 Aug 11 21:20 execd_install_t3wn01_2008-08-11_21:20:31.log
-rw-r--r-- 1 sgeadmin sgeadmin 15249 Aug 11 21:24 execd_install_t3wn01_2008-08-11_21:24:51.log
-rw-r--r-- 1 sgeadmin sgeadmin 15249 Aug 11 21:27 execd_install_t3wn01_2008-08-11_21:27:38.log
-rw-r--r-- 1 sgeadmin sgeadmin 15117 Aug 11 21:30 execd_install_t3wn01_2008-08-11_21:30:50.log
-rw-r--r-- 1 sgeadmin sgeadmin 707 Aug 11 21:31 execd_install_t3wn01_2008-08-11_21:31:00.log
-rw-r--r-- 1 sgeadmin sgeadmin 443 Aug 11 21:57 execd_install_t3wn02_2008-08-11_21:57:48.log
-rw-r--r-- 1 sgeadmin sgeadmin 3068 Aug 11 17:47 qmaster_install_t3ce01_2008-08-11_17:47:44.log
Uninstalling Execution Hosts
Note - Uninstall all compute hosts before you uninstall the master host. If you uninstall the master host first, you have to uninstall all execution hosts manually.
During the execution host uninstallation, all configuration information for the targeted hosts is deleted. The uninstallation attempts to stop the exec hosts in a graceful manner, then the execution daemon will be shut down, then the configuration, global spool directory or local spool directory will be removed.
$SGE_ROOT/util/install_modules/tier3.conf has a section for identifying hosts that can be uninstalled automatically
# Remove this execution hosts in automatic mode
EXEC_HOST_LIST_RM="host1 host2 host3 host4"
Every host in the EXEC_HOST_LIST_RM list will be automatically removed from the cluster.
To start the automatic uninstallation of execution hosts, type the following command:
./inst_sge -ux -auto $SGE_ROOT/util/install_modules/tier3.conf
For more information, please consult the page:
http://docs.sun.com/app/docs/doc/820-0697/gesal?a=view
Runtime Troubleshooting
Logfiles
The default log file for the master can be found in the master's spool area at
/var/spool/sge/qmaster/messages
.
Settings
- This section contains a hidden (html) block where the local TWiki variables are set.
...
Show Details Hide Details
-->
*
my_twiki_styles.css: My CSS