Basic job submission and monitoring on the Tier-3

Basic job submission
Job monitoring
Dealing with jobs with queue error states
killing a job
a generic job script
an example CMSSW job
Job accounting

The Tier-3 cluster is running the Sun Grid Engine batch queueing system. You can submit any shell script to the batch system, and you will get back the stdout and stderr.

Special things to note

The User Interface node as well as the worker nodes offer a fully working Grid environment, i.e. you can interact with our and others' storage elements
Your home directory /t3home/$username (as well as /work and /pnfs) is shared between all nodes
Intensive IO should be done locally in the /scratch/$username directory on the worker node.
Big result files should be written to our dCache storage element (SE). Your user area on the element is located here
```
 srm://t3se01.psi.ch:8443/srm/managerv2?SFN=/pnfs/psi.ch/cms/trivcat/store/user/$your-CMS-hn-name 
```
You need a valid Grid proxy certificate for interacting with the SE. So, if you plan to run a job which lasts for 30 hours, you need a Grid proxy valid for at least this time:
```
 voms-proxy-init -voms cms -valid 32 
```
(Your proxy gets saved to your shared home directory at $HOME/.x509up_u${UID}, and so it is seen from all the worker nodes. Use the -valid flag instead of the older -hours flag, since this gives an extended lifetime for both your proxy and the VOMS extensions given by CMS).

The next sections will provide you with example scripts that observe all of the mentioned points. You may use these or cook up your own.

Basic job submission

You can get extensive information through the man pages on the UI, or by referring to the documentation on the SGE home site.

Submitting a job script with qsub:

  qsub example_jobscript.sh

Note: various options can be passed to the qsub command, either on the command line (e.g. -q short.q for submitting to the short queue) or from within the job script (using a line beginning with #$, e.g. #$ -q short.q. This is the preferred way used in the templates below).

The job's stdout/stderr will be copied to the directory from which you submitted, if you used the -cwd flag to the qsub command or if you placed #$ -cwd into your job file (q.v. the generic job example below). It is also possible to define the paths for these file explictely using the -o and -e flags. If no option at all is given, the files will be copied to the user's home directory.

Example:

# Job name (defines name seen in monitoring by qstat and the
#     job script's stderr/stdout names)
#$ -N example_job

### Specify the queue on which to run
#$ -q all.q

# Change to the current working directory from which the job got
# submitted . This will also result in the job report stdout/stderr being
# written to this directory, if you do not override it (below).
#$ -cwd

# here you could change location of the job report stdout/stderr files
#  if you did not want them in the submission directory
#$ -o /work/username/mydir/
#$ -e /work/username/mydir/

Queues

For queue limits and policy, please have a look here

We offer the following batch system queues:

all.q: Supports jobs with up to 10h running time. This is the default queue
long.q: Supports jobs with up to 96h running time
short.q: Supports testing or small jobs. 90 min maximal running time.
debug.q: Supports the debugging of a live or past job on a specific server t3wnXX. 20min time limit. Just 1 debugging session per server.
bigmem.q: A queue with two slots and a maximum virtual memory limit of 20G. Useful for MVA training.

You can choose the target queue using the -q option to the qsub command or from within the job script (q.v. example job, below) :

qsub -q short.q example_jobscript..sh

For learning how to use the debug.q queue please consult the HowToDebugJobs section.

To observe which settings a queue has run for instance the command qconf -sq all.q.

Specifying job's memory requirements

If your job needs memory in excess of 3 GB, you need to specify this in the job submission. If not specified, the job will inherit this default limit. Any job exceeding its specified requirement will be killed by the queuing system. This is a safety feature that will prevent the job from affecting other jobs running on the same node and destabilizing the node. If you explicitely specify extra memory requirements, the queuing system will guarantee that this amount of memory is available on the host while your job is running.

CMS specifies that Jobs should usually be able to stay below 2 GB. So, if your jobs need more, try to understand why and whether you have not made an error
Please note that you should never specify high but unnecessary memory limits on a default basis , for this will restrict the number of jobs that will be run on a node. So, this will reduce the cluster's throughput and hurt you and all other users.
You may request up to 6 GB of memory. Jobs above this limit will not be run. Ask the system administrators should you have need of such jobs,

Use the following command to submit a job with a reservation for 4GB of memory

  qsub -l h_vmem=4g example_jobscript.sh

Setting default job requirements in $HOME/.sge_request

Every T3 user can set it's own default job requirements by creating a file $HOME/.sge_request as described on http://linux.die.net/man/5/sge_request

( or by reading man sge_request ) ; the system wide defaults are in /gridware/sge/default/common/sge_request

Job monitoring

Looking at the job status using qstat:

  qstat            # lists your jobs only
  qstat -u '*'    # lists everyone's jobs

Typical output:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
    197 0.00000 example_jo feichtinger  qw    10/04/2008 22:45:01                                    1

Dealing with jobs with queue error states

It may happen that your jobs will exhibit an error state in the queue job list:

qstat -u '*'

...
401915 0.55018 myjob someuser      Eqw   09/24/2010 18:57:36                                    1
...

You can use qstat with the following flags to get more details on the error

qstat -explain E -j  401915

...
stdout_path_list:           CMSSW_499.stdout
jobshare:                   0
hard_queue_list:            all.q
env_list:
script_file:                /t3home/someuser/CMSSW_3_1_4/src/test/crab_0_100924_185610/res//batchscript
error reason    1:          09/24/2010 19:28:16 [543:7918]: error: can't chdir to /t3home/someuser/CMSSW_3_1_4/src/test/crab_0_100
...

to resubmit a job you'll have to create a 'clone' job by qresub and delete the original job by qdel ; you can't resubmit the same job with the same job id instead.

killing a job

Use the qstat command to obtain the ID of the job you want to cancel. Then invoke the qdel command on this ID, e.g.

qdel 197

a generic job script

You can find this example script at /swshare/sge/examples/generic_job.sh.

This job will

create a sandbox in the /scratch area on the worker node
run there
copy back the output files defined by you to a job directory under your submission directory in the shared home
copy back the output files defined by you to a directory on the SE
clean up the working area on the execution node.
the job script's stdout/stderr contain batch related information for diagnosing problems (can be made more verbose with setting a debug flag)

The job script's stdout/stderr files you will find in your submission directory.

Do not forget to create a grid proxy (with voms-proxy-init -voms cms) if you want to interact with the SE!

You can build your functionality shell code into the section labelled "YOUR FUNCTIONALITY...". All entries which you normally would want to adapt are printed in green.

#!/bin/bash
#################################
# PSI Tier-3 example batch Job  #
#################################

##### CONFIGURATION ##############################################
# Output files to be copied back to the User Interface
# (the file path must be given relative to the working directory)
OUTFILES="myout.txt myerr.txt"

# Output files to be copied to the SE
# (as above the file path must be given relative to the working directory)
SEOUTFILES="mybigfile"
#
# By default, the files will be copied to $USER_SRM_HOME/$username/$JOBDIR,
# but here you can define the subdirectory under your SE storage path
# to which files will be copied (uncomment line)
#SEUSERSUBDIR="mytestsubdir/somedir"
#
# User's CMS hypernews name (needed for user's SE storage home path
# USER_SRM_HOME below)
HN_NAME=your-hn-name

# set DBG=1 for additional debug output in the job report files
# DBG=2 will also give detailed output on SRM operations
DBG=0

#### The following configurations you should not need to change
# The SE's user home area (SRMv2 URL)
USER_SRM_HOME="srm://t3se01.psi.ch:8443/srm/managerv2?SFN=/pnfs/psi.ch/cms/trivcat/store/user/"

# Top working directory on worker node's local disk. The batch
# job working directory will be created below this
TOPWORKDIR=/scratch/`whoami`

# Basename of job sandbox (job workdir will be $TOPWORKDIR/$JOBDIR)
JOBDIR=sgejob-$JOB_ID
##################################################################


############ BATCH QUEUE DIRECTIVES ##############################
# Lines beginning with #$ are used to set options for the SGE
# queueing system (same as specifying these options to the qsub
# command

# Job name (defines name seen in monitoring by qstat and the
#     job script's stderr/stdout names)
#$ -N example_job

### Specify the queue on which to run
#$ -q all.q

# Change to the current working directory from which the job got
# submitted (will also result in the job report stdout/stderr being
# written to this directory)
#$ -cwd

# here you could change location of the job report stdout/stderr files
#  if you did not want them in the submission directory
#  #$ -o /work/username/mydir/
#  #$ -e /work/username/mydir/

##################################################################



##### MONITORING/DEBUG INFORMATION ###############################
DATE_START=`date +%s`
echo "Job started at " `date`
cat <<EOF
################################################################
## QUEUEING SYSTEM SETTINGS:
HOME=$HOME
USER=$USER
JOB_ID=$JOB_ID
JOB_NAME=$JOB_NAME
HOSTNAME=$HOSTNAME
TASK_ID=$TASK_ID
QUEUE=$QUEUE

EOF

echo "################################################################"

if test 0"$DBG" -gt 0; then
   echo "######## Environment Variables ##########"
   env
   echo "################################################################"
fi


##### SET UP WORKDIR AND ENVIRONMENT ######################################
STARTDIR=`pwd`
WORKDIR=$TOPWORKDIR/$JOBDIR
RESULTDIR=$STARTDIR/$JOBDIR
if test x"$SEUSERSUBDIR" = x; then
   SERESULTDIR=$USER_SRM_HOME/$HN_NAME/$JOBDIR
else
   SERESULTDIR=$USER_SRM_HOME/$HN_NAME/$SEUSERSUBDIR
fi
if test -e "$WORKDIR"; then
   echo "ERROR: WORKDIR ($WORKDIR) already exists! Aborting..." >&2
   exit 1
fi
mkdir -p $WORKDIR
if test ! -d "$WORKDIR"; then
   echo "ERROR: Failed to create workdir ($WORKDIR)! Aborting..." >&2
   exit 1
fi

cd $WORKDIR
cat <<EOF
################################################################
## JOB SETTINGS:
STARTDIR=$STARTDIR
WORKDIR=$WORKDIR
RESULTDIR=$RESULTDIR
SERESULTDIR=$SERESULTDIR
EOF

###########################################################################
## YOUR FUNCTIONALITY CODE GOES HERE
# set up CMS environment

## 17-11-2014 
## IF YOU REALLY WANT TO USE THE OLD /swshare/cms INSTEAD OF THE CONSTANTLY UPDATED /cvmfs/cms.cern.ch THEN UNCOMMENT THE FOLLOWING ROW # export VO_CMS_SW_DIR=/swshare/cms
## FUTHER INFO ON https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToWorkInCmsEnv
# export VO_CMS_SW_DIR=/swshare/cms

source $VO_CMS_SW_DIR/cmsset_default.sh

# Here we produce some output for the files that are copied back to
#  our shared home
scramv1 list > myout.txt 2>myerr.txt

# lcg-ls and all the others lcg-* tools got deprecated
# https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToAccessSe#gfal_tools
#lcg-ls -b -D srmv2 -l srm://t3se01.psi.ch:8443/srm/managerv2?SFN=/pnfs/psi.ch/cms/ >> myout.txt 2>>myerr.txt
gfal-ls srm://t3se01.psi.ch:8443/srm/managerv2?SFN=/pnfs/psi.ch/cms/ >> myout.txt 2>>myerr.txt

# create a dummy file for copying back to the SE
dd if=/dev/urandom of=mybigfile count=100 &>/dev/null


#### RETRIEVAL OF OUTPUT FILES AND CLEANING UP ############################
cd $WORKDIR
if test 0"$DBG" -gt 0; then
    echo "########################################################"
    echo "############# Working directory contents ###############"
    echo "pwd: " `pwd`
    ls -Rl
    echo "########################################################"
    echo "YOUR OUTPUT WILL BE MOVED TO $RESULTDIR"
    echo "########################################################"
fi

if test x"$OUTFILES" != x; then
   mkdir -p $RESULTDIR
   if test ! -e "$RESULTDIR"; then
          echo "ERROR: Failed to create $RESULTDIR ...Aborting..." >&2
          exit 1
   fi
   for n in $OUTFILES; do
       if test ! -e $WORKDIR/$n; then
          echo "WARNING: Cannot find output file $WORKDIR/$n. Ignoring it" >&2
       else
          cp -a $WORKDIR/$n $RESULTDIR/$n
          if test $? -ne 0; then
             echo "ERROR: Failed to copy $WORKDIR/$n to $RESULTDIR/$n" >&2
          fi
   fi
   done
fi

if test x"$SEOUTFILES" != x; then
   if test 0"$DBG" -ge 2; then
      srmdebug="-v"
   fi
   for n in $SEOUTFILES; do
       if test ! -e $WORKDIR/$n; then
          echo "WARNING: Cannot find output file $WORKDIR/$n. Ignoring it" >&2
       else
          # lcg-cp and all the others lcg-* tools got deprecated
          # https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToAccessSe#gfal_tools
          #lcg-cp $srmdebug -b -D srmv2 file:$WORKDIR/$n $SERESULTDIR/$n
          gfal-copy $srmdebug file:$WORKDIR/$n $SERESULTDIR/$n
          if test $? -ne 0; then
             echo "ERROR: Failed to copy $WORKDIR/$n to $SERESULTDIR/$n" >&2
          fi
   fi
   done
fi

echo "Cleaning up $WORKDIR"
rm -rf $WORKDIR

###########################################################################
DATE_END=`date +%s`
RUNTIME=$((DATE_END-DATE_START))
echo "################################################################"
echo "Job finished at " `date`
echo "Wallclock running time: $RUNTIME s"
exit 0

an example CMSSW job

Note: Check whether the CMS CRAB tool may not be better suited for what you want. This example shows how to submit standalone CMSSW jobs to the SGE batch system.

This example uses the code of the generic job above, but with the functionality section being adapted like shown in the code below (i.e. the section below must be inserted into the generic job sript, above). Note that the job does not need to create a CMSSW area and unpack any shared libraries on the worker node. It can get its environment from the shared home in which you work from the UI. The local work directory (scratch)on the worker node is needed for the large output files generated by the job - writing those to the shared home area would be inefficient. After the job the output files will be copied either to the SE or your home area, just as you define it in the header of the generic_job.sh script.

###########################################################################
## YOUR FUNCTIONALITY CODE GOES HERE

CMSSW_DIR=/t3home/feichtinger/cmssw_test/CMSSW_2_1_6
CMSSW_CONFIG_FILE=$CMSSW_DIR/demoanalyzer-classic.cfg

source $VO_CMS_SW_DIR/cmsset_default.sh
# shopt -s expand_aliases is needed if you want to use the alias 'cmsenv' created by $VO_CMS_SW_DIR/cmsset_default.sh instead of the less mnemonic eval `scramv1 runtime -sh`
shopt -s expand_aliases 

cd $CMSSW_DIR/src
eval `scramv1 runtime -sh`
if test $? -ne 0; then
   echo "ERROR: Failed to source scram environment" >&2
   exit 1
fi

cd $WORKDIR
cmsRun $CMSSW_CONFIG_FILE > myout.txt 2>myerr.txt

NOTE: The result files produced by the cmsRun should get listed in the OUTFILES or SEOUTFILES variables of the generic_job.sh header, in order for them to get copied back to your job result directory or the SE.

Job accounting

You can get accounting information on the cluster usage by executing

$ qacct -d 7 -o -f /gridware/sge/default/common/accounting.complete        # get info for the last 7 days on a per user basis

You can get detailed information on a job if you know it's job ID:

$ qacct -j 6049949 -f /gridware/sge/default/common/accounting.complete

The following e-mail will show how to print your last month jobs plus their mem maxvmem values:

Dear T3 users

since presently the final report of a T3 job reports senseless values about vmem and maxvmem:

# ================================================================
# JOB Live Resources USAGE for job 6040187: ( don't consider mem values, they are wrong )
# usage    1:                 cpu=00:00:01, mem=9.00000 GBs, io=0.00004, vmem=3.000G, maxvmem=3.000G
# JOB Historical Resources USAGE for job 6040187: you have to manually run
# qacct -j 6040187 2&> /dev/null || qacct -f /gridware/sge/default/common/accounting.complete -j 6040187
# JOB ran over queue@host = short.q@t3wn41.psi.ch
# JOBs executed on t3wn[30-40] should run ~1.13 faster than t3wn[10-29]
# removing TMPDIR: /scratch/tmpdir-6040187.1.short.q

some of you requested me how to check the RAM used by his/her finished jobs ; that can be done by this pipe :

$ qacct -o $USER -d 30 -j -f /gridware/sge/default/common/accounting.complete | egrep 'jobnumber|mem|maxvmem' | paste -s -d '  \n' | awk '{ print $1,$2,$3,$4,$5,$6}'
jobnumber 5916517 mem 0.000 maxvmem 0.000
jobnumber 5916518 mem 0.000 maxvmem 120.898M
jobnumber 5916526 mem 0.005 maxvmem 105.305M
jobnumber 5937531 mem 0.003 maxvmem 105.508M
jobnumber 5937532 mem 0.347 maxvmem 2.678G
...

tune that -d 30 ( last month ) parameter according to your needs.

from 'man accounting'
***********************
mem:
The integral memory usage in Gbytes cpu seconds
( i.e. mem is the integral of the memory usage curve, i.e. the sum of the amount of memory used in each time interval for the life of the job. )

maxvmem:
The maximum vmem size in bytes.
***********************
honestly the meaning, and accordingly the values, of the field 'mem' are a bit obscure while the values 'maxvmem' are understandable and usable.

Topic revision: r50 - 2020-03-13 - NinaLoktionova

CmsTier3

User Pages
Main Page
Policies

Physics Groups
Steering Board Meetings

Admin Pages
AdminArea
Cluster Specs