Site Specific Modifications

This page contains details about modification made to software that are specific to CSCS.

Full details should be found on the relevant service page, this is intended to give a brief overview and to be more readable than cfengine.

If something is mentioned here and not in detail on the wiki page for the service please inform the service maintainer found at the link below.

https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceInformation

Site Specific Modifications

General modifications

Polyinstantiated /tmp on WNs

To make sure that jobs see a specific /tmp directory on GPFS, we need to polyinstantiate /tmp and put it on the shared filesystem. Following http://tech.ryancox.net/2013/07/per-user-tmp-and-devshm-directories.html, we can configure it by doing this:

Add the following to /etc/rc.local

# MG 05.05.14 10:00 as per http://tech.ryancox.net/2013/07/per-user-tmp-and-devshm-directories.html
#
#mkdir -pm 000 /tmp/usertmp
mkdir -pm 000 /dev/shm/usertmp
mount --make-shared /
mount --bind /tmp /tmp
mount --make-private /tmp
mount --bind /dev/shm /dev/shm
mount --make-private /dev/shm
mount --bind /cvmfs /cvmfs
mount --make-rshared /cvmfs
mount --bind /gpfs_pp /gpfs_pp
mount --make-rshared /gpfs_pp

Add the proper PAM configuration:
1. At the end of /etc/pam.d/sshd, add the following:
```
session required pam_namespace.so ignore_instance_parent_mode
```
2. At the end of /etc/pam.d/slurm, add the following:
```
auth     required  pam_localuser.so
account  required  pam_unix.so
session  required  pam_limits.so
session  required  pam_namespace.so ignore_instance_parent_mode
```
  NOTE: the argument ignore_instance_parent_mode is there to allow /tmp to be polyinstantiated on a subdirectory of which parent is not created with 000 permissions (i.e. /gpfs_pp)

Now the file /etc/security/namespace.conf needs to have the following lines:

/tmp      /gpfs_pp/usertmp/     user root
/dev/shm  /dev/shm/usertmp/     user  root

And all left is to create the directory with where the polyinstantiated /tmp will be stored:
```
mkdir -pm 000 /gpfs_pp/usertmp
```

NOTE Please, take into account that this directory is shared across the whole cluster, so if user's jobs create directories on /tmp, they have to be uniquely named.

SLURM

To enable this for SLURM jobs, additional steps need to be done:

Make sure the directive UsePAM is enabled in /etc/slurm/slurm.conf:
```
UsePAM=1
```

Make sure that there is no slurmd information on /tmp by properly configuring the following variables:

#--------------------------------------------------------------------------------------
# PATHS
#--------------------------------------------------------------------------------------
SlurmdSpoolDir=/var/spool/slurmd        # this is for slurmd, must be local to each system
StateSaveLocation=/var/spool            # in PROD /slurm/spool/state, this is for slurmctld

Also, due to cvmfs working via autofs and the way namespaces behave, a Prolog script that makes sure that cvmfs is mounted on the WN before the job runs is required:
1. Add the following entries to slurm.conf (TaskProlog and TaskEpilog existed before). The idea is not to make sure that CVMFS works (that is done on the Node Health check), but to make sure that those filesystems are mounted if they are to be there. Of course we could add more complex checks in here, but that'd stress the system much more.
```
# http://slurm.schedmd.com/prolog_epilog.html
# Prolog & Epilog to be run before/after each batch task, to set default environment variables etc.
TaskProlog=/etc/slurm/TaskProlog.sh
TaskEpilog=/etc/slurm/TaskEpilog.sh

# Prolog & Epilog to be run before/after the job is actually executed
Prolog=/etc/slurm/Prolog.sh
Epilog=/etc/slurm/Epilog.sh
```
2. The contents of Prolog.sh and Epilog.sh are these:
  - = -- Prolog.sh -- =
```
#! /bin/bash
if [ -e /usr/bin/cvmfs_config ]; then
        /usr/bin/cvmfs_config probe 2>&1 >/dev/null
fi
exit 0
```
  - = -- Epilog.sh -- =
```
#! /bin/bash
```

Keeping a mixed environment with polyinstantiated /tmp only on some nodes

In order to accomplish this, make sure that all modifications shown above are applied. It is especially important that /etc/pam.d/slurm exists in all nodes, otherwise those in which it doesn't exist will be marked as DOWN. Then, the following change must be applied:

On those nodes in which we want to enforce the polyinstantiated /tmp, the following line must be present:
```
session  required  pam_namespace.so ignore_instance_parent_mode
```
While for those nodes which we want to show a standard behaviour, it must be commented out
```
#session  required  pam_namespace.so ignore_instance_parent_mode
```

du on GPFS

When running du on a file on the GPFS filesystem the disk usage is reported as twice the size of the file. This is because we use the GPFS native RAID which ensures there are two copies of the file.

The ATLAS pilot jobs had an issue with this so du on the worker nodes was replaced with a script that halves the size reported when run on GPFS.

Cream

Publishing

Only cream01 should publish values for the CPUs in the cluster. Other creams should not report this, the relevant file is

/var/lib/bdii/gip/ldif/static-file-Cluster.ldif

The following file determines if the cream ce publishes as production or draining. This is controlled via cfengine.

/var/lib/bdii/gip/plugin/glite-info-dynamic-ce

Which in turn runs a script from the hoem of root which queires slurm in order to generate certain values e.g.

vim /root/fakeinfo.bash
  ...
  MaxJobsPerGroup=$(${SACCTMGR_CMD} -n -r list association account=${VO} format=grpjobs | awk '{print $1}')
  ...

Sym links

/usr/local/bin/sacct needs to be a symbolic link to /usr/bin/sacct

ll /usr/local/bin/sacct 
lrwxrwxrwx 1 root root 14  9. Okt 10:14 /usr/local/bin/sacct -> /usr/bin/sacct

Job working directory

On the worker nodes to ensure jobs run in tmpdir_slurm rather than the users home the following file is modified

/etc/glite/cp_1.sh

Priorities

The following file is modified so jobs submitted with sgm/ops accounts are run on a reservation.

/usr/libexec/slurm_local_submit_attributes.sh

We saw pilot failures across VOs due to their pilot jobs being queued for too long. As such the following was implemented.

Accounting

Slurm logs parser used by APEL:

/usr/lib/python2.6/site-packages/apel/parsers/slurm.py
/usr/lib/python2.6/site-packages/apel/common/datetime_utils.py

have been patched according to this GGUS ticket. A few issues still need to be solved: the correct version should be included in the new version of package apel-parsers (currently 1.1.2) that will be updated when available.

Reservation for ops

As we are unable to reserve a single core on a node so we have reserved two nodes t run these jobs

ReservationName=priority_jobs StartTime=24 Oct 16:29 EndTime=24 Oct 2014 Duration=365-00:00:00
   Nodes=wn65,wn73 NodeCnt=2 CoreCnt=64 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   Users=(null) Accounts=ops,dteam Licenses=(null) State=ACTIVE

Jobs are directed to this reservation by a modification to the submission script.

vim /usr/libexec/slurm_local_submit_attributes.sh
  ...
  # DTEAM
  REGEX="dteam[0-9][0-9][0-9]"
  USER=`whoami`
  if [[ ( $USER =~ $REGEX ) ]] ; then
          # This extracts the queue from the SUDO command and assigns the dteam reservation if required
          QUEUE=$(echo $SUDO_COMMAND |awk -F'-q ' '{print $2}' | sed 's/ -n.*$//g' 2>&1)
          if [ "$QUEUE" == "cscs" ]; then
                  echo "#SBATCH --reservation=priority_jobs"
          fi
  fi
  ...

Arc

Submit with non atlas user

The below file can be modified to allow a user to submit to the ARC CE.

vim /usr/share/arc/ARC0ClusterInfo.pm

     if ($q->{'name'} eq "cscs" and $sn !~ m/Pablo Fern/) { next; }

Accounting

Currently the accounting data publishing via jura is under investigation, but since temporary accounting data are filling up arc[01,02] 's disks those data have been stored on the NAS for future reference:

nas.lcg.cscs.ch:/ifs/LCG/shared/apel_accounting_backup
/opt/apel_accounting_backup

for each machine a specific directory has been created where temporary data (i.e. APEL-compliant records not sent yet) can be moved from time to time to free some space on the disk:

[root@arc02:~]# mv /var/spool/arc/ssm/test-msg02.afroditi.hellasgrid.gr/outgoing/00000000/* /opt/apel_accounting_backup/arc02_outgoing_tmp/ssm/test-msg02/

[root@arc01:~]# mv /var/spool/arc/ssm/test-msg02.afroditi.hellasgrid.gr/outgoing/00000000/* /opt/apel_accounting_backup/arc01_outgoing_tmp/ssm/test-msg02/

Another file that can be moved in order to free some space is:

[root@arc02:~]# mv /var/spool/nordugrid/jobstatus/job.logger.errors  /gpfs/apel_test/job.logger.errors_arc02_20140205

this file can be easily grow to a few GB in case of sending errors reported by jura.

Modify the job comment to reflect the DN

The file "/usr/share/arc/submit-SLURM-job" has modifications made to it to enable viewing of the DN the job was submitted by in the job comment. This gives much more detail when looking at things like squeue.

MYUSERDN=$(/usr/bin/openssl x509 -in ${X509_USER_PROXY} -subject -noout | sed -r 's/.*= (.*)/\1/g' 2>&1)
MYHN=$(hostname -s)
COMMENT="\"$MYHN,$MYUSERDN\""
echo "#SBATCH --comment=$COMMENT" >> $LRMS_JOB_SCRIPT

Previously there were issues with the memory size requested by jobs however this has since been resolved upstream.

dCache

No real modifications specific to dCache itself for CSCS. See dCache wiki page for set-up.

Storage pools se0[1-8] have python26 package installed by hand from epel

pdsh -w se0[1-8] 'yum install python26 --enablerepo=epel -y' |dshbak -c

Prevent publishing of file access protocol.

With the NFS41 domain the file access protocol is published and not easily disabled. As WN do not mount /pnfs pilots will fail. To work around this we perform a sed. Below is an example prior to modifying the info provider script.

/var/lib/bdii/gip/provider/info-based-infoProvider.sh > /tmp/info.orig
sed -e '232,246d' /tmp/info.orig > /tmp/info.mod

diff /tmp/info.*                                                                                                                                                                                   
231a232,246
> dn: GlueSEAccessProtocolLocalID=NFSv41-storage02@nfs-storage02Domain,GlueSE
>  UniqueID=storage01.lcg.cscs.ch,mds-vo-name=resource,o=grid
> objectClass: GlueSETop
> objectClass: GlueSEAccessProtocol
> objectClass: GlueKey
> objectClass: GlueSchemaVersion
> GlueSEAccessProtocolLocalID: NFSv41-storage02@nfs-storage02Domain
> GlueSEAccessProtocolType: file
> GlueSEAccessProtocolEndpoint: file://storage02.lcg.cscs.ch:2049
> GlueSEAccessProtocolMaxStreams: 5
> GlueSEAccessProtocolCapability: file transfer
> GlueSEAccessProtocolVersion: file
> GlueSchemaVersionMajor: 1
> GlueSchemaVersionMinor: 3
> GlueChunkKey: GlueSEUniqueID=storage01.lcg.cscs.ch

It appears in later versions of dcache a second modification is required, the following must be set to a blank value (by default it is set to file). We noticed this when upgrading to 2.6.27