SiteSpecificModifications < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup
-->

---+ Site Specific Modifications

This page contains details about modification made to software that are specific to CSCS.

Full details should be found on the relevant service page, this is intended to give a brief overview and to be more readable than cfengine.

If something is mentioned here and not in detail on the wiki page for the service please inform the service maintainer found at the link below.

https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceInformation

%TOC%

---++ General modifications
---+++ Polyinstantiated /tmp on WNs

To make sure that jobs see a specific /tmp directory on GPFS, we need to polyinstantiate /tmp and put it on the shared filesystem. Following http://tech.ryancox.net/2013/07/per-user-tmp-and-devshm-directories.html, we can configure it by doing this:
   1. Add the following to =/etc/rc.local= <verbatim># MG 05.05.14 10:00 as per http://tech.ryancox.net/2013/07/per-user-tmp-and-devshm-directories.html
#
#mkdir -pm 000 /tmp/usertmp
mkdir -pm 000 /dev/shm/usertmp
mount --make-shared /
mount --bind /tmp /tmp
mount --make-private /tmp
mount --bind /dev/shm /dev/shm
mount --make-private /dev/shm
mount --bind /cvmfs /cvmfs
mount --make-rshared /cvmfs
mount --bind /gpfs_pp /gpfs_pp
mount --make-rshared /gpfs_pp</verbatim>
   1. Add the proper PAM configuration:
      a. At the end of =/etc/pam.d/sshd=, add the following:<verbatim>session required pam_namespace.so ignore_instance_parent_mode</verbatim>
      a. At the end of =/etc/pam.d/slurm=, add the following: <verbatim>auth     required  pam_localuser.so
account  required  pam_unix.so
session  required  pam_limits.so
session  required  pam_namespace.so ignore_instance_parent_mode</verbatim>NOTE: the argument =ignore_instance_parent_mode= is there to allow =/tmp= to be polyinstantiated on a subdirectory of which parent is not created with =000= permissions (i.e. =/gpfs_pp=)
   1. Now the file =/etc/security/namespace.conf= needs to have the following lines:<verbatim>/tmp      /gpfs_pp/usertmp/     user root
/dev/shm  /dev/shm/usertmp/     user  root </verbatim>
   1. And all left is to create the directory with where the polyinstantiated /tmp will be stored: <verbatim>mkdir -pm 000 /gpfs_pp/usertmp</verbatim>

*NOTE* Please, take into account that this directory is shared across the whole cluster, so if user's jobs create directories on /tmp, they have to be *uniquely named*.

---++++ SLURM

To enable this for SLURM jobs, additional steps need to be done:
   1. Make sure the directive =UsePAM= is enabled in =/etc/slurm/slurm.conf=:<verbatim>UsePAM=1</verbatim>
   1. Make sure that there *is no slurmd information on /tmp* by properly configuring the following variables: <verbatim>#--------------------------------------------------------------------------------------
# PATHS
#--------------------------------------------------------------------------------------
SlurmdSpoolDir=/var/spool/slurmd        # this is for slurmd, must be local to each system
StateSaveLocation=/var/spool            # in PROD /slurm/spool/state, this is for slurmctld </verbatim>
   1. Also, due to *cvmfs* working via =autofs= and the way namespaces behave, a Prolog script that makes sure that cvmfs is mounted on the WN before the job runs is required:
      a. Add the following entries to =slurm.conf= (TaskProlog and TaskEpilog existed before). The idea is not to make sure that CVMFS works (that is done on the Node Health check), but to make sure that those filesystems are mounted if they are to be there. Of course we could add more complex checks in here, but that'd stress the system much more. <verbatim># http://slurm.schedmd.com/prolog_epilog.html
# Prolog & Epilog to be run before/after each batch task, to set default environment variables etc.
TaskProlog=/etc/slurm/TaskProlog.sh
TaskEpilog=/etc/slurm/TaskEpilog.sh

# Prolog & Epilog to be run before/after the job is actually executed
Prolog=/etc/slurm/Prolog.sh
Epilog=/etc/slurm/Epilog.sh</verbatim>
      a. The contents of =Prolog.sh= and =Epilog.sh= are these:
         * = -- Prolog.sh -- = <verbatim>#! /bin/bash
if [ -e /usr/bin/cvmfs_config ]; then
        /usr/bin/cvmfs_config probe 2>&1 >/dev/null
fi
exit 0</verbatim>
         * = -- Epilog.sh -- = <verbatim>#! /bin/bash
</verbatim>

---++++ Keeping a mixed environment with polyinstantiated /tmp only on some nodes

In order to accomplish this, make sure that all modifications shown above are applied. It is especially important that =/etc/pam.d/slurm= exists in *all nodes*, otherwise those in which it doesn't exist will be marked as =DOWN=. Then, the following change must be applied:

   1. On those nodes in which we want to enforce the polyinstantiated /tmp, the following line must be present:<verbatim>session  required  pam_namespace.so ignore_instance_parent_mode</verbatim>
   1. While for those nodes which we want to show a standard behaviour, it must be commented out<verbatim>#session  required  pam_namespace.so ignore_instance_parent_mode</verbatim>

---+++ du on GPFS

When running du on a file on the GPFS filesystem the disk usage is reported as twice the size of the file. This is because we use the GPFS native RAID which ensures there are two copies of the file.

The ATLAS pilot jobs had an issue with this so du on the worker nodes was replaced with a script that halves the size reported when run on GPFS.

---++ Cream

---+++ Publishing

Only cream01 should publish values for the CPUs in the cluster. Other creams should not report this, the relevant file is

<verbatim>
/var/lib/bdii/gip/ldif/static-file-Cluster.ldif
</verbatim>

The following file determines if the cream ce publishes as production or draining. This is controlled via cfengine.

<verbatim>
/var/lib/bdii/gip/plugin/glite-info-dynamic-ce
</verbatim>

Which in turn runs a script from the hoem of root which queires slurm in order to generate certain values e.g.

<verbatim>
vim /root/fakeinfo.bash
  ...
  MaxJobsPerGroup=$(${SACCTMGR_CMD} -n -r list association account=${VO} format=grpjobs | awk '{print $1}')
  ...
</verbatim>

---+++ Sym links

/usr/local/bin/sacct needs to be a symbolic link to /usr/bin/sacct

<verbatim>
ll /usr/local/bin/sacct 
lrwxrwxrwx 1 root root 14  9. Okt 10:14 /usr/local/bin/sacct -> /usr/bin/sacct
</verbatim>

---+++ Job working directory

On the worker nodes to ensure jobs run in tmpdir_slurm rather than the users home the following file is modified

<verbatim>
/etc/glite/cp_1.sh
</verbatim>

---+++ Priorities

The following file is modified so jobs submitted with sgm/ops accounts are run on a reservation.

<verbatim>
/usr/libexec/slurm_local_submit_attributes.sh
</verbatim>

We saw pilot failures across VOs due to their pilot jobs being queued for too long. As such the following was implemented.

---+++ Accounting

Slurm logs parser used by APEL:

<verbatim>
/usr/lib/python2.6/site-packages/apel/parsers/slurm.py
/usr/lib/python2.6/site-packages/apel/common/datetime_utils.py
</verbatim>

have been patched according to this [[https://ggus.eu/ws/ticket_info.php?ticket=98409][GGUS ticket]]. A few issues still need to be solved: the correct version should be included in the new version of package =apel-parsers= (currently 1.1.2) that will be updated when available.

---+++ Reservation for ops

As we are unable to reserve a single core on a node so we have reserved two nodes t run these jobs

<verbatim>
ReservationName=priority_jobs StartTime=24 Oct 16:29 EndTime=24 Oct 2014 Duration=365-00:00:00
   Nodes=wn65,wn73 NodeCnt=2 CoreCnt=64 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   Users=(null) Accounts=ops,dteam Licenses=(null) State=ACTIVE
</verbatim>

Jobs are directed to this reservation by a modification to the submission script.

<verbatim>
vim /usr/libexec/slurm_local_submit_attributes.sh
  ...
  # DTEAM
  REGEX="dteam[0-9][0-9][0-9]"
  USER=`whoami`
  if [[ ( $USER =~ $REGEX ) ]] ; then
          # This extracts the queue from the SUDO command and assigns the dteam reservation if required
          QUEUE=$(echo $SUDO_COMMAND |awk -F'-q ' '{print $2}' | sed 's/ -n.*$//g' 2>&1)
          if [ "$QUEUE" == "cscs" ]; then
                  echo "#SBATCH --reservation=priority_jobs"
          fi
  fi
  ...
</verbatim>

---++ Arc

---+++ Submit with non atlas user

The below file can be modified to allow a user to submit to the ARC CE.

<verbatim>
vim /usr/share/arc/ARC0ClusterInfo.pm

     if ($q->{'name'} eq "cscs" and $sn !~ m/Pablo Fern/) { next; }
</verbatim>

---+++ Accounting

Currently the accounting data publishing via _jura_ is under investigation, but since temporary accounting data are filling up =arc[01,02]= 's disks those data have been stored on the NAS for future reference:

<verbatim>
nas.lcg.cscs.ch:/ifs/LCG/shared/apel_accounting_backup
/opt/apel_accounting_backup
</verbatim>

for each machine a specific directory has been created where temporary data (i.e. APEL-compliant records not sent yet) can be moved from time to time to free some space on the disk:

<verbatim>
[root@arc02:~]# mv /var/spool/arc/ssm/test-msg02.afroditi.hellasgrid.gr/outgoing/00000000/* /opt/apel_accounting_backup/arc02_outgoing_tmp/ssm/test-msg02/

[root@arc01:~]# mv /var/spool/arc/ssm/test-msg02.afroditi.hellasgrid.gr/outgoing/00000000/* /opt/apel_accounting_backup/arc01_outgoing_tmp/ssm/test-msg02/
</verbatim>

Another file that can be moved in order to free some space is:

<verbatim>
[root@arc02:~]# mv /var/spool/nordugrid/jobstatus/job.logger.errors  /gpfs/apel_test/job.logger.errors_arc02_20140205                               
</verbatim>

this file can be easily grow to a few GB in case of sending errors reported by _jura_.

---+++ Modify the job comment to reflect the DN

The file "/usr/share/arc/submit-SLURM-job" has modifications made to it to enable viewing of the DN the job was submitted by in the job comment. This gives much more detail when looking at things like squeue.

<verbatim>
MYUSERDN=$(/usr/bin/openssl x509 -in ${X509_USER_PROXY} -subject -noout | sed -r 's/.*= (.*)/\1/g' 2>&1)
MYHN=$(hostname -s)
COMMENT="\"$MYHN,$MYUSERDN\""
echo "#SBATCH --comment=$COMMENT" >> $LRMS_JOB_SCRIPT
</verbatim>

Previously there were issues with the memory size requested by jobs however this has since been resolved upstream.

---++ dCache

No real modifications specific to dCache itself for CSCS. See dCache wiki page for set-up.

   * Storage pools =se0[1-8]= have python26 package installed by hand from *epel* <verbatim>pdsh -w se0[1-8] 'yum install python26 --enablerepo=epel -y' |dshbak -c</verbatim>

---+++ Prevent publishing of file access protocol.

With the NFS41 domain the file access protocol is published and not easily disabled. As WN do not mount /pnfs pilots will fail. To work around this we perform a sed. Below is an example prior to modifying the info provider script.

<verbatim>
/var/lib/bdii/gip/provider/info-based-infoProvider.sh > /tmp/info.orig
sed -e '232,246d' /tmp/info.orig > /tmp/info.mod

diff /tmp/info.*                                                                                                                                                                                   
231a232,246
> dn: GlueSEAccessProtocolLocalID=NFSv41-storage02@nfs-storage02Domain,GlueSE
>  UniqueID=storage01.lcg.cscs.ch,mds-vo-name=resource,o=grid
> objectClass: GlueSETop
> objectClass: GlueSEAccessProtocol
> objectClass: GlueKey
> objectClass: GlueSchemaVersion
> GlueSEAccessProtocolLocalID: NFSv41-storage02@nfs-storage02Domain
> GlueSEAccessProtocolType: file
> GlueSEAccessProtocolEndpoint: file://storage02.lcg.cscs.ch:2049
> GlueSEAccessProtocolMaxStreams: 5
> GlueSEAccessProtocolCapability: file transfer
> GlueSEAccessProtocolVersion: file
> GlueSchemaVersionMajor: 1
> GlueSchemaVersionMinor: 3
> GlueChunkKey: GlueSEUniqueID=storage01.lcg.cscs.ch
</verbatim>

It appears in later versions of dcache a second modification is required, the following must be set to a blank value (by default it is set to file). We noticed this when upgrading to 2.6.27

<verbatim>
nfs.published.name =
</verbatim>

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105586#update#6

-- Main.GeorgeBrown - 2013-11-11