Site Specific Modifications
This page contains details about modification made to software that are specific to CSCS.
Full details should be found on the relevant service page, this is intended to give a brief overview and to be more readable than cfengine.
If something is mentioned here and not in detail on the wiki page for the service please inform the service maintainer found at the link below.
https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceInformation
General modifications
Polyinstantiated /tmp on WNs
To make sure that jobs see a specific /tmp directory on GPFS, we need to polyinstantiate /tmp and put it on the shared filesystem. Following
http://tech.ryancox.net/2013/07/per-user-tmp-and-devshm-directories.html, we can configure it by doing this:
- Add the following to
/etc/rc.local
# MG 05.05.14 10:00 as per http://tech.ryancox.net/2013/07/per-user-tmp-and-devshm-directories.html
#
#mkdir -pm 000 /tmp/usertmp
mkdir -pm 000 /dev/shm/usertmp
mount --make-shared /
mount --bind /tmp /tmp
mount --make-private /tmp
mount --bind /dev/shm /dev/shm
mount --make-private /dev/shm
mount --bind /cvmfs /cvmfs
mount --make-rshared /cvmfs
mount --bind /gpfs_pp /gpfs_pp
mount --make-rshared /gpfs_pp
- Add the proper PAM configuration:
- At the end of
/etc/pam.d/sshd
, add the following:session required pam_namespace.so ignore_instance_parent_mode
- At the end of
/etc/pam.d/slurm
, add the following: auth required pam_localuser.so
account required pam_unix.so
session required pam_limits.so
session required pam_namespace.so ignore_instance_parent_mode
NOTE: the argument ignore_instance_parent_mode
is there to allow /tmp
to be polyinstantiated on a subdirectory of which parent is not created with 000
permissions (i.e. /gpfs_pp
)
- Now the file
/etc/security/namespace.conf
needs to have the following lines:/tmp /gpfs_pp/usertmp/ user root
/dev/shm /dev/shm/usertmp/ user root
- And all left is to create the directory with where the polyinstantiated /tmp will be stored:
mkdir -pm 000 /gpfs_pp/usertmp
NOTE Please, take into account that this directory is shared across the whole cluster, so if user's jobs create directories on /tmp, they have to be
uniquely named.
SLURM
To enable this for SLURM jobs, additional steps need to be done:
- Make sure the directive
UsePAM
is enabled in /etc/slurm/slurm.conf
:UsePAM=1
- Make sure that there is no slurmd information on /tmp by properly configuring the following variables:
#--------------------------------------------------------------------------------------
# PATHS
#--------------------------------------------------------------------------------------
SlurmdSpoolDir=/var/spool/slurmd # this is for slurmd, must be local to each system
StateSaveLocation=/var/spool # in PROD /slurm/spool/state, this is for slurmctld
- Also, due to cvmfs working via
autofs
and the way namespaces behave, a Prolog script that makes sure that cvmfs is mounted on the WN before the job runs is required:
- Add the following entries to
slurm.conf
(TaskProlog and TaskEpilog existed before). The idea is not to make sure that CVMFS works (that is done on the Node Health check), but to make sure that those filesystems are mounted if they are to be there. Of course we could add more complex checks in here, but that'd stress the system much more. # http://slurm.schedmd.com/prolog_epilog.html
# Prolog & Epilog to be run before/after each batch task, to set default environment variables etc.
TaskProlog=/etc/slurm/TaskProlog.sh
TaskEpilog=/etc/slurm/TaskEpilog.sh
# Prolog & Epilog to be run before/after the job is actually executed
Prolog=/etc/slurm/Prolog.sh
Epilog=/etc/slurm/Epilog.sh
- The contents of
Prolog.sh
and Epilog.sh
are these:
Keeping a mixed environment with polyinstantiated /tmp only on some nodes
In order to accomplish this, make sure that all modifications shown above are applied. It is especially important that
/etc/pam.d/slurm
exists in
all nodes, otherwise those in which it doesn't exist will be marked as
DOWN
. Then, the following change must be applied:
- On those nodes in which we want to enforce the polyinstantiated /tmp, the following line must be present:
session required pam_namespace.so ignore_instance_parent_mode
- While for those nodes which we want to show a standard behaviour, it must be commented out
#session required pam_namespace.so ignore_instance_parent_mode
du on GPFS
When running du on a file on the GPFS filesystem the disk usage is reported as twice the size of the file. This is because we use the GPFS native RAID which ensures there are two copies of the file.
The ATLAS pilot jobs had an issue with this so du on the worker nodes was replaced with a script that halves the size reported when run on GPFS.
Cream
Publishing
Only cream01 should publish values for the CPUs in the cluster. Other creams should not report this, the relevant file is
/var/lib/bdii/gip/ldif/static-file-Cluster.ldif
The following file determines if the cream ce publishes as production or draining. This is controlled via cfengine.
/var/lib/bdii/gip/plugin/glite-info-dynamic-ce
Which in turn runs a script from the hoem of root which queires slurm in order to generate certain values e.g.
vim /root/fakeinfo.bash
...
MaxJobsPerGroup=$(${SACCTMGR_CMD} -n -r list association account=${VO} format=grpjobs | awk '{print $1}')
...
Sym links
/usr/local/bin/sacct needs to be a symbolic link to /usr/bin/sacct
ll /usr/local/bin/sacct
lrwxrwxrwx 1 root root 14 9. Okt 10:14 /usr/local/bin/sacct -> /usr/bin/sacct
Job working directory
On the worker nodes to ensure jobs run in tmpdir_slurm rather than the users home the following file is modified
/etc/glite/cp_1.sh
Priorities
The following file is modified so jobs submitted with sgm/ops accounts are run on a reservation.
/usr/libexec/slurm_local_submit_attributes.sh
We saw pilot failures across VOs due to their pilot jobs being queued for too long. As such the following was implemented.
Accounting
Slurm logs parser used by APEL:
/usr/lib/python2.6/site-packages/apel/parsers/slurm.py
/usr/lib/python2.6/site-packages/apel/common/datetime_utils.py
have been patched according to this
GGUS ticket. A few issues still need to be solved: the correct version should be included in the new version of package
apel-parsers
(currently 1.1.2) that will be updated when available.
Reservation for ops
As we are unable to reserve a single core on a node so we have reserved two nodes t run these jobs
ReservationName=priority_jobs StartTime=24 Oct 16:29 EndTime=24 Oct 2014 Duration=365-00:00:00
Nodes=wn65,wn73 NodeCnt=2 CoreCnt=64 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
Users=(null) Accounts=ops,dteam Licenses=(null) State=ACTIVE
Jobs are directed to this reservation by a modification to the submission script.
vim /usr/libexec/slurm_local_submit_attributes.sh
...
# DTEAM
REGEX="dteam[0-9][0-9][0-9]"
USER=`whoami`
if [[ ( $USER =~ $REGEX ) ]] ; then
# This extracts the queue from the SUDO command and assigns the dteam reservation if required
QUEUE=$(echo $SUDO_COMMAND |awk -F'-q ' '{print $2}' | sed 's/ -n.*$//g' 2>&1)
if [ "$QUEUE" == "cscs" ]; then
echo "#SBATCH --reservation=priority_jobs"
fi
fi
...
Arc
Submit with non atlas user
The below file can be modified to allow a user to submit to the ARC CE.
vim /usr/share/arc/ARC0ClusterInfo.pm
if ($q->{'name'} eq "cscs" and $sn !~ m/Pablo Fern/) { next; }
Accounting
Currently the accounting data publishing via
jura is under investigation, but since temporary accounting data are filling up
arc[01,02]
's disks those data have been stored on the NAS for future reference:
nas.lcg.cscs.ch:/ifs/LCG/shared/apel_accounting_backup
/opt/apel_accounting_backup
for each machine a specific directory has been created where temporary data (i.e. APEL-compliant records not sent yet) can be moved from time to time to free some space on the disk:
[root@arc02:~]# mv /var/spool/arc/ssm/test-msg02.afroditi.hellasgrid.gr/outgoing/00000000/* /opt/apel_accounting_backup/arc02_outgoing_tmp/ssm/test-msg02/
[root@arc01:~]# mv /var/spool/arc/ssm/test-msg02.afroditi.hellasgrid.gr/outgoing/00000000/* /opt/apel_accounting_backup/arc01_outgoing_tmp/ssm/test-msg02/
Another file that can be moved in order to free some space is:
[root@arc02:~]# mv /var/spool/nordugrid/jobstatus/job.logger.errors /gpfs/apel_test/job.logger.errors_arc02_20140205
this file can be easily grow to a few GB in case of sending errors reported by
jura.
Modify the job comment to reflect the DN
The file "/usr/share/arc/submit-SLURM-job" has modifications made to it to enable viewing of the DN the job was submitted by in the job comment. This gives much more detail when looking at things like squeue.
MYUSERDN=$(/usr/bin/openssl x509 -in ${X509_USER_PROXY} -subject -noout | sed -r 's/.*= (.*)/\1/g' 2>&1)
MYHN=$(hostname -s)
COMMENT="\"$MYHN,$MYUSERDN\""
echo "#SBATCH --comment=$COMMENT" >> $LRMS_JOB_SCRIPT
Previously there were issues with the memory size requested by jobs however this has since been resolved upstream.
dCache
No real modifications specific to dCache itself for CSCS. See dCache wiki page for set-up.
Prevent publishing of file access protocol.
With the NFS41 domain the file access protocol is published and not easily disabled. As WN do not mount /pnfs pilots will fail. To work around this we perform a sed. Below is an example prior to modifying the info provider script.
/var/lib/bdii/gip/provider/info-based-infoProvider.sh > /tmp/info.orig
sed -e '232,246d' /tmp/info.orig > /tmp/info.mod
diff /tmp/info.*
231a232,246
> dn: GlueSEAccessProtocolLocalID=NFSv41-storage02@nfs-storage02Domain,GlueSE
> UniqueID=storage01.lcg.cscs.ch,mds-vo-name=resource,o=grid
> objectClass: GlueSETop
> objectClass: GlueSEAccessProtocol
> objectClass: GlueKey
> objectClass: GlueSchemaVersion
> GlueSEAccessProtocolLocalID: NFSv41-storage02@nfs-storage02Domain
> GlueSEAccessProtocolType: file
> GlueSEAccessProtocolEndpoint: file://storage02.lcg.cscs.ch:2049
> GlueSEAccessProtocolMaxStreams: 5
> GlueSEAccessProtocolCapability: file transfer
> GlueSEAccessProtocolVersion: file
> GlueSchemaVersionMajor: 1
> GlueSchemaVersionMinor: 3
> GlueChunkKey: GlueSEUniqueID=storage01.lcg.cscs.ch
It appears in later versions of dcache a second modification is required, the following must be set to a blank value (by default it is set to file). We noticed this when upgrading to 2.6.27
nfs.published.name =
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105586#update#6
--
GeorgeBrown - 2013-11-11