Service Card for Worker Node
Definition
At the time of writing these notes, all Worker Nodes are configured as follows:
- Scientific Linux 6.5 x86_64
- In-kernel IB stack
- EMI-3 middleware
- GPFS 3.5 client.
Operations
Client tools
Testing
See how to test the batch system
ServiceLRMS
Start/stop procedures
Failover check
Checking logs
- If the system has glexec installed, check
/var/log/glexec/lcas_lcmaps.log
- SLURM logs are in
/var/log/slurmd.log
Set up
Dependencies (other services, mount points, ...)
- WN depends on NFS for the experiment software area:
-
nas.lcg.cscs.ch:/ifs/LCG/shared/exp_soft_arc/atlas/ on /experiment_software/atlas
- Also the scratch file system from GPFS has to be mounted and linked:
-
/home/nordugrid-atlas --> /gpfs2/gridhome/nordugrid-atlas-slurm
-
/home/nordugrid-atlas-slurm --> /gpfs2/gridhome/nordugrid-atlas-slurm
-
/home/wlcg --> /gpfs2/gridhome/wlcg
-
/tmpdir_slurm --> /gpfs2/scratch/tmpdir_slurm
Refer to
HOWTORecreateSCRATCH in order to see more information on this.
Installation
- PXE boot the system:
# ireset wnXX
# ipxe wnXX
- Once the node has finished the OS installation, if
CFEngine hasn't done it yet
, install the following packages: # yum install emi-slurm-client emi-wn emi-glexec_wn globus-proxy-utils globus-gass-copy-progs --enablerepo=epel
- If adding a new workernode ensure that the FQDN is listed in
/opt/cscs/siteinfo/wn-list.conf
which is under the SLURM group in cfengine. Also, do not forget to make sure /etc/ssh/shosts.equiv
properly reflects the values in wn-list.conf
.
- Also be sure to add the node to the DSH group for slurm
cd /srv/cfengine/DSHGROUPS
touch INPUT/groups/ALL/WN_SLURM/wn01.lcg.cscs.ch
make all
- Run YAIM
/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n GLEXEC_wn -n SLURM_utils -n SLURM_client
- Make sure munge and slurm daemons are installed:
# service munge status
munged (pid 1651) is running...
# service slurm status
slurmd (pid 25551) is running...
- Ensure the reservations are updated as detailed on the LRMS page if adding a new node.
Configuration
- LVM is used in all nodes to define partitioning:
- On those nodes with one hard disk, a similar approach is followed, but with only one VG:
TODO
Upgrade
Monitoring
Instructions about monitoring the service
Pakiti
Pakiti provides a monitoring and notification mechanism to check the patching status of systems.
- Intallation
Installation is only needed on the WN here ad CSCS-LCG2 Download the client from: https://pakiti.egi.eu/client.php?site=CSCS-LCG2
- Host monitoring
At the following link you can check the patching status of all our WN: https://pakiti.egi.eu/hosts.php
note: to access to this pages you need the Security Operator role for your certificate
Nagios
Ganglia
Self Sanity / revival?
Other?
Manuals
Issues
Issue1: Intel sw RAID out of sync (OLD, kept for reference).
Sometimes, when reinstalling a machine or replacing a hard disk, we need to activate the raid to make it be in OK status:
# dmraid -s -d
DEBUG: _find_set: searching isw_bgidcceegc
DEBUG: _find_set: not found isw_bgidcceegc
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: not found isw_bgidcceegc_Volume0
DEBUG: _find_set: not found isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc
DEBUG: _find_set: found isw_bgidcceegc
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: found isw_bgidcceegc_Volume0
DEBUG: _find_set: found isw_bgidcceegc_Volume0
DEBUG: set status of set "isw_bgidcceegc_Volume0" to 8
*** Group superset isw_bgidcceegc
--> Active Subset
name : isw_bgidcceegc_Volume0
size : 927985664
stride : 128
type : mirror
status : nosync
subsets: 0
devs : 2
spares : 0
DEBUG: freeing devices of RAID set "isw_bgidcceegc_Volume0"
DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sda"
DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sdb"
DEBUG: freeing devices of RAID set "isw_bgidcceegc"
DEBUG: freeing device "isw_bgidcceegc", path "/dev/sda"
DEBUG: freeing device "isw_bgidcceegc", path "/dev/sdb"
# dmraid -ay
# dmraid -s
*** Group superset isw_bgidcceegc
--> Active Subset
name : isw_bgidcceegc_Volume0
size : 927985664
stride : 128
type : mirror
status : ok
subsets: 0
devs : 2
spares : 0