Service Card for Worker Node
Definition
At the time of writing these notes, all Worker Nodes are configured as follows:
- Scientific Linux 6.4 x86_64
- In-kernel IB stack
- EMI-3 middleware
- GPFS 3.5 client.
Operations
Client tools
Testing
See how to test the batch system
ServiceLRMS
Start/stop procedures
Failover check
Checking logs
- If the system has glexec installed, check
/var/log/glexec/lcas_lcmaps.log
- SLURM logs are in
/var/log/slurmd.log
Set up
Dependencies (other services, mount points, ...)
- WN depends on NFS for the experiment software area:
-
nfs:/experiment_software/atlas on /experiment_software/atlas
-
nfs:/experiment_software/cms on /experiment_software/cms
-
nfs:/experiment_software/lhcb on /experiment_software/lhcb
-
nfs:/experiment_software/others/dech on /experiment_software/dech
-
nfs:/experiment_software/others/dteam on /experiment_software/dteam
-
nfs:/experiment_software/others/gear on /experiment_software/gear
-
nfs:/experiment_software/others/ops on /experiment_software/ops
-
nfs:/experiment_software/others/hone on /experiment_software/hone
- Also the scratch file system from GPFS has to be mounted and linked:
- /home/nordugrid-atlas-slurm -> /gpfs/home/nordugrid-atlas-slurm
- /home/wlcg -> /gpfs/home/wlcg
- /tmpdir_slurm -> /gpfs/tmpdir_slurm
Installation
- PXE boot the system:
# ireset wnXX
# ipxe wnXX
- Once the node has finished the OS installation, if
CFEngine hasn't done it yet
, install the following packages:# yum install emi-slurm-client emi-wn emi-glexec_wn globus-proxy-utils globus-gass-copy-progs --enablerepo=epel
- If adding a new workernode ensure that the FQDN is listed in
/opt/cscs/siteinfo/wn-list.conf
which is under the SLURM group in cfengine. Also, do not forget to make sure /etc/ssh/shosts.equiv
properly reflects the values in wn-list.conf
.
- Also be sure to add the node to the DSH group for slurm
cd /srv/cfengine/DSHGROUPS
touch INPUT/groups/ALL/WN_SLURM/wn01.lcg.cscs.ch
make all
- Run YAIM
/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n GLEXEC_wn -n SLURM_utils -n SLURM_client
- Make sure munge and slurm daemons are installed:
# service munge status
munged (pid 1651) is running...
# service slurm status
slurmd (pid 25551) is running...
- Ensure the reservations are updated as detailed on the LRMS page if adding a new node.
Configuration
- LVM is used in all nodes to define partitioning:
- On those nodes with one hard disk, a similar approach is followed, but with only one VG:
TODO
Upgrade
Monitoring
Instructions about monitoring the service
Nagios
Ganglia
Self Sanity / revival?
Other?
Manuals
Issues
Issue1: Intel sw RAID out of sync (OLD, kept for reference).
Sometimes, when reinstalling a machine or replacing a hard disk, we need to activate the raid to make it be in OK status:
# dmraid -s -d
DEBUG: _find_set: searching isw_bgidcceegc
DEBUG: _find_set: not found isw_bgidcceegc
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: not found isw_bgidcceegc_Volume0
DEBUG: _find_set: not found isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc
DEBUG: _find_set: found isw_bgidcceegc
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: found isw_bgidcceegc_Volume0
DEBUG: _find_set: found isw_bgidcceegc_Volume0
DEBUG: set status of set "isw_bgidcceegc_Volume0" to 8
*** Group superset isw_bgidcceegc
--> Active Subset
name : isw_bgidcceegc_Volume0
size : 927985664
stride : 128
type : mirror
status : nosync
subsets: 0
devs : 2
spares : 0
DEBUG: freeing devices of RAID set "isw_bgidcceegc_Volume0"
DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sda"
DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sdb"
DEBUG: freeing devices of RAID set "isw_bgidcceegc"
DEBUG: freeing device "isw_bgidcceegc", path "/dev/sda"
DEBUG: freeing device "isw_bgidcceegc", path "/dev/sdb"
# dmraid -ay
# dmraid -s
*** Group superset isw_bgidcceegc
--> Active Subset
name : isw_bgidcceegc_Volume0
size : 927985664
stride : 128
type : mirror
status : ok
subsets: 0
devs : 2
spares : 0