Tags:
create new tag
view all tags

Service Card for Worker Node

Definition

At the time of writing these notes, all Worker Nodes are configured as follows:

  • Scientific Linux 6.5 x86_64
  • In-kernel IB stack
  • EMI-3 middleware
  • GPFS 3.5 client.

Operations

Client tools

Testing

See how to test the batch system ServiceLRMS

Start/stop procedures

Failover check

Checking logs

  • If the system has glexec installed, check /var/log/glexec/lcas_lcmaps.log
  • SLURM logs are in /var/log/slurmd.log

Set up

Dependencies (other services, mount points, ...)

  • WN depends on NFS for the experiment software area:
    • nas.lcg.cscs.ch:/ifs/LCG/shared/exp_soft_arc/atlas/ on /experiment_software/atlas

  • Also the scratch file system from GPFS has to be mounted and linked:
    • /home/nordugrid-atlas --> /gpfs2/gridhome/nordugrid-atlas-slurm
    • /home/nordugrid-atlas-slurm --> /gpfs2/gridhome/nordugrid-atlas-slurm
    • /home/wlcg --> /gpfs2/gridhome/wlcg
    • /tmpdir_slurm --> /gpfs2/scratch/tmpdir_slurm
Refer to HOWTORecreateSCRATCH in order to see more information on this.

Installation

  • PXE boot the system:
    # ireset wnXX
    # ipxe wnXX
  • Once the node has finished the OS installation, if CFEngine hasn't done it yet, install the following packages:
    # yum install emi-slurm-client emi-wn emi-glexec_wn globus-proxy-utils globus-gass-copy-progs --enablerepo=epel
    
  • If adding a new workernode ensure that the FQDN is listed in /opt/cscs/siteinfo/wn-list.conf which is under the SLURM group in cfengine. Also, do not forget to make sure /etc/ssh/shosts.equiv properly reflects the values in wn-list.conf.
  • Also be sure to add the node to the DSH group for slurm
    cd /srv/cfengine/DSHGROUPS
    touch INPUT/groups/ALL/WN_SLURM/wn01.lcg.cscs.ch
    make all
  • Run YAIM
    /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def  -n WN -n GLEXEC_wn -n SLURM_utils -n SLURM_client
  • Make sure munge and slurm daemons are installed:
    # service munge status
    munged (pid 1651) is running...
    # service slurm status
    slurmd (pid 25551) is running...

  • Bring up cvmfs
    service autofs start #autofs is chkconfig'd by cfengine but we need to start it has the machine hasn't rebooted
    cvmfs_config probe

  • Ensure the reservations are updated as detailed on the LRMS page if adding a new node.

  • Additionally, before putting the node online, you can run the node health checker script and make sure the result value is 0:
    # /etc/slurm/nodeHealthCheck.sh
    # echo $?
    0

  • If everything is ok, you can set the node online using scontrol
    scontrol update nodename=$(hostname -s) state=resume

Configuration

  • LVM is used in all nodes to define partitioning:
    • On those nodes with two hard disks, one is configured for the OS and the other is for CVMFS cache ( /cvmfs_local).
      # pvdisplay |grep -A 1 'PV Name'
        PV Name               /dev/sdb1
        VG Name               vg_cvmfs
      --
        PV Name               /dev/sda2
        VG Name               vg_root
      # lvdisplay |grep 'LV Path' -A 2
        LV Path                /dev/vg_cvmfs/lv_cvmfs
        LV Name                lv_cvmfs
        VG Name                vg_cvmfs
      --
        LV Path                /dev/vg_root/lv_swap
        LV Name                lv_swap
        VG Name                vg_root
      --
        LV Path                /dev/vg_root/lv_root
        LV Name                lv_root
        VG Name                vg_root
  • On those nodes with one hard disk, a similar approach is followed, but with only one VG:
    TODO

Upgrade

Monitoring

Instructions about monitoring the service

Pakiti

Pakiti provides a monitoring and notification mechanism to check the patching status of systems.

  • Intallation
    Installation is only needed on the WN here ad CSCS-LCG2 Download the client from: https://pakiti.egi.eu/client.php?site=CSCS-LCG2
  • Host monitoring
    At the following link you can check the patching status of all our WN: https://pakiti.egi.eu/hosts.php
    note: to access to this pages you need the Security Operator role for your certificate

Nagios

Ganglia

Self Sanity / revival?

Other?

Manuals

Issues

Issue1: Intel sw RAID out of sync (OLD, kept for reference).

Sometimes, when reinstalling a machine or replacing a hard disk, we need to activate the raid to make it be in OK status:

# dmraid -s -d
DEBUG: _find_set: searching isw_bgidcceegc
DEBUG: _find_set: not found isw_bgidcceegc
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: not found isw_bgidcceegc_Volume0
DEBUG: _find_set: not found isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc
DEBUG: _find_set: found isw_bgidcceegc
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: found isw_bgidcceegc_Volume0
DEBUG: _find_set: found isw_bgidcceegc_Volume0
DEBUG: set status of set "isw_bgidcceegc_Volume0" to 8
*** Group superset isw_bgidcceegc
--> Active Subset
name   : isw_bgidcceegc_Volume0
size   : 927985664
stride : 128
type   : mirror
status : nosync
subsets: 0
devs   : 2
spares : 0
DEBUG: freeing devices of RAID set "isw_bgidcceegc_Volume0"
DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sda"
DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sdb"
DEBUG: freeing devices of RAID set "isw_bgidcceegc"
DEBUG: freeing device "isw_bgidcceegc", path "/dev/sda"
DEBUG: freeing device "isw_bgidcceegc", path "/dev/sdb"
# dmraid -ay
# dmraid -s
*** Group superset isw_bgidcceegc
--> Active Subset
name   : isw_bgidcceegc_Volume0
size   : 927985664
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0
ServiceCardForm
Service name WN
Machines this service is installed in wn[01-48,50,52-79]
Is Grid service Yes
Depends on the following services cvmfs, gpfs2, nfs, lrms
Expert Gianni Ricciardi
CM CfEngine
Provisioning PuppetForeman
Edit | Attach | Watch | Print version | History: r32 < r31 < r30 < r29 < r28 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r32 - 2015-04-23 - DinoConciatore
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback