(r28) ServiceWN < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup
-->

---+ Service Card for Worker Node

Short description about the service.

%TOC%

---++ Definition

As of today, all Worker Nodes are configured as follows:
   * Scientific Linux 5.7 x86_64 
   * Mellanox IB stack
   * UMD 1 middleware
   * GPFS 3.4.0-13 client.

---++ Operations

Interesting information like how to deal with the service.
---+++ Client tools
---+++ Testing

See how to test the batch system ServiceLRMS

---+++ Start/stop procedures
---+++ Failover check
---+++ Checking logs
   * Standard PBS logs are in =/var/spool/pbs/mom_logs/=
   * If the system has glexec installed, checkj =/var/log/glexec/lcas_lcmaps.log=

---++ Set up

Instructions on how to set up the service, like:

---+++ Dependencies (other services, mount points, ...)

   * WN depends on NFS for the experiment software area: 
      * <verbatim>nfs:/experiment_software/atlas on /experiment_software/atlas</verbatim>
      * <verbatim>nfs:/experiment_software/cms on /experiment_software/cms</verbatim>
      * <verbatim>nfs:/experiment_software/lhcb on /experiment_software/lhcb</verbatim>
      * <verbatim>nfs:/experiment_software/others/dech on /experiment_software/dech</verbatim>
      * <verbatim>nfs:/experiment_software/others/dteam on /experiment_software/dteam</verbatim>
      * <verbatim>nfs:/experiment_software/others/gear on /experiment_software/gear</verbatim>
      * <verbatim>nfs:/experiment_software/others/ops on /experiment_software/ops</verbatim>
      * <verbatim>nfs:/experiment_software/others/hone on /experiment_software/hone</verbatim>

   * Moreover a scratch file system from GPFS has to be mounted and linked from =/tmpdir_slurm= and =/home/wlcg=.

<verbatim>
      Apr 18 15:29 [root@wn101:~]# ls -ld /tmpdir_slurm 
      lrwxrwxrwx 1 root root 16 Mar 29 10:36 /tmpdir_slurm -> /gpfs/tmpdir_slurm
</verbatim>

---+++ Installation
   * If the system is installed by cfengine, GPFS should also be installed already. Verify that =/gpfs= is mounted *via NFS* from =ppnfs=. Here there is information about GPFS on Phoenix: ServiceGPFS
   * Install these packages as follows: <verbatim># cfagent -q
umount /lcg.cscs.ch/packages/rpms
echo "touch /var/lock/subsys/local" > /etc/rc.d/rc.local
rm -fv /etc/yum.repos.d/sl-security.repo # old SL59 security repo
yum update -y #Update all possible from sl-security but NOT any IB or kernel related package.
yum install ca-policy-egi-core -y
yum install libtorque-2.4.16-1.cri $(ssh lrms01 'rpm -qa |grep torque-client') torque --disableexcludes=main --enablerepo=cscs -y
yum install emi-torque-client --enablerepo=epel -y
yum install cvmfs cvmfs-keys cvmfs-init-scripts emi-wn emi-glexec_wn --enablerepo=epel -y
chkconfig autofs on
service autofs start
scp ppnfs:/var/mmfs/gen/mmsdrfs /var/mmfs/gen/
mmrefresh -f
mmstartup
mmgetstate</verbatim>
   * Reboot the system to make sure all gets mounted and GPFS started on each reboot. <verbatim>reboot</verbatim>
   * Now, make sure that all the mountpoints are installed and that cvmfs is working: <verbatim>mmstartup ; sleep 5s; ls /gpfs; df -h |grep gpfs
cvmfs_config probe 
  Probing /cvmfs/atlas.cern.ch... OK
  Probing /cvmfs/atlas-condb.cern.ch... OK
  Probing /cvmfs/lhcb.cern.ch... OK
  Probing /cvmfs/hone.cern.ch... Failed!
  Probing /cvmfs/cms.cern.ch... OK
mount |grep 'gpfs\|experiment'
   ppnfs:/gpfs on /gpfs type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.64.227)
   ppnfs:/gpfs/preproduction on /gpfs_pp type nfs (rw,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.64.227)
   nfs:/experiment_software/atlas on /experiment_software/atlas type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100)
   nfs:/experiment_software/cms on /experiment_software/cms type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100)
   nfs:/experiment_software/lhcb on /experiment_software/lhcb type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100)
   nfs:/experiment_software/others/dech on /experiment_software/dech type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100)
   nfs:/experiment_software/others/dteam on /experiment_software/dteam type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100)
   nfs:/experiment_software/others/gear on /experiment_software/gear type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100)
   nfs:/experiment_software/others/ops on /experiment_software/ops type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100)
   nfs:/experiment_software/others/hone on /experiment_software/hone type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100)
</verbatim>
   * At this point, we need to configure the software installed with YAIM: <verbatim>## /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n TORQUE_client -n GLEXEC_wn
nohup /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n TORQUE_client -n GLEXEC_wn 2>&1 | tee /root/yaim.log &</verbatim>
   * And run =cfengine= and =grid-service2 restart=<verbatim>cfagent -q; grid-service2 restart</verbatim>

---++++ EMI-3 (SLURM)
   * Install the following packages:<verbatim># yum install emi-slurm-client emi-wn emi-glexec_wn globus-proxy-utils globus-gass-copy-progs --enablerepo=epel
</verbatim>
   * If adding a new workernode ensure that the FQDN is listed in =/opt/cscs/siteinfo/wn-list.conf= which is under the SLURM group in cfengine. Also, do not forget to make sure =/etc/ssh/shosts.equiv= properly reflects the values in =wn-list.conf=.
   * Also be sure to add the node to the DSH group for slurm
   <verbatim>
   cd /srv/cfengine/DSHGROUPS
   touch INPUT/groups/ALL/WN_SLURM/wn01.lcg.cscs.ch
   make all
   </verbatim>
   * Run YAIM <verbatim>/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def  -n WN -n GLEXEC_wn -n SLURM_utils -n SLURM_client</verbatim>
   * Make sure munge and slurm daemons are installed:<verbatim># service munge status
munged (pid 1651) is running...
# service slurm status
slurmd (pid 25551) is running...</verbatim>

   * Bring up cvmfs <verbatim>service autofs start #autofs is chkconfig'd by cfengine but we need to start it has the machine hasn't rebooted
cvmfs_config probe</verbatim>

   * Ensure the reservations are updated as detailed on the LRMS page if adding a new node.

---+++ Configuration

<verbatim>
Apr 18 15:13 [root@wn101:~]# cat /var/spool/pbs/mom_priv/config 
$logevent 255
# MOM interval in seconds. Should be <= servers job_stat_rate
$check_poll_time 90
# Interval of information update to server. Should be <= scheduling interval
$status_update_time 90
$timeout 30

# Moab takes care about killing jobs. This allows jobs to overrun walltime by some time
$ignwalltime true

$usecp arc01.lcg.cscs.ch:/home/nordugrid-atlas /home/nordugrid-atlas
$usecp arc02.lcg.cscs.ch:/home/nordugrid-atlas /home/nordugrid-atlas
$usecp ce01.lcg.cscs.ch:/home /home
$usecp ce02.lcg.cscs.ch:/home /home

# gLite 3.2 CREAM
$usecp cream01.lcg.cscs.ch:/opt/glite/var/cream_sandbox /lustre/scratch/CREAM_CE/cream01/cream_sandbox
# For EMI 1 (UMD 1.0.0 & UMD 1.1.0 releases) this line must be the following:
$usecp cream02.lcg.cscs.ch:/var/cream_sandbox /lustre/scratch/CREAM_CE/cream02/cream_sandbox
$tmpdir /tmpdir_pbs

# Torque's default connection timeout is 10ms instead of 10s... should be fixed in a later release, but for now:
# 4s works fine in productin at Cyfronet (should be fine for Phoenix too)
$max_conn_timeout_micro_sec 4000000

# scale cputime and walltime to average HEP-SPEC06 published
# Average HEP-SPEC06/core (C+D): 9.69
# PhaseC: 10 --> 1.03
# PhaseD: 8.2 --> 0.85
$cputmult 1.03
$wallmult 1.03

# in case Lustre is slow we want to prevent that the job get's requed
$prologalarm 600
</verbatim>

---+++ Upgrade

   * gLite 3.2: Run <verbatim>/usr/local/bin/yum-with-glite groupupdate --enablerepo=cscs glite-WN </verbatim>
   * EMI 1: Run a simple update on =emi-wn= and =emi-glexec_wn=. Do not attemp to do it with all packages as there is a newer version of libtorque in CSCS repo that wants to be installed.
<verbatim> yum update --enablerepo=cscs --enablerepo=epel emi-wn emi-glexec_wn</verbatim>

%ICON{"warning"}% Make sure that the torque packages are taken from the CSCS repo!

---++ Monitoring

Instructions about monitoring the service
---+++ Nagios
---+++ Ganglia
---+++ Self Sanity / revival?
---+++ Other?

---++ Manuals

   * [[https://twiki.cern.ch/twiki/bin/view/EGEE/GliteWN][glite-WN Service Reference Card]]
   * [[http://glite.cern.ch/glite-WN/][gLite Release Notes]]
   * [[http://www.adaptivecomputing.com/resources/docs/torque/index.php][Torque]]

---++ Issues

Information about issues found with this service, and how to deal with them.
---+++ Issue1

If after the installation of a new node, jobs fail on that node and you get this message in the =/var/log/messages= of the node:
<verbatim>Sep  7 17:37:37 ppwn04 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB dteam001@ppcream02.lcg.cscs.ch:/var/local_cream_sandbox/dteam/_DC_com_DC_quovadisglobal_DC_grid_DC_switch_DC_users_C_CH_O_ETH_Zuerich_CN_Miguel_Angel_Gila_Arrondo_dteam_Role_NULL_Capability_NULL_dteam001/proxy/005d0b069e96cba166a0f1caf82a7ad25cc7b77612719093722029 crpp2_788806913.proxy' failed with status=1, giving up after 4 attempts
Sep  7 17:37:37 ppwn04 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file dteam001@ppcream02.lcg.cscs.ch:/var/local_cream_sandbox/dteam/_DC_com_DC_quovadisglobal_DC_grid_DC_switch_DC_users_C_CH_O_ETH_Zuerich_CN_Miguel_Angel_Gila_Arrondo_dteam_Role_NULL_Capability_NULL_dteam001/proxy/005d0b069e96cba166a0f1caf82a7ad25cc7b77612719093722029 to crpp2_788806913.proxy</verbatim>

Then, make sure that the =ssh_known_hosts= file has been generated recently and contains the new keys by running the following command on cfengine server
<verbatim>/srv/cfengine/scripts/new_known_hosts</verbatim>

---+++ Issue2: Intel sw RAID out of sync

Sometimes, when reinstalling a machine or replacing a hard disk, we need to activate the raid to make it be in OK status:
<verbatim># dmraid -s -d
DEBUG: _find_set: searching isw_bgidcceegc
DEBUG: _find_set: not found isw_bgidcceegc
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: not found isw_bgidcceegc_Volume0
DEBUG: _find_set: not found isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc
DEBUG: _find_set: found isw_bgidcceegc
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: searching isw_bgidcceegc_Volume0
DEBUG: _find_set: found isw_bgidcceegc_Volume0
DEBUG: _find_set: found isw_bgidcceegc_Volume0
DEBUG: set status of set "isw_bgidcceegc_Volume0" to 8
*** Group superset isw_bgidcceegc
--> Active Subset
name   : isw_bgidcceegc_Volume0
size   : 927985664
stride : 128
type   : mirror
status : nosync
subsets: 0
devs   : 2
spares : 0
DEBUG: freeing devices of RAID set "isw_bgidcceegc_Volume0"
DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sda"
DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sdb"
DEBUG: freeing devices of RAID set "isw_bgidcceegc"
DEBUG: freeing device "isw_bgidcceegc", path "/dev/sda"
DEBUG: freeing device "isw_bgidcceegc", path "/dev/sdb"
# dmraid -ay
# dmraid -s
*** Group superset isw_bgidcceegc
--> Active Subset
name   : isw_bgidcceegc_Volume0
size   : 927985664
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0</verbatim>