Tags:
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> ---+ Service Card for Worker Node Short description about the service. %TOC% ---++ Definition As of today, all Worker Nodes are configured as follows: * Scientific Linux 5.7 x86_64 * Mellanox IB stack * UMD 1 middleware * GPFS 3.4.0-13 client. ---++ Operations Interesting information like how to deal with the service. ---+++ Client tools ---+++ Testing See how to test the batch system ServiceLRMS ---+++ Start/stop procedures ---+++ Failover check ---+++ Checking logs * Standard PBS logs are in =/var/spool/pbs/mom_logs/= * If the system has glexec installed, checkj =/var/log/glexec/lcas_lcmaps.log= ---++ Set up Instructions on how to set up the service, like: ---+++ Dependencies (other services, mount points, ...) * WN depends on NFS for the experiment software area: * <verbatim>nfs:/experiment_software/atlas on /experiment_software/atlas</verbatim> * <verbatim>nfs:/experiment_software/cms on /experiment_software/cms</verbatim> * <verbatim>nfs:/experiment_software/lhcb on /experiment_software/lhcb</verbatim> * <verbatim>nfs:/experiment_software/others/dech on /experiment_software/dech</verbatim> * <verbatim>nfs:/experiment_software/others/dteam on /experiment_software/dteam</verbatim> * <verbatim>nfs:/experiment_software/others/gear on /experiment_software/gear</verbatim> * <verbatim>nfs:/experiment_software/others/ops on /experiment_software/ops</verbatim> * <verbatim>nfs:/experiment_software/others/hone on /experiment_software/hone</verbatim> * Moreover a scratch file system from GPFS has to be mounted and linked from =/tmpdir_slurm= and =/home/wlcg=. <verbatim> Apr 18 15:29 [root@wn101:~]# ls -ld /tmpdir_slurm lrwxrwxrwx 1 root root 16 Mar 29 10:36 /tmpdir_slurm -> /gpfs/tmpdir_slurm </verbatim> ---+++ Installation * If the system is installed by cfengine, GPFS should also be installed already. Verify that =/gpfs= is mounted *via NFS* from =ppnfs=. Here there is information about GPFS on Phoenix: ServiceGPFS * Install these packages as follows: <verbatim># cfagent -q umount /lcg.cscs.ch/packages/rpms echo "touch /var/lock/subsys/local" > /etc/rc.d/rc.local rm -fv /etc/yum.repos.d/sl-security.repo # old SL59 security repo yum update -y #Update all possible from sl-security but NOT any IB or kernel related package. yum install ca-policy-egi-core -y yum install libtorque-2.4.16-1.cri $(ssh lrms01 'rpm -qa |grep torque-client') torque --disableexcludes=main --enablerepo=cscs -y yum install emi-torque-client --enablerepo=epel -y yum install cvmfs cvmfs-keys cvmfs-init-scripts emi-wn emi-glexec_wn --enablerepo=epel -y chkconfig autofs on service autofs start scp ppnfs:/var/mmfs/gen/mmsdrfs /var/mmfs/gen/ mmrefresh -f mmstartup mmgetstate</verbatim> * Reboot the system to make sure all gets mounted and GPFS started on each reboot. <verbatim>reboot</verbatim> * Now, make sure that all the mountpoints are installed and that cvmfs is working: <verbatim>mmstartup ; sleep 5s; ls /gpfs; df -h |grep gpfs cvmfs_config probe Probing /cvmfs/atlas.cern.ch... OK Probing /cvmfs/atlas-condb.cern.ch... OK Probing /cvmfs/lhcb.cern.ch... OK Probing /cvmfs/hone.cern.ch... Failed! Probing /cvmfs/cms.cern.ch... OK mount |grep 'gpfs\|experiment' ppnfs:/gpfs on /gpfs type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.64.227) ppnfs:/gpfs/preproduction on /gpfs_pp type nfs (rw,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.64.227) nfs:/experiment_software/atlas on /experiment_software/atlas type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100) nfs:/experiment_software/cms on /experiment_software/cms type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100) nfs:/experiment_software/lhcb on /experiment_software/lhcb type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100) nfs:/experiment_software/others/dech on /experiment_software/dech type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100) nfs:/experiment_software/others/dteam on /experiment_software/dteam type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100) nfs:/experiment_software/others/gear on /experiment_software/gear type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100) nfs:/experiment_software/others/ops on /experiment_software/ops type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100) nfs:/experiment_software/others/hone on /experiment_software/hone type nfs (ro,bg,proto=tcp,rsize=32768,wsize=32768,soft,intr,nfsvers=3,addr=148.187.67.100) </verbatim> * At this point, we need to configure the software installed with YAIM: <verbatim>## /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n TORQUE_client -n GLEXEC_wn nohup /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n TORQUE_client -n GLEXEC_wn 2>&1 | tee /root/yaim.log &</verbatim> * And run =cfengine= and =grid-service2 restart=<verbatim>cfagent -q; grid-service2 restart</verbatim> ---++++ EMI-3 (SLURM) * Install the following packages:<verbatim># yum install emi-slurm-client emi-wn emi-glexec_wn globus-proxy-utils globus-gass-copy-progs --enablerepo=epel </verbatim> * If adding a new workernode ensure that the FQDN is listed in =/opt/cscs/siteinfo/wn-list.conf= which is under the SLURM group in cfengine. Also, do not forget to make sure =/etc/ssh/shosts.equiv= properly reflects the values in =wn-list.conf=. * Also be sure to add the node to the DSH group for slurm <verbatim> cd /srv/cfengine/DSHGROUPS touch INPUT/groups/ALL/WN_SLURM/wn01.lcg.cscs.ch make all </verbatim> * Run YAIM <verbatim>/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n GLEXEC_wn -n SLURM_utils -n SLURM_client</verbatim> * Make sure munge and slurm daemons are installed:<verbatim># service munge status munged (pid 1651) is running... # service slurm status slurmd (pid 25551) is running...</verbatim> * Bring up cvmfs <verbatim>service autofs start #autofs is chkconfig'd by cfengine but we need to start it has the machine hasn't rebooted cvmfs_config probe</verbatim> * Ensure the reservations are updated as detailed on the LRMS page if adding a new node. ---+++ Configuration <verbatim> Apr 18 15:13 [root@wn101:~]# cat /var/spool/pbs/mom_priv/config $logevent 255 # MOM interval in seconds. Should be <= servers job_stat_rate $check_poll_time 90 # Interval of information update to server. Should be <= scheduling interval $status_update_time 90 $timeout 30 # Moab takes care about killing jobs. This allows jobs to overrun walltime by some time $ignwalltime true $usecp arc01.lcg.cscs.ch:/home/nordugrid-atlas /home/nordugrid-atlas $usecp arc02.lcg.cscs.ch:/home/nordugrid-atlas /home/nordugrid-atlas $usecp ce01.lcg.cscs.ch:/home /home $usecp ce02.lcg.cscs.ch:/home /home # gLite 3.2 CREAM $usecp cream01.lcg.cscs.ch:/opt/glite/var/cream_sandbox /lustre/scratch/CREAM_CE/cream01/cream_sandbox # For EMI 1 (UMD 1.0.0 & UMD 1.1.0 releases) this line must be the following: $usecp cream02.lcg.cscs.ch:/var/cream_sandbox /lustre/scratch/CREAM_CE/cream02/cream_sandbox $tmpdir /tmpdir_pbs # Torque's default connection timeout is 10ms instead of 10s... should be fixed in a later release, but for now: # 4s works fine in productin at Cyfronet (should be fine for Phoenix too) $max_conn_timeout_micro_sec 4000000 # scale cputime and walltime to average HEP-SPEC06 published # Average HEP-SPEC06/core (C+D): 9.69 # PhaseC: 10 --> 1.03 # PhaseD: 8.2 --> 0.85 $cputmult 1.03 $wallmult 1.03 # in case Lustre is slow we want to prevent that the job get's requed $prologalarm 600 </verbatim> ---+++ Upgrade * gLite 3.2: Run <verbatim>/usr/local/bin/yum-with-glite groupupdate --enablerepo=cscs glite-WN </verbatim> * EMI 1: Run a simple update on =emi-wn= and =emi-glexec_wn=. Do not attemp to do it with all packages as there is a newer version of libtorque in CSCS repo that wants to be installed. <verbatim> yum update --enablerepo=cscs --enablerepo=epel emi-wn emi-glexec_wn</verbatim> %ICON{"warning"}% Make sure that the torque packages are taken from the CSCS repo! ---++ Monitoring Instructions about monitoring the service ---+++ Nagios ---+++ Ganglia ---+++ Self Sanity / revival? ---+++ Other? ---++ Manuals * [[https://twiki.cern.ch/twiki/bin/view/EGEE/GliteWN][glite-WN Service Reference Card]] * [[http://glite.cern.ch/glite-WN/][gLite Release Notes]] * [[http://www.adaptivecomputing.com/resources/docs/torque/index.php][Torque]] ---++ Issues Information about issues found with this service, and how to deal with them. ---+++ Issue1 If after the installation of a new node, jobs fail on that node and you get this message in the =/var/log/messages= of the node: <verbatim>Sep 7 17:37:37 ppwn04 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB dteam001@ppcream02.lcg.cscs.ch:/var/local_cream_sandbox/dteam/_DC_com_DC_quovadisglobal_DC_grid_DC_switch_DC_users_C_CH_O_ETH_Zuerich_CN_Miguel_Angel_Gila_Arrondo_dteam_Role_NULL_Capability_NULL_dteam001/proxy/005d0b069e96cba166a0f1caf82a7ad25cc7b77612719093722029 crpp2_788806913.proxy' failed with status=1, giving up after 4 attempts Sep 7 17:37:37 ppwn04 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file dteam001@ppcream02.lcg.cscs.ch:/var/local_cream_sandbox/dteam/_DC_com_DC_quovadisglobal_DC_grid_DC_switch_DC_users_C_CH_O_ETH_Zuerich_CN_Miguel_Angel_Gila_Arrondo_dteam_Role_NULL_Capability_NULL_dteam001/proxy/005d0b069e96cba166a0f1caf82a7ad25cc7b77612719093722029 to crpp2_788806913.proxy</verbatim> Then, make sure that the =ssh_known_hosts= file has been generated recently and contains the new keys by running the following command on cfengine server <verbatim>/srv/cfengine/scripts/new_known_hosts</verbatim> ---+++ Issue2: Intel sw RAID out of sync Sometimes, when reinstalling a machine or replacing a hard disk, we need to activate the raid to make it be in OK status: <verbatim># dmraid -s -d DEBUG: _find_set: searching isw_bgidcceegc DEBUG: _find_set: not found isw_bgidcceegc DEBUG: _find_set: searching isw_bgidcceegc_Volume0 DEBUG: _find_set: searching isw_bgidcceegc_Volume0 DEBUG: _find_set: not found isw_bgidcceegc_Volume0 DEBUG: _find_set: not found isw_bgidcceegc_Volume0 DEBUG: _find_set: searching isw_bgidcceegc DEBUG: _find_set: found isw_bgidcceegc DEBUG: _find_set: searching isw_bgidcceegc_Volume0 DEBUG: _find_set: searching isw_bgidcceegc_Volume0 DEBUG: _find_set: found isw_bgidcceegc_Volume0 DEBUG: _find_set: found isw_bgidcceegc_Volume0 DEBUG: set status of set "isw_bgidcceegc_Volume0" to 8 *** Group superset isw_bgidcceegc --> Active Subset name : isw_bgidcceegc_Volume0 size : 927985664 stride : 128 type : mirror status : nosync subsets: 0 devs : 2 spares : 0 DEBUG: freeing devices of RAID set "isw_bgidcceegc_Volume0" DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sda" DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sdb" DEBUG: freeing devices of RAID set "isw_bgidcceegc" DEBUG: freeing device "isw_bgidcceegc", path "/dev/sda" DEBUG: freeing device "isw_bgidcceegc", path "/dev/sdb" # dmraid -ay # dmraid -s *** Group superset isw_bgidcceegc --> Active Subset name : isw_bgidcceegc_Volume0 size : 927985664 stride : 128 type : mirror status : ok subsets: 0 devs : 2 spares : 0</verbatim>
ServiceCardForm
Service name
WN
Machines this service is installed in
wn[01-79]
Is Grid service
Yes
Depends on the following services
cvmfs, gpfs, nfs, lrms
Expert
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r32
|
r30
<
r29
<
r28
<
r27
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r28 - 2013-10-17
-
MiguelGila
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback