Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> ---+ Service Card for Worker Node %TOC% ---++ Definition At the time of writing these notes, all Worker Nodes are configured as follows: * Scientific Linux 6.5 x86_64 * In-kernel IB stack * EMI-3 middleware * GPFS 3.5 client. ---++ Operations ---+++ Client tools ---+++ Testing See how to test the batch system ServiceLRMS ---+++ Start/stop procedures ---+++ Failover check ---+++ Checking logs * If the system has glexec installed, check =/var/log/glexec/lcas_lcmaps.log= * SLURM logs are in =/var/log/slurmd.log= ---++ Set up ---+++ Dependencies (other services, mount points, ...) * WN depends on NFS for the experiment software area: * =nas.lcg.cscs.ch:/ifs/LCG/shared/exp_soft_arc/atlas/ on /experiment_software/atlas= * Also the scratch file system from GPFS has to be mounted and linked: * =/home/nordugrid-atlas --> /gpfs2/gridhome/nordugrid-atlas-slurm= * =/home/nordugrid-atlas-slurm --> /gpfs2/gridhome/nordugrid-atlas-slurm= * =/home/wlcg --> /gpfs2/gridhome/wlcg= * =/tmpdir_slurm --> /gpfs2/scratch/tmpdir_slurm= Refer to HOWTORecreateSCRATCH in order to see more information on this. ---+++ Installation * PXE boot the system: <verbatim># ireset wnXX # ipxe wnXX</verbatim> * Once the node has finished the OS installation, if =CFEngine hasn't done it yet=, install the following packages: <verbatim># yum install emi-slurm-client emi-wn emi-glexec_wn globus-proxy-utils globus-gass-copy-progs --enablerepo=epel </verbatim> * If adding a new workernode ensure that the FQDN is listed in =/opt/cscs/siteinfo/wn-list.conf= which is under the SLURM group in cfengine. Also, do not forget to make sure =/etc/ssh/shosts.equiv= properly reflects the values in =wn-list.conf=. * Also be sure to add the node to the DSH group for slurm <verbatim> cd /srv/cfengine/DSHGROUPS touch INPUT/groups/ALL/WN_SLURM/wn01.lcg.cscs.ch make all</verbatim> * Run YAIM <verbatim>/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n GLEXEC_wn -n SLURM_utils -n SLURM_client</verbatim> * Make sure munge and slurm daemons are installed: <verbatim># service munge status munged (pid 1651) is running... # service slurm status slurmd (pid 25551) is running...</verbatim> * Bring up cvmfs <verbatim>service autofs start #autofs is chkconfig'd by cfengine but we need to start it has the machine hasn't rebooted cvmfs_config probe</verbatim> * Ensure the reservations are updated as detailed on the LRMS page if adding a new node. * Additionally, before putting the node online, you can run the node health checker script and make sure the result value is 0: <verbatim># /etc/slurm/nodeHealthCheck.sh # echo $? 0</verbatim> * If everything is ok, you can set the node online using =scontrol= <verbatim>scontrol update nodename=$(hostname -s) state=resume</verbatim> ---+++ Configuration * LVM is used in all nodes to define partitioning: * On those nodes with two hard disks, one is configured for the OS and the other is for CVMFS cache ( =/cvmfs_local=). <verbatim># pvdisplay |grep -A 1 'PV Name' PV Name /dev/sdb1 VG Name vg_cvmfs -- PV Name /dev/sda2 VG Name vg_root # lvdisplay |grep 'LV Path' -A 2 LV Path /dev/vg_cvmfs/lv_cvmfs LV Name lv_cvmfs VG Name vg_cvmfs -- LV Path /dev/vg_root/lv_swap LV Name lv_swap VG Name vg_root -- LV Path /dev/vg_root/lv_root LV Name lv_root VG Name vg_root</verbatim> * On those nodes with one hard disk, a similar approach is followed, but with only one VG: <verbatim>TODO</verbatim> ---+++ Upgrade ---++ Monitoring Instructions about monitoring the service ---+++ Pakiti Pakiti provides a monitoring and notification mechanism to check the patching status of systems. * <strong> Intallation</strong><br /><span style="background-color: transparent;">Installation is only needed on the WN here ad CSCS-LCG2 Download the client from: </span><span style="background-color: transparent;">https://pakiti.egi.eu/client.php?site=CSCS-LCG2</span> * <strong> Host monitoring</strong><br /><span style="background-color: transparent;">At the following link you can check the patching status of all our WN: </span><span style="background-color: transparent;">https://pakiti.egi.eu/hosts.php<br /></span><span style="background-color: transparent;">note: to access to this pages you need the Security Operator role for your certificate</span> ---+++ Nagios ---+++ Ganglia ---+++ Self Sanity / revival? ---+++ Other? ---++ Manuals ---++ Issues ---+++ Issue1: Intel sw RAID out of sync (OLD, kept for reference). Sometimes, when reinstalling a machine or replacing a hard disk, we need to activate the raid to make it be in OK status: <verbatim># dmraid -s -d DEBUG: _find_set: searching isw_bgidcceegc DEBUG: _find_set: not found isw_bgidcceegc DEBUG: _find_set: searching isw_bgidcceegc_Volume0 DEBUG: _find_set: searching isw_bgidcceegc_Volume0 DEBUG: _find_set: not found isw_bgidcceegc_Volume0 DEBUG: _find_set: not found isw_bgidcceegc_Volume0 DEBUG: _find_set: searching isw_bgidcceegc DEBUG: _find_set: found isw_bgidcceegc DEBUG: _find_set: searching isw_bgidcceegc_Volume0 DEBUG: _find_set: searching isw_bgidcceegc_Volume0 DEBUG: _find_set: found isw_bgidcceegc_Volume0 DEBUG: _find_set: found isw_bgidcceegc_Volume0 DEBUG: set status of set "isw_bgidcceegc_Volume0" to 8 *** Group superset isw_bgidcceegc --> Active Subset name : isw_bgidcceegc_Volume0 size : 927985664 stride : 128 type : mirror status : nosync subsets: 0 devs : 2 spares : 0 DEBUG: freeing devices of RAID set "isw_bgidcceegc_Volume0" DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sda" DEBUG: freeing device "isw_bgidcceegc_Volume0", path "/dev/sdb" DEBUG: freeing devices of RAID set "isw_bgidcceegc" DEBUG: freeing device "isw_bgidcceegc", path "/dev/sda" DEBUG: freeing device "isw_bgidcceegc", path "/dev/sdb" # dmraid -ay # dmraid -s *** Group superset isw_bgidcceegc --> Active Subset name : isw_bgidcceegc_Volume0 size : 927985664 stride : 128 type : mirror status : ok subsets: 0 devs : 2 spares : 0</verbatim>
ServiceCardForm
Service name
WN
Machines this service is installed in
wn[01-48,50,52-79]
Is Grid service
Yes
Depends on the following services
cvmfs, gpfs2, nfs, lrms
Expert
Gianni Ricciardi
CM
CfEngine
Provisioning
PuppetForeman
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r32
<
r31
<
r30
<
r29
<
r28
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r32 - 2015-04-23
-
DinoConciatore
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback