Service Configuration
Service Nodes
General Instructions
- install the OS: XenSampleImageReplication
- check kernel version, should be kernel-xen ≥ 2.6.18-194.17.1
- create cfengine key in
cfengine:/srv/cfengine/ppkeys
- copy the keys to
nfs:/export/kickstarts/private/cfengine/
-
scp /srv/cfengine/ppkeys/root-IPADRESS* nfs:/export/kickstarts/private/cfengine/
- copy
newmachine
script from xen03 and run it
- NOTE: This step takes a long time, wait until it's done and the machine is automatically rebooted.
-
scp xen03:/nfs/kickstarts/newmachine /root/ && /root/newmachine
- copy ssh keys to cfengine server:
-
cd /srv/cfengine/private/ssh/
-
mkdir HOSTNAME
-
ls se30|xargs -n1 --replace scp HOSTNAME:/etc/ssh/{} HOSTNAME/
- check in ssh key to svn
-
asvn add HOSTNAME
-
asvn commit HOSTNAME --username poettl -m'New SSH keys for host HOSTNAME'
- create new known_hosts file
-
/srv/cfengine/scripts/new_known_hosts
- run
/opt/cscs/sbin/install-glite
to configure gLite middleware (or do it by hand step by step...)
-
cfagent -qv
- reboot
Service Specific Notes
Worker Nodes
Worker Nodes
[PP] WNs
Once all the previous steps have been done, Lustre has to be loaded to be able to sucessfuly run the last part of
/opt/cscs/sbin/install-glite
. In order to do that, you must make sure that the VM guest has two NICs, one public IP and the 10.10 IP. In the XEN host:
Apr 06 16:00 [root@xen17:xen]# cat /etc/xen/ppwn02
name = "ppwn02"
vcpus = 2
memory = 4096
disk = ['phy:/dev/vg_root/ppwn02_root,xvda,w']
#vif = ['mac=00:16:3E:64:00:50,bridge=xenbr0','mac=00:16:10:64:00:50,bridge=xenbr2']
vif = ['mac=00:16:3E:67:00:02,bridge=xenbr1','mac=00:16:10:67:00:02,bridge=xenbr2']
bootloader = "/usr/bin/pygrub"
on_reboot = 'restart'
on_crash = 'destroy'
In the XEN guest, prepare the network:
Apr 06 16:02 [root@ppwn02:~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+
DEVICE=eth1
BOOTPROTO=static
IPADDR=10.10.64.202
NETMASK=255.255.252.0
IPV6INIT=no
IPV6_AUTOCONF=no
ONBOOT=yes
TYPE=Ethernet
Apr 06 16:02 [root@ppwn02:~]# ifup eth1
Apr 06 16:02 [root@ppwn02:~]# ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:16:10:67:00:02
inet addr:10.10.64.202 Bcast:10.10.67.255 Mask:255.255.252.0
inet6 addr: fe80::216:10ff:fe67:2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:18531 errors:0 dropped:0 overruns:0 frame:0
TX packets:1134 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:21221791 (20.2 MiB) TX bytes:236364 (230.8 KiB)
Apr 06 16:04 [root@ppwn02:~]# ping -c 4 10.10.64.201
PING 10.10.64.201 (10.10.64.201) 56(84) bytes of data.
64 bytes from 10.10.64.201: icmp_seq=1 ttl=64 time=0.112 ms
64 bytes from 10.10.64.201: icmp_seq=2 ttl=64 time=0.082 ms
64 bytes from 10.10.64.201: icmp_seq=3 ttl=64 time=0.081 ms
64 bytes from 10.10.64.201: icmp_seq=4 ttl=64 time=0.088 ms
--- 10.10.64.201 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2997ms
rtt min/avg/max/mdev = 0.081/0.090/0.112/0.017 ms
Now you need to install lustre RPMs for the running kernel and start it up. In the XEN Guest:
Apr 06 16:04 [root@ppwn02:~]# mount xen11:/nfs /media
Apr 06 16:06 [root@ppwn02:~]# uname -r
2.6.18-238.5.1.el5xen
Apr 06 16:06 [root@ppwn02:~]# rpm -ivh /media/rpms/xen_guest_lustre_1.8.4_238/lustre-*
Preparing... ########################################### [100%]
package lustre-modules-1.8.4-2.6.18_238.5.1.el5xen_201104061032.x86_64 is already installed
package lustre-1.8.4-2.6.18_238.5.1.el5xen_201104061032.x86_64 is already installed
Apr 06 16:06 [root@ppwn02:~]# mkdir -p /lustre/scratch
Apr 06 16:07 [root@ppwn02:~]# service lustre start
Apr 06 16:07 [root@ppwn02:~]# lfs df -h
UUID bytes Used Available Use% Mounted on
scratch-MDT0000_UUID 1.4T 1.8G 1.3T 0% /lustre/scratch[MDT:0]
scratch-OST0000_UUID 3.6T 174.3G 3.2T 4% /lustre/scratch[OST:0]
scratch-OST0001_UUID 3.6T 175.4G 3.2T 4% /lustre/scratch[OST:1]
scratch-OST0002_UUID 3.6T 181.0G 3.2T 4% /lustre/scratch[OST:2]
[...]
At this point you can run the last part of the archive and it will (hopefully) work!:
Apr 06 16:07 [root@ppwn02:~]# umount /media
Apr 06 16:07 [root@ppwn02:~]# /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n TORQUE_client
CREAM-CE
References
[PP] CREAM-CEs
In order for CREAM-CE to work well lustre has to be mounted. So the same steps executed before have to be followed.
Problem when installing tomcat rpms
Problem description: When running
rpm -qa | grep tomcat5
you don't see the tomcat5 rpm installed.
Apr 12 10:34 [root@ppcream02:~]# rpm -qa |grep tomcat5
tomcat5-server-lib-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-jasper-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-jsp-2.0-api-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-common-lib-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-servlet-2.4-api-5.5.23-0jpp.17.el5_6.x86_64
And when you try to install it you get some errors:
Loaded plugins: kernel-module
Excluding Packages in global exclude list
Finished
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package tomcat5.x86_64 0:5.5.23-0jpp.17.el5_6 set to be updated
--> Finished Dependency Resolution
Beginning Kernel Module Plugin
Finished Kernel Module Plugin
Dependencies Resolved
==================================================================================================================================================================================
Package Arch Version Repository Size
==================================================================================================================================================================================
Installing:
tomcat5 x86_64 5.5.23-0jpp.17.el5_6 sl-security 362 k
Transaction Summary
==================================================================================================================================================================================
Install 1 Package(s)
Update 0 Package(s)
Remove 0 Package(s)
Total size: 362 k
Is this ok [y/N]: y
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : tomcat5 1/1
Error unpacking rpm package tomcat5-5.5.23-0jpp.17.el5_6.x86_64
warning: /etc/tomcat5/server.xml created as /etc/tomcat5/server.xml.rpmnew
warning: /etc/tomcat5/tomcat5.conf created as /etc/tomcat5/tomcat5.conf.rpmnew
error: unpacking of archive failed on file /usr/share/tomcat5/webapps: cpio: rename
And/or you have broken links in
/usr/share/tomcat5
and/or
/var/lib/tomcat5
Solution: You have to completely erase all files within
/usr/share/tomcat5
and
/var/lib/tomcat5
and run yum and yaim again:
Apr 12 10:37 [root@ppcream02:~]# yum install tomcat5-5.5.23-0jpp.17.el5_6.x86_64 # Replace the tomcat5 version with the relevant one!!!!
Apr 12 10:37 [root@ppcream02:~]# rpm -qa |grep tomcat
tomcat5-server-lib-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-jasper-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-jsp-2.0-api-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-common-lib-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-servlet-2.4-api-5.5.23-0jpp.17.el5_6.x86_64
Apr 12 10:38 [root@ppcream02:~]# /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n creamCE -n TORQUE_utils
Problem when submitting jobs
Problem description: When submitting a job from the UI you get the following message
Apr 12 10:33 [pablof@ui64:test_ppcream01]$ glite-ce-job-submit -a -r ppcream02/cream-pbs-atlas $PWD/jobs/hostname.jdl
2011-04-12 10:46:40,635 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection refused]
And then you look into
/var/lib/tomcat5/webapps/
and you only see this
Apr 12 10:46 [root@ppcream02:~]# ls -lh /var/lib/tomcat5/webapps/
total 4.4M
-rw-r--r-- 1 root root 4.4M Apr 12 10:45 ce-cream.war
Note also check that the time on the CREAM host is correct.
Solution: You need to copy the directory
/var/lib/tomcat5/webapps/*
from another running instance of the cream-ce
Apr 12 10:48 [root@ppcream02:~]# scp -r ppcream01:/usr/share/tomcat5/webapps/ce-crea* /usr/share/tomcat5/webapps/
pr 12 10:49 [root@ppcream02:~]# ls -lh /var/lib/tomcat5/webapps/
total 4.4M
drwxr-xr-x 5 root root 4.0K Apr 12 10:49 ce-cream
-rw-r--r-- 1 root root 4.4M Apr 12 10:49 ce-cream.war
Apr 12 10:49 [root@ppcream02:~]# service gLite restart
STOPPING SERVICES
*** glite-ce-blahparser:
Shutting down BNotifier: [FAILED]
Shutting down BUpdaterPBS: [FAILED]
*** glite-lb-locallogger:
Stopping glite-lb-logd ... done
Stopping glite-lb-interlogd ... done
*** tomcat5:
Stopping tomcat5: [ OK ]
STARTING SERVICES
*** tomcat5:
Starting tomcat5: [ OK ]
*** glite-lb-locallogger:
Starting glite-lb-logd ...This is LocalLogger, part of Workload Management System in EU DataGrid & EGEE.
[20453] Initializing...
[20453] Parse messages for correctness... [yes]
[20453] Send messages also to inter-logger... [yes]
[20453] Messages will be stored with the filename prefix "/var/glite/log/dglogd.log".
[20453] Server running with certificate: /DC=com/DC=quovadisglobal/DC=grid/DC=switch/DC=hosts/C=CH/ST=Zuerich/L=Zuerich/O=ETH Zuerich/CN=ppcream02.lcg.cscs.ch
[20453] Listening on port 9002
[20453] Running as daemon... [yes]
done
Starting glite-lb-interlogd ... done
*** glite-ce-blahparser:
Starting BNotifier: [ OK ]
Starting BUpdaterPBS: [ OK ]
lrms
Compile Torque 2.5.x with HA and create RPM's
- download newest version of torque
-
./configure --prefix=/usr --with-server-home=/var/spool/pbs --with-default-server=lrms02.lcg.cscs.ch,lrms01.lcg.cscs.ch --enable-high-availability
-
make rpm
- copy rpms to repo
-
scp /usr/src/redhat/RPMS/x86_64/torque{,-server,-mom,-client}-2.5.2-1cri.x86_64.rpm nfs01:/export/packages/repo
- on nfs01:
cd /export/packages/repo; createrepo .
lcg-CE
After the reboot the gridmap files have to be created. Either wait for the cron job running or run:
-
/opt/edg/sbin/edg-mkgridmap --output=/etc/grid-security/dn-grid-mapfile --safe
-
cp /etc/grid-security/dn-grid-mapfile /etc/grid-security/grid-mapfile.tmp; cat /etc/grid-security/voms-grid-mapfile >> /etc/grid-security/grid-mapfile.tmp; mv /etc/grid-security/grid-mapfile.tmp /etc/grid-security/grid-mapfile
BDII
Site BDII
For a detailed log of the last installation refer to:
https://webrt.cscs.ch/Ticket/Display.html?id=7962 , In short:
BDII_REGIONS="SE CE"
BDII_CE_URL="ldap://ce01.lcg.cscs.ch:2170/mds-vo-name=resource,o=grid"
BDII_SE_URL="ldap://storage01.lcg.cscs.ch:2170/mds-vo-name=resource,o=grid"
- Run the Yaim conf tool: /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site
- wget/configure/make/install LBCD, from http://archives.eyrie.org/software/system/lbcd-3.3.0.tar.gz
- Check iptables
- service lbcd start # that's it, it should appear in the DNS list, IFF DT has included it in the master LBCD node
[PP] Top BDII
Make sure that you have run cfengine and that the following files are installed in your system:
- /etc/glite/glite-info-update-endpoints.conf: it specifies which extra sites must be queried (in our case, preproduction bdii). Should look like this:
PPCSCS-LCG2 ldap://ppbdii01.lcg.cscs.ch:2170/mds-vo-name=ppcscs-lcg2,o=grid
- /opt/cscs/etc/glite-info-update-extra-endpoints: tells bdii which file has the configuration for extra sites. Should look like this:
[configuration]
EGI = True
OSG = True
manual = True
manual_file = /opt/cscs/etc/glite-info-update-extra-endpoints
output_file = /opt/glite/etc/gip/top-urls.conf
cache_dir = /var/cache/glite/glite-info-update-endpoints
ui64
You need to do
yum groupinstall glite-UI
/opt/glite/yaim/bin/yaim -c -s /misc/siteinfo/site-info.def -n UI
nagios
References
--
PeterOettl - 2010-03-01