Service Configuration

Service Nodes

General Instructions

  • install the OS: XenSampleImageReplication
  • check kernel version, should be kernel-xen ≥ 2.6.18-194.17.1
    • yum upgrade kernel-xen
  • create cfengine key in cfengine:/srv/cfengine/ppkeys
    • cfkey -f root-IPADDRESS
  • copy the keys to nfs:/export/kickstarts/private/cfengine/
    • scp /srv/cfengine/ppkeys/root-IPADRESS* nfs:/export/kickstarts/private/cfengine/
  • copy newmachine script from xen03 and run it
    • ALERT! NOTE: This step takes a long time, wait until it's done and the machine is automatically rebooted.
    • scp xen03:/nfs/kickstarts/newmachine /root/ && /root/newmachine
  • copy ssh keys to cfengine server:
    • cd /srv/cfengine/private/ssh/
    • mkdir HOSTNAME
    • ls se30|xargs -n1 --replace scp HOSTNAME:/etc/ssh/{} HOSTNAME/
  • check in ssh key to svn
    • asvn add HOSTNAME
    • asvn commit HOSTNAME --username poettl -m'New SSH keys for host HOSTNAME'
  • create new known_hosts file
    • /srv/cfengine/scripts/new_known_hosts
  • run /opt/cscs/sbin/install-glite to configure gLite middleware (or do it by hand step by step...)
  • cfagent -qv
  • reboot

Service Specific Notes

Worker Nodes

[PP] WNs

Once all the previous steps have been done, Lustre has to be loaded to be able to sucessfuly run the last part of /opt/cscs/sbin/install-glite. In order to do that, you must make sure that the VM guest has two NICs, one public IP and the 10.10 IP. In the XEN host:

Apr 06 16:00 [root@xen17:xen]# cat /etc/xen/ppwn02 
name = "ppwn02"

vcpus = 2
memory = 4096
disk = ['phy:/dev/vg_root/ppwn02_root,xvda,w']
#vif = ['mac=00:16:3E:64:00:50,bridge=xenbr0','mac=00:16:10:64:00:50,bridge=xenbr2']
vif = ['mac=00:16:3E:67:00:02,bridge=xenbr1','mac=00:16:10:67:00:02,bridge=xenbr2']

bootloader = "/usr/bin/pygrub"
on_reboot = 'restart'
on_crash = 'destroy'

In the XEN guest, prepare the network:

Apr 06 16:02 [root@ppwn02:~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+
Apr 06 16:02 [root@ppwn02:~]# ifup eth1
Apr 06 16:02 [root@ppwn02:~]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:16:10:67:00:02  
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::216:10ff:fe67:2/64 Scope:Link
          RX packets:18531 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1134 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:21221791 (20.2 MiB)  TX bytes:236364 (230.8 KiB)

Apr 06 16:04 [root@ppwn02:~]# ping -c 4 
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.112 ms
64 bytes from icmp_seq=2 ttl=64 time=0.082 ms
64 bytes from icmp_seq=3 ttl=64 time=0.081 ms
64 bytes from icmp_seq=4 ttl=64 time=0.088 ms

--- ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2997ms
rtt min/avg/max/mdev = 0.081/0.090/0.112/0.017 ms

Now you need to install lustre RPMs for the running kernel and start it up. In the XEN Guest:

Apr 06 16:04 [root@ppwn02:~]# mount xen11:/nfs /media
Apr 06 16:06 [root@ppwn02:~]# uname -r
Apr 06 16:06 [root@ppwn02:~]#  rpm -ivh /media/rpms/xen_guest_lustre_1.8.4_238/lustre-*
Preparing...                ########################################### [100%]
        package lustre-modules-1.8.4-2.6.18_238.5.1.el5xen_201104061032.x86_64 is already installed
        package lustre-1.8.4-2.6.18_238.5.1.el5xen_201104061032.x86_64 is already installed
Apr 06 16:06 [root@ppwn02:~]# mkdir -p /lustre/scratch
Apr 06 16:07 [root@ppwn02:~]# service lustre start
Apr 06 16:07 [root@ppwn02:~]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
scratch-MDT0000_UUID        1.4T        1.8G        1.3T   0% /lustre/scratch[MDT:0]
scratch-OST0000_UUID        3.6T      174.3G        3.2T   4% /lustre/scratch[OST:0]
scratch-OST0001_UUID        3.6T      175.4G        3.2T   4% /lustre/scratch[OST:1]
scratch-OST0002_UUID        3.6T      181.0G        3.2T   4% /lustre/scratch[OST:2]

At this point you can run the last part of the archive and it will (hopefully) work!:

Apr 06 16:07 [root@ppwn02:~]# umount /media 
Apr 06 16:07 [root@ppwn02:~]# /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n TORQUE_client




In order for CREAM-CE to work well lustre has to be mounted. So the same steps executed before have to be followed.

Problem when installing tomcat rpms

Problem description: When running rpm -qa | grep tomcat5 you don't see the tomcat5 rpm installed.

Apr 12 10:34 [root@ppcream02:~]# rpm -qa |grep tomcat5

And when you try to install it you get some errors:

Loaded plugins: kernel-module
Excluding Packages in global exclude list
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package tomcat5.x86_64 0:5.5.23-0jpp.17.el5_6 set to be updated
--> Finished Dependency Resolution
Beginning Kernel Module Plugin
Finished Kernel Module Plugin

Dependencies Resolved

 Package                               Arch                                 Version                                               Repository                                 Size
 tomcat5                               x86_64                               5.5.23-0jpp.17.el5_6                                  sl-security                               362 k

Transaction Summary
Install      1 Package(s)         
Update       0 Package(s)         
Remove       0 Package(s)         

Total size: 362 k
Is this ok [y/N]: y
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing     : tomcat5                                                                                                                                                    1/1 
Error unpacking rpm package tomcat5-5.5.23-0jpp.17.el5_6.x86_64
warning: /etc/tomcat5/server.xml created as /etc/tomcat5/server.xml.rpmnew
warning: /etc/tomcat5/tomcat5.conf created as /etc/tomcat5/tomcat5.conf.rpmnew
error: unpacking of archive failed on file /usr/share/tomcat5/webapps: cpio: rename

And/or you have broken links in /usr/share/tomcat5 and/or /var/lib/tomcat5

Solution: You have to completely erase all files within /usr/share/tomcat5 and /var/lib/tomcat5 and run yum and yaim again:

Apr 12 10:37 [root@ppcream02:~]# yum install tomcat5-5.5.23-0jpp.17.el5_6.x86_64 # Replace the tomcat5 version with the relevant one!!!!
Apr 12 10:37 [root@ppcream02:~]# rpm -qa |grep tomcat

Apr 12 10:38 [root@ppcream02:~]# /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n creamCE -n TORQUE_utils

Problem when submitting jobs

Problem description: When submitting a job from the UI you get the following message

Apr 12 10:33 [pablof@ui64:test_ppcream01]$ glite-ce-job-submit -a -r ppcream02/cream-pbs-atlas $PWD/jobs/hostname.jdl
2011-04-12 10:46:40,635 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection refused]

And then you look into /var/lib/tomcat5/webapps/ and you only see this

Apr 12 10:46 [root@ppcream02:~]# ls -lh /var/lib/tomcat5/webapps/
total 4.4M
-rw-r--r-- 1 root root 4.4M Apr 12 10:45 ce-cream.war

Note also check that the time on the CREAM host is correct.

Solution: You need to copy the directory /var/lib/tomcat5/webapps/* from another running instance of the cream-ce

Apr 12 10:48 [root@ppcream02:~]#  scp -r ppcream01:/usr/share/tomcat5/webapps/ce-crea* /usr/share/tomcat5/webapps/
pr 12 10:49 [root@ppcream02:~]# ls -lh /var/lib/tomcat5/webapps/
total 4.4M
drwxr-xr-x 5 root root 4.0K Apr 12 10:49 ce-cream
-rw-r--r-- 1 root root 4.4M Apr 12 10:49 ce-cream.war

Apr 12 10:49 [root@ppcream02:~]# service gLite restart
*** glite-ce-blahparser:
Shutting down BNotifier:                                   [FAILED]
Shutting down BUpdaterPBS:                                 [FAILED]

*** glite-lb-locallogger:
Stopping glite-lb-logd ... done
Stopping glite-lb-interlogd ... done

*** tomcat5:
Stopping tomcat5:                                          [  OK  ]
*** tomcat5:
Starting tomcat5:                                          [  OK  ]

*** glite-lb-locallogger:
Starting glite-lb-logd ...This is LocalLogger, part of Workload Management System in EU DataGrid & EGEE.
[20453] Initializing...
[20453] Parse messages for correctness... [yes]
[20453] Send messages also to inter-logger... [yes]
[20453] Messages will be stored with the filename prefix "/var/glite/log/dglogd.log".
[20453] Server running with certificate: /DC=com/DC=quovadisglobal/DC=grid/DC=switch/DC=hosts/C=CH/ST=Zuerich/L=Zuerich/O=ETH Zuerich/
[20453] Listening on port 9002
[20453] Running as daemon... [yes]
Starting glite-lb-interlogd ... done

*** glite-ce-blahparser:
Starting BNotifier:                                        [  OK  ]
Starting BUpdaterPBS:                                      [  OK  ]


Compile Torque 2.5.x with HA and create RPM's

  • download newest version of torque
  • ./configure --prefix=/usr --with-server-home=/var/spool/pbs, --enable-high-availability
  • make rpm
  • copy rpms to repo
    • scp /usr/src/redhat/RPMS/x86_64/torque{,-server,-mom,-client}-2.5.2-1cri.x86_64.rpm nfs01:/export/packages/repo
    • on nfs01: cd /export/packages/repo; createrepo .


After the reboot the gridmap files have to be created. Either wait for the cron job running or run:

  • /opt/edg/sbin/edg-mkgridmap --output=/etc/grid-security/dn-grid-mapfile --safe
  • cp /etc/grid-security/dn-grid-mapfile /etc/grid-security/grid-mapfile.tmp; cat /etc/grid-security/voms-grid-mapfile >> /etc/grid-security/grid-mapfile.tmp; mv /etc/grid-security/grid-mapfile.tmp /etc/grid-security/grid-mapfile



For a detailed log of the last installation refer to: , In short:

  • Run the Yaim conf tool: /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site
  • wget/configure/make/install LBCD, from
  • Check iptables
  • service lbcd start # that's it, it should appear in the DNS list, IFF DT has included it in the master LBCD node


Make sure that you have run cfengine and that the following files are installed in your system:

  • /etc/glite/glite-info-update-endpoints.conf: it specifies which extra sites must be queried (in our case, preproduction bdii). Should look like this:
    PPCSCS-LCG2    ldap://,o=grid
  • /opt/cscs/etc/glite-info-update-extra-endpoints: tells bdii which file has the configuration for extra sites. Should look like this:
    EGI  = True
    OSG = True
    manual = True
    manual_file = /opt/cscs/etc/glite-info-update-extra-endpoints
    output_file = /opt/glite/etc/gip/top-urls.conf 
    cache_dir = /var/cache/glite/glite-info-update-endpoints


You need to do

yum groupinstall glite-UI
/opt/glite/yaim/bin/yaim -c -s /misc/siteinfo/site-info.def -n UI


  • wget
  • rpm -ihv sa1-release-2-1.el5.noarch.rpm
  • yum install httpd
  • yum install libyaml.i386
  • yum install egee-NAGIOS lcg-CA


