Tags:
create new tag
view all tags

Service Configuration

Service Nodes

General Instructions

  • install the OS: XenSampleImageReplication
  • check kernel version, should be kernel-xen ≥ 2.6.18-194.17.1
    • yum upgrade kernel-xen
  • create cfengine key in cfengine:/srv/cfengine/ppkeys
    • cfkey -f root-IPADDRESS
  • copy the keys to nfs:/export/kickstarts/private/cfengine/
    • scp /srv/cfengine/ppkeys/root-IPADRESS* nfs:/export/kickstarts/private/cfengine/
  • copy newmachine script from xen03 and run it
    • ALERT! NOTE: This step takes a long time, wait until it's done and the machine is automatically rebooted.
    • scp xen03:/nfs/kickstarts/newmachine /root/ && /root/newmachine
  • copy ssh keys to cfengine server:
    • cd /srv/cfengine/private/ssh/
    • mkdir HOSTNAME
    • ls se30|xargs -n1 --replace scp HOSTNAME:/etc/ssh/{} HOSTNAME/
  • check in ssh key to svn
    • asvn add HOSTNAME
    • asvn commit HOSTNAME --username poettl -m'New SSH keys for host HOSTNAME'
  • create new known_hosts file
    • /srv/cfengine/scripts/new_known_hosts
  • run /opt/cscs/sbin/install-glite to configure gLite middleware (or do it by hand step by step...)
  • cfagent -qv
  • reboot

Service Specific Notes

Worker Nodes

Worker Nodes

[PP] WNs

Once all the previous steps have been done, Lustre has to be loaded to be able to sucessfuly run the last part of /opt/cscs/sbin/install-glite. In order to do that, you must make sure that the VM guest has two NICs, one public IP and the 10.10 IP. In the XEN host:

Apr 06 16:00 [root@xen17:xen]# cat /etc/xen/ppwn02 
name = "ppwn02"

vcpus = 2
memory = 4096
disk = ['phy:/dev/vg_root/ppwn02_root,xvda,w']
#vif = ['mac=00:16:3E:64:00:50,bridge=xenbr0','mac=00:16:10:64:00:50,bridge=xenbr2']
vif = ['mac=00:16:3E:67:00:02,bridge=xenbr1','mac=00:16:10:67:00:02,bridge=xenbr2']

bootloader = "/usr/bin/pygrub"
on_reboot = 'restart'
on_crash = 'destroy'

In the XEN guest, prepare the network:

Apr 06 16:02 [root@ppwn02:~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+
DEVICE=eth1
BOOTPROTO=static
IPADDR=10.10.64.202
NETMASK=255.255.252.0
IPV6INIT=no
IPV6_AUTOCONF=no
ONBOOT=yes
TYPE=Ethernet
Apr 06 16:02 [root@ppwn02:~]# ifup eth1
Apr 06 16:02 [root@ppwn02:~]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:16:10:67:00:02  
          inet addr:10.10.64.202  Bcast:10.10.67.255  Mask:255.255.252.0
          inet6 addr: fe80::216:10ff:fe67:2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18531 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1134 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:21221791 (20.2 MiB)  TX bytes:236364 (230.8 KiB)

Apr 06 16:04 [root@ppwn02:~]# ping -c 4 10.10.64.201 
PING 10.10.64.201 (10.10.64.201) 56(84) bytes of data.
64 bytes from 10.10.64.201: icmp_seq=1 ttl=64 time=0.112 ms
64 bytes from 10.10.64.201: icmp_seq=2 ttl=64 time=0.082 ms
64 bytes from 10.10.64.201: icmp_seq=3 ttl=64 time=0.081 ms
64 bytes from 10.10.64.201: icmp_seq=4 ttl=64 time=0.088 ms

--- 10.10.64.201 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2997ms
rtt min/avg/max/mdev = 0.081/0.090/0.112/0.017 ms

Now you need to install lustre RPMs for the running kernel and start it up. In the XEN Guest:

Apr 06 16:04 [root@ppwn02:~]# mount xen11:/nfs /media
Apr 06 16:06 [root@ppwn02:~]# uname -r
2.6.18-238.5.1.el5xen
Apr 06 16:06 [root@ppwn02:~]#  rpm -ivh /media/rpms/xen_guest_lustre_1.8.4_238/lustre-*
Preparing...                ########################################### [100%]
        package lustre-modules-1.8.4-2.6.18_238.5.1.el5xen_201104061032.x86_64 is already installed
        package lustre-1.8.4-2.6.18_238.5.1.el5xen_201104061032.x86_64 is already installed
Apr 06 16:06 [root@ppwn02:~]# mkdir -p /lustre/scratch
Apr 06 16:07 [root@ppwn02:~]# service lustre start
Apr 06 16:07 [root@ppwn02:~]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
scratch-MDT0000_UUID        1.4T        1.8G        1.3T   0% /lustre/scratch[MDT:0]
scratch-OST0000_UUID        3.6T      174.3G        3.2T   4% /lustre/scratch[OST:0]
scratch-OST0001_UUID        3.6T      175.4G        3.2T   4% /lustre/scratch[OST:1]
scratch-OST0002_UUID        3.6T      181.0G        3.2T   4% /lustre/scratch[OST:2]
[...]

At this point you can run the last part of the archive and it will (hopefully) work!:

Apr 06 16:07 [root@ppwn02:~]# umount /media 
Apr 06 16:07 [root@ppwn02:~]# /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n WN -n TORQUE_client

CREAM-CE

References

[PP] CREAM-CEs

In order for CREAM-CE to work well lustre has to be mounted. So the same steps executed before have to be followed.

Problem when installing tomcat rpms

Problem description: When running rpm -qa | grep tomcat5 you don't see the tomcat5 rpm installed.

Apr 12 10:34 [root@ppcream02:~]# rpm -qa |grep tomcat5
tomcat5-server-lib-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-jasper-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-jsp-2.0-api-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-common-lib-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-servlet-2.4-api-5.5.23-0jpp.17.el5_6.x86_64

And when you try to install it you get some errors:

Loaded plugins: kernel-module
Excluding Packages in global exclude list
Finished
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package tomcat5.x86_64 0:5.5.23-0jpp.17.el5_6 set to be updated
--> Finished Dependency Resolution
Beginning Kernel Module Plugin
Finished Kernel Module Plugin

Dependencies Resolved

==================================================================================================================================================================================
 Package                               Arch                                 Version                                               Repository                                 Size
==================================================================================================================================================================================
Installing:
 tomcat5                               x86_64                               5.5.23-0jpp.17.el5_6                                  sl-security                               362 k

Transaction Summary
==================================================================================================================================================================================
Install      1 Package(s)         
Update       0 Package(s)         
Remove       0 Package(s)         

Total size: 362 k
Is this ok [y/N]: y
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing     : tomcat5                                                                                                                                                    1/1 
Error unpacking rpm package tomcat5-5.5.23-0jpp.17.el5_6.x86_64
warning: /etc/tomcat5/server.xml created as /etc/tomcat5/server.xml.rpmnew
warning: /etc/tomcat5/tomcat5.conf created as /etc/tomcat5/tomcat5.conf.rpmnew
error: unpacking of archive failed on file /usr/share/tomcat5/webapps: cpio: rename

And/or you have broken links in /usr/share/tomcat5 and/or /var/lib/tomcat5

Solution: You have to completely erase all files within /usr/share/tomcat5 and /var/lib/tomcat5 and run yum and yaim again:

Apr 12 10:37 [root@ppcream02:~]# yum install tomcat5-5.5.23-0jpp.17.el5_6.x86_64 # Replace the tomcat5 version with the relevant one!!!!
Apr 12 10:37 [root@ppcream02:~]# rpm -qa |grep tomcat
tomcat5-server-lib-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-jasper-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-jsp-2.0-api-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-common-lib-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-5.5.23-0jpp.17.el5_6.x86_64
tomcat5-servlet-2.4-api-5.5.23-0jpp.17.el5_6.x86_64

Apr 12 10:38 [root@ppcream02:~]# /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n creamCE -n TORQUE_utils

Problem when submitting jobs

Problem description: When submitting a job from the UI you get the following message

Apr 12 10:33 [pablof@ui64:test_ppcream01]$ glite-ce-job-submit -a -r ppcream02/cream-pbs-atlas $PWD/jobs/hostname.jdl
2011-04-12 10:46:40,635 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection refused]

And then you look into /var/lib/tomcat5/webapps/ and you only see this

Apr 12 10:46 [root@ppcream02:~]# ls -lh /var/lib/tomcat5/webapps/
total 4.4M
-rw-r--r-- 1 root root 4.4M Apr 12 10:45 ce-cream.war

Note also check that the time on the CREAM host is correct.

Solution: You need to copy the directory /var/lib/tomcat5/webapps/* from another running instance of the cream-ce

Apr 12 10:48 [root@ppcream02:~]#  scp -r ppcream01:/usr/share/tomcat5/webapps/ce-crea* /usr/share/tomcat5/webapps/
pr 12 10:49 [root@ppcream02:~]# ls -lh /var/lib/tomcat5/webapps/
total 4.4M
drwxr-xr-x 5 root root 4.0K Apr 12 10:49 ce-cream
-rw-r--r-- 1 root root 4.4M Apr 12 10:49 ce-cream.war

Apr 12 10:49 [root@ppcream02:~]# service gLite restart
STOPPING SERVICES
*** glite-ce-blahparser:
Shutting down BNotifier:                                   [FAILED]
Shutting down BUpdaterPBS:                                 [FAILED]

*** glite-lb-locallogger:
Stopping glite-lb-logd ... done
Stopping glite-lb-interlogd ... done

*** tomcat5:
Stopping tomcat5:                                          [  OK  ]
STARTING SERVICES
*** tomcat5:
Starting tomcat5:                                          [  OK  ]

*** glite-lb-locallogger:
Starting glite-lb-logd ...This is LocalLogger, part of Workload Management System in EU DataGrid & EGEE.
[20453] Initializing...
[20453] Parse messages for correctness... [yes]
[20453] Send messages also to inter-logger... [yes]
[20453] Messages will be stored with the filename prefix "/var/glite/log/dglogd.log".
[20453] Server running with certificate: /DC=com/DC=quovadisglobal/DC=grid/DC=switch/DC=hosts/C=CH/ST=Zuerich/L=Zuerich/O=ETH Zuerich/CN=ppcream02.lcg.cscs.ch
[20453] Listening on port 9002
[20453] Running as daemon... [yes]
 done
Starting glite-lb-interlogd ... done

*** glite-ce-blahparser:
Starting BNotifier:                                        [  OK  ]
Starting BUpdaterPBS:                                      [  OK  ]

lrms

Compile Torque 2.5.x with HA and create RPM's

  • download newest version of torque
  • ./configure --prefix=/usr --with-server-home=/var/spool/pbs --with-default-server=lrms02.lcg.cscs.ch,lrms01.lcg.cscs.ch --enable-high-availability
  • make rpm
  • copy rpms to repo
    • scp /usr/src/redhat/RPMS/x86_64/torque{,-server,-mom,-client}-2.5.2-1cri.x86_64.rpm nfs01:/export/packages/repo
    • on nfs01: cd /export/packages/repo; createrepo .

lcg-CE

After the reboot the gridmap files have to be created. Either wait for the cron job running or run:

  • /opt/edg/sbin/edg-mkgridmap --output=/etc/grid-security/dn-grid-mapfile --safe
  • cp /etc/grid-security/dn-grid-mapfile /etc/grid-security/grid-mapfile.tmp; cat /etc/grid-security/voms-grid-mapfile >> /etc/grid-security/grid-mapfile.tmp; mv /etc/grid-security/grid-mapfile.tmp /etc/grid-security/grid-mapfile

BDII

Site BDII

For a detailed log of the last installation refer to: https://webrt.cscs.ch/Ticket/Display.html?id=7962 , In short:

BDII_REGIONS="SE CE"
BDII_CE_URL="ldap://ce01.lcg.cscs.ch:2170/mds-vo-name=resource,o=grid"
BDII_SE_URL="ldap://storage01.lcg.cscs.ch:2170/mds-vo-name=resource,o=grid"
  • Run the Yaim conf tool: /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site
  • wget/configure/make/install LBCD, from http://archives.eyrie.org/software/system/lbcd-3.3.0.tar.gz
  • Check iptables
  • service lbcd start # that's it, it should appear in the DNS list, IFF DT has included it in the master LBCD node

[PP] Top BDII

Make sure that you have run cfengine and that the following files are installed in your system:

  • /etc/glite/glite-info-update-endpoints.conf: it specifies which extra sites must be queried (in our case, preproduction bdii). Should look like this:
    PPCSCS-LCG2    ldap://ppbdii01.lcg.cscs.ch:2170/mds-vo-name=ppcscs-lcg2,o=grid
  • /opt/cscs/etc/glite-info-update-extra-endpoints: tells bdii which file has the configuration for extra sites. Should look like this:
    [configuration]
    EGI  = True
    OSG = True
    manual = True
    manual_file = /opt/cscs/etc/glite-info-update-extra-endpoints
    output_file = /opt/glite/etc/gip/top-urls.conf 
    cache_dir = /var/cache/glite/glite-info-update-endpoints

ui64

You need to do

yum groupinstall glite-UI
/opt/glite/yaim/bin/yaim -c -s /misc/siteinfo/site-info.def -n UI

nagios

  • wget http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/centos5/x86_64/sa1-release-2-1.el5.noarch.rpm
  • rpm -ihv sa1-release-2-1.el5.noarch.rpm
  • yum install httpd
  • yum install libyaml.i386
  • yum install egee-NAGIOS lcg-CA

References

-- PeterOettl - 2010-03-01

Edit | Attach | Watch | Print version | History: r38 < r37 < r36 < r35 < r34 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r38 - 2013-03-20 - GeorgeBrown
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback