Service Card for BDII

Definition

Our site BDII host is bdii.lcg.cscs.ch which is a DNS alias for sbdii[01-03].

The three site-bdiis are set up in High Availability mode, using DNS load balancing. If the lbcd daemon is running, it publishes the machine's load to our DNS server and it will redirect the queries to the least busy one. If the lbcd service is stopped, the DNS server will not send queries to it.

Using this mechanism, we can easily do rolling upgrades or installs.

As of today, Site BDII hosts are configured as follows:

  • sbdii[01,02,03]:
    • Scientific Linux 6.4 x86_64
    • EMI-3 site bdii (bdii-5.2.22-1)
There is no top BDII deployed at CSCS-LCG2 at the moment.

Operations

Normally this service does not require any operation, but when the BDII service fails for some reason it's important to disable it by stopping the lbcd service on that machine. This makes the bdii.lcg.cscs.ch not to go to this particular machine.

$ service lbcd stop

Client tools

Testing

The best way to test the service is by using ldapsearch. Here are some examples of usage:

$ ldapsearch -x -LLL -h ppcream01 -p 2170 -b "o=grid"  # to test BDII of ppcream01
$ ldapsearch -x -LLL -h bdii.lcg.cscs.ch -p 2170 -b "o=grid" 

The result of these queries should be a long document in LDIF format.

Failover check

Checking logs

Set up

Dependencies (other services, mount points, ...)

This service does not depend on any other system, just on its own. What it does need, though, is access to the BDII port (=2170 in EMI/gLite, 2135 in ARC) of the other machines defined in the siteinfo .

Redundancy notes

As stated before, there are 3 machines providing the service on a BDII DNS load balancing.

Installation

Site BDII (EMI/UMD release)

After you bring it the VM up, run cfengine once. Then try:

yum update --enablerepo=epel
yum install emi-bdii-site --enablerepo=cscs,epel
cfagent -q
/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site
cfagent -q
grid-service restart

ls -l lbcd-3.3.0.tar.gz # you should have got this 75K tarball via cfengine
# OR wget http://archives.eyrie.org/software/system/lbcd-3.3.0.tar.gz
tar -zxvf lbcd-3.3.0.tar.gz
cd lbcd-3.3.0
./configure && make && make install

# By now, you must check that bdii is in chkconfig, is UP and CORRECT!!!
chkconfig ntpd off # this should not be needed normally
cfagent -qv ; reboot

# wait for the machine to come back and bring it in production
service iptables stop
service lbcd start

Upgrade

Simply stop the services (including lbcd), update the packages, run YAIM and start the services again. At least two instance of lbcd must run on two different servers at any time in order to enable DNS load balancing. To perform a rolling update stop lbcd on one of the 3 sbdii[01-03], update the node, test it and start =lbcd again; repeat for the other two nodes one node at a time.

Monitoring

The best way to see whether the service works okay is by running the ldapsearch command stated before, but there is also another important thing to do: check the status of GSTAT.

Nagios

Ganglia

Self Sanity / revival?

Other?

Manuals

Issues

Information about issues found with this service, and how to deal with them.

BDII dies without notification

Sometimes, when there are a lot of entries to be handled by the bdii, the ramdisk used by it fills up and the service dies without notifying it. Usually, if you have selected to use the RAMDISK for performance, you need to create a bigger ramdisk than the default:

Originally, /etc/init.d/bdii contains something like this:

    # Create RAM Disk
    if [ "${BDII_RAM_DISK}" = "yes" ]; then
        mount -t tmpfs -o size=1500M,mode=0744 tmpfs ${SLAPD_DB_DIR}
    fi

This needs to be changed to something like:

    # Create RAM Disk
    if [ "${BDII_RAM_DISK}" = "yes" ]; then
        mount -t tmpfs -o size=3000M,mode=0744 tmpfs ${SLAPD_DB_DIR}
    fi

Also, if we talk about a TOP BDII, we need to add these settings to /etc/bdii/DB_CONFIG_top

[...]
# test values

set_cachesize 1 0 1
set_flags DB_CDB_ALLDB
set_flags DB_LOG_AUTOREMOVE
#set_flags DB_LOG_INMEMORY
#set_flags DB_TXN_NOSYNC
set_lk_max_locks 10000
set_tas_spins 100

tmpfs /dev/shm tmpfs defaults,size=3G 0 0

Issue2

Issue3

OLD - BDII Reference Guide

Implementation details

Operations

For a restart of the bdii service you can type either grid-service restart or service bdii restart

Functionality Testing

It is recommended to use a tool like ldapsearch; in the meantime try basic testing seen below.

Automated Testing

gstat-validate-sanity-check -H sbdii01 -p 2170 -b o=grid
OK - errors 0, warnings 0, info 0

Scripts

  • From the UI run ~fotis/bin/UI_testBDIIs
    • Queries 3 32bit and 3 64bit BDIIs and CERN's topBDIIs in relation to ce02.lcg.cscs.ch's TAGs
    • output provided should be fully consistent.
    • combine it with watch to see the values as they propagate the system
    • IT DOES NOT CHECK THE COMPLETE LDAP TREE (THAT IS WAY MORE COMPLICATED BUSINESS)

Nagios Checks

Basic Testing

[fotis@ui ~]$ ldapsearch -x -h sbdii01.lcg.cscs.ch -p 2170 -b o=grid|wc
  17020   33154  681594
[fotis@ui ~]$ ldapsearch -x -h sbdii02.lcg.cscs.ch -p 2170 -b o=grid|wc
  17020   33154  681602
[fotis@ui ~]$ ldapsearch -x -h sbdii03.lcg.cscs.ch -p 2170 -b o=grid|wc
  17020   33154  681604

OR

[fotis@ui ~]$ cat bin/UI_testBDIIs
#!/bin/sh
TMPFILE=/tmp/testbdii.$$

bdiicheck() {
  echo -ne "$i:  \t "; ldapsearch -x -h $i -p 2170 -b 'o=grid' GlueSubClusterUniqueID=ce02.lcg.cscs.ch \
        |sed 's/lcg.cscs.ch, CSCS-LCG2, local, grid/lcg.cscs.ch, CSCS-LCG2, grid/g' \
        |sed 's/Mds-Vo-name=CSCS-LCG2,Mds-Vo-name=local,o=grid/Mds-Vo-name=CSCS-LCG2,o=grid/g' \
        |sort|tee $TMPFILE|wc|xargs echo -n|sed 's/$/\tsha1sum: /g';cat $TMPFILE|sha1sum
        #| grep GlueHostApplicationSoftwareRunTimeEnvironment |tee $TMPFILE|wc -l|xargs echo -n|sed 's/$/ sha1sum: /g';cat $TMPFILE|sha1sum
}

echo Testing siteBDIIs
for i in {bdii01.lcg.cscs.ch,bdii02.lcg.cscs.ch,bdii03.lcg.cscs.ch}; do bdiicheck ;done
echo Testing topBDIIs @ CERN
for i in `host lcg-bdii.cern.ch|cut -f4 -d' '|sort`; do bdiicheck ;done
echo Testing siteBDIIs-glite3.2
for i in {sbdii01.lcg.cscs.ch,sbdii02.lcg.cscs.ch,sbdii03.lcg.cscs.ch}; do bdiicheck ;done
rm $TMPFILE
[fotis@ui ~]$ !$
bin/UI_testBDIIs
Testing siteBDIIs
bdii01.lcg.cscs.ch:      479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
bdii02.lcg.cscs.ch:      479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
bdii03.lcg.cscs.ch:      479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
Testing topBDIIs @ CERN
128.142.142.161:         479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
128.142.198.40:          479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
128.142.198.41:          479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
128.142.198.43:          479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
Testing siteBDIIs-glite3.2
sbdii01.lcg.cscs.ch:     479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
sbdii02.lcg.cscs.ch:     479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
sbdii03.lcg.cscs.ch:     479 808 24988  sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a  -
[fotis@ui ~]$

Installation Notes

After you bring it the VM up, run cfengine once. Then try:

time /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site

ls -l lbcd-3.3.0.tar.gz # you should have got this 75K tarball via cfengine
# OR wget http://archives.eyrie.org/software/system/lbcd-3.3.0.tar.gz
tar -zxvf lbcd-3.3.0.tar.gz
cd lbcd-3.3.0
./configure && make && make install

# By now, you must check that bdii is in chkconfig, is UP and CORRECT!!!
chkconfig ntpd off # this should not be needed normally
cfagent -qv ; reboot

# wait for the machine to come back and bring it in production
service iptables stop
service lbcd start

cfengine

  • classes: BDII, LUSTRE_XEN, xen_domU, # lcgnodes?

References

-- FotisGeorgatos - 2010-09-23
ServiceCardForm
Service name BDII-site
Machines this service is installed in sbdii[01-03]
Is Grid service Yes
Depends on the following services -
Expert Gianni Ricciardi
Edit | Attach | Watch | Print version | History: r20 | r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r16 - 2013-11-19 - GianniRicciardi
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback