Service Card for BDII
Definition
Our site BDII host is
bdii.lcg.cscs.ch which is a DNS alias for sbdii[01-03].
The three site-bdiis are set up in High Availability mode, using DNS load balancing. If the
lbcd daemon is running, it publishes the machine's load to our DNS server and it will redirect the queries to the least busy one. If the lbcd service is stopped, the DNS server will not send queries to it.
Using this mechanism, we can easily do rolling upgrades or installs.
As of today, Site BDII hosts are configured as follows:
- sbdii[01,02,03]:
- Scientific Linux 6.4 x86_64
- EMI-3 site bdii (bdii-5.2.22-1)
There is no top BDII deployed at CSCS-LCG2 at the moment.
Operations
Normally this service does not require any operation, but when the BDII service fails for some reason it's important to disable it by stopping the lbcd service on that machine. This makes the
bdii.lcg.cscs.ch
not to go to this particular machine.
$ service lbcd stop
Client tools
Testing
The best way to test the service is by using ldapsearch. Here are some examples of usage:
$ ldapsearch -x -LLL -h ppcream01 -p 2170 -b "o=grid" # to test BDII of ppcream01
$ ldapsearch -x -LLL -h bdii.lcg.cscs.ch -p 2170 -b "o=grid"
The result of these queries should be a long document in LDIF format.
Failover check
Checking logs
Set up
Dependencies (other services, mount points, ...)
This service does not depend on any other system, just on its own. What it does need, though, is access to the BDII port (=2170 in EMI/gLite, 2135 in ARC) of the other machines defined in the
siteinfo .
Redundancy notes
As stated before, there are 3 machines providing the service on a BDII DNS load balancing.
Installation
Site BDII (EMI/UMD release)
After you bring it the VM up, run cfengine once. Then try:
yum update --enablerepo=epel
yum install emi-bdii-site --enablerepo=cscs,epel
cfagent -q
/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site
cfagent -q
grid-service restart
ls -l lbcd-3.3.0.tar.gz # you should have got this 75K tarball via cfengine
# OR wget http://archives.eyrie.org/software/system/lbcd-3.3.0.tar.gz
tar -zxvf lbcd-3.3.0.tar.gz
cd lbcd-3.3.0
./configure && make && make install
# By now, you must check that bdii is in chkconfig, is UP and CORRECT!!!
chkconfig ntpd off # this should not be needed normally
cfagent -qv ; reboot
# wait for the machine to come back and bring it in production
service iptables stop
service lbcd start
Upgrade
Simply stop the services (including
lbcd
), update the packages, run YAIM and start the services again. At least two instance of
lbcd
must run on two different servers at any time in order to enable DNS load balancing. To perform a
rolling update stop
lbcd
on one of the 3
sbdii[01-03], update the node, test it and start =lbcd
again; repeat for the other two nodes one node at a time.
Monitoring
The best way to see whether the service works okay is by running the ldapsearch command stated before, but there is also another important thing to do: check the status of
GSTAT.
Nagios
Ganglia
Self Sanity / revival?
Other?
Manuals
Issues
Information about issues found with this service, and how to deal with them.
BDII dies without notification
Sometimes, when there are a lot of entries to be handled by the bdii, the ramdisk used by it fills up and the service dies without notifying it. Usually, if you have selected to use the RAMDISK for performance, you need to create a bigger ramdisk than the default:
Originally,
/etc/init.d/bdii
contains something like this:
# Create RAM Disk
if [ "${BDII_RAM_DISK}" = "yes" ]; then
mount -t tmpfs -o size=1500M,mode=0744 tmpfs ${SLAPD_DB_DIR}
fi
This needs to be changed to something like:
# Create RAM Disk
if [ "${BDII_RAM_DISK}" = "yes" ]; then
mount -t tmpfs -o size=3000M,mode=0744 tmpfs ${SLAPD_DB_DIR}
fi
Also, if we talk about a TOP BDII, we need to add these settings to
/etc/bdii/DB_CONFIG_top
[...]
# test values
set_cachesize 1 0 1
set_flags DB_CDB_ALLDB
set_flags DB_LOG_AUTOREMOVE
#set_flags DB_LOG_INMEMORY
#set_flags DB_TXN_NOSYNC
set_lk_max_locks 10000
set_tas_spins 100
tmpfs /dev/shm tmpfs defaults,size=3G 0 0
Issue2
Issue3
OLD - BDII Reference Guide
Implementation details
Operations
For a restart of the bdii service you can type either
grid-service restart or
service bdii restart
Functionality Testing
It is recommended to use a tool like ldapsearch; in the meantime try basic testing seen below.
Automated Testing
gstat-validate-sanity-check -H sbdii01 -p 2170 -b o=grid
OK - errors 0, warnings 0, info 0
Scripts
- From the UI run
~fotis/bin/UI_testBDIIs
- Queries 3 32bit and 3 64bit BDIIs and CERN's topBDIIs in relation to ce02.lcg.cscs.ch's TAGs
- output provided should be fully consistent.
- combine it with watch to see the values as they propagate the system
- IT DOES NOT CHECK THE COMPLETE LDAP TREE (THAT IS WAY MORE COMPLICATED BUSINESS)
Nagios Checks
Basic Testing
[fotis@ui ~]$ ldapsearch -x -h sbdii01.lcg.cscs.ch -p 2170 -b o=grid|wc
17020 33154 681594
[fotis@ui ~]$ ldapsearch -x -h sbdii02.lcg.cscs.ch -p 2170 -b o=grid|wc
17020 33154 681602
[fotis@ui ~]$ ldapsearch -x -h sbdii03.lcg.cscs.ch -p 2170 -b o=grid|wc
17020 33154 681604
OR
[fotis@ui ~]$ cat bin/UI_testBDIIs
#!/bin/sh
TMPFILE=/tmp/testbdii.$$
bdiicheck() {
echo -ne "$i: \t "; ldapsearch -x -h $i -p 2170 -b 'o=grid' GlueSubClusterUniqueID=ce02.lcg.cscs.ch \
|sed 's/lcg.cscs.ch, CSCS-LCG2, local, grid/lcg.cscs.ch, CSCS-LCG2, grid/g' \
|sed 's/Mds-Vo-name=CSCS-LCG2,Mds-Vo-name=local,o=grid/Mds-Vo-name=CSCS-LCG2,o=grid/g' \
|sort|tee $TMPFILE|wc|xargs echo -n|sed 's/$/\tsha1sum: /g';cat $TMPFILE|sha1sum
#| grep GlueHostApplicationSoftwareRunTimeEnvironment |tee $TMPFILE|wc -l|xargs echo -n|sed 's/$/ sha1sum: /g';cat $TMPFILE|sha1sum
}
echo Testing siteBDIIs
for i in {bdii01.lcg.cscs.ch,bdii02.lcg.cscs.ch,bdii03.lcg.cscs.ch}; do bdiicheck ;done
echo Testing topBDIIs @ CERN
for i in `host lcg-bdii.cern.ch|cut -f4 -d' '|sort`; do bdiicheck ;done
echo Testing siteBDIIs-glite3.2
for i in {sbdii01.lcg.cscs.ch,sbdii02.lcg.cscs.ch,sbdii03.lcg.cscs.ch}; do bdiicheck ;done
rm $TMPFILE
[fotis@ui ~]$ !$
bin/UI_testBDIIs
Testing siteBDIIs
bdii01.lcg.cscs.ch: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
bdii02.lcg.cscs.ch: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
bdii03.lcg.cscs.ch: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
Testing topBDIIs @ CERN
128.142.142.161: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
128.142.198.40: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
128.142.198.41: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
128.142.198.43: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
Testing siteBDIIs-glite3.2
sbdii01.lcg.cscs.ch: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
sbdii02.lcg.cscs.ch: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
sbdii03.lcg.cscs.ch: 479 808 24988 sha1sum: da83daff50a4a4f0cac59a3d972fcf2969ea423a -
[fotis@ui ~]$
Installation Notes
After you bring it the VM up, run cfengine once. Then try:
time /opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site
ls -l lbcd-3.3.0.tar.gz # you should have got this 75K tarball via cfengine
# OR wget http://archives.eyrie.org/software/system/lbcd-3.3.0.tar.gz
tar -zxvf lbcd-3.3.0.tar.gz
cd lbcd-3.3.0
./configure && make && make install
# By now, you must check that bdii is in chkconfig, is UP and CORRECT!!!
chkconfig ntpd off # this should not be needed normally
cfagent -qv ; reboot
# wait for the machine to come back and bring it in production
service iptables stop
service lbcd start
cfengine
- classes: BDII, LUSTRE_XEN, xen_domU, # lcgnodes?
References
--
FotisGeorgatos - 2010-09-23