Service Card for BDII
Definition
Our site BDII host is
bdii.lcg.cscs.ch which is a DNS alias for sbdii[01-03].
The three site-bdiis are set up in High Availability mode, using DNS load balancing. If the
lbcd daemon is running, it publishes the machine's load to our DNS server and it will redirect the queries to the least busy one. If the lbcd service is stopped, the DNS server will not send queries to it.
Using this mechanism, we can easily do rolling upgrades or installs.
As of today, site BDII hosts are configured as follows:
- sbdii[01,02,03]:
- Scientific Linux 6.4 x86_64
- EMI-3 site bdii (bdii-5.2.22-1)
There is no top BDII deployed at CSCS-LCG2 at the moment.
Operations
Normally this service does not require any operation, but when the BDII service fails for some reason it's important to disable it by stopping the lbcd service on that machine. This makes the
bdii.lcg.cscs.ch
not to go to this particular machine.
$ service lbcd stop
For a restart of the
bdii
service itself either
grid-service2 restart or
service bdii restart can be used.
Client tools
Testing
The best way to test the service is by using ldapsearch. Here are some examples of usage:
$ ldapsearch -x -LLL -h ppcream01 -p 2170 -b "o=grid" # to test BDII of ppcream01
$ ldapsearch -x -LLL -h bdii.lcg.cscs.ch -p 2170 -b "o=grid"
The result of these queries should be a long document in LDIF format.
A useful tool to check published values is the
GLUE validator (already installed on all
sbdii*
machines, otherwise available on official EMI-3 Updates repository) that can be run against a site BDII or even a BDII resource (e.g. a CE) to check the conformity of published data to the GLUE (version 2.0 by default) schema:
[sbdii03] # glue-validator -H bdii.lcg.cscs.ch -p 2170 -b o=glue -k
CRITICAL - errors 38, warnings 8, info 258 | errors=38;warnings=8;info=258
in this case there are several critical errors that can be further investigated increasing the verbosity:
[sbdii03] # glue-validator -H sbdii01.lcg.cscs.ch -p 2170 -b -o=glue -k -v2
CRITICAL - errors 38, warnings 8, info 258 | errors=38;warnings=8;info=258
Summary per type of error, warning and info message:
E002 - Obsolete entry (GLUE2EntityValidity): 38
I007 - Unknown WLCG Name (GLUE2EntityOtherInfo): 2
I032 - Default value published (GLUE2ComputingShareMaxTotalJobs): 36
I033 - Default value published (GLUE2ComputingShareMaxRunningJobs): 36
I034 - Default value published (GLUE2ComputingShareMaxWaitingJobs): 36
I043 - Memory higher than 100,000 MB (GLUE2ComputingShareMaxMainMemory): 36
I045 - Memory higher than 100,000 MB (GLUE2ComputingShareMaxVirtualMemory): 36
I091 - Total share capacity size less than 1000 GB (GLUE2StorageShareCapacityTot
alSize): 4
I096 - Default value published (GLUE2ComputingShareMaxMainMemory): 36
I097 - Default value published (GLUE2ComputingShareMaxVirtualMemory): 36
W023 - Incoherent attribute range (GLUE2ComputingShareMaxUserRunningJobs): 6
W025 - Incoherent number of total jobs (GLUE2ComputingShareTotalJobs): 2
Using
glue-validator
to check data published by a single resource:
[sbdii03] # glue-validator -H cream01.lcg.cscs.ch -p 2170 -b -o=glue -k
OK - errors 0, warnings 0, info 84 | errors=0;warnings=0;info=84
Failover check
Checking logs
Set up
Dependencies (other services, mount points, ...)
This service does not depend on any other system, just on its own. What it does need, though, is access to the BDII port (=2170 in EMI/gLite, 2135 in ARC) of the other machines defined in the
siteinfo .
Redundancy notes
As stated before, there are 3 machines providing the service on a BDII DNS load balancing.
Installation
Site BDII (EMI/UMD release)
After you bring it the VM up, run cfengine once. Then try:
yum update --enablerepo=epel
yum install emi-bdii-site --enablerepo=cscs,epel
cfagent -q
/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site
cfagent -q
grid-service restart
ls -l lbcd-3.3.0.tar.gz # you should have got this 75K tarball via cfengine
# OR wget http://archives.eyrie.org/software/system/lbcd-3.3.0.tar.gz
tar -zxvf lbcd-3.3.0.tar.gz
cd lbcd-3.3.0
./configure && make && make install
# By now, you must check that bdii is in chkconfig, is UP and CORRECT!!!
chkconfig ntpd off # this should not be needed normally
cfagent -qv ; reboot
# wait for the machine to come back and bring it in production
service iptables stop
service lbcd start
Upgrade
Simply stop the services (including
lbcd
), update the packages, run YAIM and start the services again. At least two instance of
lbcd
must run on two different servers at any time in order to enable DNS load balancing. To perform a
rolling update stop
lbcd
on one of the 3
sbdii[01-03]
, update the node, test it and start
lbcd
again; repeat on the other two nodes one node at a time.
Monitoring
The best way to see whether the service works okay is by running the ldapsearch command stated before, but there is also another important thing to do: check the status of
GSTAT.
Nagios
A few specific checks have been implemented to check the status of
slapd
,
bdii-update
,
lbcd
.
Ganglia
Usual monitoring deployed, no specific checks implemented.
Self Sanity / revival?
Other?
Manuals
Issues
Information about issues found with this service, and how to deal with them.
BDII dies without notification
Sometimes, when there are a lot of entries to be handled by the bdii, the ramdisk used by it fills up and the service dies without notifying it. Usually, if you have selected to use the RAMDISK for performance, you need to create a bigger ramdisk than the default:
Originally,
/etc/init.d/bdii
contains something like this:
# Create RAM Disk
if [ "${BDII_RAM_DISK}" = "yes" ]; then
mount -t tmpfs -o size=1500M,mode=0744 tmpfs ${SLAPD_DB_DIR}
fi
This needs to be changed to something like:
# Create RAM Disk
if [ "${BDII_RAM_DISK}" = "yes" ]; then
mount -t tmpfs -o size=3000M,mode=0744 tmpfs ${SLAPD_DB_DIR}
fi
Also, if we talk about a TOP BDII, we need to add these settings to
/etc/bdii/DB_CONFIG_top
[...]
# test values
set_cachesize 1 0 1
set_flags DB_CDB_ALLDB
set_flags DB_LOG_AUTOREMOVE
#set_flags DB_LOG_INMEMORY
#set_flags DB_TXN_NOSYNC
set_lk_max_locks 10000
set_tas_spins 100
tmpfs /dev/shm tmpfs defaults,size=3G 0 0
Issue2
Issue3
References
--
FotisGeorgatos - 2010-09-23