Service Card for BDII

Definition

Our site BDII host is bdii.lcg.cscs.ch which is a DNS alias for sbdii[01-03].

The three site-bdiis are set up in High Availability mode, using DNS load balancing. If the lbcd daemon is running, it publishes the machine's load to our DNS server and it will redirect the queries to the least busy one. If the lbcd service is stopped, the DNS server will not send queries to it.

Using this mechanism, we can easily do rolling upgrades or installs.

As of today, site BDII hosts are configured as follows:

sbdii[01,02,03]:
- Scientific Linux 6.4 x86_64
- EMI-3 site bdii (bdii-5.2.22-1)

There is no top BDII deployed at CSCS-LCG2 at the moment.

Operations

Normally this service does not require any operation, but when the BDII service fails for some reason it's important to disable it by stopping the lbcd service on that machine. This makes the bdii.lcg.cscs.ch not to go to this particular machine.

$ service lbcd stop

For a restart of the bdii service itself either grid-service2 restart or service bdii restart can be used.

Client tools

Testing

The best way to test the service is by using ldapsearch. Here are some examples of usage:

$ ldapsearch -x -LLL -h ppcream01 -p 2170 -b "o=grid"  # to test BDII of ppcream01
$ ldapsearch -x -LLL -h bdii.lcg.cscs.ch -p 2170 -b "o=grid"

The result of these queries should be a long document in LDIF format.

A useful tool to check published values is the GLUE validator (already installed on all sbdii* machines, otherwise available on official EMI-3 Updates repository) that can be run against a site BDII or even a BDII resource (e.g. a CE) to check the conformity of published data to the GLUE (version 2.0 by default) schema:

[sbdii03] # glue-validator -H bdii.lcg.cscs.ch -p 2170 -b o=glue -k
CRITICAL - errors 38, warnings 8, info 258 | errors=38;warnings=8;info=258

in this case there are several critical errors that can be further investigated increasing the verbosity:

[sbdii03] # glue-validator -H sbdii01.lcg.cscs.ch -p 2170 -b -o=glue -k -v2
CRITICAL - errors 38, warnings 8, info 258 | errors=38;warnings=8;info=258
Summary per type of error, warning and info message:
E002 - Obsolete entry (GLUE2EntityValidity): 38
I007 - Unknown WLCG Name (GLUE2EntityOtherInfo): 2
I032 - Default value published (GLUE2ComputingShareMaxTotalJobs): 36
I033 - Default value published (GLUE2ComputingShareMaxRunningJobs): 36
I034 - Default value published (GLUE2ComputingShareMaxWaitingJobs): 36
I043 - Memory higher than 100,000 MB (GLUE2ComputingShareMaxMainMemory): 36
I045 - Memory higher than 100,000 MB (GLUE2ComputingShareMaxVirtualMemory): 36
I091 - Total share capacity size less than 1000 GB (GLUE2StorageShareCapacityTot
alSize): 4
I096 - Default value published (GLUE2ComputingShareMaxMainMemory): 36
I097 - Default value published (GLUE2ComputingShareMaxVirtualMemory): 36
W023 - Incoherent attribute range (GLUE2ComputingShareMaxUserRunningJobs): 6
W025 - Incoherent number of total jobs (GLUE2ComputingShareTotalJobs): 2

Using glue-validator to check data published by a single resource:

[sbdii03] # glue-validator -H cream01.lcg.cscs.ch -p 2170 -b -o=glue -k
OK - errors 0, warnings 0, info 84 | errors=0;warnings=0;info=84

Failover check

Checking logs

Set up

Dependencies (other services, mount points, ...)

This service does not depend on any other system, just on its own. What it does need, though, is access to the BDII port (=2170 in EMI/gLite, 2135 in ARC) of the other machines defined in the siteinfo .

Redundancy notes

As stated before, there are 3 machines providing the service on a BDII DNS load balancing.

Installation

Site BDII (EMI/UMD release)

After you bring it the VM up, run cfengine once. Then try:

yum update --enablerepo=epel
yum install emi-bdii-site --enablerepo=cscs,epel
cfagent -q
/opt/glite/yaim/bin/yaim -c -s /opt/cscs/siteinfo/site-info.def -n BDII_site
cfagent -q
grid-service restart

ls -l lbcd-3.3.0.tar.gz # you should have got this 75K tarball via cfengine
# OR wget http://archives.eyrie.org/software/system/lbcd-3.3.0.tar.gz
tar -zxvf lbcd-3.3.0.tar.gz
cd lbcd-3.3.0
./configure && make && make install

# By now, you must check that bdii is in chkconfig, is UP and CORRECT!!!
chkconfig ntpd off # this should not be needed normally
cfagent -qv ; reboot

# wait for the machine to come back and bring it in production
service iptables stop
service lbcd start

Upgrade

Simply stop the services (including lbcd), update the packages, run YAIM and start the services again. At least two instance of lbcd must run on two different servers at any time in order to enable DNS load balancing. To perform a rolling update stop lbcd on one of the 3 sbdii[01-03], update the node, test it and start lbcd again; repeat on the other two nodes one node at a time.

Monitoring

The best way to see whether the service works okay is by running the ldapsearch command stated before, but there is also another important thing to do: check the status of GSTAT.

Nagios

A few specific checks have been implemented to check the status of slapd, bdii-update, lbcd.

Ganglia

Usual monitoring deployed, no specific checks implemented.

Self Sanity / revival?

Other?

Manuals

EMI Generic installation configuration

Issues

Information about issues found with this service, and how to deal with them.

BDII dies without notification

Sometimes, when there are a lot of entries to be handled by the bdii, the ramdisk used by it fills up and the service dies without notifying it. Usually, if you have selected to use the RAMDISK for performance, you need to create a bigger ramdisk than the default:

Originally, /etc/init.d/bdii contains something like this:

    # Create RAM Disk
    if [ "${BDII_RAM_DISK}" = "yes" ]; then
        mount -t tmpfs -o size=1500M,mode=0744 tmpfs ${SLAPD_DB_DIR}
    fi

This needs to be changed to something like:

    # Create RAM Disk
    if [ "${BDII_RAM_DISK}" = "yes" ]; then
        mount -t tmpfs -o size=3000M,mode=0744 tmpfs ${SLAPD_DB_DIR}
    fi

Also, if we talk about a TOP BDII, we need to add these settings to /etc/bdii/DB_CONFIG_top

[...]
# test values

set_cachesize 1 0 1
set_flags DB_CDB_ALLDB
set_flags DB_LOG_AUTOREMOVE
#set_flags DB_LOG_INMEMORY
#set_flags DB_TXN_NOSYNC
set_lk_max_locks 10000
set_tas_spins 100

tmpfs /dev/shm tmpfs defaults,size=3G 0 0

Issue2

Issue3

References

-- FotisGeorgatos - 2010-09-23

ServiceCardForm
Service name	BDII-site
Machines this service is installed in	sbdii[01-03]
Is Grid service	Yes
Depends on the following services	-
Expert	Gianni Ricciardi
CM	CfEngine
Provisioning	none