KeyWords:
SysAdmin,
Nagios
LCG-enabled Nagios probes
An experimental
Nagios server has
been installed on host
ui.lcg.cscs.ch
, to monitor the availability
of LCG-services running on the local PHOENIX cluster.
Two types of Nagios probes are run:
- remote passive checks, gathering results from LCG SAM;
- local active checks, comprising both simple network-level probes (e.g., TCP ports open, LDAP response time, etc.) and Grid-level probes (e.g., SRM and GridFTP operations).
The Grid-level probes require a valid certificate to run, which the system
operators must supply; therefore, the Grid-level tests are currently run
within the
DECH VO.
Anyone registered with the VOMS servers of any-one of the supported
VOs (that is: atlas, cms, dech, dteam, gear, hone, lhcb, ops) can view
the
Nagios server pages. Of
course, only local administrators can issue commands through the
Nagios web interface.
You need to have your LCG certificate loaded into the browser!
By default, Firefox keeps popping up a dialog box to confirm that you want
to send your certificate to the server; to disable this behavior,
follow this procedure:
- Open the Preferences... item from the Edit menu;
- Click on the Advanced tab;
- Select item Select one automatically under When server requests my personal certificate.
- Click on OK
Note: In the future we may drop the web interface support and
integrate the Nagios reporting in the
central CSCS Nagios,
for accessing which you will need a local CSCS account.
Notes on the installation
The service configuration scripts and instructions are provided
by the
Grid Monitoring Group;
installation has been done as described in the
LCG TWiki pages
GridMonitoringNcg
and
GridMonitoringNcgYaim.
There are however a few issues stil to be ironed outl, both in the docs and in the supplied scripts;
here's a list of what we found:
- The credential name (set with the
-k
option to myproxy-init
) must be unique among those uploaded to the same server: a simple solution is to embed the site name into the credential name, that is, use NagiosRetrieve-$SITENAME
instead of NagiosRetrieve
. A patch to this extent has to modify two files:
- YAIM's
config_nagios_proxy_renew
to generate /etc/nagios-proxy-refresh.conf
including the site name into the credential name;
- NCG's
ConfigGen/Nagios.pm
to include the credential name NagiosRetrieve-$SITENAME
into /etc/nagios/wlcg_resources.cfg
, (used by tests checking for credential availability). (A different approach could be to use the host DN as MyProxy "username" (-l
option) and store the credential under a well-known name.)
- Retrieval of credentials by name is inherently insecure and lends itself to easy denial of service: anyone can upload a different proxy under that name, and have tests running with his/her signature, possibly disrupting the service.
- TWiki page GridMonitoringNcg:
- In the example "myproxy-init" command line:
- argument to "-k" option must match what is in
/etc/nagios-proxy-retrieve.conf
(currently it's NagiosRetrieve
. not NagiosRefresh
as written instead on the TWiki);
- option "-s" is repeated twice;
-
-s $MYPROXY_SERVER
would be a better choice (works also with copy+paste).
- TWiki page GridMonitoringNcgYaim:
- must use
-n glite-NAGIOS -n glite-UI
in the YAIM incantation even if the UI software has already been installed and configured; otherwise the UI software and environment (notably, myproxy-init
and $MYPROXY_SERVER
) might not work.
- In the generated file
/etc/nagios/wlcg.d/services.cfg
:
- org.nagios-BDII service entry uses SRMv2 object entry DN instead of the top-level "mds-vo-name=resource,o=grid". Is this intentional?
-
/usr/sbin/nagios-proxy-refresh
:
- error message doesn't match file location (patch):
- The wrong LDAP DN for SRM GlueService* objects is configured in the
org.nagios-BDII
service probe, causing tests to fail. Lines 335 and 337 in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetricsAttrs/Active.pm
should be changed as follows (patch):
-
- 335
- $self->{SITEDB}->hostAttribute($hostname, "MDS_DN", "GlueServiceUniqueID=httpg://$hostname:".$self->{SITEDB}->hostAttribute($hostname,"SRM2_PORT")."/srm/managerv2,GlueSEUniqueID=".$hostname.",Mds-Vo-Name=local,O=Grid");
- 337
- $self->{SITEDB}->hostAttribute($hostname, "MDS_DN", "GlueServiceUniqueID=httpg://$hostname:".$self->{SITEDB}->hostAttribute($hostname,"SRM1_PORT")."/srm/managerv1,GlueSEUniqueID=".$hostname.",Mds-Vo-Name=local,O=Grid");
--
RiccardoMurri - 18 Aug 2008
Additional comments