KeyWords:
SysAdmin,
Nagios
LCG-enabled Nagios probes
An experimental
Nagios server![](/twiki/pub/TWiki/TWikiDocGraphics/external-link.gif)
has
been installed on host
ui.lcg.cscs.ch
, to monitor the availability
of LCG-services running on the local PHOENIX cluster.
Two types of Nagios probes are run:
- remote passive checks, gathering results from LCG SAM
;
- local active checks, comprising both simple network-level probes (e.g., TCP ports open, LDAP response time, etc.) and Grid-level probes (e.g., SRM and GridFTP operations).
The Grid-level probes require a valid certificate to run, which the system
operators must supply; therefore, the Grid-level tests are currently run
within the
DECH VO![](/twiki/pub/TWiki/TWikiDocGraphics/external-link.gif)
.
Anyone registered with the VOMS servers of any-one of the supported
VOs (that is: atlas, cms, dech, dteam, gear, hone, lhcb, ops) can view
the
Nagios server pages![](/twiki/pub/TWiki/TWikiDocGraphics/external-link.gif)
. Of
course, only local administrators can issue commands through the
Nagios web interface.
You need to have your LCG certificate loaded into the browser!
By default, Firefox keeps popping up a dialog box to confirm that you want
to send your certificate to the server; to disable this behavior,
follow this procedure:
- Open the Preferences... item from the Edit menu;
- Click on the Advanced tab;
- Select item Select one automatically under When server requests my personal certificate.
- Click on OK
Note: In the future we may drop the web interface support and
integrate the Nagios reporting in the
central CSCS Nagios![](/twiki/pub/TWiki/TWikiDocGraphics/external-link.gif)
,
for accessing which you will need a local CSCS account.
Notes on the installation
The service configuration scripts and instructions are provided
by the
Grid Monitoring Group![](/twiki/pub/TWiki/TWikiDocGraphics/external-link.gif)
;
installation has been done as described in the
LCG TWiki![](/twiki/pub/TWiki/TWikiDocGraphics/external-link.gif)
pages
GridMonitoringNcg![](/twiki/pub/TWiki/TWikiDocGraphics/external-link.gif)
and
GridMonitoringNcgYaim![](/twiki/pub/TWiki/TWikiDocGraphics/external-link.gif)
.
Note: All the issues reported in this page have been promptly fixed by Steve Traylen. Thanks!!
There are however a few issues stil to be ironed outl, both in the docs and in the supplied scripts;
here's a list of what we found:
- The credential name (set with the
-k
option to myproxy-init
) must be unique among those uploaded to the same server: a simple solution is to embed the site name into the credential name, that is, use NagiosRetrieve-$SITENAME
instead of NagiosRetrieve
. A patch
to this extent has to modify two files:
- YAIM's
config_nagios_proxy_renew
to generate /etc/nagios-proxy-refresh.conf
including the site name into the credential name;
- NCG's
ConfigGen/Nagios.pm
to include the credential name NagiosRetrieve-$SITENAME
into /etc/nagios/wlcg_resources.cfg
, (used by tests checking for credential availability). (A different approach could be to use the host DN as MyProxy "username" (-l
option) and store the credential under a well-known name.)
- Retrieval of credentials by name is inherently insecure and lends itself to easy denial of service: anyone can upload a different proxy under that name, and have tests running with his/her signature, possibly disrupting the service.
- TWiki page GridMonitoringNcg
:
- In the example "myproxy-init" command line:
- argument to "-k" option must match what is in
/etc/nagios-proxy-retrieve.conf
(currently it's NagiosRetrieve
. not NagiosRefresh
as written instead on the TWiki);
- option "-s" is repeated twice;
-
-s $MYPROXY_SERVER
would be a better choice (works also with copy+paste).
- TWiki page GridMonitoringNcgYaim
:
- must use
-n glite-NAGIOS -n glite-UI
in the YAIM incantation even if the UI software has already been installed and configured; otherwise the UI software and environment (notably, myproxy-init
and $MYPROXY_SERVER
) might not work.
- In the generated file
/etc/nagios/wlcg.d/services.cfg
:
- org.nagios-BDII service entry uses SRMv2 object entry DN instead of the top-level "mds-vo-name=resource,o=grid". Is this intentional?
-
/usr/sbin/nagios-proxy-refresh
:
- error message doesn't match file location (patch)
:
- The wrong LDAP DN for SRM GlueService* objects is configured in the
org.nagios-BDII
service probe, causing tests to fail. Lines 335 and 337 in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetricsAttrs/Active.pm
should be changed as follows (patch)
:
-
- 335
- $self->{SITEDB}->hostAttribute($hostname, "MDS_DN", "GlueServiceUniqueID=httpg://$hostname:".$self->{SITEDB}->hostAttribute($hostname,"SRM2_PORT")."/srm/managerv2,GlueSEUniqueID=".$hostname.",Mds-Vo-Name=local,O=Grid");
- 337
- $self->{SITEDB}->hostAttribute($hostname, "MDS_DN", "GlueServiceUniqueID=httpg://$hostname:".$self->{SITEDB}->hostAttribute($hostname,"SRM1_PORT")."/srm/managerv1,GlueSEUniqueID=".$hostname.",Mds-Vo-Name=local,O=Grid");
--
RiccardoMurri - 18 Aug 2008
Additional comments