Tags:
create new tag
view all tags

KeyWords: SysAdmin, Nagios

LCG-enabled Nagios probes

An experimental Nagios server has been installed on host ui.lcg.cscs.ch, to monitor the availability of LCG-services running on the local PHOENIX cluster. Two types of Nagios probes are run:

  • remote passive checks, gathering results from LCG SAM;
  • local active checks, comprising both simple network-level probes (e.g., TCP ports open, LDAP response time, etc.) and Grid-level probes (e.g., SRM and GridFTP operations).
The Grid-level probes require a valid certificate to run, which the system operators must supply; therefore, the Grid-level tests are currently run within the DECH VO.

Anyone registered with the VOMS servers of any-one of the supported VOs (that is: atlas, cms, dech, dteam, gear, hone, lhcb, ops) can view the Nagios server pages. Of course, only local administrators can issue commands through the Nagios web interface.

You need to have your LCG certificate loaded into the browser! By default, Firefox keeps popping up a dialog box to confirm that you want to send your certificate to the server; to disable this behavior, follow this procedure:

  1. Open the Preferences... item from the Edit menu;
  2. Click on the Advanced tab;
  3. Select item Select one automatically under When server requests my personal certificate.
  4. Click on OK

Note: In the future we may drop the web interface support and integrate the Nagios reporting in the central CSCS Nagios, for accessing which you will need a local CSCS account.

Notes on the installation

The service configuration scripts and instructions are provided by the Grid Monitoring Group; installation has been done as described in the LCG TWiki pages GridMonitoringNcg and GridMonitoringNcgYaim.

Note: All the issues reported in this page have been promptly fixed by Steve Traylen. Thanks!!

There are however a few issues stil to be ironed outl, both in the docs and in the supplied scripts; here's a list of what we found:

  • The credential name (set with the -k option to myproxy-init) must be unique among those uploaded to the same server: a simple solution is to embed the site name into the credential name, that is, use NagiosRetrieve-$SITENAME instead of NagiosRetrieve. A patch to this extent has to modify two files:
    1. YAIM's config_nagios_proxy_renew to generate /etc/nagios-proxy-refresh.conf including the site name into the credential name;
    2. NCG's ConfigGen/Nagios.pm to include the credential name NagiosRetrieve-$SITENAME into /etc/nagios/wlcg_resources.cfg, (used by tests checking for credential availability). (A different approach could be to use the host DN as MyProxy "username" (-l option) and store the credential under a well-known name.)

  • Retrieval of credentials by name is inherently insecure and lends itself to easy denial of service: anyone can upload a different proxy under that name, and have tests running with his/her signature, possibly disrupting the service.

  • TWiki page GridMonitoringNcg:
    • In the example "myproxy-init" command line:
      • argument to "-k" option must match what is in /etc/nagios-proxy-retrieve.conf (currently it's NagiosRetrieve. not NagiosRefresh as written instead on the TWiki);
      • option "-s" is repeated twice;
      • -s $MYPROXY_SERVER would be a better choice (works also with copy+paste).

  • TWiki page GridMonitoringNcgYaim:
    • must use -n glite-NAGIOS -n glite-UI in the YAIM incantation even if the UI software has already been installed and configured; otherwise the UI software and environment (notably, myproxy-init and $MYPROXY_SERVER) might not work.

  • In YAIM functions:
    • config_voms2htpasswd: several issues (patch):
      1. VOMS server is hard-coded (it uses the one at CERN, irrespective of what's in the vomss:// URL);
      2. if the VOMS server uses the old XML format, DNs are not printed;
      3. the wrong path in installed cron job: voms2htpasswd is in /usr/bin, not /bin
    • config_nagios_proxy_renew: need to set MYPROXY_NAME to include site name (see above) (patch);
    • config_httpd_nagios: Apache needs to be restarted every few hours to re-read the latest CRLs, but the cron job is too verbose; need to bracket (patch):
                 (service httpd status && service httpd graceful) > /dev/null 2>&1     
              

  • In the generated file /etc/nagios/wlcg.d/services.cfg:
    • org.nagios-BDII service entry uses SRMv2 object entry DN instead of the top-level "mds-vo-name=resource,o=grid". Is this intentional?

  • /usr/sbin/nagios-proxy-refresh:
    • error message doesn't match file location (patch):

  • The wrong LDAP DN for SRM GlueService* objects is configured in the org.nagios-BDII service probe, causing tests to fail. Lines 335 and 337 in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetricsAttrs/Active.pm should be changed as follows (patch):
    335
    $self->{SITEDB}->hostAttribute($hostname, "MDS_DN", "GlueServiceUniqueID=httpg://$hostname:".$self->{SITEDB}->hostAttribute($hostname,"SRM2_PORT")."/srm/managerv2,GlueSEUniqueID=".$hostname.",Mds-Vo-Name=local,O=Grid");
    337
    $self->{SITEDB}->hostAttribute($hostname, "MDS_DN", "GlueServiceUniqueID=httpg://$hostname:".$self->{SITEDB}->hostAttribute($hostname,"SRM1_PORT")."/srm/managerv1,GlueSEUniqueID=".$hostname.",Mds-Vo-Name=local,O=Grid");

-- RiccardoMurri - 18 Aug 2008

Additional comments

 
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatdiff Active.pm.diff r1 manage 1.6 K 2008-09-23 - 11:07 RiccardoMurri  
Unknown file formatdiff config_httpd_nagios.diff r1 manage 0.6 K 2008-09-23 - 11:06 RiccardoMurri  
Unknown file formatdiff config_voms2htpasswd.diff r1 manage 0.7 K 2008-09-23 - 11:06 RiccardoMurri  
Unknown file formatdiff nagios-proxy-refresh.diff r1 manage 0.4 K 2008-09-23 - 11:05 RiccardoMurri  
Unknown file formatdiff use_sitename_in_myproxy_credential.diff r1 manage 2.2 K 2008-09-23 - 13:08 RiccardoMurri If SITENAME is defined, use NagiosRetrieve-$SITENAME as myproxy credential name, otherwise fall back to "NagiosRetrieve"; patches /usr/lib/perl5/vendor_perl/5.8.5/NCG/ConfigGen/Nagios.pm and /opt/glite/yaim/functions/config_nagios_proxy_renew
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2008-10-02 - RiccardoMurri
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback