Phoenix Ganglia configuration
Note: All configuration files, non-standard startup files and
gmetric scripts needed for this ganglia cluster are kept in the CSCS SVN at
https://svn.cscs.ch/LCG/monitoring/ganglia. The deployment of the configuration files is done mainly through cfengine. Please consult the
CfEngine#Ganglia page to learn about how to implement changes.
Ganglia main server: mon.lcg.cscs.ch
This node runs the central ganglia services. There's a number of collector processes for the principal node groups, and a daemon storing the information in round robin data bases. It also runs a web server for displaying the information.
gmond
There are three
gmond
processes running as collectors. They listen to UDP transmissions from the various nodes to be monitored. Each of these nodes runs a
gmond
process configured for sending UDP packets to the respective collector ports.
The collector
gmonds
on mon.lcg.cscs.ch use these configuration files
-
/etc/gmond-wn-collector.conf
-
/etc/gmond-service-collector.conf
-
/etc/gmond-fileserver-collector.conf
The standard service startup file
/etc/init.d/gmond
is modified to start/stop all three of these services (the init file can be found at
https://svn.cscs.ch/LCG/monitoring/ganglia/ganglia-config/mon-box/init.d/gmond)
gmetad
The
gmetad
records the history of ganglia monitoring information in round robin data bases (located under
/var/lib/ganglia/rrds
). It's configuration file
/etc/gmetad.conf
contains directives for polling the three
gmond
collectors.
The
gmetad
web pages reside under
/var/www/html/ganglia
and also contain a small configuration file
conf.php
.
Note: In order to get pie charts, you need to have the
php-gd package installed!
ramdisk / tmpfs for RRD files
The
gmetad
writes the ganglia monitoring information to RRD (
round robin data base) files, using one file per sensor. For a large cluster this leads to a high frequency of I/O operations to large numbers of files, always on the same disk area. The CPU will tend to be in I/O-wait states most of the time, and people have reported fast degradation of the hard disk.
Therefore the RRD files are hosted in a tmpfs area in memory (earlier we had used a ram disk), and the contents of this area are synchronized every few minutes to a disk area to prevent information loss in case of system breakdown.
Note: The standard location for the ganglia RRDs is a symbolic link to the tmpfs area:
/var/lib/ganglia/rrds -> /dev/shm/ganglia/rrds
The ram disk is started as a service with the custom
/etc/init.d/tmpfs-sync-area
init script. The script resides in the CSCS svn at
https://svn.cscs.ch/LCG/monitoring/ganglia/ganglia-config/mon-box/init.d. It is started before
gmetad
and does the following
- upon start
- initializes tmpfs area with contents from disk area
- does sanity checks on every important operation
- installs a cron job for synchronizing tmpfs area to disk area in regular intervals
- upon stop
- makes sure that dependent services (gmetad) are stopped first
- synchronizes tmpfs area to disk area
- uninstalls the cron job
Note: The same functionality is available in the form of a ramdisk based service (
https://svn.cscs.ch/LCG/monitoring/ganglia/ganglia-config/mon-box/init.d/ramdisk).
The scripts are quite generic and use configuration information from an appropriate file in
/etc/sysconfg/
.
httpd
Needs to allow running
php scripts in the ganglia web directory.
CSCS custom graphs
There is a script at
/root/CSCS_custom_graphs/custom_rrd_cscs.pl
producing the CSCS custom graphs which gets executed by the cron job
/etc/cron.d/CSCS_custom_graphs
. The custom graphs get stored in the
/var/www/html/ganglia/CSCS-custom
directory. They are used for the
PhoenixMonOverview page and other statistics pages.
A similar script pulls down the pie charts for the subclusters for display on the monitoring page.
The sources for the scripts can be found under
https://svn.cscs.ch/LCG/monitoring/ganglia/custom_graphs.
Client nodes
gmond
For every class of node there exists a specific
gmond.conf
configuration file. These can be found in our SVN at
https://svn.cscs.ch/LCG/monitoring/ganglia/ganglia-config. The files need to be copied to
/etc/gmond.conf
on the
node in order to work with the standard init procedure. There are three classes of nodes:
- worker-nodes
- fileservers: The dcache pool servers
- service-nodes: all remaining nodes, including the dcache head and data base nodes
gmetric scripts
Some nodes send additional information by using ganglia's
gmetric
utility. Every node making use of this feature has the same kind of basic configuration:
-
/root/gmetric
: contains the specific scripts issuing the gmetric
command lines. This is a direct checkout from the corresponding https://svn.cscs.ch/LCG/monitoring/ganglia/gmetric-scripts subdirectory. Keep this up to date when you make changes
-
/etc/cron.d/gmetric
: cron job to regularly run the scripts
The following nodes have gmetric scripts:
- CE: send queue length and running jobs per VO information
- SE head node: collects information from dCache
- WN: (not installed) There is a script that collects the jobID and user for each nodes and displays it as a string.
Ganglia 3.1.7 on ganglia.lcg.cscs.ch
Build
-
yum install apr apr-devel pango pango-devel pcre-devel
- download confuse
-
./configure CFLAGS=-fPIC --disable-nls
-
make && make install
- download rrdtool, configure and install
- configure Ganglia:
./configure --with-gmetad --with-librrd=/opt/rrdtool-1.4.3/ --sysconfdir=/etc/ganglia
-
cp gmond/gmond.init /etc/init.d/gmond
-
cp gmetad/gmetad.init /etc/init.d/gmetad
-
cp -r web/ /var/www/ganglia
--
PeterOettl - 2010-05-04
--
DerekFeichtinger - 30 Jan 2008