Other tools that may be of interest, internal and external.
Full Cluster Restart
Each service individually describes how to start up. On a more general level, and probably useful for power loss or after an intentional full shutdown.
Phase 1. Start all network services
- All Infiniband and ethernet switches
- Firewalls:
fw[01-02]
- DHCP services, currently:
xen03
Phase 2. Start shared filesystem servers
- Boot both NFS servers (
nfs[01-02]
) at the same time, otherwise the first one booting will have all mounts for itself.
- Boot
storage02
. It starts pnfs server (for storage01 to be able to start srm) and also the dCache LM server (the Location Manager for the whole dCache service)
- Boot all GPFS nodes (
mds[1-2],oss[11-42]
). The order does not matter, but they all need to be started before the clients.
Phase 3. Start the rest. The order does not matter too much. Services should be smart enough to wait for their dependencies to work.
- dCache service nodes:
storage01, pools
- All other non-virtualized services, probably
cream[01-02]
- All VM hosts and guests
- All WNs
Black Hole detection mechanism
lrms02 runs a script that detects failures in pbs, and in case one node is failing a lot of jobs in a fraction of time (determined in the script, with a complex formula) it automatically offlines that node, and sends an email. For security, it will only offline three nodes per day, and only offline them once (again, per day), so that you can later manually online it if needed.
It also sends an email with statistics at the end of the day.
It consists of two scripts. A bash wrapper that runs all the time, and detects changes in the date (to pick up a new pbs accounting file), and a perl script that does the real detection. This has been running in our cluster for one year very successfully. Both scripts are attached below.
qtop
QTOP is a set of scripts that show current activity in Compute Nodes inside a Batch system. It shows active jobs on each core, and classifies jobs per username (with the DN identity of the current user account). An example can be viewed here:
http://ganglia.lcg.cscs.ch/qtop.html
Its source code can be downloaded from here:
http://fotis.web.cern.ch/fotis/QTOP/
rpm_clone
It's a script that can be very useful to backup, clone and restore the RPM status of a machine. It saves all package names and versions in a file, that can later be used to restore its status. It is completely based on YUM, so it's important to use the same repositories in the source and destination hosts. Also, it can't handle manually installed RPMs, simply because they're (supposedly) not in a repository.
You invoke it using
/opt/cscs/bin/rpm_clone, or download it from the attachments section, below, or here:
rpm_clone
(if you are very interested please ask for a newer version, if available)
Remove spikes in RRD Ganglia graphs
There is a script that does the trick on
/opt/cscs/bin/remove_spikes. The script is also attached in this page (
remove_spikes). Please ask for a current version if you are interested.
The process is something like this:
- First make a backup of the rrd file
cp storage01_free_cms.rrd storage01_free_cms.rrd.backup
- Then dump the RRD data into an XML file:
rrdtool dump storage01_free_cms.rrd > storage01_free_cms.rrd.xml
- Change the values you want to change inside the XML to NaN, for example:
sed 's/[0-9]\.[0-9]*e.13/NaN/' -i storage01_free_cms.rrd.xml
- And finally dump the data back to the RRD file, it should overwrite the previous values:
rrdtool restore -f storage01_free_cms.rrd.xml storage01_free_cms.rrd
ILOM tips and Tricks
We use iloms to access the machines. Here are some helpful hints:
1. Use the itools - istat, ioff, ion, ireset, idisk, etc in /root/bin from a Xen host
2, in order to set up a new ilom from, say supermicro, or IBM:
supermicro - username - ADMIN, password - ADMIN
ibm - username ADMIN, password -
PASSW0RD (with a zero, I believe)
next, add user, etc:
ipmitool -I lan -H 192.168.66.79 -U ADMIN user set name 3 root
ipmitool -I lan -H 192.168.66.79 -U ADMIN user set password 3 changeme
ipmitool -I lan -H 192.168.66.79 -U ADMIN user priv 3 4 1
ipmitool -I lan -H 192.168.66.79 -U ADMIN user enable 3
Wiki tools
Tools to manage the insides of twiki, add strange functionality, stuff like that
--
PabloFernandez - 2011-01-13