Other tools that may be of interest, internal and external.

Black Hole detection mechanism

lrms02 runs a script that detects failures in pbs, and in case one node is failing a lot of jobs in a fraction of time (determined in the script, with a complex formula) it automatically offlines that node, and sends an email. For security, it will only offline three nodes per day, and only offline them once (again, per day), so that you can later manually online it if needed.

It also sends an email with statistics at the end of the day.

It consists of two scripts. A bash wrapper that runs all the time, and detects changes in the date (to pick up a new pbs accounting file), and a perl script that does the real detection. This has been running in our cluster for one year very successfully. Both scripts are attached below.

qtop

QTOP is a set of scripts that show current activity in Compute Nodes inside a Batch system. It shows active jobs on each core, and classifies jobs per username (with the DN identity of the current user account). An example can be viewed here: http://ganglia.lcg.cscs.ch/qtop.html

Its source code can be downloaded from here: http://fotis.web.cern.ch/fotis/QTOP/

rpm_clone

It's a script that can be very useful to backup, clone and restore the RPM status of a machine. It saves all package names and versions in a file, that can later be used to restore its status. It is completely based on YUM, so it's important to use the same repositories in the source and destination hosts. Also, it can't handle manually installed RPMs, simply because they're (supposedly) not in a repository.

You invoke it using /opt/cscs/bin/rpm_clone, or download it from the attachments section, below, or here: rpm_clone

(if you are very interested please ask for a newer version, if available)

Remove spikes in RRD Ganglia graphs

There is a script that does the trick on /opt/cscs/bin/remove_spikes. The script is also attached in this page ( remove_spikes). Please ask for a current version if you are interested.

The process is something like this:

  • First make a backup of the rrd file cp storage01_free_cms.rrd storage01_free_cms.rrd.backup
  • Then dump the RRD data into an XML file: rrdtool dump storage01_free_cms.rrd > storage01_free_cms.rrd.xml
  • Change the values you want to change inside the XML to NaN, for example: sed 's/[0-9]\.[0-9]*e.13/NaN/' -i storage01_free_cms.rrd.xml
  • And finally dump the data back to the RRD file, it should overwrite the previous values: rrdtool restore -f storage01_free_cms.rrd.xml storage01_free_cms.rrd

ILOM tips and Tricks

We use iloms to access the machines. Here are some helpful hints:

1. Use the itools - istat, ioff, ion, ireset, idisk, etc in /root/bin from a Xen host

2, in order to set up a new ilom from, say supermicro, or IBM:

supermicro - username - ADMIN, password - ADMIN
ibm - username ADMIN, password - PASSW0RD (with a zero, I believe)

next, add user, etc:

ipmitool -I lan -H 192.168.66.79 -U ADMIN user set name 3 root
ipmitool -I lan -H 192.168.66.79 -U ADMIN user set password 3 changeme
ipmitool -I lan -H 192.168.66.79 -U ADMIN user priv 3 4 1
ipmitool -I lan -H 192.168.66.79 -U ADMIN user enable 3

Wiki tools

Tools to manage the insides of twiki, add strange functionality, stuff like that

-- PabloFernandez - 2011-01-13

Topic attachments
I Attachment History Action Size Date Who Comment
Unix shell scriptsh BH_wrapper.sh r1 manage 0.6 K 2011-05-20 - 15:51 PabloFernandez  
Texttxt check_black_hole.pl.txt r1 manage 12.1 K 2011-05-20 - 15:51 PabloFernandez  
Unknown file formatEXT remove_spikes r1 manage 1.8 K 2011-02-16 - 09:38 PabloFernandez  
Unknown file formatEXT rpm_clone r1 manage 13.6 K 2011-05-20 - 15:26 PabloFernandez  
Edit | Attach | Watch | Print version | History: r15 | r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2011-06-01 - PabloFernandez
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback