Other tools that may be of interest, internal and external.
Full Cluster Restart
Each service individually describes how to start up. On a more general level, and probably useful for power loss or after an intentional full shutdown.
Phase 1. Start all network services
- All Infiniband and ethernet switches
- Firewalls:
fw[01-02]
- DHCP services, currently:
xen03
Phase 2. Start shared filesystem servers
- Boot both NFS servers (
nfs[01-02]
) at the same time, otherwise the first one booting will have all mounts for itself.
- Boot
storage02
. It starts pnfs server (for storage01 to be able to start srm) and also the dCache LM server (the Location Manager for the whole dCache service)
- Boot all GPFS nodes (
mds[1-2],oss[11-42]
). The order does not matter, but they all need to be started before the clients.
Phase 3. Start the rest. The order does not matter too much. Services should be smart enough to wait for their dependencies to work.
- dCache service nodes:
storage01, pools
- All other non-virtualized services, probably
cream[01-02]
- All VM hosts and guests
- All WNs
Black Hole detection mechanism
lrms02 runs a script that detects failures in pbs, and in case one node is failing a lot of jobs in a fraction of time (determined in the script, with a complex formula) it automatically offlines that node, and sends an email. For security, it will only offline three nodes per day, and only offline them once (again, per day), so that you can later manually online it if needed.
It also sends an email with statistics at the end of the day.
It consists of two scripts. A bash wrapper that runs all the time, and detects changes in the date (to pick up a new pbs accounting file), and a perl script that does the real detection. This has been running in our cluster for one year very successfully. Both scripts are attached below.
qtop
QTOP is a set of scripts that show current activity in Compute Nodes inside a Batch system. It shows active jobs on each core, and classifies jobs per username (with the DN identity of the current user account). An example can be viewed here:
http://ganglia.lcg.cscs.ch/qtop.html
Its source code can be downloaded from here:
http://fotis.web.cern.ch/fotis/QTOP/
rpm_clone
It's a script that can be very useful to backup, clone and restore the RPM status of a machine. It saves all package names and versions in a file, that can later be used to restore its status. It is completely based on YUM, so it's important to use the same repositories in the source and destination hosts. Also, it can't handle manually installed RPMs, simply because they're (supposedly) not in a repository.
You invoke it using
/opt/cscs/bin/rpm_clone, or download it from the attachments section, below, or here:
rpm_clone
(if you are very interested please ask for a newer version, if available)
Remove spikes in RRD Ganglia graphs
There is a script that does the trick on
/opt/cscs/bin/remove_spikes. The script is also attached in this page (
remove_spikes). Please ask for a current version if you are interested.
The process is something like this:
- First make a backup of the rrd file
cp storage01_free_cms.rrd storage01_free_cms.rrd.backup
- Then dump the RRD data into an XML file:
rrdtool dump storage01_free_cms.rrd > storage01_free_cms.rrd.xml
- Change the values you want to change inside the XML to NaN, for example:
sed 's/[0-9]\.[0-9]*e.13/NaN/' -i storage01_free_cms.rrd.xml
- And finally dump the data back to the RRD file, it should overwrite the previous values:
rrdtool restore -f storage01_free_cms.rrd.xml storage01_free_cms.rrd
ILOM tips and Tricks
We use iloms to access the machines. Here are some helpful hints:
1. Use the itools - istat, ioff, ion, ireset, idisk, etc in /root/bin from a Xen host
2, in order to set up a new ilom from, say supermicro, or IBM:
supermicro - username - ADMIN, password - ADMIN
ibm - username ADMIN, password -
PASSW0RD (with a zero, I believe)
next, add user, etc:
ipmitool -I lan -H 192.168.66.79 -U ADMIN user set name 3 root
ipmitool -I lan -H 192.168.66.79 -U ADMIN user set password 3 changeme
ipmitool -I lan -H 192.168.66.79 -U ADMIN user priv 3 4 1
ipmitool -I lan -H 192.168.66.79 -U ADMIN user enable 3
Wiki tools
Tools to manage the insides of twiki, add strange functionality, stuff like that
SMclient (IBM Storage)
How to Update SMclient using a DVD
After installing a new disk controller the currently installed
SMclient
can complain about a detected new version of firmware that it is not supported suggesting to update. The DVD shipped with the controller and labeled as
IBM System Storage DS Storage Manager can be used for such an update.
Mount it on the machine where the current version of
SMclient
is installed and, assuming to install the Linux 64-bit version, just run:
# cd /media/cdrom
# cd Code/Storage_Manager/Linux_2.6_x86-64/
# ./SMIA-LINUXX64-10.86.0A05.0028.bin
Preparing to install...
Extracting the JRE from the installer archive...
Unpacking the JRE...
Extracting the installation resources from the installer archive...
Configuring the installer for this system's environment...
Launching installer...
then a GUI should start and it is should be enough to confirm the default options, accept the license and so on. The installation script should remove the old version of
SMclient
and install the new one. Then it should be possible to run the client and gradually see all the previously installed storage systems along as the new ones appear in
Enterprise Management GUI. The current version (2013 September) is the 10.86.
How to Copy & Clone a Storage Configuration
Using
SMclient
it is possible to save the current configuration of a storage system and use it to
clone the configuration to other similar storage systems.
# SMclient
then the
Enterprise Management GUI should start. Select a storage in the left-sided list with a double click the
Subsystem Management interface should pop-up. To save the configuration it is enough to:
Storage Subsystem -> Configuration -> Save
select all the elements to be saved (Storage Subsystem, Logical Drive configuration Logical Drive-to-LUN mappings, Topology) and save the file locally on the machine
SMclient
has been launched from. The saved configuration must be customized with the arrays names, host groups names, LUN mapping, etc. related to the storage to be_cloned_ before being applied.
To apply one of this configuration on a brand new identical storage system it is enough to run the
Enterprise Management interface as above, right click on the Storage Subsystem to be configured and run:
Execute Script -> File -> Load Script
that should open a new window with an editor: here it is possible to load a previously saved configuration file, check its syntax and run it using the appropriate drop down menu options. Please note that:
- creating array can take several hours and the creation progress can be checked using the Subsystem Management interface;
Summary -> Operations in Progress
- saved configuration files should not be used for disaster recovery; (please refer to official doc);
- they should be run on brand new storage system or to a previously installed ones, but in the latter case the configuration file must be completed with specific commands to delete the current configuration (please refer to official doc);
- do not save the configuration if there are pending or running operations attempting to change it;
How to retrieve the WWIDs associated to logical drives
When configuring
multipath.conf
on the servers attached to the storage systems it is necessary to know the WWID (World Wide IDentifiers) associated to the logical arrays created on the storage and seen by the servers as logical devices.
To do this it is enough to run the
Subsystem Management interface as described above (run
SMclient
and then select the appropriate storage subsystem where the arrays has been created):
SMClient -> Subsystem Mangement (for the specific storage subsys) ->
Storage & Copy Services -> Arrays -> (select a Logical Array)
for each Logical Array an information page is shown: the WWID can be derived by the
Logical Drive ID item shown on the page by removing the colons used as separators and prefixing a
3 in this way:
Storage11_LogicalArray0
Logical Drive ID:60:08:0e:50:00:3e:1c:f8:00:00:0a:75:52:39:4c:1a
becomes
WWN 360080e50003e1cf800000a7552394c1a
that can be used as a WWID in a
multipath.conf
file.
Please note that the number
3 is often used by IBM on its storage systems, but to be sure the same format is used just run a
# multipath -ll
on the server attached to the storage to be sure about what kind of format is used on that particular storage (before configuring the array names through
multipath.conf).
--
PabloFernandez - 2011-01-13