Tags:
create new tag
view all tags

Other tools that may be of interest, internal and external.

Full Cluster Restart

Each service individually describes how to start up. On a more general level, and probably useful for power loss or after an intentional full shutdown.

Phase 1. Start all network services

  • All Infiniband and ethernet switches
  • Firewalls: fw[01-02]
  • DHCP services, currently: xen03
Phase 2. Start shared filesystem servers
  • Boot both NFS servers ( nfs[01-02]) at the same time, otherwise the first one booting will have all mounts for itself.
  • Boot storage02. It starts pnfs server (for storage01 to be able to start srm) and also the dCache LM server (the Location Manager for the whole dCache service)
  • Boot all GPFS nodes ( mds[1-2],oss[11-42]). The order does not matter, but they all need to be started before the clients.
Phase 3. Start the rest. The order does not matter too much. Services should be smart enough to wait for their dependencies to work.
  • dCache service nodes: storage01, pools
  • All other non-virtualized services, probably cream[01-02]
  • All VM hosts and guests
  • All WNs

Black Hole detection mechanism

lrms02 runs a script that detects failures in pbs, and in case one node is failing a lot of jobs in a fraction of time (determined in the script, with a complex formula) it automatically offlines that node, and sends an email. For security, it will only offline three nodes per day, and only offline them once (again, per day), so that you can later manually online it if needed.

It also sends an email with statistics at the end of the day.

It consists of two scripts. A bash wrapper that runs all the time, and detects changes in the date (to pick up a new pbs accounting file), and a perl script that does the real detection. This has been running in our cluster for one year very successfully. Both scripts are attached below.

qtop

QTOP is a set of scripts that show current activity in Compute Nodes inside a Batch system. It shows active jobs on each core, and classifies jobs per username (with the DN identity of the current user account). An example can be viewed here: http://ganglia.lcg.cscs.ch/qtop.html

Its source code can be downloaded from here: http://fotis.web.cern.ch/fotis/QTOP/

rpm_clone

It's a script that can be very useful to backup, clone and restore the RPM status of a machine. It saves all package names and versions in a file, that can later be used to restore its status. It is completely based on YUM, so it's important to use the same repositories in the source and destination hosts. Also, it can't handle manually installed RPMs, simply because they're (supposedly) not in a repository.

You invoke it using /opt/cscs/bin/rpm_clone, or download it from the attachments section, below, or here: rpm_clone

(if you are very interested please ask for a newer version, if available)

Remove spikes in RRD Ganglia graphs

There is a script that does the trick on /opt/cscs/bin/remove_spikes. The script is also attached in this page ( remove_spikes). Please ask for a current version if you are interested.

The process is something like this:

  • First make a backup of the rrd file cp storage01_free_cms.rrd storage01_free_cms.rrd.backup
  • Then dump the RRD data into an XML file: rrdtool dump storage01_free_cms.rrd > storage01_free_cms.rrd.xml
  • Change the values you want to change inside the XML to NaN, for example: sed 's/[0-9]\.[0-9]*e.13/NaN/' -i storage01_free_cms.rrd.xml
  • And finally dump the data back to the RRD file, it should overwrite the previous values: rrdtool restore -f storage01_free_cms.rrd.xml storage01_free_cms.rrd

ILOM tips and Tricks

We use iloms to access the machines. Here are some helpful hints:

1. Use the itools - istat, ioff, ion, ireset, idisk, etc in /root/bin from a Xen host

2, in order to set up a new ilom from, say supermicro, or IBM:

supermicro - username - ADMIN, password - ADMIN
ibm - username ADMIN, password - PASSW0RD (with a zero, I believe)

next, add user, etc:

ipmitool -I lan -H 192.168.66.79 -U ADMIN user set name 3 root
ipmitool -I lan -H 192.168.66.79 -U ADMIN user set password 3 changeme
ipmitool -I lan -H 192.168.66.79 -U ADMIN user priv 3 4 1
ipmitool -I lan -H 192.168.66.79 -U ADMIN user enable 3

Wiki tools

Tools to manage the insides of twiki, add strange functionality, stuff like that

SMclient (IBM Storage)

How to Update SMclient using a DVD

After installing a new disk controller the currently installed SMclient can complain about a detected new version of firmware that it is not supported suggesting to update. The DVD shipped with the controller and labeled as IBM System Storage DS Storage Manager can be used for such an update.

Mount it on the machine where the current version of SMclient is installed and, assuming to install the Linux 64-bit version, just run:

# cd /media/cdrom
# cd Code/Storage_Manager/Linux_2.6_x86-64/
# ./SMIA-LINUXX64-10.86.0A05.0028.bin
Preparing to install...
Extracting the JRE from the installer archive...
Unpacking the JRE...
Extracting the installation resources from the installer archive...
Configuring the installer for this system's environment...

Launching installer...

then a GUI should start and it is should be enough to confirm the default options, accept the license and so on. The installation script should remove the old version of SMclient and install the new one. Then it should be possible to run the client and gradually see all the previously installed storage systems along as the new ones appear in Enterprise Management GUI. The current version (2013 September) is the 10.86.

How to Copy & Clone a Storage Configuration

Using SMclient it is possible to save the current configuration of a storage system and use it to clone the configuration to other similar storage systems.

# SMclient

then the Enterprise Management GUI should start. Select a storage in the left-sided list with a double click the Subsystem Management interface should pop-up. To save the configuration it is enough to:

Storage Subsystem -> Configuration -> Save

select all the elements to be saved (Storage Subsystem, Logical Drive configuration Logical Drive-to-LUN mappings, Topology) and save the file locally on the machine SMclient has been launched from. The saved configuration must be customized with the arrays names, host groups names, LUN mapping, etc. related to the storage to be_cloned_ before being applied.

To apply one of this configuration on a brand new identical storage system it is enough to run the Enterprise Management interface as above, right click on the Storage Subsystem to be configured and run:

Execute Script -> File -> Load Script

that should open a new window with an editor: here it is possible to load a previously saved configuration file, check its syntax and run it using the appropriate drop down menu options. Please note that:

  • creating array can take several hours and the creation progress can be checked using the Subsystem Management interface;

Summary -> Operations in Progress

  • saved configuration files should not be used for disaster recovery; (please refer to official doc);

  • they should be run on brand new storage system or to a previously installed ones, but in the latter case the configuration file must be completed with specific commands to delete the current configuration (please refer to official doc);

  • do not save the configuration if there are pending or running operations attempting to change it;

How to retrieve the WWIDs associated to logical drives

When configuring multipath.conf on the servers attached to the storage systems it is necessary to know the WWID (World Wide IDentifiers) associated to the logical arrays created on the storage and seen by the servers as logical devices.

To do this it is enough to run the Subsystem Management interface as described above (run SMclient and then select the appropriate storage subsystem where the arrays has been created):

SMClient -> Subsystem Mangement (for the specific storage subsys) ->
Storage & Copy Services -> Arrays -> (select a Logical Array)

for each Logical Array an information page is shown: the WWID can be derived by the Logical Drive ID item shown on the page by removing the colons used as separators and prefixing a 3 in this way:

Storage11_LogicalArray0

Logical Drive ID:60:08:0e:50:00:3e:1c:f8:00:00:0a:75:52:39:4c:1a

becomes

WWN 360080e50003e1cf800000a7552394c1a

that can be used as a WWID in a multipath.conf file.

Please note that the number 3 is often used by IBM on its storage systems, but to be sure the same format is used just run a

# multipath -ll

on the server attached to the storage to be sure about what kind of format is used on that particular storage (before configuring the array names through multipath.conf).

-- PabloFernandez - 2011-01-13

Topic attachments
I Attachment History Action Size Date Who Comment
Unix shell scriptsh BH_wrapper.sh r1 manage 0.6 K 2011-05-20 - 15:51 PabloFernandez  
Texttxt check_black_hole.pl.txt r1 manage 12.1 K 2011-05-20 - 15:51 PabloFernandez  
Unknown file formatEXT remove_spikes r1 manage 1.8 K 2011-02-16 - 09:38 PabloFernandez  
Unknown file formatEXT rpm_clone r1 manage 13.6 K 2011-05-20 - 15:26 PabloFernandez  
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r15 - 2014-02-12 - PabloFernandez
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback