SGE 6.2u5 and ARCO MySQL hosted on ZFS

Revision 11, 2012-06-03 19:37:58

Sun Grid Engine project home page: http://gridengine.sunsource.net/



This document describes the experiences we made during the upgrade of the SGE installation from 6.1 to 6.2u5, the last free version of this batch system; apart from the SGE upgrade itself that introduced several new features in the batch system we have also migrated the O.S., the method to manage accounting by introducing a DB and we have introduced the ZFS Linux driver to use this advanced filesystem in our context.

HW installation

For our installation we detached t3ui07 from the cluster and we converted in t3ce02, that is the new SGE master in this document; to protect its data we made an HW RAID1 configuration by using the LSI Bios at boot time and the final disk layout is a 140GB LSI Virtual Volume that we have partitioned during the SL6 installation in this way:

[root@t3ce02 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             9.7G  2.3G  6.9G  25% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/sda1             485M   34M  426M   8% /boot

[root@t3ce02 ~]# mount 
/dev/sda3 on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

Because there are 4 Gigabit NICs in the server it's worth to connect to the switch as many NICs it's possible and later configure a Linux Bonding configuration type 6 to improve the server bandwidth and availability but for the time being we skipped this step.

SL6 64bit Installation

We simply pointed the Virtual CD of t3ce02 to a SL6 DVD iso file we saved in t3admin01:/home/ and made a "Basic Server" installation, that's enough to install utilities like SSHs, yum, .. later we selected the other RPMs at run time. The "Basic Server" installation turns ON selinux by default, to disable it edit this file and then reboot the system:

[root@t3ce02 ~]# grep -v \# /etc/sysconfig/selinux 
SELINUX=disabled
SELINUXTYPE=targeted 
[root@t3ce02 ~]#

also turn OFF cron yum updated editing this file:

/etc/sysconfig/yum-autoupdate

And install these i686 RPMs because later they are needed by the Sun Web Console and by the LSI RAID utility mpt-status:

[root@t3ce02 ~]# yum install glibc.i686
...
Dependencies Resolved
================================================================================================================================
 Package                            Arch                   Version                            Repository                   Size
================================================================================================================================
Installing:
 glibc                              i686                   2.12-1.7.el6_0.3                   sl-security                 4.3 M
Installing for dependencies:
 nss-softokn-freebl                 i686                   3.12.8-1.el6_0                     sl-security                 108 k
Updating for dependencies:
 glibc                              x86_64                 2.12-1.7.el6_0.3                   sl-security                 3.7 M
 glibc-common                       x86_64                 2.12-1.7.el6_0.3                   sl-security                  14 M
 nss-softokn-freebl                 x86_64                 3.12.8-1.el6_0                     sl-security                 114 k

Transaction Summary
================================================================================================================================
Install       2 Package(s)
Upgrade       3 Package(s)

Total size: 22 M
Total download size: 4.4 M
Is this ok [y/N]: y
Downloading Packages:
(1/2): glibc-2.12-1.7.el6_0.3.i686.rpm                                                                   | 4.3 MB     00:09     
(2/2): nss-softokn-freebl-3.12.8-1.el6_0.i686.rpm                                                        | 108 kB     00:00     
--------------------------------------------------------------------------------------------------------------------------------
... 
Complete!
[root@t3ce02 ~]#

now you can install the LSI RAID checker "mpt-status" to monitor the HW RAID status:

[root@t3ce02 ~]# rpm -Uv http://www.drugphish.ch/~ratz/mpt-status/RPMS/1.2.0_RC7/mpt-status-1.2.0_RC7-3.i386.rpm
Retrieving http://www.drugphish.ch/~ratz/mpt-status/RPMS/1.2.0_RC7/mpt-status-1.2.0_RC7-3.i386.rpm
Preparing packages for installation...
mpt-status-1.2.0_RC7-3
[root@t3ce02 ~]#

load the driver and verify the HW RAID1:

[root@t3ce02 ~]# modprobe mptctl
[root@t3ce02 ~]# mpt-status 
ioc0 vol_id 0 type IM, 2 phy, 135 GB, state OPTIMAL, flags ENABLED
ioc0 phy 1 scsi_id 2 SEAGATE  ST914602SSUN146G 0603, 136 GB, state ONLINE, flags NONE
ioc0 phy 0 scsi_id 1 SEAGATE  ST914602SSUN146G 0603, 136 GB, state ONLINE, flags NONE
[root@t3ce02 ~]#

curiously I couldn't find /etc/modprobe.conf, probablt in SL6 there is an other mechanism for that, so I just appended the command:

[root@t3ce02 etc]# echo modprobe mptctl >> /etc/rc.local

did you reboot the system? if no let's do it now.

ZFS on SL6 64bit.

We found interesting to run ZFS filesystems on SL6 and at the time when this document was produced we used version zfs-linux-20110214.tar.bz2;

ZFS .tar.bz2 once opened allows to create RPMs that are always appreciated by Red Hat Admins, so be sure to have the package rpm-build deployed in you O.S. before to try to build ZFS RPMs.

Once you downloaded the file zfs-linux-20110214.tar.bz2 create a dir in /opt/zfs-build to build the ZFS RPMs, copy there the file zfs-linux-20110214.tar.bz2, and open it with tar -xjvf zfs-linux-20110214.tar.bz2, then follow these 'macro' steps:

[root@t3ce02 zfs-build]# ll
total 19680
drwxr-xr-x  9 root root     4096 Mar  3 14:53 lzfs
drwxr-xr-x  4 root root     4096 Mar  3 14:28 misc-scripts
drwxr-xr-x 11 root root     4096 Mar  3 14:35 spl
drwxr-xr-x 14 root root     4096 Mar  3 14:32 zfs
-rw-r--r--  1 root root 20132179 Feb 14 15:28 zfs-linux-20110214.tar.bz2

cd /opt/zfs-build/lzfs
./configure && make rpm

cd /opt/zfs-build/spl
./configure && make rpm

cd /opt/zfs-build/zfs
./configure && make rpm

yum install /opt/zfs-build/spl/*.rpm
yum install /opt/zfs-build/zfs/*.rpm
yum install /opt/zfs-build/lzfs/*.rpm

Here you can see the RPMs so far involved in the O.S. installation + the ZFS RPMs just built and installed by yum t3ce02.RPMs.list.after.ZFS.installation.txt.

Here is the md5sums list of the ZFS RPMs produced; all the RPMs are available at the bottom of this Wiki page, but remember that those are for SL6 64bit:

e6b0b62d710689586ee9cbbe8f6defdd  ./spl/spl-0.5.2-1.x86_64.rpm
a36c6797ba234f3935ea351c07002c61  ./spl/spl-modules-0.5.2-1_2.6.32_71.18.1.el6.x86_64.rpm
f462f15ab6c5a38db10290b38fcede8c  ./spl/spl-modules-devel-0.5.2-1_2.6.32_71.18.1.el6.x86_64.rpm
9397f335a0d33196a652e37b3a52b6ba  ./spl/spl-modules-0.5.2-1.src.rpm
e05f6da1226dd47b171b9764a15f488b  ./spl/spl-0.5.2-1.src.rpm
afe350394b3e9edd833dd15f1506e675  ./lzfs/lzfs-1.0-1.src.rpm
d3f9f6b6f0344bf95620c01b5fad3b2e  ./lzfs/lzfs-1.0-1_2.6.32_71.18.1.el6.x86_64.rpm
fecce1786206c71c20701d7872a5ca87  ./zfs/zfs-modules-0.5.1-1.src.rpm
e646e0ea853f8ce8c4166fa388dd1ecd  ./zfs/zfs-test-0.5.1-1.x86_64.rpm
1c7f4d7b34e4a8b92981b6b2bce875e4  ./zfs/zfs-0.5.1-1.x86_64.rpm
547d3680339b99ac99217f2f43e2b544  ./zfs/zfs-devel-0.5.1-1.x86_64.rpm
01368ff2a044612573481b3cb154ab58  ./zfs/zfs-0.5.1-1.src.rpm
23892b48ed147a166ac7d1b0ff3fb9ee  ./zfs/zfs-modules-devel-0.5.1-1_2.6.32_71.18.1.el6.x86_64.rpm
5e767ad12087ed18c7d72b05b39530d1  ./zfs/zfs-modules-0.5.1-1_2.6.32_71.18.1.el6.x86_64.rpm

We have partitioned the rest of the disk like 'sda4' to create there a ZFS pool where later to instantiate ZFS filesystems:

[root@t3ce02 ~]# fdisk  -l
Disk /dev/sda: 146.0 GB, 145999527936 bytes
255 heads, 63 sectors/track, 17750 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000d12bc

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          64      512000   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              64        1339    10240000   82  Linux swap / Solaris
/dev/sda3            1339        2614    10240000   83  Linux
/dev/sda4            2614       17751   121584640   83  Linux

this is the command we ran to create the pool:

[root@t3ce02 ~]# zpool create -f zfspool -m /mnt/zfs sda4
[root@t3ce02 ~]# df -h 
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             9.7G  2.9G  6.3G  31% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/sda1             485M   57M  403M  13% /boot
zfspool               114G   21K  114G   1% /mnt/zfs
[root@t3ce02 ~]#

MySQL Database

MySQL ZFS filesystem

On the official MySQL website we read about good performances regarding the relationship MySQL/ZFS, so we applied that procedure to create the ZFS fs to store our MySQL; this DB is going to be used by the SGE ARCO tool.

[root@t3ce02 zfs]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             9.7G  2.9G  6.3G  32% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/sda1             485M   57M  403M  13% /boot
zfspool               114G  5.9G  108G   6% /mnt/zfs
[root@t3ce02 zfs]# zfs create zfspool/mysql
[root@t3ce02 zfs]# zfs set recordsize=16K zfspool/mysql

MySQL RPMs

Because we prepared a ZFS filesystem for MySQL let's continue installing mysql-server and relocating the files on ZFS, please follow these macro steps:

yum install mysql-server
/etc/init.d/mysqld stop
cd /var/lib
mv mysql /mnt/zfs/mysql && ln -s /mnt/zfs/mysql/mysql .
/etc/init.d/mysqld start
chkconfig mysql on

to manage MySQL you can use several tools, probably the most common choice is to deploy mysql-workbench or phpmyadmin;

MySQL PhPMyAdmin

We liked and installed https://t3ce02.psi.ch/phpmyadmin/.

Features

Something worth to report about PhPMyAdmin, with version 3.4 you can:

    * browse and drop databases, tables, views, columns and indexes
    * create, copy, drop, rename and alter databases, tables, columns and indexes
    * maintenance server, databases and tables, with proposals on server configuration
    * execute, edit and bookmark any SQL-statement, even batch-queries
    * load text files into tables
    * create1 and read dumps of tables
    * export1 data to various formats: CSV, XML, PDF, ISO/IEC 26300 - OpenDocument Text and Spreadsheet, Word, Excel and LATEX formats
    * import data and MySQL structures from Microsoft Excel and OpenDocument spreadsheets, as well as XML, CSV, and SQL files
    * administer multiple servers
    * manage MySQL users and privileges
    * check referential integrity in MyISAM tables
    * using Query-by-example (QBE), create complex queries automatically connecting required tables
    * create PDF graphics of your Database layout
    * search globally in a database or a subset of it
    * transform stored data into any format using a set of predefined functions, like displaying BLOB-data as image or download-link
    * track changes on databases, tables and views
    * support InnoDB tables and foreign keys (see FAQ 3.6)
    * support mysqli, the improved MySQL extension (see FAQ 1.17)
    * communicate in 62 different languages
    * synchronize two databases residing on the same as well as remote servers (see FAQ 9.1)

Dedicated READONLY ZFS filesystem

With ZFS you can create as many filesystems you can store on you ZFS pool, so we created a ZFS filesystem for phpmyadmin and once we did the configuration we setted the fs READ ONLY feature to ON:

[root@t3ce02 ~]# cd /var/www/html/
[root@t3ce02 html]# ll
lrwxrwxrwx 1 root root 20 Mar  6 19:54 phpmyadmin -> /mnt/zfs/phpmyadmin/

[root@t3ce02 html]# df -h  /mnt/zfs/phpmyadmin/
Filesystem            Size  Used Avail Use% Mounted on
zfspool/phpmyadmin     53G   18M   53G   1% /mnt/zfs/phpmyadmin

[root@t3ce02 html]# zfs get readonly  zfspool/phpmyadmin
NAME                PROPERTY  VALUE   SOURCE
zfspool/phpmyadmin  readonly  on      local

MySQL ARCO DB

Now we can prepare the sge_arco DB and the related 2 MySQL users, user 'arco_read' that's used by the ARCO Web application to run queries and the user 'arco_write' that's used by the reporting module to parse the SGE reporting file /gridware/sge/default/common/reporting and insert new rows in the DB sge_arco.

We followed the ARCO procedure for the MySQL case.

This is our final permissions layout in MySQL:

User    Host    Password    Global privileges Tip    Grant    
   arco_read    %    Yes    USAGE    No    Edit Privileges
   arco_read    localhost    Yes    USAGE    No    Edit Privileges
   arco_write    %    Yes    ALL PRIVILEGES    Yes    Edit Privileges
   arco_write    localhost    Yes    ALL PRIVILEGES    Yes    Edit Privileges
   root    127.0.0.1    Yes    ALL PRIVILEGES    Yes    Edit Privileges
   root    localhost    Yes    ALL PRIVILEGES    Yes    Edit Privileges
   root    t3ce02    Yes    ALL PRIVILEGES    Yes    Edit Privileges
   root    t3ce02.psi.ch    Yes    ALL PRIVILEGES    Yes    Edit Privileges

MySQL Query logging

To debug what's happening in our db it's worth to enable the query logging feature of MySQL, so this is the /etc/my.cnf, please look the 'log' tag:

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
log=/var/lib/mysql/general.log
# Disabling symbolic-links is recommended to prevent assorted security risks
# symbolic-links=0
#
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

at run time we can use the command 'tail' to debug the queries:

[root@t3ce02 sun]# tail -f /mnt/zfs/mysql/mysql/general.log
/usr/libexec/mysqld, Version: 5.1.52-log (Source distribution). started with:
Tcp port: 0  Unix socket: /var/lib/mysql/mysql.sock
Time                 Id Command    Argument
110303 17:48:13       1 Connect   Access denied for user 'UNKNOWN_MYSQL_USER'@'localhost' (using password: NO)

Sun Web Console installation

The first step to do to install SGE and SGE ARCO is to deploy the Sun Web Console, basically a Java framework developed by Sun to host their Java web applications. Online there is a procedure to install but we prefered to report here the steps:

So starting from these SGE6.2u5 files in /opt:

[root@t3ce02 SGE6.2u5]# ll
total 221396
-rw-r--r-- 1 root root  3865332 Feb 24 10:20 sdm10u5_core_rpm.zip
-rw-r--r-- 1 root root  3868219 Feb 24 10:20 sdm10u5_core_targz.zip
-rw-r--r-- 1 root root 10271047 Feb 24 10:20 sge62u5_arco_rpm.zip
-rw-r--r-- 1 root root 10305829 Feb 24 10:20 sge62u5_arco_targz.zip
-rw-r--r-- 1 root root 18839411 Feb 24 10:21 sge62u5_inspect_rpm.zip
-rw-r--r-- 1 root root 18899376 Feb 24 10:21 sge62u5_inspect_targz.zip
-rw-r--r-- 1 root root 29514366 Feb 24 11:17 sge62u5_linux24-i586_rpm.zip
-rw-r--r-- 1 root root 29533073 Feb 24 10:20 sge62u5_linux24-x64_rpm.zip
-rw-r--r-- 1 root root 34009465 Feb 24 10:20 sge62u5_sources+gpl-code_targz.zip
-rw-r--r-- 1 root root 67576445 Feb 24 10:21 webconsole3.0.2-linux.targz.zip

[root@t3ce02 SGE6.2u5]# md5sum *
c89ab2b3db585a5df092ac3399bcdb21  sdm10u5_core_rpm.zip
0bbccb40251dd189c22496d5f945c4f6  sdm10u5_core_targz.zip
188d3e28313b629f19dae761a8b6522b  sge62u5_arco_rpm.zip
e24d3b8e7e11447312771c3cdaf03687  sge62u5_arco_targz.zip
fe8f85829bb57938e8edc09186a93afa  sge62u5_inspect_rpm.zip
d40484210cde65a880e3eab86651ab9e  sge62u5_inspect_targz.zip
68f232beeb66a94c12f286860f07185e  sge62u5_linux24-i586_rpm.zip
23a81889b532253f1a1573ac3145111b  sge62u5_linux24-x64_rpm.zip
0d1fd15da1aee3bb159eb0b5dccae0cb  sge62u5_sources+gpl-code_targz.zip
b931ec2bde0137ebaeae4c4669a65df1  webconsole3.0.2-linux.targz.zip
[root@t3ce02 SGE6.2u5]#

Let's unzip the webconsole package:

[root@t3ce02 SGE6.2u5]# unzip webconsole3.0.2-linux.targz.zip
Archive:  webconsole3.0.2-linux.targz.zip
  inflating: sge6_2u5/webconsole3.0.2-linux.tar.gz  
[root@t3ce02 SGE6.2u5]# cd sge6_2u5/
[root@t3ce02 sge6_2u5]# tar -xzvf webconsole3.0.2-linux.tar.gz 
SUNWjato-2.1.5.i386.rpm
SUNWjatodmo-2.1.5.i386.rpm
SUNWjatodoc-2.1.5.i386.rpm
SUNWmcon-3.0.2-5.i386.rpm
SUNWmconr-3.0.2-5.i386.rpm
SUNWmcos-3.0.2-5.i386.rpm
SUNWmcosx-3.0.2-5.i386.rpm
SUNWmctag-3.0.2-5.i386.rpm
config_properties.tpl
jdk-1_5_0_04-linux-i586.rpm
setup
sun-javahelp-2.0_01-fcs.i586.rpm
.pkgrc
.setup_default
[root@t3ce02 sge6_2u5]#

Be sure to install the RPM pam.i686 because Sun Web Console is 32bit software and then install the framework:

[root@t3ce02 sge6_2u5]# ./setup 
Preparing packages for installation...
jdk-1.5.0_04-fcs
Preparing packages for installation...
sun-javahelp-2.0-fcs
Linking JavaHelp to /usr/java/jdk1.5.0_04 ...
Preparing packages for installation...
SUNWjato-2.1.5-9
Preparing packages for installation...
SUNWjatodoc-2.1.5-9
Preparing packages for installation...
SUNWjatodmo-2.1.5-9
Preparing packages for installation...
SUNWmctag-3.0.2-5
Preparing packages for installation...
SUNWmconr-3.0.2-5
Preparing packages for installation...
SUNWmcon-3.0.2-5
Preparing packages for installation...
SUNWmcos-3.0.2-5
Preparing packages for installation...
SUNWmcosx-3.0.2-5

Installation complete.

Starting Sun Java(TM) Web Console Version 3.0.2 ...
The console is running.
[root@t3ce02 sge6_2u5]#

The Sun Web Console is listening on TCP 6789:

[root@t3ce02 sge6_2u5]# netstat -tpln |grep java
tcp        0      0 ::ffff:127.0.0.1:41086      :::*                        LISTEN      7013/java           
tcp        0      0 :::6788                     :::*                        LISTEN      7013/java           
tcp        0      0 :::6789                     :::*                        LISTEN      7013/java           

and you can access with your Linux credentials root/pwd py pointing your browser to https://t3ce02.psi.ch:6789/

Here you can see the Sun Web Console logs:

[root@t3ce02 sun]# tail /var/log/webconsole/console/console_debug_log
==============================================================
Java Web Console Version 3.0.2 started on Thu Mar  3 17:17:05 CET 2011
==============================================================
[root@t3ce02 sun]#

and here which applications are registered in the console:

[root@t3ce common]# wcadmin list -a
 
Deployed web applications (application name, context name, status):
 
    com.sun.grid.arco_6.2u5  reporting       [running]
    console           ROOT            [running]
    console           com_sun_web_ui  [running]
    console           console         [running]
    console           manager         [running]

More info at this OpenSolaris link

SGE QMASTER 6.2u5 installation

Now we can install SGE, please have a look to the following steps:

[root@t3ce02 SGE6.2u5]# unzip sge62u5_linux24-x64_rpm.zip
Archive:  sge62u5_linux24-x64_rpm.zip
  inflating: sge6_2u5/sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm  
  inflating: sge6_2u5/sun-sge-common-6.2-5.noarch.rpm  
[root@t3ce02 SGE6.2u5]# cd sge6_2u5/
[root@t3ce02 sge6_2u5]# ll
total 161640
-r--r--r-- 1 root   bin       1235 Dec  9  2006 config_properties.tpl
-rw-r--r-- 1 102852 wheel 47286234 Jul 27  2005 jdk-1_5_0_04-linux-i586.rpm
-r-xr-xr-x 1 root   bin      48781 Dec  9  2006 setup
-rw-r--r-- 1   5074 wheel  6340876 May 11  2004 sun-javahelp-2.0_01-fcs.i586.rpm
-rw-r--r-- 1 root   root  25583219 Dec 15  2009 sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm
-rw-r--r-- 1 root   root   4161238 Dec 15  2009 sun-sge-common-6.2-5.noarch.rpm
-r--r--r-- 1 root   bin     731610 Nov  8  2005 SUNWjato-2.1.5.i386.rpm
-r--r--r-- 1 root   bin    1216562 Nov  8  2005 SUNWjatodmo-2.1.5.i386.rpm
-r--r--r-- 1 root   bin    1049729 Nov  8  2005 SUNWjatodoc-2.1.5.i386.rpm
-rw-rw-r-- 1 root   bin   10504152 Dec  9  2006 SUNWmcon-3.0.2-5.i386.rpm
-rw-rw-r-- 1 root   bin      29130 Dec  9  2006 SUNWmconr-3.0.2-5.i386.rpm
-rw-rw-r-- 1 root   bin      46593 Dec  9  2006 SUNWmcos-3.0.2-5.i386.rpm
-rw-rw-r-- 1 root   bin       3803 Dec  9  2006 SUNWmcosx-3.0.2-5.i386.rpm
-rw-rw-r-- 1 root   bin     919212 Dec  9  2006 SUNWmctag-3.0.2-5.i386.rpm
-rw-r--r-- 1 root   root  67566632 Dec 15  2009 webconsole3.0.2-linux.tar.gz
[root@t3ce02 sge6_2u5]# yum install sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm sun-sge-common-6.2-5.noarch.rpm
...
Dependencies Resolved

================================================================================================================================
 Package                          Arch            Version                  Repository                                      Size
================================================================================================================================
Installing:
 sun-sge-bin-linux24-x64          x86_64          6.2-5                    /sun-sge-bin-linux24-x64-6.2-5.x86_64           61 M
 sun-sge-common                   noarch          6.2-5                    /sun-sge-common-6.2-5.noarch                    11 M
Installing for dependencies:
 ksh                              x86_64          20100621-2.el6           sl                                             655 k
 libXp                            x86_64          1.0.0-15.1.el6           sl                                              22 k
 libXpm                           x86_64          3.5.8-2.el6              sl                                              58 k
 openmotif22                      x86_64          2.2.3-19.el6             sl                                             1.2 M
 tcl                              x86_64          1:8.5.7-6.el6            sl                                             1.9 M

...
Complete!

Move the SGE installation on the ZFS filesystem:

[root@t3ce02 /]# mv gridware/ /mnt/zfs/sge/ && ln -s /mnt/zfs/sge/gridware .
[root@t3ce02 /]# ll gridware
lrwxrwxrwx 1 root root 21 Mar  3 17:58 gridware -> /mnt/zfs/sge/gridware

Let's install SGE by running the script start_gui_installer; this is the final configuration we did:

Grid Engine cluster configuration
Grid Engine root directory ($SGE_ROOT)
/mnt/zfs/sge/gridware/sge
Cell name ($SGE_CELL)
default
Cluster name ($SGE_CLUSTER_NAME)
p6444
Qmaster port ($SGE_QMASTER_PORT)
6444
Execd port ($SGE_EXECD_PORT)
6445
Group id range ($SGE_GID_RANGE)
20000-20100
Qmaster spool directory
/mnt/zfs/sge/gridware/sge/default/spool/qmaster
Global execd spool directory
/mnt/zfs/sge/gridware/sge/default/spool
Spooling method
berkeleydb
Spooling directory
/mnt/zfs/sge/gridware/sge/default/spool/spooldb
JMX port
6446
JVM library path
/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/amd64/server/libjvm.so
JMX SSL server keystore path
/var/sgeCA/port6444/default/private/keystore
Administrator mail
fabio.martinelli@psi.ch
 

Succeded
Failed
Qmaster host
t3ce02.psi.ch

Execution host(s)
t3ce02.psi.ch

Shadow host(s)


Berkeley db host


Admin host(s)
t3ce02.psi.ch

Submit host(s)
t3ce02.psi.ch

How to start with Grid Engine
Set the environment... ... if you are a csh/tcsh user: source /mnt/zfs/sge/gridware/sge/default/common/settings.csh ... if you are a sh/ksh user: . /mnt/zfs/sge/gridware/sge/default/common/settings.sh  This will set or expand the following environment variables:
$SGE_ROOT (always necessary)
$SGE_CELL (if you are using a cell other than default)
$SGE_CLUSTER_NAME (always necessary)
$SGE_QMASTER_PORT (if you haven't added the service sge_qmaster)
$SGE_EXECD_PORT (if you haven't added the service sge_execd)
$PATH/$path (to find the Grid Engine binaries)
$MANPATH (to access the manual pages)
 
Submit one of the sample scripts contained in the /mnt/zfs/sge/gridware/sge/examples/jobs directory. qsub /mnt/zfs/sge/gridware/sge/examples/jobs/simple.sh or qsub /mnt/zfs/sge/gridware/sge/examples/jobs/sleeper.sh
 
Use the qstat command to monitor the job's behavior. qstat -f
 
After the job finishes executing, check your home directory for the redirected stdout/stderr files script-name.ejob-id and script-name.ojob-id. The job-id is a consecutive unique integer number assigned to each job.
Administering Grid Engine
Grid Engine startup scripts can be found at: Qmaster: /mnt/zfs/sge/gridware/sge/default/common/sgemaster start/stop Exec daemon: /mnt/zfs/sge/gridware/sge/default/common/sgeexecd start/stop  After startup the daemons log their messages in their spool directories. Qmaster: /mnt/zfs/sge/gridware/sge/default/spool/qmaster/messages Exec daemon: //messages
Useful links
Sun Grid Engine Information Center
http://wikis.sun.com/display/SunGridEngine/Home
Grid Engine project home
http://gridengine.sunsource.net

SGE setting scripts

Please create these symbolic links:

[root@t3ce02 profile.d]# pwd
/etc/profile.d
[root@t3ce02 profile.d]# ll se*
lrwxrwxrwx 1 root root 53 Mar  3 18:03 settings.csh -> /mnt/zfs/sge/gridware/sge/default/common/settings.csh
lrwxrwxrwx 1 root root 52 Mar  3 18:03 settings.sh -> /mnt/zfs/sge/gridware/sge/default/common/settings.sh
[root@t3ce02 profile.d]#

then logout and login again by SSH.

SGE configuration tuning

Now to enable the SGE reporting file and to save job logs on the server where the job ran we tuned the SGE conf with the 'qconf -mconf' command taking into account this fragment :

...
execd_params                 KEEP_ACTIVE=1 ENABLE_ADDGRP_KILL=TRUE                               H_MEMORYLOCKED=infinity
reporting_params             accounting=true reporting=true                               flush_time=00:00:15 joblog=true sharelog=00:00:00
...

and

[root@t3ce02 sge6_2u5]# qconf -se global
hostname              global
load_scaling          NONE
complex_values        NONE
load_values           NONE
processors            0
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      cpu,np_load_avg,mem_free,virtual_free
[root@t3ce02 sge6_2u5]#

and on the SGE scheduler

[root@t3ce02 ~]# qconf -ssconf
...
schedd_job_info                   true
...

SGE dbwriter

Once the SGE master is properly working we can install the dbwriter tool, that involves the MySQL user 'arco_write', and the ARCO reporting software, that involves the MySQL user 'arco_read' and the Sun Web Console. Please have a look to the official SGE documentation.

reporting file duplication for security reasons

Once up and running dbwriter parses and eventually drops the SGE reporting file, even if it's the nominal behaviour of this SW we disagreed on collateral effect, so we've started this command to preserve the reporting file content inside an other file, the tail terminates when sge_qmaster terminates:

nohup tail --pid=$(pidof sge_qmaster) -n 0 -F /gridware/sge/default/common/reporting >> /gridware/sge/default/common/reporting.not.deleted.by.dbwriter &

RPM installation

This was our installation experiences:

[root@t3ce02 SGE6.2u5]# unzip sge62u5_arco_rpm.zip
Archive:  sge62u5_arco_rpm.zip
  inflating: sge6_2u5/sun-sge-arco-6.2-5.noarch.rpm  
[root@t3ce02 SGE6.2u5]# cd sge6_2u5
[root@t3ce02 sge6_2u5]# yum install sun-sge-arco-6.2-5.noarch.rpm
Setting up Install Process
Examining sun-sge-arco-6.2-5.noarch.rpm: sun-sge-arco-6.2-5.noarch
Marking sun-sge-arco-6.2-5.noarch.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package sun-sge-arco.noarch 0:6.2-5 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================================================================
 Package                       Arch                    Version                Repository                                   Size
================================================================================================================================
Installing:
 sun-sge-arco                  noarch                  6.2-5                  /sun-sge-arco-6.2-5.noarch                   19 M

Transaction Summary
================================================================================================================================
Install       1 Package(s)
Upgrade       0 Package(s)

Total size: 19 M
Installed size: 19 M
Is this ok [y/N]: y
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing     : sun-sge-arco-6.2-5.noarch                                                                                1/1 

Installed:
  sun-sge-arco.noarch 0:6.2-5                                                                                                   

Complete!
[root@t3ce02 sge6_2u5]#

MySQL JDBC driver

Be sure that you have a MySQL JDBC driver file and link that file inside the SGE dir:

[root@t3ce02 sge6_2u5]# yum install mysql-connector-java.x86_64
...
[root@t3ce02 lib]# pwd
/mnt/zfs/sge/gridware/sge/dbwriter/lib

[root@t3ce02 lib]# ln -s /usr/share/java/mysql-connector-java.jar

Installation /inst_dbwriter

During the dbwriter installation itself, that's well reported inside the official SGE website, we were prompted for several things, one is which Java to use, there we specified '/etc/alternatives/jre/' to be protected from Java update. So we ran:

cd $SGE_ROOT/dbwriter && /inst_dbwriter
...
All parameters are now collected
--------------------------------

        SGE_ROOT=/mnt/zfs/sge/gridware/sge
        SGE_CELL=default
       JAVA_HOME=/etc/alternatives/jre (1.6.0_17)
          DB_URL=jdbc:mysql://localhost:3306/sge_arco
         DB_USER=arco_write
       READ_USER=arco_read
        INTERVAL=120
       SPOOL_DIR=/mnt/zfs/sge/gridware/sge/default/spool/dbwriter
    DERIVED_FILE=/mnt/zfs/sge/gridware/sge/dbwriter/database/mysql/dbwriter.xml
     DEBUG_LEVEL=FINE

Are these settings correct? (y/n) [y] >>

Please note the MySQL sge_arco Tables and Views creation phase:

Update version table
commiting changes
Version 6.1u3 (id=6) successfully installed
Install version 6.1u4 (id=7) -------
Create table sge_version
Insert first value in the checkpoint table
Update version table
commiting changes
Version 6.1u4 (id=7) successfully installed
Install version 6.2 (id=8) -------
Drop primary key constraint on sge_version table
Create compound primary key for sge_version
Create table sge_ar
Create index sge_ar_idx0 on column ar_number
Create index sge_ar_idx1 on column ar_owner
Create table sge_ar_attribute
Create index sge_ar_attribute_idx0 on column ara_end_time
Create table sge_ar_usage
Create table sge_ar_log
Create index sge_ar_log_idx0 on column arl_event
Create table sge_ar_resource_usage
Add the column ju_ar_parent to sge_job_usage table
Create index sge_job_usage_idx2 on column ju_ar_parent
Drop view view_job_times
Drop view view_accounting
Drop view view_jobs_completed
Update view view_accounting
Create view view_job_times_subquery
Update view view_job_times
Update view view_jobs_completed
Create view view_ar_attribute
Create view view_ar_log
Create view view_ar_usage
Create view view_ar_resource_usage

            Create view view_ar_time_usage 
         
Drop the column ju_state from sge_job_usage table
Drop the column j_open from sge_job table
Updating derived host values variable h_jobs to h_jobs_finished
Update version table
commiting changes
Version 6.2 (id=8) successfully installed
Install version 6.1u6 (id=9) -------
Extend too small integer field sge_department_values.dv_id,
drop temporarily constraint for foreign key sge_department_values.dv_parent
and extend too small integer field sge_department_values.dv_parent
Extend too small integer field sge_department.d_id
Recreate foreign key sge_department_values.dv_parent
Extend too small integer field sge_group_values.gv_id,
drop temporarily constraint for foreign key sge_group_values.gv_parent
and extend too small integer field sge_group_values.gv_parent
Extend too small integer
      field sge_group.g_id
Recreate foreign key sge_group_values.gv_parent
Extend too small integer field sge_host_values.hv_id,
drop temporarily constraint for foreign key sge_host_values.hv_parent
and extend too small integer field sge_host_values.hv_parent
Extend too small integer field sge_host.h_id
Recreate foreign key sge_host_values.hv_parent
Extend too small integer field sge_job_log.jl_id,
drop temporarily constraint for foreign key sge_job_log.jl_parent
and extend too small integer field sge_job_log.jl_parent
Extend too small integer field sge_job_request.jr_id,
drop temporarily constraint for foreign key sge_job_request.jr_parent
and extend too small integer field sge_job_request.jr_parent
Extend too small integer field sge_job_usage.ju_id,
drop temporarily constraint for foreign key sge_job_usage.ju_parent
and extend too small integer field sge_job_usage.ju_parent
Extend too small integer field sge_job.j_id
Recreate foreign key sge_job_log.jl_parent
Recreate foreign key sge_job_request.jr_parent
Recreate foreign key sge_job_usage.ju_parent
Extend too small integer field sge_project_values.pv_id,
drop temporarily constraint for foreign key sge_project_values.pv_parent
and extend too small integer field sge_project_values.pv_parent
Extend too small integer field sge_project.p_id
Recreate foreign key sge_project_values.pv_parent
Extend too small integer field sge_queue_values.qv_id,
drop temporarily constraint for foreign key sge_queue_values.qv_parent
and extend too small integer field sge_queue_values.qv_parent
Extend too small integer field sge_queue.q_id
Recreate foreign key sge_queue_values.qv_parent
Extend too small integer field sge_share_log.sl_id
Extend too small integer field sge_statistic_values.sv_id,
drop temporarily constraint for foreign key sge_statistic_values.sv_parent
and extend too small integer field sge_statistic_values.sv_parent
Extend too small integer field sge_statistic.s_id
Recreate foreign key sge_statistic_values.sv_parent
Extend too small integer field sge_user_values.uv_id,
drop temporarily constraint for foreign key sge_user_values.uv_parent
and extend too small integer field sge_user_values.uv_parent
Extend too small integer field sge_user.u_id
Recreate foreign key sge_user_values.uv_parent
Update version table
commiting changes
Version 6.1u6 (id=9) successfully installed
Install version 6.2u1 (id=10) -------
Extend too small integer field sge_ar_attribute.ara_id,
drop temporarily constraint for foreign key sge_ar_attribute.ara_parent
and extend too small integer field sge_ar_attribute.ara_parent
Extend too small integer field sge_ar_log.arl_id,
drop temporarily constraint for foreign key sge_ar_log.arl_parent
and extend too small integer field sge_ar_log.arl_parent
Extend too small integer field sge_ar_resource_usage.arru_id,
drop temporarily constraint for foreign key sge_ar_resource_usage.arru_parent
and extend too small integer field sge_ar_resource_usage.arru_parent
Extend too small integer field sge_ar_usage.aru_id,
drop temporarily constraint for foreign key sge_ar_usage.aru_parent
and extend too small integer field sge_ar_usage.aru_parent
Extend too small integer field sge_ar.ar_id
Extend too small integer field sge_job_usage.ju_parent and sge_job_usage.ju_ar_parent
Recreate foreign key sge_ar_attribute.ara_parent
Recreate foreign key sge_ar_log.arl_parent
Recreate foreign key sge_ar_resource_usage.arru_parent
Recreate foreign key sge_ar_usage.aru_parent
Drop primary key constraint on sge_version table
Create compound primary key for sge_version
Update version table
commiting changes
Version 6.2u1 (id=10) successfully installed
OK

Create start script sgedbwriter in /mnt/zfs/sge/gridware/sge/default/common

Create configuration file for dbwriter in /mnt/zfs/sge/gridware/sge/default/common

Hit  to continue >>

When the dbwriter installation is completed we got:

dbwriter startup script
-----------------------

We can install the startup script that will
start dbwriter at machine boot (y/n) [y] >> 

cp /mnt/zfs/sge/gridware/sge/default/common/sgedbwriter /etc/init.d/sgedbwriter.p6444
/usr/lib/lsb/install_initd /etc/init.d/sgedbwriter.p6444
Creating dbwriter spool directory /mnt/zfs/sge/gridware/sge/default/spool/dbwriter
starting dbwriter
dbwriter started (pid=11098)
Installation of dbwriter completed
[root@t3ce02 dbwriter]#

Checking dbwriter logs

The program dbwriter is now a service in your system, you can start/stop it with:

/etc/init.d/sgedbwriter.p6444

And double checking what's going on with a tail command on these 2 log files:

[root@t3ce02 ~]# tail -f /mnt/zfs/sge/gridware/sge/default/spool/dbwriter/dbwriter.log
06/03/2011 16:24:51|t3ce02.psi.ch|ivedValueThread.commitExecuted|D|new object received, timestampOfLastRowData is 1,299,428,609,000
06/03/2011 16:24:51|t3ce02.psi.ch|iter.file.FileParser.parseFile|I|Deleting file reporting.processing
06/03/2011 16:24:51|t3ce02.psi.ch|.RecordCache.getStoredDBRecord|D|Object for key 'dbwriter' = [sge_statistic, id=1, parent=0, key=['dbwriter'], addr=0x7f712b3a]
06/03/2011 16:24:51|t3ce02.psi.ch|le.FileParser.createStatistics|I|Processed 6 lines in 0s (1500 lines/s)
06/03/2011 16:24:51|t3ce02.psi.ch|ter.RecordManager.executeBatch|D|Batch success. Number of statements executed: 0 table: 'sge_host_values'
06/03/2011 16:24:51|t3ce02.psi.ch|ter.RecordManager.executeBatch|D|Batch success. Number of statements executed: 1 table: 'sge_statistic_values'
06/03/2011 16:24:51|t3ce02.psi.ch|r.Controller.flushBatchesAtEnd|D|All Batches flushed and commited
06/03/2011 16:24:51|t3ce02.psi.ch|ng.dbwriter.db.Database.commit|D|Thread dbwriter commits Connection 3 (null@jdbc:mysql://localhost:3306/sge_arco)
06/03/2011 16:24:51|t3ce02.psi.ch|g.dbwriter.db.Database.release|D|Thread dbwriter releases Connection 3 (null@jdbc:mysql://localhost:3306/sge_arco)
06/03/2011 16:24:51|t3ce02.psi.ch|ter.ReportingDBWriter.mainLoop|C|Sleeping for 119,992 milli seconds

[root@t3ce02 ~]# tail -f /mnt/zfs/mysql/mysql/general.log
          3 Query   INSERT INTO sge_host_values (hv_id, hv_parent, hv_time_start, hv_time_end, hv_variable, hv_svalue, hv_dvalue, hv_dconfig) VALUES (97031, 37, '2011-03-06 16:25:29', '2011-03-06 16:25:29', 'mem_free', '1805.183594M', 1.892872192262144E+9, 0.0)
          3 Query   INSERT INTO sge_host_values (hv_id, hv_parent, hv_time_start, hv_time_end, hv_variable, hv_svalue, hv_dvalue, hv_dconfig) VALUES (97032, 37, '2011-03-06 16:25:29', '2011-03-06 16:25:29', 'virtual_free', '4127.070312M', 4.327546879475712E+9, 0.0)
          3 Query   INSERT INTO sge_host_values (hv_id, hv_parent, hv_time_start, hv_time_end, hv_variable, hv_svalue, hv_dvalue, hv_dconfig) VALUES (97033, 37, '2011-03-06 16:26:09', '2011-03-06 16:26:09', 'cpu', '0.000000', 0.0, 0.0)
          3 Query   INSERT INTO sge_host_values (hv_id, hv_parent, hv_time_start, hv_time_end, hv_variable, hv_svalue, hv_dvalue, hv_dconfig) VALUES (97034, 37, '2011-03-06 16:26:09', '2011-03-06 16:26:09', 'np_load_avg', '0.000000', 0.0, 0.0)
          3 Query   INSERT INTO sge_host_values (hv_id, hv_parent, hv_time_start, hv_time_end, hv_variable, hv_svalue, hv_dvalue, hv_dconfig) VALUES (97035, 37, '2011-03-06 16:26:09', '2011-03-06 16:26:09', 'mem_free', '1805.183594M', 1.892872192262144E+9, 0.0)
          3 Query   INSERT INTO sge_host_values (hv_id, hv_parent, hv_time_start, hv_time_end, hv_variable, hv_svalue, hv_dvalue, hv_dconfig) VALUES (97036, 37, '2011-03-06 16:26:09', '2011-03-06 16:26:09', 'virtual_free', '4127.070312M', 4.327546879475712E+9, 0.0)
          3 Query   UPDATE sge_checkpoint SET ch_line = 0, ch_time = '2011-03-06 16:26:51' WHERE ch_id=1
          3 Query   commit
          3 Query   INSERT INTO sge_statistic_values (sv_id, sv_parent, sv_time_start, sv_time_end, sv_variable, sv_dvalue) VALUES (2357, 1, '2011-03-06 16:26:51', '2011-03-06 16:26:51', 'lines_per_second', 1166.6666666666667)
          3 Query   commit

dbwriter RE-importing a reporting file

If for whatever reason you need to ingest again the reporting file inside the MySQL DB sge_arco then please run these sequence of truncates:

TRUNCATE `sge_checkpoint`;
TRUNCATE `sge_department`;
TRUNCATE `sge_group`;
TRUNCATE `sge_host`;
TRUNCATE `sge_host_values`;
TRUNCATE `sge_job`;
TRUNCATE `sge_job_log`;
TRUNCATE `sge_job_request`;
TRUNCATE `sge_job_usage`;
TRUNCATE `sge_project`;
TRUNCATE `sge_project_values`;
TRUNCATE `sge_queue`;
TRUNCATE `sge_queue_values`;
TRUNCATE `sge_statistic`;
TRUNCATE `sge_statistic_values`;
TRUNCATE `sge_user`;
TRUNCATE `sge_user_values`;

SGE ARCO

Now it's time to install the reporting layer, please have a look to the Official ARCO documentation

Here follows our installation experience:

MySQL JDBC driver

ARCO it's Java application that needs to communicate with MySQL, so we created an other symbolic link like in the dbwriter case:

[root@t3ce02 ~]# ll /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar
lrwxrwxrwx 1 root root 40 Mar  3 20:30 /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar -> /usr/share/java/mysql-connector-java.jar

link that was properly recognized by the ARCO installation procedure:

...
Searching for the jdbc driver com.mysql.jdbc.Driver 
in directory /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib 

OK, jdbc driver found

Should the connection to the database be tested? (y/n) [y] >> 

Test database connection to 'jdbc:mysql://localhost:3306/sge_arco' ... OK

Hit  to continue >> 

DB parameters are now collected
-------------------------------
    CLUSTER_NAME=T3_PSI_CH
          DB_URL=jdbc:mysql://localhost:3306/sge_arco
         DB_USER=arco_read

Are these settings correct? (y/n) [y] >> 

Do you want to add another cluster? (y/n) [n] >>n

Configure users with write access
---------------------------------

Users: default
Enter a user login name. (Hit  to finish) >> root

Users: default root
Enter a user login name. (Hit  to finish) >> martinelli_f

Users: default root martinelli_f
Enter a user login name. (Hit  to finish) >> 

All parameters are now collected
--------------------------------
       SPOOL_DIR=/var/spool/arco
      APPL_USERS=default root martinelli_f

Are these settings correct? (y/n) [y] >> 

   found incorrect permissions lrwxrwxrwx for /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar
   Correcting file permissions ... done

Standard ARCO Queries

SGE Engineers designed some standard queries useful for any kind of SGE cluster:

....
Install predefined queries
--------------------------

Directory /var/spool/arco does not exist, create it? (y/n) [y] >> y

Create directory /var/spool/arco
Create directory /var/spool/arco/queries
Copy examples queries into /var/spool/arco/queries
Copy query Accounting_per_AR.xml ... OK
Copy query Accounting_per_Department.xml ... OK
Copy query Accounting_per_Project.xml ... OK
Copy query Accounting_per_User.xml ... OK
Copy query AR_Attributes.xml ... OK
Copy query AR_by_User.xml ... OK
Copy query AR_Log.xml ... OK
Copy query AR_Reserved_Time_Usage.xml ... OK
Copy query Average_Job_Turnaround_Time.xml ... OK
Copy query Average_Job_Wait_Time.xml ... OK
Copy query DBWriter_Performance.xml ... OK
Copy query Host_Load.xml ... OK
Copy query Job_Log.xml ... OK
Copy query Number_of_Jobs_Completed_per_AR.xml ... OK
Copy query Number_of_Jobs_completed.xml ... OK
Copy query Queue_Consumables.xml ... OK
Copy query Statistic_History.xml ... OK
Copy query Statistics.xml ... OK
Copy query Wallclock_time.xml ... OK
Create directory /var/spool/arco/results

Hit  to continue >> 

ARCo reporting module setup
---------------------------
Copying ARCo reporting file into /mnt/zfs/sge/gridware/sge/default/arco/reporting

Setting up ARCo reporting configuration file. After registration of
the ARCo reporting module in the Sun Java Web Console you can find 
this file at

      /mnt/zfs/sge/gridware/sge/default/arco/reporting/config.xml

Hit  to continue >> 

Importing Sun Java Web Console 3.0 or 3.1 files
-----------------------------------------------
Imported files to /mnt/zfs/sge/gridware/sge/default/arco/reporting
Created product images in /mnt/zfs/sge/gridware/sge/default/arco/reporting/com_sun_web_ui/images

Hit  to continue >> 

Registering the SGE reporting module in the Sun Java Web Console
----------------------------------------------------------------
The reporting web application has been successfully deployed.
Set 1 properties for the com.sun.grid.arco_6.2u5 application.
Set 1 properties for the com.sun.grid.arco_6.2u5 application.
Set 1 properties for the com.sun.grid.arco_6.2u5 application.
Creating the TOC file ... OK

Restarting Sun Java Web Console
-------------------------------
Shutting down Sun Java(TM) Web Console Version 3.0.2 ...
Starting Sun Java(TM) Web Console Version 3.0.2 ...
The console is running.
SGE  ARCo reporting successfully installed

Eventually the ARCO web access

At the end of the installation script we were able to access into ARCO; https://t3ce02.psi.ch:6789/

SGE, importing an previous reporting file

It's possible to ingest an previous reporting file coming from an other SGE installation; because in our old cluster we had one we ingested > 1.5 year of statistics in this way:

[root@t3ce02 common]# ll /root/reporting 
-rw-r--r--. 1 root root 740664403 Feb 28 23:08 /root/reporting
[root@t3ce02 common]# pwd
/mnt/zfs/sge/gridware/sge/default/common
[root@t3ce02 common]# cp -p /root/reporting .
cp: overwrite `./reporting'? y

SGE, inspect tool

It's possible to graphically monitor several SGE clusters and their queues by using the Java tool Inspect that we have installed in the following way:

[root@t3ce02 SGE6.2u5]# unzip sge62u5_inspect_rpm.zip
Archive:  sge62u5_inspect_rpm.zip
  inflating: sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm  
[root@t3ce02 SGE6.2u5]# yum install sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm  
Setting up Install Process
Examining sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm: sun-sge-inspect-6.2-5.noarch
Marking sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package sun-sge-inspect.noarch 0:6.2-5 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================================================================
 Package                        Arch                  Version                Repository                                    Size
================================================================================================================================
Installing:
 sun-sge-inspect                noarch                6.2-5                  /sun-sge-inspect-6.2-5.noarch                 40 M

Transaction Summary
================================================================================================================================
Install       1 Package(s)
Upgrade       0 Package(s)

Total size: 40 M
Installed size: 40 M
Is this ok [y/N]: y
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing     : sun-sge-inspect-6.2-5.noarch                                                                             1/1 

Installed:
  sun-sge-inspect.noarch 0:6.2-5                                                                                                

Complete!
[root@t3ce02 SGE6.2u5]#

Install jdk-develop by using yum:

...
================================================================================================================================
 Package                               Arch                Version                               Repository                Size
================================================================================================================================
Installing:
 java-1.6.0-openjdk-devel              x86_64              1:1.6.0.0-1.39.b17.el6_0              sl-security              8.5 M

Transaction Summary
================================================================================================================================
...

You need to create users and keys ( not really clear why.. ):

[root@t3ce02 bin]# cat /opt/SGE6.2u5/myusers.txt
root:iamroot:fabio.martinelli@psi.ch
[root@t3ce02 bin]#

[root@t3ce02 bin]# /mnt/zfs/sge/gridware/sge/util/sgeCA/sge_ca -usercert /opt/SGE6.2u5/myusers.txt 
Generating user certificate and key for 'root' ('iamroot','fabio.martinelli@psi.ch').

Creating 'user' certificate and key for iamroot
-----------------------------------------------
Generating a 1024 bit RSA private key
......++++++
...++++++
writing new private key to '/var/sgeCA/port6444/default/userkeys/root/key.pem'
-----
Using configuration from /tmp/sge_ca115195.tmp
Check that the request matches the signature
Signature ok
The Subject's Distinguished Name is as follows
countryName           :PRINTABLE:'DE'
stateOrProvinceName   :PRINTABLE:'GERMANY'
localityName          :PRINTABLE:'Building'
organizationName      :PRINTABLE:'Organisation'
organizationalUnitName:T61STRING:'Organisation_unit'
userId                :PRINTABLE:'root'
commonName            :PRINTABLE:'iamroot'
emailAddress          :IA5STRING:'fabio.martinelli@psi.ch'
Certificate is to be certified until Mar  2 22:35:05 2012 GMT (365 days)

Write out database with 1 new entries
Data Base Updated
created and signed certificate for user 'root' in '/var/sgeCA/port6444/default/userkeys/root'
[root@t3ce02 bin]#

Create and use passwords:

[root@t3ce02 bin]# /mnt/zfs/sge/gridware/sge/util/sgeCA/sge_ca -userks -kspwf /tmp/mysecret.txt

We made a script to setup JAVA_HOME and run inspect, please look:

[root@t3ce02 ~]# ll /usr/local/bin/sgeinspect.sh 
lrwxrwxrwx 1 root root 42 Mar  3 23:51 /usr/local/bin/sgeinspect.sh -> /gridware/sge/sgeinspect/bin/sgeinspect.sh

[root@t3ce02 ~]# cat /usr/local/bin/sgeinspect.sh
export JAVA_HOME=/etc/alternatives/java_sdk
cd /gridware/sge/sgeinspect/bin
./sgeinspect
cd -
[root@t3ce02 ~]#

SGE EXECD 6.2u5 installation

Install the SGE execution side was easier than to install the master one but it requires some steps to follow well described in the official SGE How to Install Execution Hosts: there is a step missing in the official documentation after the RPM installation and before to run the ./install_execd, there let's run this command from the master:

rsync -av /gridware/sge/default/common EXECUTION_HOST:/gridware/sge/default/common

so the ./install_execd receives the configuration files where is reported who is the master and parameters affecting the execution.

we did an installation without any NFS dependency, this should avoid global job crashes during an NFS server unreachable event and it improves I/O performances.

Dropping old job files

It's a good policy to keep recent SGE computations in the WN to troubleshoot what went wrong during a job failure but after a month or two it doesn't make sense preserve those files and maybe you could collapse the directory: when I was in ESA I did this cron script to move out and later delete old SGE job dirs ( the script has to be adapted to your cluster ):

#!/bin/bash 

# by martinelli @ ESA - 26/05/2010  
# /var/sge/spool/$HOST/active_jobs dir was found with 31199 dirs inside and because that dir is hosted on an EXT3 filesystem it can't store max 32k dirs; 
# so on a cron base I preliminarily move old jobs in a local XFS /stage/active_jobs_old/ dir that doesn't have these constraints and eventually I drop the old job dirs

HOST=$(hostname|cut -d\. -f1 )
ACTIVE_JOBS="/var/sge/spool/$HOST/active_jobs"
BASENAME=$(basename $0)



[ ! -d /stage/active_jobs_old/ ] && echo "/stage/active_jobs_old/ not a dir, exiting" && exit 1
[ ! -d $ACTIVE_JOBS ]            && echo "$ACTIVE_JOBS not a dir, exiting"            && exit 1
[ ! -d /tmp ]                    && echo "/tmp not a dir, exiting"                    && exit 1

# prepares move commands
cd $ACTIVE_JOBS
/usr/bin/find . -mtime +15 -type d -exec echo  mv '{}' /stage/active_jobs_old/ \; > /tmp/$BASENAME-mv.sh 
# executes move commands
source /tmp/$BASENAME-mv.sh

# later drops old files, prepare the rm commands
cd /stage/active_jobs_old/
/usr/bin/find . -mtime +45 -type d -exec echo  rm -rf '{}' \;  > /tmp/$BASENAME-rm.sh
# execute them
source /tmp/$BASENAME-rm.sh && exit 0

# something went wrong
exit 1

Customize SGE ARCO

Once the installation was completed and the old SGE reporting file ingested in the MySQL we started to design some SQL queries in ARCO and produce graphs that's the most interesting part. So we produced:

1 day CPU usage

1 day MEM usage

-- FabioMartinelli - 2011-03-03

  • ARCO Graph showing 1 day of CPU usage in the T3 Cluster, date 2011-01-24.:
    1dayCPUusage.png

  • ARCO Graph showing 1 day of MEM usage in the T3 Cluster, date 2011-01-24.:
    1dayMEMusage.png

Topic attachments
I Attachment History Action Size Date Who Comment
XMLxml 1_day_CPU_usage.xml r1 manage 1.2 K 2011-03-09 - 15:19 FabioMartinelli ARCO Query configuration to be installed in: /var/spool/arco/queries/
XMLxml 1_day_MEM_usage.xml r1 manage 1.2 K 2011-03-09 - 15:19 FabioMartinelli ARCO Query configuration to be installed in: /var/spool/arco/queries/
PNGpng 1dayCPUusage.png r1 manage 8.7 K 2011-03-09 - 15:06 FabioMartinelli ARCO Graph showing 1 day of CPU usage in the T3 Cluster, date 2011-01-24.
PNGpng 1dayMEMusage.png r1 manage 20.9 K 2011-03-09 - 15:06 FabioMartinelli ARCO Graph showing 1 day of MEM usage in the T3 Cluster, date 2011-01-24.
Unknown file formatrpm lzfs-1.0-1.src.rpm r1 manage 271.3 K 2011-03-03 - 13:23 FabioMartinelli ZFS LZFS layer
Unknown file formatrpm lzfs-1.0-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 589.2 K 2011-03-03 - 13:23 FabioMartinelli ZFS LZFS layer
Unknown file formatrpm spl-0.5.2-1.src.rpm r1 manage 420.8 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Unknown file formatrpm spl-0.5.2-1.x86_64.rpm r1 manage 28.4 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Unknown file formatrpm spl-modules-0.5.2-1.src.rpm r1 manage 422.6 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Unknown file formatrpm spl-modules-0.5.2-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 2178.9 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Unknown file formatrpm spl-modules-devel-0.5.2-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 70.0 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Texttxt t3ce02.RPMs.list.after.SGE.MySQL.DBWRITER.ARCO.installation.txt r1 manage 22.5 K 2011-03-06 - 16:58 FabioMartinelli This file lists the RPMs involved in the t3ce02 SL6 installation after the complete SGE installation.
Texttxt t3ce02.RPMs.list.after.ZFS.installation.txt r1 manage 17.0 K 2011-03-03 - 13:01 FabioMartinelli This file lists the RPMs involved in the t3ce02 SL6 installation just after the ZFS installation that was the 1st task I did after the O.S. installation.
Unknown file formatrpm zfs-0.5.1-1.src.rpm r1 manage 1815.3 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-0.5.1-1.x86_64.rpm r1 manage 2505.8 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-devel-0.5.1-1.x86_64.rpm r1 manage 275.6 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-modules-0.5.1-1.src.rpm r1 manage 1816.3 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-modules-0.5.1-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 7585.8 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-modules-devel-0.5.1-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 224.3 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-test-0.5.1-1.x86_64.rpm r1 manage 26.4 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2012-06-03 - FabioMartinelli
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback