SGE 6.2u5 plus ARCO MySQL on SL6 64bit powered by ZFS

Revision 3, 2011-03-03 22:16:21

Sun Grid Engine project home page: http://gridengine.sunsource.net/



This document describes the experiences gained during the upgrade of the SGE installation from 6.1 to 6.2u5, the last free version of this batch system; apart from the SGE upgrade itself that introduced several new features in the batch system we migrated also O.S., the method to manage accounting by introducing a DB and we introduced the ZFS driver to use this advanced filesystem in our Linux context.

HW installation

For our installation we detached t3ui07 from the cluster and we converted in t3ce02, our new SGE master; because of the criticality of this new machine we made a HW RAID1 configuration in the LSI Bios at boot time. The final layout is a 140GB LSI Virtual Volume that we partitioned during the SL6 installation in according to this commands output:

[root@t3ce02 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             9.7G  2.3G  6.9G  25% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/sda1             485M   34M  426M   8% /boot

[root@t3ce02 ~]# mount 
/dev/sda3 on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

Because there are 4 Gigabit NICs in the server it's worth to connect to the switch as many NICs it's possible and later configure a Linux Bonding configuration type 6 to improve the server bandwidth and availability. For the time being we skipped this step.

SL6 64bit Installation

So far we have just 1 server like and probably this is going to stay in the future so we simply pointed the Virtual CD of t3ce02 to a SL6 DVD iso file we saved in t3admin01:/home/ and made a "Basic Server" installation, that's enough to have installed utilities like SSHs, yum, .. so we can select the other RPMs at run time. The "Basic Server" installation turns ON selinux by default, to disable it edit this file and eventually reboot the system:
[root@t3ce02 ~]# grep -v \# /etc/sysconfig/selinux 
SELINUX=disabled
SELINUXTYPE=targeted 
[root@t3ce02 ~]#
also turn OFF cron yum updated editing this file:
/etc/sysconfig/yum-autoupdate
install these i686 RPMs, later they are needed by the Sun Web Console and also by the LSI RAID utility mpt-status:
[root@t3ce02 ~]# yum install glibc.i686
...
Dependencies Resolved
================================================================================================================================
 Package                            Arch                   Version                            Repository                   Size
================================================================================================================================
Installing:
 glibc                              i686                   2.12-1.7.el6_0.3                   sl-security                 4.3 M
Installing for dependencies:
 nss-softokn-freebl                 i686                   3.12.8-1.el6_0                     sl-security                 108 k
Updating for dependencies:
 glibc                              x86_64                 2.12-1.7.el6_0.3                   sl-security                 3.7 M
 glibc-common                       x86_64                 2.12-1.7.el6_0.3                   sl-security                  14 M
 nss-softokn-freebl                 x86_64                 3.12.8-1.el6_0                     sl-security                 114 k

Transaction Summary
================================================================================================================================
Install       2 Package(s)
Upgrade       3 Package(s)

Total size: 22 M
Total download size: 4.4 M
Is this ok [y/N]: y
Downloading Packages:
(1/2): glibc-2.12-1.7.el6_0.3.i686.rpm                                                                   | 4.3 MB     00:09     
(2/2): nss-softokn-freebl-3.12.8-1.el6_0.i686.rpm                                                        | 108 kB     00:00     
--------------------------------------------------------------------------------------------------------------------------------
... 
Complete!
[root@t3ce02 ~]#

now you can install the LSI RAID checker "mpt-status":

[root@t3ce02 ~]# rpm -Uv http://www.drugphish.ch/~ratz/mpt-status/RPMS/1.2.0_RC7/mpt-status-1.2.0_RC7-3.i386.rpm
Retrieving http://www.drugphish.ch/~ratz/mpt-status/RPMS/1.2.0_RC7/mpt-status-1.2.0_RC7-3.i386.rpm
Preparing packages for installation...
mpt-status-1.2.0_RC7-3
[root@t3ce02 ~]#
load the driver and verify the RAID1 status:
[root@t3ce02 ~]# modprobe mptctl
[root@t3ce02 ~]# mpt-status 
ioc0 vol_id 0 type IM, 2 phy, 135 GB, state OPTIMAL, flags ENABLED
ioc0 phy 1 scsi_id 2 SEAGATE  ST914602SSUN146G 0603, 136 GB, state ONLINE, flags NONE
ioc0 phy 0 scsi_id 1 SEAGATE  ST914602SSUN146G 0603, 136 GB, state ONLINE, flags NONE
[root@t3ce02 ~]#
curiously I couldn't find /etc/modprobe.conf, so I just ran:
[root@t3ce02 etc]# echo modprobe mptctl >> /etc/rc.local
ok if you still have to reboot now it's time to do it.

ZFS on SL6 64bit.

A new O.S. release always delivers some news, about the SL6 kernel one news is the interesting opportunity to run ZFS filesystems; so far we used the ZFS driver version zfs-linux-20110214.tar.bz2;

ZFS allows to create RPMs that are always appreciated by Sys Admins so be sure to have the RPM rpm-build deployed in you O.S. before to try to compile ZFS.

Once you downloaded the file zfs-linux-20110214.tar.bz2 create a dir in /opt/zfs-build to build the ZFS RPMs, copy there the file zfs-linux-20110214.tar.bz2, and open it with tar -xjvf zfs-linux-20110214.tar.bz2, then follow these macro steps:

[root@t3ce02 zfs-build]# ll
total 19680
drwxr-xr-x  9 root root     4096 Mar  3 14:53 lzfs
drwxr-xr-x  4 root root     4096 Mar  3 14:28 misc-scripts
drwxr-xr-x 11 root root     4096 Mar  3 14:35 spl
drwxr-xr-x 14 root root     4096 Mar  3 14:32 zfs
-rw-r--r--  1 root root 20132179 Feb 14 15:28 zfs-linux-20110214.tar.bz2

cd /opt/zfs-build/lzfs
./configure && make rpm

cd /opt/zfs-build/spl
./configure && make rpm

cd /opt/zfs-build/zfs
./configure && make rpm

yum install /opt/zfs-build/spl/*.rpm
yum install /opt/zfs-build/zfs/*.rpm
yum install /opt/zfs-build/lzfs/*.rpm
Here you can see the RPMs so far involved in the O.S. installation + the ZFS RPMs just produced and installed t3ce02.RPMs.list.after.ZFS.installation.txt.

Here is the md5sums list of the ZFS RPMs produced; all the RPMs are available at the bottom of this Wiki page:

[root@t3ce02 zfs-build]# find . | grep \\.rpm | xargs -iI md5sum I
e6b0b62d710689586ee9cbbe8f6defdd  ./spl/spl-0.5.2-1.x86_64.rpm
a36c6797ba234f3935ea351c07002c61  ./spl/spl-modules-0.5.2-1_2.6.32_71.18.1.el6.x86_64.rpm
f462f15ab6c5a38db10290b38fcede8c  ./spl/spl-modules-devel-0.5.2-1_2.6.32_71.18.1.el6.x86_64.rpm
9397f335a0d33196a652e37b3a52b6ba  ./spl/spl-modules-0.5.2-1.src.rpm
e05f6da1226dd47b171b9764a15f488b  ./spl/spl-0.5.2-1.src.rpm
afe350394b3e9edd833dd15f1506e675  ./lzfs/lzfs-1.0-1.src.rpm
d3f9f6b6f0344bf95620c01b5fad3b2e  ./lzfs/lzfs-1.0-1_2.6.32_71.18.1.el6.x86_64.rpm
fecce1786206c71c20701d7872a5ca87  ./zfs/zfs-modules-0.5.1-1.src.rpm
e646e0ea853f8ce8c4166fa388dd1ecd  ./zfs/zfs-test-0.5.1-1.x86_64.rpm
1c7f4d7b34e4a8b92981b6b2bce875e4  ./zfs/zfs-0.5.1-1.x86_64.rpm
547d3680339b99ac99217f2f43e2b544  ./zfs/zfs-devel-0.5.1-1.x86_64.rpm
01368ff2a044612573481b3cb154ab58  ./zfs/zfs-0.5.1-1.src.rpm
23892b48ed147a166ac7d1b0ff3fb9ee  ./zfs/zfs-modules-devel-0.5.1-1_2.6.32_71.18.1.el6.x86_64.rpm
5e767ad12087ed18c7d72b05b39530d1  ./zfs/zfs-modules-0.5.1-1_2.6.32_71.18.1.el6.x86_64.rpm

We partitioned the rest of the disk like sda4 to become a ZFS pool where to create ZFS filesystems:

[root@t3ce02 ~]# fdisk  -l

Disk /dev/sda: 146.0 GB, 145999527936 bytes
255 heads, 63 sectors/track, 17750 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000d12bc

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          64      512000   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              64        1339    10240000   82  Linux swap / Solaris
/dev/sda3            1339        2614    10240000   83  Linux
/dev/sda4            2614       17751   121584640   83  Linux

this is the command we ran to create the pool:

[root@t3ce02 ~]# zpool create -f zfspool -m /mnt/zfs sda4
[root@t3ce02 ~]# df -h 
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             9.7G  2.9G  6.3G  31% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/sda1             485M   57M  403M  13% /boot
zfspool               114G   21K  114G   1% /mnt/zfs
[root@t3ce02 ~]#

MySQL Database

MySQL ZFS filesystem

On the official MySQL website we read about good performances regarding the relationship MySQL/ZFS, so we applied that procedure to create the ZFS fs to store our MySQL; this DB is going to be used by the SGE ARCO tool.
[root@t3ce02 zfs]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             9.7G  2.9G  6.3G  32% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/sda1             485M   57M  403M  13% /boot
zfspool               114G  5.9G  108G   6% /mnt/zfs
[root@t3ce02 zfs]# zfs create zfspool/mysql
[root@t3ce02 zfs]# zfs set recordsize=16K zfspool/mysql

MySQL RPMs

Because we prepared a ZFS filesystem for MySQL let's continue installing mysql-server and relocating the files on ZFS, please follow these macro steps:
yum install mysql-server
/etc/init.d/mysqld stop
cd /var/lib
mv mysql /mnt/zfs/mysql && ln -s /mnt/zfs/mysql/mysql .
/etc/init.d/mysqld start
chkconfig mysql on
to manage MySQL you can use several tools, probably the most common choice is to deploy mysql-workbench or phpmyadmin;

PhPMyAdmin

Eventually we liked and installed https://t3ce02.psi.ch/phpmyadmin/.

MySQL ARCO DB

Now we can prepare the sge_arco DB and the related 2 MySQL users, user 'arco_read' that's used by the ARCO Web application to run queries and the user 'arco_write' that's used by the reporting module to parse the SGE reporting file /gridware/sge/default/common/reporting and insert new rows in the DB sge_arco.

We followed the ARCO procedure for the MySQL case.

This is our final permissions layout in MySQL:

User 	Host 	Password 	Global privileges Tip 	Grant 	
	arco_read 	% 	Yes 	USAGE 	No 	Edit Privileges
	arco_read 	localhost 	Yes 	USAGE 	No 	Edit Privileges
	arco_write 	% 	Yes 	ALL PRIVILEGES 	Yes 	Edit Privileges
	arco_write 	localhost 	Yes 	ALL PRIVILEGES 	Yes 	Edit Privileges
	root 	127.0.0.1 	Yes 	ALL PRIVILEGES 	Yes 	Edit Privileges
	root 	localhost 	Yes 	ALL PRIVILEGES 	Yes 	Edit Privileges
	root 	t3ce02 	Yes 	ALL PRIVILEGES 	Yes 	Edit Privileges
	root 	t3ce02.psi.ch 	Yes 	ALL PRIVILEGES 	Yes 	Edit Privileges

MySQL Query logging

To debug what's happening in our db it's worth to enable the query logging feature of MySQL, so this is the /etc/my.cnf, please look the 'log' tag:
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
log=/var/lib/mysql/general.log
# Disabling symbolic-links is recommended to prevent assorted security risks
# symbolic-links=0
#
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

at run time we can use the command 'tail' to debug the queries:

[root@t3ce02 sun]# tail -f /mnt/zfs/mysql/mysql/general.log
/usr/libexec/mysqld, Version: 5.1.52-log (Source distribution). started with:
Tcp port: 0  Unix socket: /var/lib/mysql/mysql.sock
Time                 Id Command    Argument
110303 17:48:13	    1 Connect	Access denied for user 'UNKNOWN_MYSQL_USER'@'localhost' (using password: NO)

Sun Web Console installation

The first step to do to install SGE and SGE ARCO is to deploy the Sun Web Console, basically a Java framework developed by Sun to host their Java web applications. Online there is a procedure to install but we prefered to report here the steps:

So starting from these SGE6.2u5 files in /opt:

[root@t3ce02 SGE6.2u5]# ll
total 221396
-rw-r--r-- 1 root root  3865332 Feb 24 10:20 sdm10u5_core_rpm.zip
-rw-r--r-- 1 root root  3868219 Feb 24 10:20 sdm10u5_core_targz.zip
-rw-r--r-- 1 root root 10271047 Feb 24 10:20 sge62u5_arco_rpm.zip
-rw-r--r-- 1 root root 10305829 Feb 24 10:20 sge62u5_arco_targz.zip
-rw-r--r-- 1 root root 18839411 Feb 24 10:21 sge62u5_inspect_rpm.zip
-rw-r--r-- 1 root root 18899376 Feb 24 10:21 sge62u5_inspect_targz.zip
-rw-r--r-- 1 root root 29514366 Feb 24 11:17 sge62u5_linux24-i586_rpm.zip
-rw-r--r-- 1 root root 29533073 Feb 24 10:20 sge62u5_linux24-x64_rpm.zip
-rw-r--r-- 1 root root 34009465 Feb 24 10:20 sge62u5_sources+gpl-code_targz.zip
-rw-r--r-- 1 root root 67576445 Feb 24 10:21 webconsole3.0.2-linux.targz.zip

[root@t3ce02 SGE6.2u5]# md5sum *
c89ab2b3db585a5df092ac3399bcdb21  sdm10u5_core_rpm.zip
0bbccb40251dd189c22496d5f945c4f6  sdm10u5_core_targz.zip
188d3e28313b629f19dae761a8b6522b  sge62u5_arco_rpm.zip
e24d3b8e7e11447312771c3cdaf03687  sge62u5_arco_targz.zip
fe8f85829bb57938e8edc09186a93afa  sge62u5_inspect_rpm.zip
d40484210cde65a880e3eab86651ab9e  sge62u5_inspect_targz.zip
68f232beeb66a94c12f286860f07185e  sge62u5_linux24-i586_rpm.zip
23a81889b532253f1a1573ac3145111b  sge62u5_linux24-x64_rpm.zip
0d1fd15da1aee3bb159eb0b5dccae0cb  sge62u5_sources+gpl-code_targz.zip
b931ec2bde0137ebaeae4c4669a65df1  webconsole3.0.2-linux.targz.zip
[root@t3ce02 SGE6.2u5]#

Let's unzip the webconsole package:

[root@t3ce02 SGE6.2u5]# unzip webconsole3.0.2-linux.targz.zip
Archive:  webconsole3.0.2-linux.targz.zip
  inflating: sge6_2u5/webconsole3.0.2-linux.tar.gz  
[root@t3ce02 SGE6.2u5]# cd sge6_2u5/
[root@t3ce02 sge6_2u5]# tar -xzvf webconsole3.0.2-linux.tar.gz 
SUNWjato-2.1.5.i386.rpm
SUNWjatodmo-2.1.5.i386.rpm
SUNWjatodoc-2.1.5.i386.rpm
SUNWmcon-3.0.2-5.i386.rpm
SUNWmconr-3.0.2-5.i386.rpm
SUNWmcos-3.0.2-5.i386.rpm
SUNWmcosx-3.0.2-5.i386.rpm
SUNWmctag-3.0.2-5.i386.rpm
config_properties.tpl
jdk-1_5_0_04-linux-i586.rpm
setup
sun-javahelp-2.0_01-fcs.i586.rpm
.pkgrc
.setup_default
[root@t3ce02 sge6_2u5]#

Be sure to install the RPM pam.i686 because Sun Web Console is 32bit software and then install the framework:

[root@t3ce02 sge6_2u5]# ./setup 
Preparing packages for installation...
jdk-1.5.0_04-fcs
Preparing packages for installation...
sun-javahelp-2.0-fcs
Linking JavaHelp to /usr/java/jdk1.5.0_04 ...
Preparing packages for installation...
SUNWjato-2.1.5-9
Preparing packages for installation...
SUNWjatodoc-2.1.5-9
Preparing packages for installation...
SUNWjatodmo-2.1.5-9
Preparing packages for installation...
SUNWmctag-3.0.2-5
Preparing packages for installation...
SUNWmconr-3.0.2-5
Preparing packages for installation...
SUNWmcon-3.0.2-5
Preparing packages for installation...
SUNWmcos-3.0.2-5
Preparing packages for installation...
SUNWmcosx-3.0.2-5

Installation complete.

Starting Sun Java(TM) Web Console Version 3.0.2 ...
The console is running.
[root@t3ce02 sge6_2u5]#

The Sun Web Console is listening on TCP 6789:

[root@t3ce02 sge6_2u5]# netstat -tpln |grep java
tcp        0      0 ::ffff:127.0.0.1:41086      :::*                        LISTEN      7013/java           
tcp        0      0 :::6788                     :::*                        LISTEN      7013/java           
tcp        0      0 :::6789                     :::*                        LISTEN      7013/java           

and you can access with your Linux credentials root/pwd py pointing your browser to https://t3ce02.psi.ch:6789/

Here you can see the Sun Web Console logs:

[root@t3ce02 sun]# tail /var/log/webconsole/console/console_debug_log
==============================================================
Java Web Console Version 3.0.2 started on Thu Mar  3 17:17:05 CET 2011
==============================================================
[root@t3ce02 sun]#

SGE 6.2u5 installation

Now we can install SGE, please have a look to the following steps:
[root@t3ce02 SGE6.2u5]# unzip sge62u5_linux24-x64_rpm.zip
Archive:  sge62u5_linux24-x64_rpm.zip
  inflating: sge6_2u5/sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm  
  inflating: sge6_2u5/sun-sge-common-6.2-5.noarch.rpm  
[root@t3ce02 SGE6.2u5]# cd sge6_2u5/
[root@t3ce02 sge6_2u5]# ll
total 161640
-r--r--r-- 1 root   bin       1235 Dec  9  2006 config_properties.tpl
-rw-r--r-- 1 102852 wheel 47286234 Jul 27  2005 jdk-1_5_0_04-linux-i586.rpm
-r-xr-xr-x 1 root   bin      48781 Dec  9  2006 setup
-rw-r--r-- 1   5074 wheel  6340876 May 11  2004 sun-javahelp-2.0_01-fcs.i586.rpm
-rw-r--r-- 1 root   root  25583219 Dec 15  2009 sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm
-rw-r--r-- 1 root   root   4161238 Dec 15  2009 sun-sge-common-6.2-5.noarch.rpm
-r--r--r-- 1 root   bin     731610 Nov  8  2005 SUNWjato-2.1.5.i386.rpm
-r--r--r-- 1 root   bin    1216562 Nov  8  2005 SUNWjatodmo-2.1.5.i386.rpm
-r--r--r-- 1 root   bin    1049729 Nov  8  2005 SUNWjatodoc-2.1.5.i386.rpm
-rw-rw-r-- 1 root   bin   10504152 Dec  9  2006 SUNWmcon-3.0.2-5.i386.rpm
-rw-rw-r-- 1 root   bin      29130 Dec  9  2006 SUNWmconr-3.0.2-5.i386.rpm
-rw-rw-r-- 1 root   bin      46593 Dec  9  2006 SUNWmcos-3.0.2-5.i386.rpm
-rw-rw-r-- 1 root   bin       3803 Dec  9  2006 SUNWmcosx-3.0.2-5.i386.rpm
-rw-rw-r-- 1 root   bin     919212 Dec  9  2006 SUNWmctag-3.0.2-5.i386.rpm
-rw-r--r-- 1 root   root  67566632 Dec 15  2009 webconsole3.0.2-linux.tar.gz
[root@t3ce02 sge6_2u5]# yum install sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm sun-sge-common-6.2-5.noarch.rpm
...
Dependencies Resolved

================================================================================================================================
 Package                          Arch            Version                  Repository                                      Size
================================================================================================================================
Installing:
 sun-sge-bin-linux24-x64          x86_64          6.2-5                    /sun-sge-bin-linux24-x64-6.2-5.x86_64           61 M
 sun-sge-common                   noarch          6.2-5                    /sun-sge-common-6.2-5.noarch                    11 M
Installing for dependencies:
 ksh                              x86_64          20100621-2.el6           sl                                             655 k
 libXp                            x86_64          1.0.0-15.1.el6           sl                                              22 k
 libXpm                           x86_64          3.5.8-2.el6              sl                                              58 k
 openmotif22                      x86_64          2.2.3-19.el6             sl                                             1.2 M
 tcl                              x86_64          1:8.5.7-6.el6            sl                                             1.9 M

...
Complete!

Move the SGE installation on the ZFS filesystem:

[root@t3ce02 /]# mv gridware/ /mnt/zfs/sge/ && ln -s /mnt/zfs/sge/gridware .
[root@t3ce02 /]# ll gridware
lrwxrwxrwx 1 root root 21 Mar  3 17:58 gridware -> /mnt/zfs/sge/gridware

Let's install SGE by running the script start_gui_installer; this is the final configuration we did:

Grid Engine cluster configuration
Grid Engine root directory ($SGE_ROOT)
/mnt/zfs/sge/gridware/sge
Cell name ($SGE_CELL)
default
Cluster name ($SGE_CLUSTER_NAME)
p6444
Qmaster port ($SGE_QMASTER_PORT)
6444
Execd port ($SGE_EXECD_PORT)
6445
Group id range ($SGE_GID_RANGE)
20000-20100
Qmaster spool directory
/mnt/zfs/sge/gridware/sge/default/spool/qmaster
Global execd spool directory
/mnt/zfs/sge/gridware/sge/default/spool
Spooling method
berkeleydb
Spooling directory
/mnt/zfs/sge/gridware/sge/default/spool/spooldb
JMX port
6446
JVM library path
/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/amd64/server/libjvm.so
JMX SSL server keystore path
/var/sgeCA/port6444/default/private/keystore
Administrator mail
fabio.martinelli@psi.ch
 

Succeded
Failed
Qmaster host
t3ce02.psi.ch

Execution host(s)
t3ce02.psi.ch

Shadow host(s)


Berkeley db host


Admin host(s)
t3ce02.psi.ch

Submit host(s)
t3ce02.psi.ch

How to start with Grid Engine
Set the environment... ... if you are a csh/tcsh user: source /mnt/zfs/sge/gridware/sge/default/common/settings.csh ... if you are a sh/ksh user: . /mnt/zfs/sge/gridware/sge/default/common/settings.sh  This will set or expand the following environment variables:
$SGE_ROOT (always necessary)
$SGE_CELL (if you are using a cell other than default)
$SGE_CLUSTER_NAME (always necessary)
$SGE_QMASTER_PORT (if you haven't added the service sge_qmaster)
$SGE_EXECD_PORT (if you haven't added the service sge_execd)
$PATH/$path (to find the Grid Engine binaries)
$MANPATH (to access the manual pages)
 
Submit one of the sample scripts contained in the /mnt/zfs/sge/gridware/sge/examples/jobs directory. qsub /mnt/zfs/sge/gridware/sge/examples/jobs/simple.sh or qsub /mnt/zfs/sge/gridware/sge/examples/jobs/sleeper.sh
 
Use the qstat command to monitor the job's behavior. qstat -f
 
After the job finishes executing, check your home directory for the redirected stdout/stderr files script-name.ejob-id and script-name.ojob-id. The job-id is a consecutive unique integer number assigned to each job.
Administering Grid Engine
Grid Engine startup scripts can be found at: Qmaster: /mnt/zfs/sge/gridware/sge/default/common/sgemaster start/stop Exec daemon: /mnt/zfs/sge/gridware/sge/default/common/sgeexecd start/stop  After startup the daemons log their messages in their spool directories. Qmaster: /mnt/zfs/sge/gridware/sge/default/spool/qmaster/messages Exec daemon: //messages
Useful links
Sun Grid Engine Information Center
http://wikis.sun.com/display/SunGridEngine/Home
Grid Engine project home
http://gridengine.sunsource.net

SGE setting scripts

Please create these symbolic links:
[root@t3ce02 profile.d]# pwd
/etc/profile.d
[root@t3ce02 profile.d]# ll se*
lrwxrwxrwx 1 root root 53 Mar  3 18:03 settings.csh -> /mnt/zfs/sge/gridware/sge/default/common/settings.csh
lrwxrwxrwx 1 root root 52 Mar  3 18:03 settings.sh -> /mnt/zfs/sge/gridware/sge/default/common/settings.sh
[root@t3ce02 profile.d]#

then logout and login again by SSH.

SGE configuration tuning

Now to enable the SGE reporting file and to save job logs on the server where the job ran we tuned the SGE conf with the 'qconf -mconf' command taking into account this fragment :
...
execd_params                 KEEP_ACTIVE=1 ENABLE_ADDGRP_KILL=TRUE \
                             H_MEMORYLOCKED=infinity
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:15 joblog=true sharelog=00:00:00
...

SGE dbwriter

Once the SGE master is properly working we can install the dbwriter tool, that involves the MySQL user 'arco_write', and the ARCO reporting software, that involves the MySQL user 'arco_read' and the Sun Web Console. Please have a look to the official SGE documentation. This was our experience:
[root@t3ce02 SGE6.2u5]# unzip sge62u5_arco_rpm.zip
Archive:  sge62u5_arco_rpm.zip
  inflating: sge6_2u5/sun-sge-arco-6.2-5.noarch.rpm  
[root@t3ce02 SGE6.2u5]# cd sge6_2u5
[root@t3ce02 sge6_2u5]# yum install sun-sge-arco-6.2-5.noarch.rpm
Setting up Install Process
Examining sun-sge-arco-6.2-5.noarch.rpm: sun-sge-arco-6.2-5.noarch
Marking sun-sge-arco-6.2-5.noarch.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package sun-sge-arco.noarch 0:6.2-5 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================================================================
 Package                       Arch                    Version                Repository                                   Size
================================================================================================================================
Installing:
 sun-sge-arco                  noarch                  6.2-5                  /sun-sge-arco-6.2-5.noarch                   19 M

Transaction Summary
================================================================================================================================
Install       1 Package(s)
Upgrade       0 Package(s)

Total size: 19 M
Installed size: 19 M
Is this ok [y/N]: y
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing     : sun-sge-arco-6.2-5.noarch                                                                                1/1 

Installed:
  sun-sge-arco.noarch 0:6.2-5                                                                                                   

Complete!
[root@t3ce02 sge6_2u5]#

This step affects the reporting variables:

[root@t3ce02 sge6_2u5]# qconf -se global
hostname              global
load_scaling          NONE
complex_values        NONE
load_values           NONE
processors            0
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      cpu,np_load_avg,mem_free,virtual_free
[root@t3ce02 sge6_2u5]#

Be sure that you have a MySQL JDBC driver file and link that file inside the SGE dir:

[root@t3ce02 sge6_2u5]# yum install mysql-connector-java.x86_64
...
[root@t3ce02 lib]# pwd
/mnt/zfs/sge/gridware/sge/dbwriter/lib

[root@t3ce02 lib]# ln -s /usr/share/java/mysql-connector-java.jar

During the dbwriter installation you're going to be prompted for several things, one is which Java to use, there specify '/etc/alternatives/jre/' , now run cd $SGE_ROOT/dbwriter && /inst_dbwriter

All parameters are now collected
--------------------------------

        SGE_ROOT=/mnt/zfs/sge/gridware/sge
        SGE_CELL=default
       JAVA_HOME=/etc/alternatives/jre (1.6.0_17)
          DB_URL=jdbc:mysql://localhost:3306/sge_arco
         DB_USER=arco_write
       READ_USER=arco_read
        INTERVAL=120
       SPOOL_DIR=/mnt/zfs/sge/gridware/sge/default/spool/dbwriter
    DERIVED_FILE=/mnt/zfs/sge/gridware/sge/dbwriter/database/mysql/dbwriter.xml
     DEBUG_LEVEL=FINE

Are these settings correct? (y/n) [y] >>

This the MySQL sge_arco Tables and Views creation phase:

Update version table
commiting changes
Version 6.1u3 (id=6) successfully installed
Install version 6.1u4 (id=7) -------
Create table sge_version
Insert first value in the checkpoint table
Update version table
commiting changes
Version 6.1u4 (id=7) successfully installed
Install version 6.2 (id=8) -------
Drop primary key constraint on sge_version table
Create compound primary key for sge_version
Create table sge_ar
Create index sge_ar_idx0 on column ar_number
Create index sge_ar_idx1 on column ar_owner
Create table sge_ar_attribute
Create index sge_ar_attribute_idx0 on column ara_end_time
Create table sge_ar_usage
Create table sge_ar_log
Create index sge_ar_log_idx0 on column arl_event
Create table sge_ar_resource_usage
Add the column ju_ar_parent to sge_job_usage table
Create index sge_job_usage_idx2 on column ju_ar_parent
Drop view view_job_times
Drop view view_accounting
Drop view view_jobs_completed
Update view view_accounting
Create view view_job_times_subquery
Update view view_job_times
Update view view_jobs_completed
Create view view_ar_attribute
Create view view_ar_log
Create view view_ar_usage
Create view view_ar_resource_usage

            Create view view_ar_time_usage 
         
Drop the column ju_state from sge_job_usage table
Drop the column j_open from sge_job table
Updating derived host values variable h_jobs to h_jobs_finished
Update version table
commiting changes
Version 6.2 (id=8) successfully installed
Install version 6.1u6 (id=9) -------
Extend too small integer field sge_department_values.dv_id,
drop temporarily constraint for foreign key sge_department_values.dv_parent
and extend too small integer field sge_department_values.dv_parent
Extend too small integer field sge_department.d_id
Recreate foreign key sge_department_values.dv_parent
Extend too small integer field sge_group_values.gv_id,
drop temporarily constraint for foreign key sge_group_values.gv_parent
and extend too small integer field sge_group_values.gv_parent
Extend too small integer
      field sge_group.g_id
Recreate foreign key sge_group_values.gv_parent
Extend too small integer field sge_host_values.hv_id,
drop temporarily constraint for foreign key sge_host_values.hv_parent
and extend too small integer field sge_host_values.hv_parent
Extend too small integer field sge_host.h_id
Recreate foreign key sge_host_values.hv_parent
Extend too small integer field sge_job_log.jl_id,
drop temporarily constraint for foreign key sge_job_log.jl_parent
and extend too small integer field sge_job_log.jl_parent
Extend too small integer field sge_job_request.jr_id,
drop temporarily constraint for foreign key sge_job_request.jr_parent
and extend too small integer field sge_job_request.jr_parent
Extend too small integer field sge_job_usage.ju_id,
drop temporarily constraint for foreign key sge_job_usage.ju_parent
and extend too small integer field sge_job_usage.ju_parent
Extend too small integer field sge_job.j_id
Recreate foreign key sge_job_log.jl_parent
Recreate foreign key sge_job_request.jr_parent
Recreate foreign key sge_job_usage.ju_parent
Extend too small integer field sge_project_values.pv_id,
drop temporarily constraint for foreign key sge_project_values.pv_parent
and extend too small integer field sge_project_values.pv_parent
Extend too small integer field sge_project.p_id
Recreate foreign key sge_project_values.pv_parent
Extend too small integer field sge_queue_values.qv_id,
drop temporarily constraint for foreign key sge_queue_values.qv_parent
and extend too small integer field sge_queue_values.qv_parent
Extend too small integer field sge_queue.q_id
Recreate foreign key sge_queue_values.qv_parent
Extend too small integer field sge_share_log.sl_id
Extend too small integer field sge_statistic_values.sv_id,
drop temporarily constraint for foreign key sge_statistic_values.sv_parent
and extend too small integer field sge_statistic_values.sv_parent
Extend too small integer field sge_statistic.s_id
Recreate foreign key sge_statistic_values.sv_parent
Extend too small integer field sge_user_values.uv_id,
drop temporarily constraint for foreign key sge_user_values.uv_parent
and extend too small integer field sge_user_values.uv_parent
Extend too small integer field sge_user.u_id
Recreate foreign key sge_user_values.uv_parent
Update version table
commiting changes
Version 6.1u6 (id=9) successfully installed
Install version 6.2u1 (id=10) -------
Extend too small integer field sge_ar_attribute.ara_id,
drop temporarily constraint for foreign key sge_ar_attribute.ara_parent
and extend too small integer field sge_ar_attribute.ara_parent
Extend too small integer field sge_ar_log.arl_id,
drop temporarily constraint for foreign key sge_ar_log.arl_parent
and extend too small integer field sge_ar_log.arl_parent
Extend too small integer field sge_ar_resource_usage.arru_id,
drop temporarily constraint for foreign key sge_ar_resource_usage.arru_parent
and extend too small integer field sge_ar_resource_usage.arru_parent
Extend too small integer field sge_ar_usage.aru_id,
drop temporarily constraint for foreign key sge_ar_usage.aru_parent
and extend too small integer field sge_ar_usage.aru_parent
Extend too small integer field sge_ar.ar_id
Extend too small integer field sge_job_usage.ju_parent and sge_job_usage.ju_ar_parent
Recreate foreign key sge_ar_attribute.ara_parent
Recreate foreign key sge_ar_log.arl_parent
Recreate foreign key sge_ar_resource_usage.arru_parent
Recreate foreign key sge_ar_usage.aru_parent
Drop primary key constraint on sge_version table
Create compound primary key for sge_version
Update version table
commiting changes
Version 6.2u1 (id=10) successfully installed
OK

Create start script sgedbwriter in /mnt/zfs/sge/gridware/sge/default/common

Create configuration file for dbwriter in /mnt/zfs/sge/gridware/sge/default/common

Hit  to continue >>

When the dbwriter installation is completed we got:

dbwriter startup script
-----------------------

We can install the startup script that will
start dbwriter at machine boot (y/n) [y] >> 

cp /mnt/zfs/sge/gridware/sge/default/common/sgedbwriter /etc/init.d/sgedbwriter.p6444
/usr/lib/lsb/install_initd /etc/init.d/sgedbwriter.p6444
Creating dbwriter spool directory /mnt/zfs/sge/gridware/sge/default/spool/dbwriter
starting dbwriter
dbwriter started (pid=11098)
Installation of dbwriter completed
[root@t3ce02 dbwriter]#

SGE ARCO

Now it's time to install the reporting layer, please have a look to the Official ARCO documentation

Here it's our installation story:

Searching for the jdbc driver com.mysql.jdbc.Driver 
in directory /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib 

OK, jdbc driver found

Should the connection to the database be tested? (y/n) [y] >> 

Test database connection to 'jdbc:mysql://localhost:3306/sge_arco' ... OK

Hit  to continue >> 

DB parameters are now collected
-------------------------------
    CLUSTER_NAME=T3_PSI_CH
          DB_URL=jdbc:mysql://localhost:3306/sge_arco
         DB_USER=arco_read

Are these settings correct? (y/n) [y] >> 

Do you want to add another cluster? (y/n) [n] >>n

Configure users with write access
---------------------------------

Users: default
Enter a user login name. (Hit  to finish) >> root

Users: default root
Enter a user login name. (Hit  to finish) >> martinelli_f

Users: default root martinelli_f
Enter a user login name. (Hit  to finish) >> 

All parameters are now collected
--------------------------------
       SPOOL_DIR=/var/spool/arco
      APPL_USERS=default root martinelli_f

Are these settings correct? (y/n) [y] >> 

   found incorrect permissions lrwxrwxrwx for /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar
   Correcting file permissions ... done

Install predefined queries
--------------------------

Directory /var/spool/arco does not exist, create it? (y/n) [y] >> y

Create directory /var/spool/arco
Create directory /var/spool/arco/queries
Copy examples queries into /var/spool/arco/queries
Copy query Accounting_per_AR.xml ... OK
Copy query Accounting_per_Department.xml ... OK
Copy query Accounting_per_Project.xml ... OK
Copy query Accounting_per_User.xml ... OK
Copy query AR_Attributes.xml ... OK
Copy query AR_by_User.xml ... OK
Copy query AR_Log.xml ... OK
Copy query AR_Reserved_Time_Usage.xml ... OK
Copy query Average_Job_Turnaround_Time.xml ... OK
Copy query Average_Job_Wait_Time.xml ... OK
Copy query DBWriter_Performance.xml ... OK
Copy query Host_Load.xml ... OK
Copy query Job_Log.xml ... OK
Copy query Number_of_Jobs_Completed_per_AR.xml ... OK
Copy query Number_of_Jobs_completed.xml ... OK
Copy query Queue_Consumables.xml ... OK
Copy query Statistic_History.xml ... OK
Copy query Statistics.xml ... OK
Copy query Wallclock_time.xml ... OK
Create directory /var/spool/arco/results

Hit  to continue >> 

ARCo reporting module setup
---------------------------
Copying ARCo reporting file into /mnt/zfs/sge/gridware/sge/default/arco/reporting

Setting up ARCo reporting configuration file. After registration of
the ARCo reporting module in the Sun Java Web Console you can find 
this file at

      /mnt/zfs/sge/gridware/sge/default/arco/reporting/config.xml

Hit  to continue >> 

Importing Sun Java Web Console 3.0 or 3.1 files
-----------------------------------------------
Imported files to /mnt/zfs/sge/gridware/sge/default/arco/reporting
Created product images in /mnt/zfs/sge/gridware/sge/default/arco/reporting/com_sun_web_ui/images

Hit  to continue >> 

Registering the SGE reporting module in the Sun Java Web Console
----------------------------------------------------------------
The reporting web application has been successfully deployed.
Set 1 properties for the com.sun.grid.arco_6.2u5 application.
Set 1 properties for the com.sun.grid.arco_6.2u5 application.
Set 1 properties for the com.sun.grid.arco_6.2u5 application.
Creating the TOC file ... OK

Restarting Sun Java Web Console
-------------------------------
Shutting down Sun Java(TM) Web Console Version 3.0.2 ...
Starting Sun Java(TM) Web Console Version 3.0.2 ...
The console is running.
SGE  ARCo reporting successfully installed
Ok try to login in the Sun Web Console and you'll find the ARCO reporting console: https://t3ce02.psi.ch:6789/

SGE, importing an old reporting file

It's possible to ingest an old reporting file coming from an other SGE installation; because in our old cluster we had one we ingested > 1.5 year of statistics in this way:
[root@t3ce02 common]# ll /root/reporting 
-rw-r--r--. 1 root root 740664403 Feb 28 23:08 /root/reporting
[root@t3ce02 common]# pwd
/mnt/zfs/sge/gridware/sge/default/common
[root@t3ce02 common]# cp -p /root/reporting .
cp: overwrite `./reporting'? y

SGE, inspect tool

It's possible to graphically monitor several SGE clusters and their queues by using the Java tool Inspect that we have installed in the following way:
[root@t3ce02 SGE6.2u5]# unzip sge62u5_inspect_rpm.zip
Archive:  sge62u5_inspect_rpm.zip
  inflating: sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm  
[root@t3ce02 SGE6.2u5]# yum install sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm  
Setting up Install Process
Examining sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm: sun-sge-inspect-6.2-5.noarch
Marking sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package sun-sge-inspect.noarch 0:6.2-5 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================================================================
 Package                        Arch                  Version                Repository                                    Size
================================================================================================================================
Installing:
 sun-sge-inspect                noarch                6.2-5                  /sun-sge-inspect-6.2-5.noarch                 40 M

Transaction Summary
================================================================================================================================
Install       1 Package(s)
Upgrade       0 Package(s)

Total size: 40 M
Installed size: 40 M
Is this ok [y/N]: y
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing     : sun-sge-inspect-6.2-5.noarch                                                                             1/1 

Installed:
  sun-sge-inspect.noarch 0:6.2-5                                                                                                

Complete!
[root@t3ce02 SGE6.2u5]#

Install jdk-develop by using yum:

...
================================================================================================================================
 Package                               Arch                Version                               Repository                Size
================================================================================================================================
Installing:
 java-1.6.0-openjdk-devel              x86_64              1:1.6.0.0-1.39.b17.el6_0              sl-security              8.5 M

Transaction Summary
================================================================================================================================
...

You need to create users and keys ( not really clear why.. ):

[root@t3ce02 bin]# cat /opt/SGE6.2u5/myusers.txt
root:iamroot:fabio.martinelli@psi.ch
[root@t3ce02 bin]#

[root@t3ce02 bin]# /mnt/zfs/sge/gridware/sge/util/sgeCA/sge_ca -usercert /opt/SGE6.2u5/myusers.txt 
Generating user certificate and key for 'root' ('iamroot','fabio.martinelli@psi.ch').

Creating 'user' certificate and key for iamroot
-----------------------------------------------
Generating a 1024 bit RSA private key
......++++++
...++++++
writing new private key to '/var/sgeCA/port6444/default/userkeys/root/key.pem'
-----
Using configuration from /tmp/sge_ca115195.tmp
Check that the request matches the signature
Signature ok
The Subject's Distinguished Name is as follows
countryName           :PRINTABLE:'DE'
stateOrProvinceName   :PRINTABLE:'GERMANY'
localityName          :PRINTABLE:'Building'
organizationName      :PRINTABLE:'Organisation'
organizationalUnitName:T61STRING:'Organisation_unit'
userId                :PRINTABLE:'root'
commonName            :PRINTABLE:'iamroot'
emailAddress          :IA5STRING:'fabio.martinelli@psi.ch'
Certificate is to be certified until Mar  2 22:35:05 2012 GMT (365 days)

Write out database with 1 new entries
Data Base Updated
created and signed certificate for user 'root' in '/var/sgeCA/port6444/default/userkeys/root'
[root@t3ce02 bin]#

Create and use passwords:

[root@t3ce02 bin]# /mnt/zfs/sge/gridware/sge/util/sgeCA/sge_ca -userks -kspwf /tmp/mysecret.txt
We made a script to setup JAVA_HOME and run inspect, please look:
[root@t3ce02 ~]# ll /usr/local/bin/sgeinspect.sh 
lrwxrwxrwx 1 root root 42 Mar  3 23:51 /usr/local/bin/sgeinspect.sh -> /gridware/sge/sgeinspect/bin/sgeinspect.sh

[root@t3ce02 ~]# cat /usr/local/bin/sgeinspect.sh
export JAVA_HOME=/etc/alternatives/java_sdk
cd /gridware/sge/sgeinspect/bin
./sgeinspect
cd -
[root@t3ce02 ~]#

-- FabioMartinelli - 2011-03-03

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatrpm lzfs-1.0-1.src.rpm r1 manage 271.3 K 2011-03-03 - 13:23 FabioMartinelli ZFS LZFS layer
Unknown file formatrpm lzfs-1.0-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 589.2 K 2011-03-03 - 13:23 FabioMartinelli ZFS LZFS layer
Unknown file formatrpm spl-0.5.2-1.src.rpm r1 manage 420.8 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Unknown file formatrpm spl-0.5.2-1.x86_64.rpm r1 manage 28.4 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Unknown file formatrpm spl-modules-0.5.2-1.src.rpm r1 manage 422.6 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Unknown file formatrpm spl-modules-0.5.2-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 2178.9 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Unknown file formatrpm spl-modules-devel-0.5.2-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 70.0 K 2011-03-03 - 13:22 FabioMartinelli ZFS SPL layer
Texttxt t3ce02.RPMs.list.after.ZFS.installation.txt r1 manage 17.0 K 2011-03-03 - 13:01 FabioMartinelli This file lists the RPMs involved in the t3ce02 SL6 installation just after the ZFS installation that was the 1st task I did after the O.S. installation.
Unknown file formatrpm zfs-0.5.1-1.src.rpm r1 manage 1815.3 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-0.5.1-1.x86_64.rpm r1 manage 2505.8 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-devel-0.5.1-1.x86_64.rpm r1 manage 275.6 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-modules-0.5.1-1.src.rpm r1 manage 1816.3 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-modules-0.5.1-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 7585.8 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-modules-devel-0.5.1-1_2.6.32_71.18.1.el6.x86_64.rpm r1 manage 224.3 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Unknown file formatrpm zfs-test-0.5.1-1.x86_64.rpm r1 manage 26.4 K 2011-03-03 - 13:24 FabioMartinelli ZFS Main Layer
Edit | Attach | Watch | Print version | History: r12 | r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2011-03-03 - FabioMartinelli
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback