Difference: SGE6dot2u5andARCOMySQLhostedonZFS (9 vs. 10)

Revision 102011-10-17 - FabioMartinelli

Line: 1 to 1
 
META TOPICPARENT name="AdminArea"
Running transaction check
> Package sun-sge-arco.noarch 0:6.2-5 set to be updated --> Finished Dependency Resolution
>
>
--> Running transaction check
> Package sun-sge-arco.noarch 0:6.2-5 set to be updated --> Finished Dependency Resolution
  Dependencies Resolved
Line: 654 to 648
 

MySQL JDBC driver

Added:
>
>
 Be sure that you have a MySQL JDBC driver file and link that file inside the SGE dir:
Changed:
<
<
[root@t3ce02 sge6_2u5]# yum install mysql-connector-java.x86_64

>
>
[root@t3ce02 sge6_2u5]# yum install mysql-connector-java.x86_64

 ... [root@t3ce02 lib]# pwd /mnt/zfs/sge/gridware/sge/dbwriter/lib
Line: 664 to 658
 [root@t3ce02 lib]# ln -s /usr/share/java/mysql-connector-java.jar

Installation /inst_dbwriter

Changed:
<
<
During the dbwriter installation itself, that's well reported on the official SGE site, we were prompted for several things, one is which Java to use, there we specified '/etc/alternatives/jre/' to be protected by a System java update. So we ran:
cd $SGE_ROOT/dbwriter && /inst_dbwriter

>
>
During the dbwriter installation itself, that's well reported inside the official SGE website, we were prompted for several things, one is which Java to use, there we specified '/etc/alternatives/jre/' to be protected from Java update. So we ran:
cd $SGE_ROOT/dbwriter && /inst_dbwriter

 ... All parameters are now collected
Line: 682 to 676
  DERIVED_FILE=/mnt/zfs/sge/gridware/sge/dbwriter/database/mysql/dbwriter.xml DEBUG_LEVEL=FINE
Changed:
<
<
Are these settings correct? (y/n) [y] >>
>
>
Are these settings correct? (y/n) [y] >>
 

Please note the MySQL sge_arco Tables and Views creation phase:

Changed:
<
<
Update version table

>
>
Update version table

 commiting changes Version 6.1u3 (id=6) successfully installed Install version 6.1u4 (id=7) -------
Line: 814 to 807
  Create configuration file for dbwriter in /mnt/zfs/sge/gridware/sge/default/common
Changed:
<
<
Hit to continue >>
>
>
Hit to continue >>
 

When the dbwriter installation is completed we got:

Changed:
<
<
dbwriter startup script

>
>
dbwriter startup script

 

We can install the startup script that will

Changed:
<
<
start dbwriter at machine boot (y/n) [y] >>
>
>
start dbwriter at machine boot (y/n) [y] >>
  cp /mnt/zfs/sge/gridware/sge/default/common/sgedbwriter /etc/init.d/sgedbwriter.p6444 /usr/lib/lsb/install_initd /etc/init.d/sgedbwriter.p6444
Line: 833 to 825
 Installation of dbwriter completed [root@t3ce02 dbwriter]#
Added:
>
>
 

Checking dbwriter logs

Added:
>
>
 The program dbwriter is now a service in your system, you can start/stop it with:
Changed:
<
<
/etc/init.d/sgedbwriter.p6444

>
>
/etc/init.d/sgedbwriter.p6444

 
Added:
>
>
 And double checking what's going on with a tail command on these 2 log files:
Changed:
<
<
[root@t3ce02 ~]# tail -f /mnt/zfs/sge/gridware/sge/default/spool/dbwriter/dbwriter.log

>
>
[root@t3ce02 ~]# tail -f /mnt/zfs/sge/gridware/sge/default/spool/dbwriter/dbwriter.log

 06/03/2011 16:24:51|t3ce02.psi.ch|ivedValueThread.commitExecuted|D|new object received, timestampOfLastRowData is 1,299,428,609,000 06/03/2011 16:24:51|t3ce02.psi.ch|iter.file.FileParser.parseFile|I|Deleting file reporting.processing 06/03/2011 16:24:51|t3ce02.psi.ch|.RecordCache.getStoredDBRecord|D|Object for key 'dbwriter' = [sge_statistic, id=1, parent=0, key=['dbwriter'], addr=0x7f712b3a]
Line: 866 to 859
 
Added:
>
>

dbwriter RE-importing a reporting file

If for whatever reason you need to ingest again the reporting file inside the MySQL DB sge_arco then please run these sequence of truncates:

TRUNCATE `sge_checkpoint`;
TRUNCATE `sge_department`;
TRUNCATE `sge_group`;
TRUNCATE `sge_host`;
TRUNCATE `sge_host_values`;
TRUNCATE `sge_job`;
TRUNCATE `sge_job_log`;
TRUNCATE `sge_job_request`;
TRUNCATE `sge_job_usage`;
TRUNCATE `sge_project`;
TRUNCATE `sge_project_values`;
TRUNCATE `sge_queue`;
TRUNCATE `sge_queue_values`;
TRUNCATE `sge_statistic`;
TRUNCATE `sge_statistic_values`;
TRUNCATE `sge_user`;
TRUNCATE `sge_user_values`;
 

SGE ARCO

Added:
>
>
 Now it's time to install the reporting layer, please have a look to the Official ARCO documentation

Here follows our installation experience:

Line: 872 to 888
 Here follows our installation experience:

MySQL JDBC driver

Added:
>
>
 ARCO it's Java application that needs to communicate with MySQL, so we created an other symbolic link like in the dbwriter case:
Changed:
<
<
[root@t3ce02 ~]# ll /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar
lrwxrwxrwx 1 root root 40 Mar  3 20:30 /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar -> /usr/share/java/mysql-connector-java.jar

>
>
[root@t3ce02 ~]# ll /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar
lrwxrwxrwx 1 root root 40 Mar  3 20:30 /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar -> /usr/share/java/mysql-connector-java.jar

 
Added:
>
>
 link that was properly recognized by the ARCO installation procedure:
Changed:
<
<
...

>
>
...

 Searching for the jdbc driver com.mysql.jdbc.Driver in directory /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib

OK, jdbc driver found

Changed:
<
<
Should the connection to the database be tested? (y/n) [y] >>
>
>
Should the connection to the database be tested? (y/n) [y] >>
  Test database connection to 'jdbc:mysql://localhost:3306/sge_arco' ... OK
Changed:
<
<
Hit to continue >>
>
>
Hit to continue >>
  DB parameters are now collected
Line: 897 to 913
  DB_URL=jdbc:mysql://localhost:3306/sge_arco DB_USER=arco_read
Changed:
<
<
Are these settings correct? (y/n) [y] >>
>
>
Are these settings correct? (y/n) [y] >>
 
Changed:
<
<
Do you want to add another cluster? (y/n) [n] >>n
>
>
Do you want to add another cluster? (y/n) [n] >>n
  Configure users with write access

Users: default

Changed:
<
<
Enter a user login name. (Hit to finish) >> root
>
>
Enter a user login name. (Hit to finish) >> root
  Users: default root
Changed:
<
<
Enter a user login name. (Hit to finish) >> martinelli_f
>
>
Enter a user login name. (Hit to finish) >> martinelli_f
  Users: default root martinelli_f
Changed:
<
<
Enter a user login name. (Hit to finish) >>
>
>
Enter a user login name. (Hit to finish) >>
  All parameters are now collected
SPOOL_DIR=/var/spool/arco APPL_USERS=default root martinelli_f
Changed:
<
<
Are these settings correct? (y/n) [y] >>
>
>
Are these settings correct? (y/n) [y] >>
  found incorrect permissions lrwxrwxrwx for /mnt/zfs/sge/gridware/sge/reporting/WEB-INF/lib/mysql-connector-java.jar Correcting file permissions ... done
Line: 924 to 940
  Correcting file permissions ... done

Standard ARCO Queries

Added:
>
>
 SGE Engineers designed some standard queries useful for any kind of SGE cluster:
Changed:
<
<
....

>
>
....

 Install predefined queries
Changed:
<
<
Directory /var/spool/arco does not exist, create it? (y/n) [y] >> y
>
>
Directory /var/spool/arco does not exist, create it? (y/n) [y] >> y
  Create directory /var/spool/arco Create directory /var/spool/arco/queries
Line: 956 to 972
 Copy query Wallclock_time.xml ... OK Create directory /var/spool/arco/results
Changed:
<
<
Hit to continue >>
>
>
Hit to continue >>
  ARCo reporting module setup
Line: 968 to 984
  /mnt/zfs/sge/gridware/sge/default/arco/reporting/config.xml
Changed:
<
<
Hit to continue >>
>
>
Hit to continue >>
  Importing Sun Java Web Console 3.0 or 3.1 files
Imported files to /mnt/zfs/sge/gridware/sge/default/arco/reporting Created product images in /mnt/zfs/sge/gridware/sge/default/arco/reporting/com_sun_web_ui/images
Changed:
<
<
Hit to continue >>
>
>
Hit to continue >>
  Registering the SGE reporting module in the Sun Java Web Console
Line: 994 to 1010
 

Eventually the ARCO web access

Changed:
<
<
At the end of the installation script we were able to access into ARCO; https://t3ce02.psi.ch:6789/
>
>
At the end of the installation script we were able to access into ARCO; https://t3ce02.psi.ch:6789/
 

SGE, importing an previous reporting file

Changed:
<
<
It's possible to ingest an previous reporting file coming from an other SGE installation; because in our old cluster we had one we ingested > 1.5 year of statistics in this way:
[root@t3ce02 common]# ll /root/reporting 

>
>
It's possible to ingest an previous reporting file coming from an other SGE installation; because in our old cluster we had one we ingested > 1.5 year of statistics in this way:
[root@t3ce02 common]# ll /root/reporting 

 -rw-r--r--. 1 root root 740664403 Feb 28 23:08 /root/reporting [root@t3ce02 common]# pwd /mnt/zfs/sge/gridware/sge/default/common
Line: 1009 to 1025
 

SGE, inspect tool

Added:
>
>
 It's possible to graphically monitor several SGE clusters and their queues by using the Java tool Inspect that we have installed in the following way:
Changed:
<
<
[root@t3ce02 SGE6.2u5]# unzip sge62u5_inspect_rpm.zip

>
>
[root@t3ce02 SGE6.2u5]# unzip sge62u5_inspect_rpm.zip

 Archive: sge62u5_inspect_rpm.zip inflating: sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm [root@t3ce02 SGE6.2u5]# yum install sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm
Line: 1019 to 1035
 Examining sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm: sun-sge-inspect-6.2-5.noarch Marking sge6_2u5/sun-sge-inspect-6.2-5.noarch.rpm to be installed Resolving Dependencies
Changed:
<
<
--> Running transaction check
> Package sun-sge-inspect.noarch 0:6.2-5 set to be updated --> Finished Dependency Resolution
>
>
--> Running transaction check
> Package sun-sge-inspect.noarch 0:6.2-5 set to be updated --> Finished Dependency Resolution
  Dependencies Resolved
Line: 1054 to 1070
 

Install jdk-develop by using yum:

Changed:
<
<
...

>
>
...

 ============================================================================================================================ Package Arch Version Repository Size ============================================================================================================================
Line: 1068 to 1083
 

You need to create users and keys ( not really clear why.. ):

Changed:
<
<
[root@t3ce02 bin]# cat /opt/SGE6.2u5/myusers.txt

>
>
[root@t3ce02 bin]# cat /opt/SGE6.2u5/myusers.txt

 root:iamroot:fabio.martinelli@psi.ch [root@t3ce02 bin]#
Line: 1104 to 1118
 

Create and use passwords:

Changed:
<
<
[root@t3ce02 bin]# /mnt/zfs/sge/gridware/sge/util/sgeCA/sge_ca -userks -kspwf /tmp/mysecret.txt

>
>
[root@t3ce02 bin]# /mnt/zfs/sge/gridware/sge/util/sgeCA/sge_ca -userks -kspwf /tmp/mysecret.txt

 
Added:
>
>
 We made a script to setup JAVA_HOME and run inspect, please look:
Changed:
<
<
[root@t3ce02 ~]# ll /usr/local/bin/sgeinspect.sh 
lrwxrwxrwx 1 root root 42 Mar  3 23:51 /usr/local/bin/sgeinspect.sh -> /gridware/sge/sgeinspect/bin/sgeinspect.sh

>
>
[root@t3ce02 ~]# ll /usr/local/bin/sgeinspect.sh 
lrwxrwxrwx 1 root root 42 Mar  3 23:51 /usr/local/bin/sgeinspect.sh -> /gridware/sge/sgeinspect/bin/sgeinspect.sh

  [root@t3ce02 ~]# cat /usr/local/bin/sgeinspect.sh export JAVA_HOME=/etc/alternatives/java_sdk
Line: 1119 to 1132
 cd - [root@t3ce02 ~]#
Added:
>
>
 

SGE EXECD 6.2u5 installation

Added:
>
>
 Install the SGE execution side was easier than to install the master one but it requires some steps to follow well described in the official SGE How to Install Execution Hosts: there is a step missing in the official documentation after the RPM installation and before to run the ./install_execd, there let's run this command from the master:
Changed:
<
<
rsync -av /gridware/sge/default/common EXECUTION_HOST:/gridware/sge/default/common

>
>
rsync -av /gridware/sge/default/common EXECUTION_HOST:/gridware/sge/default/common

 
Added:
>
>
 so the ./install_execd receives the configuration files where is reported who is the master and parameters affecting the execution.

we did an installation without any NFS dependency, this should avoid global job crashes during an NFS server unreachable event and it improves I/O performances.

Line: 1128 to 1143
  we did an installation without any NFS dependency, this should avoid global job crashes during an NFS server unreachable event and it improves I/O performances.

Dropping old job files

Added:
>
>
 It's a good policy to keep recent SGE computations in the WN to troubleshoot what went wrong during a job failure but after a month or two it doesn't make sense preserve those files and maybe you could collapse the directory: when I was in ESA I did this cron script to move out and later delete old SGE job dirs ( the script has to be adapted to your cluster ):
Changed:
<
<
#!/bin/bash 

>
>
#!/bin/bash 

  # by martinelli @ ESA - 26/05/2010 # /var/sge/spool/$HOST/active_jobs dir was found with 31199 dirs inside and because that dir is hosted on an EXT3 filesystem it can't store max 32k dirs;
Line: 1148 to 1163
  # prepares move commands cd $ACTIVE_JOBS
Changed:
<
<
/usr/bin/find . -mtime +15 -type d -exec echo mv '{}' /stage/active_jobs_old/ \; > /tmp/$BASENAME-mv.sh
>
>
/usr/bin/find . -mtime +15 -type d -exec echo mv '{}' /stage/active_jobs_old/ \; > /tmp/$BASENAME-mv.sh
 # executes move commands source /tmp/$BASENAME-mv.sh

# later drops old files, prepare the rm commands cd /stage/active_jobs_old/

Changed:
<
<
/usr/bin/find . -mtime +45 -type d -exec echo rm -rf '{}' \; > /tmp/$BASENAME-rm.sh
>
>
/usr/bin/find . -mtime +45 -type d -exec echo rm -rf '{}' \; > /tmp/$BASENAME-rm.sh
 # execute them source /tmp/$BASENAME-rm.sh && exit 0
Line: 1202 to 1216
 
Changed:
<
<
  • ARCO Graph showing 1 day of CPU usage in the T3 Cluster, date 2011-01-24.:
    1dayCPUusage.png
>
>
  • ARCO Graph showing 1 day of CPU usage in the T3 Cluster, date 2011-01-24.:
    1dayCPUusage.png
 
Changed:
<
<
  • ARCO Graph showing 1 day of MEM usage in the T3 Cluster, date 2011-01-24.:
    1dayMEMusage.png
>
>
  • ARCO Graph showing 1 day of MEM usage in the T3 Cluster, date 2011-01-24.:
    1dayMEMusage.png
 
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback