Admin info on this node type
Firewall requirements
Regular Maintenance work
Nagios
Check out our
t3nagios
Monitoring the SGE dbwriter
daemon
A Jan '17 Fabio's email about an unexpected SGE
dbwriter
daemon "stuck" state.
the Java daemon in charge of importing the SGE reporting file
/gridware/sge/default/common/reporting
into the ARCO MySQL DB it's indeed "dbwriter" ; this reporting file is composed both of jobs info and hosts info
dbwriter will periodically rename :
/gridware/sge/default/common/reporting
as
/gridware/sge/default/common/reporting.processing
and it will process it ; no new reporting file renaming will happen until the reporting.processing file won't get consumed ; eventually the reporting.processing file will be erased by dbwriter
dbwriter will also produce hosts/jobs stats and remove old rows from the ARCO MySQL DB according to the settings defined in this .xml file :
/gridware/sge/dbwriter/database/mysql/dbwriter.xml
please have a look to it ; no dbwriter restart is needed after a change
now, having a 0.5GB reporting.processing file it's a clear symptom that the perpetual renaming/processing/dropping loop got broken ; indeed our SGE stats were stuck to :
# mysql --defaults-extra-file=/root/arco_read_my.cnf -u arco_read -D sge_arco -h t3ce02 --execute="select job_number,username,submission_time,end_time from view_accounting order by end_time desc limit 1 ; "
+------------+----------+---------------------+---------------------+
| job_number | username | submission_time | end_time |
+------------+----------+---------------------+---------------------+
| 7694385 | ursl | 2016-12-27 22:36:05 | 2016-12-29 14:46:49 |
I usually try a simple dbwriter restart and wait to see what happens but seemingly this time the reporting.processing file was too big and it never got dropped by dbwriter , so I've extrapolated the jobs info from the reporting.processing file by :
# pwd
/gridware/sge/default/common
# /etc/init.d/sgedbwriter.p6444 stop
# grep job reporting.processing > reporting.processing.only.jobs
# mv reporting.processing reporting.processing.full
# cp -p reporting.processing.only.jobs reporting.processing
# /etc/init.d/sgedbwriter.p6444 start
since then the perpetual renaming/processing/dropping loop seems working fine
Fabio
Emergency Measures
VM past Snapshots
if you've really corrupted this VM then ask to Peter to restore a past snapshot.
Tuning the h_vmem
value on each t3wn
server
Each
t3wn
server features a custom
h_vmem
setting that's usually ~ 1.8*Tot RAM(t3wn) because the likelihood to get a collision of jobs using a lot of RAM at the same time in a
t3wn
server is usually quite low ; each user job will implicitly, or explicitly, consume a slice of this custom
h_vmem
setting and Sun Grid Engine will decrease it accordingly ; eventually no more jobs will be allowed to enter in the
t3wn
server ; if needed we can tune these custom settings by :
# to print the current settings
[root@t3ce02 ~]# for x in `seq 10 59` ; do qconf -se t3wn$x ; done 2>/dev/null | egrep 't3wn|h_vmem' | paste - -
hostname t3wn10.psi.ch complex_values h_vmem=40G,os=sl6
hostname t3wn11.psi.ch complex_values h_vmem=40G,os=sl6
...
# to change the settings, select a non sensless setting per each kind of t3wn server
[root@t3ce02 ~]# for x in `seq 10 29` ; do echo qconf -rattr exechost complex_values h_vmem=40G,os=sl6 t3wn$x.psi.ch ; done | bash -x
[root@t3ce02 ~]# for x in `seq 30 40` ; do echo qconf -rattr exechost complex_values h_vmem=80G,os=sl6 t3wn$x.psi.ch ; done | bash -x
[root@t3ce02 ~]# for x in 41 43 44 50 ; do echo qconf -rattr exechost complex_values h_vmem=180G,os=sl6 t3wn$x.psi.ch ; done | bash -x
[root@t3ce02 ~]# for x in `seq 51 59` ; do echo qconf -rattr exechost complex_values h_vmem=200G,os=sl6 t3wn$x.psi.ch ; done | bash -x
recall that each user job implicitly requests 3GB of RAM because of this global setting :
[root@t3ce02 ~]# qconf -sc | egrep '#name|h_vmem'
#name shortcut type relop requestable consumable default urgency
h_vmem h_vmem MEMORY <= YES YES 3G 0
and that each user job can request max 6GB of RAM because of the custom queue settings :
[root@t3ce02 ~]# for Q in `qconf -sql` ; do echo $Q ; qconf -sq $Q | grep h_vmem ; done | paste - - | awk '{ printf "%-20s %s %s\n" ,$1,$2,$3}'
all.q h_vmem 6G
all.q.admin h_vmem 6G
debug.q h_vmem 6G
long.q h_vmem 6G
sherpa.gen.q h_vmem 6G
sherpa.int.long.q h_vmem 6G
sherpa.int.vlong.q h_vmem 6G
short.q h_vmem 6G
Installation
Fabio uses these aliases ; Puppet recipes are in
pdirmanifests
:
alias ROOT='. /afs/cern.ch/sw/lcg/external/gcc/4.8/x86_64-slc6/setup.sh && . /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.26/x86_64-slc6-gcc48-opt/root/bin/thisroot.sh'
alias cscsela='ssh -AX fmartine@ela.cscs.ch'
alias cscslogin='ssh -AX fmartine@login.lcg.cscs.ch'
alias cscspub='ssh -AX fmartinelli@pub.lcg.cscs.ch'
alias dcache='ssh -2 -l admin -p 22224 t3dcachedb.psi.ch'
alias dcache04='ssh -2 -l admin -p 22224 t3dcachedb04.psi.ch'
alias gempty='git commit --allow-empty-message -m '\'''\'''
alias kscustom54='cd /afs/psi.ch/software/linux/dist/scientific/54/custom'
alias kscustom57='cd /afs/psi.ch/software/linux/dist/scientific/57/custom'
alias kscustom60='cd /afs/psi.ch/software/linux/dist/scientific/60/custom'
alias kscustom64='cd /afs/psi.ch/software/linux/dist/scientific/64/custom'
alias kscustom66='cd /afs/psi.ch/software/linux/dist/scientific/66/x86_64/custom'
alias ksdir='cd /afs/psi.ch/software/linux/kickstart/configs'
alias ksprepostdir='cd /afs/psi.ch/software/linux/dist/scientific/60/kickstart/bin'
alias l.='ls -d .* --color=auto'
alias ll='ls -l --color=auto'
alias ls='ls --color=tty'
alias mc='. /usr/libexec/mc/mc-wrapper.sh'
alias pdir='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/'
alias pdirf='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/'
alias pdirmanifests='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/manifests/'
alias pdirredhat='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat'
alias pdirsolaris='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/Solaris/5.10'
alias vi='vim'
alias which='alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'
alias yumdir5='cd /afs/psi.ch/software/linux/dist/scientific/57/scripts'
alias yumdir6='cd /afs/psi.ch/software/linux/dist/scientific/6/scripts'
alias yumdir7='cd /afs/psi.ch/software/linux/dist/scientificlinux/7x/x86_64/Tier3/all'
alias yumdir7old='cd /afs/psi.ch/software/linux/dist/scientific/70.PLEASE_DO_NOT_USE_AND_DO_NOT_RENAME/scripts'
-
SL5_ce.pp
-
tier3-baseclasses.pp
Services
More... Close
[root@t3ce02 ~]# netstat -tpl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 *:nfs *:* LISTEN -
tcp 0 0 *:7937 *:* LISTEN 3148/nsrexecd
tcp 0 0 *:962 *:* LISTEN 20711/rpc.mountd <--- t3ui* mount RO /gridware/sge/default/common
tcp 0 0 *:5666 *:* LISTEN 16337/nrpe
tcp 0 0 *:7938 *:* LISTEN 3148/nsrexecd
tcp 0 0 *:7939 *:* LISTEN 3148/nsrexecd
tcp 0 0 *:smc-http *:* LISTEN 3276/java
tcp 0 0 *:7940 *:* LISTEN 3148/nsrexecd
tcp 0 0 *:smc-https *:* LISTEN 3276/java
tcp 0 0 *:rpasswd *:* LISTEN 20520/rpc.statd
tcp 0 0 localhost.localdomain:smux *:* LISTEN 16151/snmpd
tcp 0 0 *:8649 *:* LISTEN 3031/gmond
tcp 0 0 *:mysql *:* LISTEN 20233/mysqld <--- local DB for accounting
tcp 0 0 *:34571 *:* LISTEN 2715/sge_qmaster
tcp 0 0 *:6444 *:* LISTEN 2715/sge_qmaster
tcp 0 0 *:6446 *:* LISTEN 2715/sge_qmaster
tcp 0 0 *:sunrpc *:* LISTEN 2326/portmap
tcp 0 0 localhost.localdomain:33714 *:* LISTEN 3276/java
tcp 0 0 *:948 *:* LISTEN 20696/rpc.rquotad
tcp 0 0 *:ssh *:* LISTEN 16448/sshd
tcp 0 0 localhost.lo:x11-ssh-offset *:* LISTEN 17412/0
tcp 0 0 localhost.localdomain:6011 *:* LISTEN 25438/2
tcp 0 0 *:58940 *:* LISTEN -
[root@t3ce02 ~]# netstat -upl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 *:768 *:* 20520/rpc.statd
udp 0 0 *:nfs *:* -
udp 0 0 localhost.locald:syslog *:* 26284/syslog-ng
udp 0 0 *:7938 *:* 3148/nsrexecd
udp 0 0 *:rtip *:* 20520/rpc.statd
udp 0 0 *:snmp *:* 16151/snmpd
udp 0 0 *:945 *:* 20696/rpc.rquotad
udp 0 0 *:959 *:* 20711/rpc.mountd <--- t3ui* mount RO /gridware/sge/default/common
udp 0 0 *:bootpc *:* 2209/dhclient
udp 0 0 *:48608 *:* -
udp 0 0 *:sunrpc *:* 2326/portmap
udp 0 0 t3ce02.psi.ch:ntp *:* 15996/ntpd
udp 0 0 localhost.localdomain:ntp *:* 15996/ntpd
udp 0 0 *:ntp *:* 15996/ntpd
Sun Grid Engine - old doc
I should sort out this old and a bit messed doc but it's still valuable in many respects, have a quick look
SGE6dot2u5andARCOMySQLhostedonZFS
Sun Grid Engine
It's installed by RPMs in
/gridware/sge
be aware of the
Tier3Policies#Batch_system_policies
Sun Grid Engine doesn't take into account the Unix secondary groups !
SGE queue
short.q.validation@t3wn10.psi.ch
will accept only users having as primary group
cms
;
Here the account
martinelli_f
belonged to the
cms
group but that was NOT his primary group !
SGE Man page about ACL
[martinelli_f@t3ui10 QSUB_TESTs]$ qstat -j 3642032
==============================================================
job_number: 3642032
exec_file: job_scripts/3642032
submission_time: Mon May 6 16:35:48 2013
owner: martinelli_f
uid: 2980
group: ethz-ecal
gid: 529
sge_o_home: /shome/martinelli_f
sge_o_log_name: martinelli_f
sge_o_path: /bin:/opt/d-cache/srm/bin:/opt/d-cache/dcap/bin:/gridware/sge/bin/lx24-amd64:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/swshare/psit3/bin:/shome/martinelli_f/shellutils:/shome/martinelli_f/bin:/shome/martinelli_f/eclipse-IDE/
sge_o_shell: /bin/bash
sge_o_workdir: /shome/martinelli_f/QSUB_TESTs
sge_o_host: t3ui10
account: sge
cwd: /shome/martinelli_f/QSUB_TESTs
mail_list: martinelli_f@t3ui10.psi.ch
notify: FALSE
job_name: hostname.sh
jobshare: 0
hard_queue_list: short.q.validation@t3wn10.psi.ch
env_list:
script_file: hostname.sh
scheduling info: queue instance "all.q@t3wn35.psi.ch" dropped because it is full
queue instance "all.q@t3wn36.psi.ch" dropped because it is full
queue instance "all.q@t3wn34.psi.ch" dropped because it is full
queue instance "all.q@t3wn32.psi.ch" dropped because it is full
...
cannot run in queue "debug.q" because it is not contained in its hard queue list (-q)
cannot run in queue "short.q" because it is not contained in its hard queue list (-q)
has no permission for cluster queue "short.q.validation"
cannot run in queue "all.q" because it is not contained in its hard queue list (-q)
cannot run in queue "long.q" because it is not contained in its hard queue list (-q)
cannot run in queue "all.q.admin" because it is not contained in its hard queue list (-q)
Sun Grid Engine MySQL DB - ARCO
Apart from running the CLI
qacct
a Sun Grid Engine Admin can monitor the cluster usage by the ARCO
MySQL DB hosted on
t3ce02
; that will produce more detailed reports than the CLI
qacct
; the ARCO
MySQL DB is constantly updated with new rows ( both raw values, and further values derived from these raw values ) and cleaned of its ancient rows ; these add/del operations are executed by the Java daemon
sgedbwriter
that's an
optional component of a Sun Grid Engine setup.
Here is the official
Oracle Grid Engine Website but consider that you'll usually consult the ARCO
MySQL DB by a direct
mysql
session without interacting with the ARCO Web Console that's both very old and slow, accordingly you can avoid to fully understand the Web Console logic.
The daemon
sgedbwriter
is started as a normal
init
service ; regrettably it was found dead many times as pointed out by
https://t3nagios.psi.ch/nagios/cgi-bin/extinfo.cgi?type=2&host=t3ce02&service=SGE+ARCO+file+dbwriter+log so maybe you'll have to restart it from time to time :
/etc/init.d/sgedbwriter.p6444 start
sgedbwriter
accesses the following files:
/gridware/sge/dbwriter/lib/mysql-connector-java.jar <-- to connect to MySQL by Java
/gridware/sge/default/common/reporting <--- Sun Grid Engine will create and constantly update the reporting file with new usage info ; sgedbwriter will read it, fill accordingly the ARCO MySQL DB and eventually *delete* it.
/gridware/sge/default/common/dbwriter.conf
/gridware/sge/dbwriter/database/mysql/dbwriter.xml
/gridware/sge/default/spool/dbwriter/dbwriter.log <--- Nagios constantly checks its freshness to monitor if sgedbwriter is at least alive or not.
How to run a SQL query
If everything works fine you can run a query producing interesting stats like :
[root@t3ce02 ~]# mysql --defaults-extra-file=/root/arco_read_my.cnf -u arco_read -D sge_arco -h t3ce02 --execute="SELECT date_format(time, '%Y-%m-%d') AS day, sum(completed) AS jobs FROM view_jobs_completed WHERE time > (current_timestamp - interval 1 year) GROUP BY day"
+------------+-------+
| day | jobs |
+------------+-------+
| 2012-07-17 | 5501 |
| 2012-07-18 | 1161 |
| 2012-07-19 | 1165 |
| 2012-07-20 | 2848 |
| 2012-07-21 | 1097 |
| 2012-07-22 | 805 |
...
How to check if dbwriter
is really filling the MySQL DB with new rows
[root@t3ce02 sge]# mysql --defaults-extra-file=/root/arco_read_my.cnf -u arco_read -D sge_arco -h t3ce02 --execute="select job_number,username,submission_time,end_time from view_accounting order by submission_time desc limit 4 ; "
+------------+----------+---------------------+---------------------+
| job_number | username | submission_time | end_time |
+------------+----------+---------------------+---------------------+
| 7693720 | ursl | 2016-12-27 22:35:09 | 2016-12-29 00:00:08 |
| 7693719 | ursl | 2016-12-27 22:35:09 | 2016-12-29 00:00:03 |
| 7693718 | ursl | 2016-12-27 22:35:09 | 2016-12-29 00:00:32 |
| 7693717 | ursl | 2016-12-27 22:35:09 | 2016-12-29 00:00:29 |
+------------+----------+---------------------+---------------------+
In this case
dbwriter
got stuck since today it's
2017-01-03
/var/spool/arco/queries
Here some default ARCO queries, you just have to extract the SQL part from these files :
/var/spool/arco/queries/1_Month_CPU_Time_per_day_per_user.xml
/var/spool/arco/queries/1_Month_SUM_Wall_Time_per_User.xml
/var/spool/arco/queries/1_Month_SUM_Wall_time_and_SUM_CPU_Time_per_User.xml
/var/spool/arco/queries/1_day_CPU_User_and_System_usage.xml
/var/spool/arco/queries/24HoursJobs.xml
/var/spool/arco/queries/AR_Attributes.xml
/var/spool/arco/queries/AR_Log.xml
/var/spool/arco/queries/AR_Reserved_Time_Usage.xml
/var/spool/arco/queries/AR_by_User.xml
/var/spool/arco/queries/Accounting_per_AR.xml
/var/spool/arco/queries/Accounting_per_Department.xml
/var/spool/arco/queries/Accounting_per_Project.xml
/var/spool/arco/queries/Accounting_per_User.xml
/var/spool/arco/queries/Average_Job_Turnaround_Time.xml
/var/spool/arco/queries/Average_Job_Wait_Time.xml
/var/spool/arco/queries/Average_job_length_per_user_per_month.xml
/var/spool/arco/queries/DBWriter_Performance.xml
/var/spool/arco/queries/Failed_overlong_jobs_per_user.xml
/var/spool/arco/queries/Host_Load.xml
/var/spool/arco/queries/JOBs_MORE_3GB_RAM_LAST_2_MONTHS.xml
/var/spool/arco/queries/Job_Log.xml
/var/spool/arco/queries/Job_efficiency_per_user.xml
/var/spool/arco/queries/Job_length_histogram.xml
/var/spool/arco/queries/Jobs_per_a_specific_hour_per_users.xml
/var/spool/arco/queries/Jobs_per_hours_per_users.xml
/var/spool/arco/queries/Jobs_shorter_than_1h.xml
/var/spool/arco/queries/Jobs_shorter_that_1h_per_user.xml
/var/spool/arco/queries/Number_of_Jobs_Completed_per_AR.xml
/var/spool/arco/queries/Number_of_Jobs_completed.xml
/var/spool/arco/queries/Queue_Consumables.xml
/var/spool/arco/queries/Statistic_History.xml
/var/spool/arco/queries/Statistics.xml
/var/spool/arco/queries/Wallclock_time.xml
/var/spool/arco/queries/average2.xml
/var/spool/arco/queries/cumul_walltime_vs_job_walltime.xml
RAM usage during the last 6 months
More... Close
mysql> select username, RAM_RANGE, count(*) as JOBs from ( SELECT username, CASE WHEN maxvmem > 0 and maxvmem <= 1000000000 THEN '0GB-1GB' WHEN maxvmem > 1000000000 and maxvmem <= 2000000000 THEN '1GB-2GB' WHEN maxvmem > 2000000000 and maxvmem <= 3000000000 THEN '2GB-3GB' ELSE '>3GB' END as RAM_RANGE from view_accounting where exit_status=0 and submission_time > (current_timestamp - interval 6 month) ) as job_summaries GROUP BY username,RAM_RANGE ;
+--------------+-----------+--------+
| username | RAM_RANGE | JOBs |
+--------------+-----------+--------+
| aspiezia | 0GB-1GB | 2 |
| aspiezia | 1GB-2GB | 19 |
| bianchi | 0GB-1GB | 32 |
| bianchi | 1GB-2GB | 665 |
| bianchi | 2GB-3GB | 4348 |
| bianchi | >3GB | 327 |
| casal | 0GB-1GB | 195 |
| casal | 1GB-2GB | 1547 |
| casal | 2GB-3GB | 1215 |
| casal | >3GB | 94 |
| cgalloni | 0GB-1GB | 96945 |
| cgalloni | 1GB-2GB | 12072 |
| cgalloni | 2GB-3GB | 5263 |
| cgalloni | >3GB | 1827 |
| cheidegg | 0GB-1GB | 929 |
| cheidegg | 1GB-2GB | 17 |
| cheidegg | 2GB-3GB | 16 |
| cheidegg | >3GB | 2 |
| clange | 0GB-1GB | 72903 |
| clange | 1GB-2GB | 2726 |
| clange | 2GB-3GB | 192761 |
| clange | >3GB | 1222 |
| cmssgm | 0GB-1GB | 3 |
| dmeister | 0GB-1GB | 2 |
| dsalerno | 0GB-1GB | 308 |
| dsalerno | 1GB-2GB | 2058 |
| dsalerno | 2GB-3GB | 1235 |
| dsalerno | >3GB | 70 |
| gaperrin | 0GB-1GB | 829 |
| gaperrin | 1GB-2GB | 132 |
| gaperrin | 2GB-3GB | 54 |
| gaperrin | >3GB | 1263 |
| grauco | 0GB-1GB | 2 |
| grauco | 1GB-2GB | 15 |
| grauco | 2GB-3GB | 1 |
| grauco | >3GB | 1 |
| gregor | 0GB-1GB | 667 |
| gregor | 1GB-2GB | 3051 |
| hinzmann | 0GB-1GB | 919 |
| hinzmann | 1GB-2GB | 2204 |
| hinzmann | 2GB-3GB | 38 |
| hinzmann | >3GB | 1141 |
| jhoss | 0GB-1GB | 3274 |
| jngadiub | 0GB-1GB | 51809 |
| jngadiub | 1GB-2GB | 10967 |
| jngadiub | 2GB-3GB | 1995 |
| jngadiub | >3GB | 1549 |
| jpata | 0GB-1GB | 20958 |
| jpata | 1GB-2GB | 109 |
| jpata | 2GB-3GB | 4781 |
| jpata | >3GB | 242 |
| kotlinski | 0GB-1GB | 42 |
| kotlinski | 1GB-2GB | 63 |
| kotlinski | 2GB-3GB | 175 |
| kotlinski | >3GB | 25 |
| leac | 0GB-1GB | 769 |
| leac | 1GB-2GB | 1211 |
| leac | 2GB-3GB | 5107 |
| leac | >3GB | 486 |
| martinelli_f | 0GB-1GB | 81205 |
| martinelli_f | >3GB | 264 |
| micheli | 1GB-2GB | 18 |
| mmasciov | 0GB-1GB | 12352 |
| mmasciov | 1GB-2GB | 12845 |
| mmasciov | 2GB-3GB | 7299 |
| mmasciov | >3GB | 2646 |
| mquittna | 0GB-1GB | 1193 |
| mquittna | 1GB-2GB | 1072 |
| mschoene | 0GB-1GB | 3943 |
| mschoene | 1GB-2GB | 757 |
| mschoene | >3GB | 1273 |
| musella | 0GB-1GB | 307 |
| musella | 1GB-2GB | 376 |
| musella | >3GB | 5 |
| mwang | 0GB-1GB | 3493 |
| mwang | 1GB-2GB | 27 |
| mwang | 2GB-3GB | 51 |
| mwang | >3GB | 497 |
| nchernya | 0GB-1GB | 7888 |
| nchernya | 1GB-2GB | 19 |
| pandolf | 0GB-1GB | 285 |
| pandolf | >3GB | 4 |
| perrozzi | 0GB-1GB | 399 |
| perrozzi | 1GB-2GB | 4 |
| perrozzi | 2GB-3GB | 16 |
| perrozzi | >3GB | 217 |
| thaarres | 0GB-1GB | 43538 |
| thaarres | 1GB-2GB | 3710 |
| thaarres | 2GB-3GB | 5 |
| thaarres | >3GB | 644 |
| tklijnsm | 0GB-1GB | 873 |
| tklijnsm | 1GB-2GB | 3907 |
| tklijnsm | 2GB-3GB | 87575 |
| tklijnsm | >3GB | 1562 |
| ursl | 0GB-1GB | 1012 |
| ursl | 1GB-2GB | 2663 |
| ursl | 2GB-3GB | 16513 |
| ursl | >3GB | 2536 |
| vlambert | 1GB-2GB | 20 |
| vlambert | 2GB-3GB | 20 |
| wiederkehr_s | 0GB-1GB | 24 |
| wiederkehr_s | 1GB-2GB | 46 |
| wiederkehr_s | 2GB-3GB | 241 |
| wiederkehr_s | >3GB | 173 |
| yangyong | 0GB-1GB | 877 |
| yangyong | >3GB | 1 |
+--------------+-----------+--------+
/gridware/sge/default/common/reporting
To ask Sun Grid Engine to generate this file you need to turn on the
reporting true
setting:
[root@t3ce02 ~]# qconf -sconf |grep reporting_params
reporting_params accounting=true reporting=true
/gridware/sge/default/common/reporting.not.deleted.by.dbwriter
By default, and regrettably we can't really change it,
sgedbwriter
will
delete the
/gridware/sge/default/common/reporting
once that will get read ; in order to save a copy for the future we've decided to run a permanent
tail
left on in the background and started during the initial
init
sequence by :
# ll /gridware/sge/default/common/reporting*
-rw-r--r-- 1 root root 8337 Jul 16 14:59 /gridware/sge/default/common/reporting
-rw-r--r-- 1 root root 25221477 Jul 4 2011 /gridware/sge/default/common/reporting.4-Jul-2001_15:42
lrwxrwxrwx 1 root root 42 Apr 24 18:34 /gridware/sge/default/common/reporting.not.deleted.by.dbwriter -> /mnt/sdb/reporting.not.deleted.by.dbwriter
# cat /etc/rc.local <--- last commands executed during the initial init sequence
#!/bin/sh
# Puppet Managed File
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
touch /var/lock/subsys/local
#http://yoshinorimatsunobu.blogspot.com/2009/04/linux-io-scheduler-queue-size-and.html
echo 100000 > /sys/block/sdb/queue/nr_requests
echo deadline > /sys/block/sdb/queue/scheduler
# by martinelli to start Sun Web Console + SGE ARCO
/usr/sbin/smcwebserver stop
/usr/sbin/smcwebserver start
# 2 May 2013 - F.Martinelli
# needed by VMWare I/O path failover,
# if you add an other disk then add an other line here
echo 180 > /sys/block/sda/device/timeout
echo 180 > /sys/block/sdb/device/timeout
nohup tail --pid=$(pidof sge_qmaster) -n 0 -F /gridware/sge/default/common/accounting >> /gridware/sge/default/common/accounting.not.deleted.by.logrotate &
nohup tail --pid=$(pidof sge_qmaster) -n 0 -F /gridware/sge/default/common/reporting >> /gridware/sge/default/common/reporting.not.deleted.by.dbwriter &
/gridware/sge/default/common/accounting.not.deleted.by.logrotate
See the previous section.
/gridware/sge/default/common/dbwriter.conf
DBWRITER_USER_PW=:)
DBWRITER_USER=arco_write
READ_USER=arco_read
READ_USER_PW=
DBWRITER_URL=jdbc:mysql://localhost:3306/sge_arco
DB_SCHEMA=n/a
TABLESPACE=n/a
TABLESPACE_INDEX=n/a
DBWRITER_CONTINOUS=true
DBWRITER_INTERVAL=180
DBWRITER_DRIVER=com.mysql.jdbc.Driver
DBWRITER_REPORTING_FILE=/gridware/sge/default/common/reporting
DBWRITER_CALCULATION_FILE=/gridware/sge/dbwriter/database/mysql/dbwriter.xml
DBWRITER_SQL_THRESHOLD=3
SPOOL_DIR=/gridware/sge/default/spool/dbwriter
DBWRITER_DEBUG=INFO
/gridware/sge/dbwriter/database/mysql/dbwriter.xml
..
average queue utilization per hour
Not really correct value, as each entry for slot usage is weighted equally.
It would be necessary to have time_start and time_end per value and weight
the values by time.
...
number of jobs finished per host
...
number of jobs finished per user
...
number of jobs finished per project
...
build daily values from hourly ones
...
=========== Statistic Rules ========================================== -->
SELECT sge_host, sge_queue, sge_user, sge_group, sge_project, sge_department,
sge_host_values, sge_queue_values, sge_user_values, sge_group_values, sge_project_values, sge_department_values,
sge_job, sge_job_log, sge_job_request, sge_job_usage, sge_statistic, sge_statistic_values,
sge_share_log, sge_ar, sge_ar_attribute, sge_ar_usage, sge_ar_log, sge_ar_resource_usage
FROM (SELECT count(*) AS sge_host FROM sge_host) AS c_host,
(SELECT count(*) AS sge_queue FROM sge_queue) AS c_queue,
(SELECT count(*) AS sge_user FROM sge_user) AS c_user,
(SELECT count(*) AS sge_group FROM sge_group) AS c_group,
(SELECT count(*) AS sge_project FROM sge_project) AS c_project,
(SELECT count(*) AS sge_department FROM sge_department) AS c_department,
(SELECT count(*) AS sge_host_values FROM sge_host_values) AS c_host_values,
(SELECT count(*) AS sge_queue_values FROM sge_queue_values) AS c_queue_values,
(SELECT count(*) AS sge_user_values FROM sge_user_values) AS c_user_values,
(SELECT count(*) AS sge_group_values FROM sge_group_values) AS c_group_values,
(SELECT count(*) AS sge_project_values FROM sge_project_values) AS c_project_values,
(SELECT count(*) AS sge_department_values FROM sge_department_values) AS c_department_values,
(SELECT count(*) AS sge_job FROM sge_job) AS c_job,
(SELECT count(*) AS sge_job_log FROM sge_job_log) AS c_job_log,
(SELECT count(*) AS sge_job_request FROM sge_job_request) AS c_job_request,
(SELECT count(*) AS sge_job_usage FROM sge_job_usage) AS c_job_usage,
(SELECT count(*) AS sge_share_log FROM sge_share_log) AS c_share_log,
(SELECT count(*) AS sge_statistic FROM sge_statistic) AS c_sge_statistic,
(SELECT count(*) AS sge_statistic_values FROM sge_statistic_values) AS c_sge_statistic_values,
(SELECT count(*) AS sge_ar FROM sge_ar) AS c_sge_ar,
(SELECT count(*) AS sge_ar_attribute FROM sge_ar_attribute) AS c_sge_ar_attribute,
(SELECT count(*) AS sge_ar_usage FROM sge_ar_usage) AS c_sge_ar_usage,
(SELECT count(*) AS sge_ar_log FROM sge_ar_log) AS c_sge_ar_log,
(SELECT count(*) AS sge_ar_resource_usage FROM sge_ar) AS c_sge_ar_resource_usage
=========== Deletion Rules ========================================== -->
keep host raw values only 7 days
...
Backups
OS snapshots are nightly taken by PSI VMWare Team ( like Peter Huesser ) + we have
LinuxBackupsByLegato to recover a single file.
Also you have :
[root@t3ce02 gridware]# /gridware/sge/util/upgrade_modules/save_sge_config.sh /gridware/sge_backup
Configuration successfully saved to /gridware/sge_backup directory.