Tags:
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> ---+!! Node Type: %CALC{"$SUBSTITUTE(%TOPIC%,NodeType,)"}% [[Adm%TOPIC%?topictemplate=NodeTypeTemplate][Admin info on this node type]] ---++!! Firewall requirements | *local port* | *open to* | *reason* | <!-- keep this as a security measure: #| 22/tcp | * | Example entry for ssh | --> ---------------- %TOC{title="Table of contents"}% ---+ Regular Maintenance work <!-- keep this as a security measure: #List any regular activities which do not run automatically and need an administrator's action. --> Check out our [[https://t3nagios.psi.ch/nagios/cgi-bin/status.cgi?host=t3ce02][t3nagios]] ---+ Emergency Measures <!-- #List any measures that must be taken in case of some major incident, e.g. whether a mailing #list must be contacted or whether other services need to be shut down, etc. --> ---++ VM past Snapshots if you've really corrupted this VM then ask to Peter to restore a past snapshot. ---++ Tuning the =h_vmem= value on each =t3wn= server Each =t3wn= server features a custom =h_vmem= setting that's usually ~ 1.8*Tot RAM(t3wn) because the likelihood to get a collision of jobs using a lot of RAM at the same time in a =t3wn= server is usually low ; each user job will implicitly/explicitly request a portion of this custom =h_vmem= setting and Sun Grid Engine will decrease it accordingly ; eventually no more jobs will be allowed to enter in the =t3wn= server ; if needed we can tune these custom settings by : <pre> # to print the current settings [root@t3ce02 ~]# for x in `seq 10 59` ; do qconf -se t3wn$x ; done 2>/dev/null | egrep 't3wn|h_vmem' | paste - - hostname t3wn10.psi.ch complex_values h_vmem=%BLUE%40G%ENDCOLOR%,os=sl6 hostname t3wn11.psi.ch complex_values h_vmem=%BLUE%40G%ENDCOLOR%,os=sl6 ... # to change the settings, select a non sensless setting per each kind of t3wn server [root@t3ce02 ~]# for x in `seq 10 29` ; do echo qconf -rattr exechost complex_values h_vmem=%BLUE%40G%ENDCOLOR%,os=sl6 t3wn$x.psi.ch ; done | bash -x [root@t3ce02 ~]# for x in `seq 30 40` ; do echo qconf -rattr exechost complex_values h_vmem=%BLUE%80G%ENDCOLOR%,os=sl6 t3wn$x.psi.ch ; done | bash -x [root@t3ce02 ~]# for x in 41 43 44 50 ; do echo qconf -rattr exechost complex_values h_vmem=%BLUE%180G%ENDCOLOR%,os=sl6 t3wn$x.psi.ch ; done | bash -x [root@t3ce02 ~]# for x in `seq 51 59` ; do echo qconf -rattr exechost complex_values h_vmem=%BLUE%200G%ENDCOLOR%,os=sl6 t3wn$x.psi.ch ; done | bash -x </pre> recall that each user job implicitly requests 3GB of RAM because of this global setting : <pre> [root@t3ce02 ~]# qconf -sc | egrep '#name|h_vmem' #name shortcut type relop requestable consumable default urgency h_vmem h_vmem MEMORY <= YES %BLUE%YES%ENDCOLOR% %BLUE%3G%ENDCOLOR% 0 </pre> and that each user job can request max 6GB of RAM because of the custom queue settings : <pre> [root@t3ce02 ~]# for Q in `qconf -sql` ; do echo $Q ; qconf -sq $Q | grep h_vmem ; done | paste - - | awk '{ printf "%-20s %s %s\n" ,$1,$2,$3}' all.q h_vmem %BLUE%6G%ENDCOLOR% all.q.admin h_vmem %BLUE%6G%ENDCOLOR% debug.q h_vmem %BLUE%6G%ENDCOLOR% long.q h_vmem %BLUE%6G%ENDCOLOR% sherpa.gen.q h_vmem %BLUE%6G%ENDCOLOR% sherpa.int.long.q h_vmem %BLUE%6G%ENDCOLOR% sherpa.int.vlong.q h_vmem %BLUE%6G%ENDCOLOR% short.q h_vmem %BLUE%6G%ENDCOLOR% </pre> ---+ Installation <!-- #Comment here on any peculiarities of the installation, e.g. on special packages needed, special setup #procedures which are not obvious --> Fabio uses these aliases, Puppet recipes are in =puppetdirnodes=: <pre> alias kscustom57='cd /afs/psi.ch/software/linux/dist/scientific/57/custom' alias kscustom64='cd /afs/psi.ch/software/linux/dist/scientific/64/custom' alias ksdir='cd /afs/psi.ch/software/linux/kickstart/configs' alias puppetdir='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/' alias puppetdirnodes='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/manifests/nodes' alias puppetdirredhat='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat' alias puppetdirsolaris='cd /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/Solaris/5.10' alias yumdir5='cd /afs/psi.ch/software/linux/dist/scientific/57/scripts' alias yumdir6='cd /afs/psi.ch/software/linux/dist/scientific/6/scripts' </pre> 1 =SL5_ce.pp= 1 =tier3-baseclasses.pp= ---+ Services <!-- #List all the important services, their installation, configuration and how to start and stop them --> <pre> [root@t3ce02 ~]# netstat -tpl Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 *:nfs *:* LISTEN - tcp 0 0 *:7937 *:* LISTEN 3148/nsrexecd tcp 0 0 *:962 *:* LISTEN 20711/rpc.mountd <--- t3ui* mount RO /gridware/sge/default/common tcp 0 0 *:5666 *:* LISTEN 16337/nrpe tcp 0 0 *:7938 *:* LISTEN 3148/nsrexecd tcp 0 0 *:7939 *:* LISTEN 3148/nsrexecd tcp 0 0 *:smc-http *:* LISTEN 3276/java tcp 0 0 *:7940 *:* LISTEN 3148/nsrexecd tcp 0 0 *:smc-https *:* LISTEN 3276/java tcp 0 0 *:rpasswd *:* LISTEN 20520/rpc.statd tcp 0 0 localhost.localdomain:smux *:* LISTEN 16151/snmpd tcp 0 0 *:8649 *:* LISTEN 3031/gmond tcp 0 0 *:mysql *:* LISTEN 20233/mysqld <--- local DB for accounting tcp 0 0 *:34571 *:* LISTEN 2715/sge_qmaster tcp 0 0 *:6444 *:* LISTEN 2715/sge_qmaster tcp 0 0 *:6446 *:* LISTEN 2715/sge_qmaster tcp 0 0 *:sunrpc *:* LISTEN 2326/portmap tcp 0 0 localhost.localdomain:33714 *:* LISTEN 3276/java tcp 0 0 *:948 *:* LISTEN 20696/rpc.rquotad tcp 0 0 *:ssh *:* LISTEN 16448/sshd tcp 0 0 localhost.lo:x11-ssh-offset *:* LISTEN 17412/0 tcp 0 0 localhost.localdomain:6011 *:* LISTEN 25438/2 tcp 0 0 *:58940 *:* LISTEN - [root@t3ce02 ~]# netstat -upl Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name udp 0 0 *:768 *:* 20520/rpc.statd udp 0 0 *:nfs *:* - udp 0 0 localhost.locald:syslog *:* 26284/syslog-ng udp 0 0 *:7938 *:* 3148/nsrexecd udp 0 0 *:rtip *:* 20520/rpc.statd udp 0 0 *:snmp *:* 16151/snmpd udp 0 0 *:945 *:* 20696/rpc.rquotad udp 0 0 *:959 *:* 20711/rpc.mountd <--- t3ui* mount RO /gridware/sge/default/common udp 0 0 *:bootpc *:* 2209/dhclient udp 0 0 *:48608 *:* - udp 0 0 *:sunrpc *:* 2326/portmap udp 0 0 t3ce02.psi.ch:ntp *:* 15996/ntpd udp 0 0 localhost.localdomain:ntp *:* 15996/ntpd udp 0 0 *:ntp *:* 15996/ntpd </pre> ---++ Sun Grid Engine - old doc I should reorganize this info as it's still valuable in many respects, so please quickly read SGE6dot2u5andARCOMySQLhostedonZFS but consider it outdated ---++ Sun Grid Engine It's installed by RPMs in =/gridware/sge= </br> Consult also the Tier3Policies#Batch_system_policies ---+++ Sun Grid Engine doesn't take into account the Unix secondary groups ! SGE queue =short.q.validation@t3wn10.psi.ch= will accept only users having as primary group =cms= ; During my tests the account =martinelli_f= was inside =cms= but NOT as his primary group that was =ethz-ecal= instead <br> [[http://arc.liv.ac.uk/SGE/htmlman/htmlman5/access_list.html][SGE Man page about ACL]] <pre> [martinelli_f@t3ui10 QSUB_TESTs]$ qstat -j 3642032 ============================================================== job_number: 3642032 exec_file: job_scripts/3642032 submission_time: Mon May 6 16:35:48 2013 owner: martinelli_f uid: 2980 group: %BLUE%ethz-ecal%ENDCOLOR% gid: %BLUE%529%ENDCOLOR% sge_o_home: /shome/martinelli_f sge_o_log_name: martinelli_f sge_o_path: /bin:/opt/d-cache/srm/bin:/opt/d-cache/dcap/bin:/gridware/sge/bin/lx24-amd64:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/swshare/psit3/bin:/shome/martinelli_f/shellutils:/shome/martinelli_f/bin:/shome/martinelli_f/eclipse-IDE/ sge_o_shell: /bin/bash sge_o_workdir: /shome/martinelli_f/QSUB_TESTs sge_o_host: t3ui10 account: sge cwd: /shome/martinelli_f/QSUB_TESTs mail_list: martinelli_f@t3ui10.psi.ch notify: FALSE job_name: hostname.sh jobshare: 0 hard_queue_list: short.q.validation@t3wn10.psi.ch env_list: script_file: hostname.sh scheduling info: queue instance "all.q@t3wn35.psi.ch" dropped because it is full queue instance "all.q@t3wn36.psi.ch" dropped because it is full queue instance "all.q@t3wn34.psi.ch" dropped because it is full queue instance "all.q@t3wn32.psi.ch" dropped because it is full ... cannot run in queue "debug.q" because it is not contained in its hard queue list (-q) cannot run in queue "short.q" because it is not contained in its hard queue list (-q) %BLUE%has no permission for cluster queue "short.q.validation"%ENDCOLOR% cannot run in queue "all.q" because it is not contained in its hard queue list (-q) cannot run in queue "long.q" because it is not contained in its hard queue list (-q) cannot run in queue "all.q.admin" because it is not contained in its hard queue list (-q) </pre> ---++ Sun Grid Engine MySQL DB - ARCO Apart from running =qacct= on the CLI an SGE Admin can check the cluster usage by running SELECTs vs the ARCO MySQL DB hosted on =t3ce02=; that will produce more detailed reports than =qacct=; the ARCO DB gets constantly updated with new rows ( both raw values and values derived from the raw values ) and cleaned of old rows, both operations are made by the Java daemon =sgedbwriter=.</br> Here is the official [[http://docs.oracle.com/cd/E24901_01/doc.62/e21976/chapter2.htm#BGBHBAGE][Oracle Grid Engine Website]] but consider that we usually consult the ARCO DB by a direct =mysql= session without interacting with the ARCO Web Console that's very old and slow, so you can safely avoid to fully understand the Web Console logic. =sgedbwriter= is started as a normal =init= service and *in the remote past* it was found dead many times as pointed out by https://t3nagios.psi.ch/nagios/cgi-bin/extinfo.cgi?type=2&host=t3ce02&service=SGE+ARCO+file+dbwriter+log ; so maybe you'll have to restart it: <pre> /etc/init.d/sgedbwriter.p6444 start </pre> =sgedbwriter= uses the following files: <pre> /gridware/sge/dbwriter/lib/mysql-connector-java.jar <-- to connect to MySQL by Java /gridware/sge/default/common/reporting <--- Sun Grid Engine will create and constantly update the reporting file with new usage info, sgedbwriter will analyze it, fill accordingly the ARCO DB and eventually will delete the reporting file. /gridware/sge/default/common/dbwriter.conf /gridware/sge/dbwriter/database/mysql/dbwriter.xml /gridware/sge/default/spool/dbwriter/dbwriter.log <--- Nagios constantly check its freshness to understand if sgedbwriter is alive or not. </pre> ---+++ How to run a SQL query If everything work ok then you can run a query like: <pre> [root@t3ce02 ~]# mysql --defaults-extra-file=/root/arco_read_my.cnf -u arco_read -D sge_arco -h t3ce02 --execute="SELECT date_format(time, '%Y-%m-%d') AS day, sum(completed) AS jobs FROM view_jobs_completed WHERE time > (current_timestamp - interval 1 year) GROUP BY day" +------------+-------+ | day | jobs | +------------+-------+ | 2012-07-17 | 5501 | | 2012-07-18 | 1161 | | 2012-07-19 | 1165 | | 2012-07-20 | 2848 | | 2012-07-21 | 1097 | | 2012-07-22 | 805 | ... </pre> ---+++ /var/spool/arco/queries Here you have some default ARCO queries, you just have to extract the SQL part from them: <pre> /var/spool/arco/queries/1_Month_CPU_Time_per_day_per_user.xml /var/spool/arco/queries/1_Month_SUM_Wall_Time_per_User.xml /var/spool/arco/queries/1_Month_SUM_Wall_time_and_SUM_CPU_Time_per_User.xml /var/spool/arco/queries/1_day_CPU_User_and_System_usage.xml /var/spool/arco/queries/24HoursJobs.xml /var/spool/arco/queries/AR_Attributes.xml /var/spool/arco/queries/AR_Log.xml /var/spool/arco/queries/AR_Reserved_Time_Usage.xml /var/spool/arco/queries/AR_by_User.xml /var/spool/arco/queries/Accounting_per_AR.xml /var/spool/arco/queries/Accounting_per_Department.xml /var/spool/arco/queries/Accounting_per_Project.xml /var/spool/arco/queries/Accounting_per_User.xml /var/spool/arco/queries/Average_Job_Turnaround_Time.xml /var/spool/arco/queries/Average_Job_Wait_Time.xml /var/spool/arco/queries/Average_job_length_per_user_per_month.xml /var/spool/arco/queries/DBWriter_Performance.xml /var/spool/arco/queries/Failed_overlong_jobs_per_user.xml /var/spool/arco/queries/Host_Load.xml /var/spool/arco/queries/JOBs_MORE_3GB_RAM_LAST_2_MONTHS.xml /var/spool/arco/queries/Job_Log.xml /var/spool/arco/queries/Job_efficiency_per_user.xml /var/spool/arco/queries/Job_length_histogram.xml /var/spool/arco/queries/Jobs_per_a_specific_hour_per_users.xml /var/spool/arco/queries/Jobs_per_hours_per_users.xml /var/spool/arco/queries/Jobs_shorter_than_1h.xml /var/spool/arco/queries/Jobs_shorter_that_1h_per_user.xml /var/spool/arco/queries/Number_of_Jobs_Completed_per_AR.xml /var/spool/arco/queries/Number_of_Jobs_completed.xml /var/spool/arco/queries/Queue_Consumables.xml /var/spool/arco/queries/Statistic_History.xml /var/spool/arco/queries/Statistics.xml /var/spool/arco/queries/Wallclock_time.xml /var/spool/arco/queries/average2.xml /var/spool/arco/queries/cumul_walltime_vs_job_walltime.xml </pre> ---+++ RAM usage during the last 6 months %TWISTY%<pre> mysql> select username, RAM_RANGE, count(*) as JOBs from ( SELECT username, CASE WHEN maxvmem > 0 and maxvmem <= 1000000000 THEN '0GB-1GB' WHEN maxvmem > 1000000000 and maxvmem <= 2000000000 THEN '1GB-2GB' WHEN maxvmem > 2000000000 and maxvmem <= 3000000000 THEN '2GB-3GB' ELSE '>3GB' END as RAM_RANGE from view_accounting where exit_status=0 and submission_time > (current_timestamp - interval 6 month) ) as job_summaries GROUP BY username,RAM_RANGE ; +--------------+-----------+--------+ | username | RAM_RANGE | JOBs | +--------------+-----------+--------+ | aspiezia | 0GB-1GB | 2 | | aspiezia | 1GB-2GB | 19 | | bianchi | 0GB-1GB | 32 | | bianchi | 1GB-2GB | 665 | | bianchi | 2GB-3GB | 4348 | | bianchi | >3GB | 327 | | casal | 0GB-1GB | 195 | | casal | 1GB-2GB | 1547 | | casal | 2GB-3GB | 1215 | | casal | >3GB | 94 | | cgalloni | 0GB-1GB | 96945 | | cgalloni | 1GB-2GB | 12072 | | cgalloni | 2GB-3GB | 5263 | | cgalloni | >3GB | 1827 | | cheidegg | 0GB-1GB | 929 | | cheidegg | 1GB-2GB | 17 | | cheidegg | 2GB-3GB | 16 | | cheidegg | >3GB | 2 | | clange | 0GB-1GB | 72903 | | clange | 1GB-2GB | 2726 | | clange | 2GB-3GB | 192761 | | clange | >3GB | 1222 | | cmssgm | 0GB-1GB | 3 | | dmeister | 0GB-1GB | 2 | | dsalerno | 0GB-1GB | 308 | | dsalerno | 1GB-2GB | 2058 | | dsalerno | 2GB-3GB | 1235 | | dsalerno | >3GB | 70 | | gaperrin | 0GB-1GB | 829 | | gaperrin | 1GB-2GB | 132 | | gaperrin | 2GB-3GB | 54 | | gaperrin | >3GB | 1263 | | grauco | 0GB-1GB | 2 | | grauco | 1GB-2GB | 15 | | grauco | 2GB-3GB | 1 | | grauco | >3GB | 1 | | gregor | 0GB-1GB | 667 | | gregor | 1GB-2GB | 3051 | | hinzmann | 0GB-1GB | 919 | | hinzmann | 1GB-2GB | 2204 | | hinzmann | 2GB-3GB | 38 | | hinzmann | >3GB | 1141 | | jhoss | 0GB-1GB | 3274 | | jngadiub | 0GB-1GB | 51809 | | jngadiub | 1GB-2GB | 10967 | | jngadiub | 2GB-3GB | 1995 | | jngadiub | >3GB | 1549 | | jpata | 0GB-1GB | 20958 | | jpata | 1GB-2GB | 109 | | jpata | 2GB-3GB | 4781 | | jpata | >3GB | 242 | | kotlinski | 0GB-1GB | 42 | | kotlinski | 1GB-2GB | 63 | | kotlinski | 2GB-3GB | 175 | | kotlinski | >3GB | 25 | | leac | 0GB-1GB | 769 | | leac | 1GB-2GB | 1211 | | leac | 2GB-3GB | 5107 | | leac | >3GB | 486 | | martinelli_f | 0GB-1GB | 81205 | | martinelli_f | >3GB | 264 | | micheli | 1GB-2GB | 18 | | mmasciov | 0GB-1GB | 12352 | | mmasciov | 1GB-2GB | 12845 | | mmasciov | 2GB-3GB | 7299 | | mmasciov | >3GB | 2646 | | mquittna | 0GB-1GB | 1193 | | mquittna | 1GB-2GB | 1072 | | mschoene | 0GB-1GB | 3943 | | mschoene | 1GB-2GB | 757 | | mschoene | >3GB | 1273 | | musella | 0GB-1GB | 307 | | musella | 1GB-2GB | 376 | | musella | >3GB | 5 | | mwang | 0GB-1GB | 3493 | | mwang | 1GB-2GB | 27 | | mwang | 2GB-3GB | 51 | | mwang | >3GB | 497 | | nchernya | 0GB-1GB | 7888 | | nchernya | 1GB-2GB | 19 | | pandolf | 0GB-1GB | 285 | | pandolf | >3GB | 4 | | perrozzi | 0GB-1GB | 399 | | perrozzi | 1GB-2GB | 4 | | perrozzi | 2GB-3GB | 16 | | perrozzi | >3GB | 217 | | thaarres | 0GB-1GB | 43538 | | thaarres | 1GB-2GB | 3710 | | thaarres | 2GB-3GB | 5 | | thaarres | >3GB | 644 | | tklijnsm | 0GB-1GB | 873 | | tklijnsm | 1GB-2GB | 3907 | | tklijnsm | 2GB-3GB | 87575 | | tklijnsm | >3GB | 1562 | | ursl | 0GB-1GB | 1012 | | ursl | 1GB-2GB | 2663 | | ursl | 2GB-3GB | 16513 | | ursl | >3GB | 2536 | | vlambert | 1GB-2GB | 20 | | vlambert | 2GB-3GB | 20 | | wiederkehr_s | 0GB-1GB | 24 | | wiederkehr_s | 1GB-2GB | 46 | | wiederkehr_s | 2GB-3GB | 241 | | wiederkehr_s | >3GB | 173 | | yangyong | 0GB-1GB | 877 | | yangyong | >3GB | 1 | +--------------+-----------+--------+ </pre>%ENDTWISTY% ---+++ /gridware/sge/default/common/reporting To ask Sun Grid Engine to generate this file you need to turn on the "reporting true" setting: <pre> [root@t3ce02 ~]# qconf -sconf |grep reporting_params reporting_params accounting=true reporting=true \ </pre> ---+++ /gridware/sge/default/common/reporting.not.deleted.by.dbwriter By default, and regrettably we can't change it, =sgedbwriter= will delete =/gridware/sge/default/common/reporting= once that's processed; to save a copy for the future we run a permanent =tail= left on in the background and started during the initial =init= sequence: <pre> # ll /gridware/sge/default/common/reporting* -rw-r--r-- 1 root root 8337 Jul 16 14:59 /gridware/sge/default/common/reporting -rw-r--r-- 1 root root 25221477 Jul 4 2011 /gridware/sge/default/common/reporting.4-Jul-2001_15:42 lrwxrwxrwx 1 root root 42 Apr 24 18:34 /gridware/sge/default/common/reporting.not.deleted.by.dbwriter -> /mnt/sdb/reporting.not.deleted.by.dbwriter # cat /etc/rc.local <--- last commands executed during the initial init sequence #!/bin/sh # Puppet Managed File # # This script will be executed *after* all the other init scripts. # You can put your own initialization stuff in here if you don't # want to do the full Sys V style init stuff. touch /var/lock/subsys/local #http://yoshinorimatsunobu.blogspot.com/2009/04/linux-io-scheduler-queue-size-and.html echo 100000 > /sys/block/sdb/queue/nr_requests echo deadline > /sys/block/sdb/queue/scheduler # by martinelli to start Sun Web Console + SGE ARCO /usr/sbin/smcwebserver stop /usr/sbin/smcwebserver start # 2 May 2013 - F.Martinelli # needed by VMWare I/O path failover, # if you add an other disk then add an other line here echo 180 > /sys/block/sda/device/timeout echo 180 > /sys/block/sdb/device/timeout nohup tail --pid=$(pidof sge_qmaster) -n 0 -F /gridware/sge/default/common/accounting >> /gridware/sge/default/common/accounting.not.deleted.by.logrotate & nohup tail --pid=$(pidof sge_qmaster) -n 0 -F /gridware/sge/default/common/reporting >> /gridware/sge/default/common/reporting.not.deleted.by.dbwriter & </pre> ---+++ /gridware/sge/default/common/accounting.not.deleted.by.logrotate See the previous section. ---+++ /gridware/sge/default/common/dbwriter.conf <pre> DBWRITER_USER_PW=:) DBWRITER_USER=arco_write READ_USER=arco_read READ_USER_PW= DBWRITER_URL=jdbc:mysql://localhost:3306/sge_arco DB_SCHEMA=n/a TABLESPACE=n/a TABLESPACE_INDEX=n/a DBWRITER_CONTINOUS=true DBWRITER_INTERVAL=180 DBWRITER_DRIVER=com.mysql.jdbc.Driver DBWRITER_REPORTING_FILE=/gridware/sge/default/common/reporting DBWRITER_CALCULATION_FILE=/gridware/sge/dbwriter/database/mysql/dbwriter.xml DBWRITER_SQL_THRESHOLD=3 SPOOL_DIR=/gridware/sge/default/spool/dbwriter DBWRITER_DEBUG=INFO </pre> ---+++ /gridware/sge/dbwriter/database/mysql/dbwriter.xml <pre> .. average queue utilization per hour Not really correct value, as each entry for slot usage is weighted equally. It would be necessary to have time_start and time_end per value and weight the values by time. ... number of jobs finished per host ... number of jobs finished per user ... number of jobs finished per project ... build daily values from hourly ones ... =========== Statistic Rules ========================================== --> SELECT sge_host, sge_queue, sge_user, sge_group, sge_project, sge_department, sge_host_values, sge_queue_values, sge_user_values, sge_group_values, sge_project_values, sge_department_values, sge_job, sge_job_log, sge_job_request, sge_job_usage, sge_statistic, sge_statistic_values, sge_share_log, sge_ar, sge_ar_attribute, sge_ar_usage, sge_ar_log, sge_ar_resource_usage FROM (SELECT count(*) AS sge_host FROM sge_host) AS c_host, (SELECT count(*) AS sge_queue FROM sge_queue) AS c_queue, (SELECT count(*) AS sge_user FROM sge_user) AS c_user, (SELECT count(*) AS sge_group FROM sge_group) AS c_group, (SELECT count(*) AS sge_project FROM sge_project) AS c_project, (SELECT count(*) AS sge_department FROM sge_department) AS c_department, (SELECT count(*) AS sge_host_values FROM sge_host_values) AS c_host_values, (SELECT count(*) AS sge_queue_values FROM sge_queue_values) AS c_queue_values, (SELECT count(*) AS sge_user_values FROM sge_user_values) AS c_user_values, (SELECT count(*) AS sge_group_values FROM sge_group_values) AS c_group_values, (SELECT count(*) AS sge_project_values FROM sge_project_values) AS c_project_values, (SELECT count(*) AS sge_department_values FROM sge_department_values) AS c_department_values, (SELECT count(*) AS sge_job FROM sge_job) AS c_job, (SELECT count(*) AS sge_job_log FROM sge_job_log) AS c_job_log, (SELECT count(*) AS sge_job_request FROM sge_job_request) AS c_job_request, (SELECT count(*) AS sge_job_usage FROM sge_job_usage) AS c_job_usage, (SELECT count(*) AS sge_share_log FROM sge_share_log) AS c_share_log, (SELECT count(*) AS sge_statistic FROM sge_statistic) AS c_sge_statistic, (SELECT count(*) AS sge_statistic_values FROM sge_statistic_values) AS c_sge_statistic_values, (SELECT count(*) AS sge_ar FROM sge_ar) AS c_sge_ar, (SELECT count(*) AS sge_ar_attribute FROM sge_ar_attribute) AS c_sge_ar_attribute, (SELECT count(*) AS sge_ar_usage FROM sge_ar_usage) AS c_sge_ar_usage, (SELECT count(*) AS sge_ar_log FROM sge_ar_log) AS c_sge_ar_log, (SELECT count(*) AS sge_ar_resource_usage FROM sge_ar) AS c_sge_ar_resource_usage =========== Deletion Rules ========================================== --> keep host raw values only 7 days ... </pre> ---+ Backups OS snapshots are nightly taken by PSI VMWare Team ( like Peter Huesser ) + we have LinuxBackupsByLegato to recover a single file. Also: <pre> [root@t3ce02 gridware]# %BLUE%/gridware/sge/util/upgrade_modules/save_sge_config.sh /gridware/sge_backup Configuration successfully saved to /gridware/sge_backup directory.%ENDCOLOR% </pre>
NodeTypeForm
Hostnames
t3ce02
Services
Sun Grid Engine 6.2u5
Hardware
PSI DMZ VMWare cluster
Install Profile
t3ce
Guarantee/maintenance until
VM
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r19
<
r18
<
r17
<
r16
<
r15
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r17 - 2016-09-28
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback