Firewall requirements
Emergency Measures
Too many frequent or heavy writes
You can identify which user files are opened in write mode by :
More... Close
[root@t3admin01 ~]# salt 't3*' cmd.run " lsof -w -N | grep shome | grep REG | egrep ' [0-9]*u | [0-9]*w '| awk '{ print \$9}' | xargs -I {} -i bash -c 'ls -lh {}' "
t3bdii02.psi.ch:
t3ldap02.psi.ch:
t3frontier01.psi.ch:
t3ldap01:
t3wn42.psi.ch:
t3ce01.psi.ch:
t3wn18.psi.ch:
t3wn28.psi.ch:
t3wn38.psi.ch:
t3wn40.psi.ch:
t3wn12.psi.ch:
t3wn23.psi.ch:
t3wn26.psi.ch:
t3wn17.psi.ch:
t3wn19.psi.ch:
t3fs14.psi.ch:
t3bdii01.psi.ch:
t3nagios.psi.ch:
t3wn31.psi.ch:
t3wn21.psi.ch:
t3ui17.psi.ch:
-rw-r--r-- 1 mquittna ethz-ecal 16K Jul 8 09:18 /shome/mquittna/CMSSW/EXO_7_4_0_pre9/src/diphotons/Analysis/macros/.combine_maker.py.swp
-rw-r--r-- 1 gaperrin ethz-susy 20K Jul 8 13:31 /shome/gaperrin/tnp_gael/SSDLBkgEstimationTP/TandP/.FitInvMassBkg.C.swp
-rw-r--r-- 1 gaperrin ethz-susy 12K Jul 8 13:34 /shome/gaperrin/tnp_gael/SSDLBkgEstimationTP/TandP/.start_job2.sh.swp
-rw-r--r-- 1 gaperrin ethz-susy 12K Jul 8 11:43 /shome/gaperrin/tnp_gael/SSDLBkgEstimationTP/TandP/.job2.sh.swp
-rw-r--r-- 1 gaperrin ethz-susy 48K Jul 8 13:50 /shome/gaperrin/tnp_gael/SSDLBkgEstimationTP/TandP/.DrawInvMassBkg_combi.cc.swp
-rw-r--r-- 1 gaperrin ethz-susy 52K Jul 8 13:45 /shome/gaperrin/tnp_gael/SSDLBkgEstimationTP/TandP/.MC_Ratio.C.swp
-rw-r--r-- 1 gaperrin ethz-susy 48K Jul 8 13:47 /shome/gaperrin/tnp_gael/SSDLBkgEstimationTP/TandP/.TandP.C.swp
-rw-r--r-- 1 gaperrin ethz-susy 36K Jul 8 13:50 /shome/gaperrin/tnp_gael/SSDLBkgEstimationTP/TandP/.CompareMCvsTandP.cc.swp
-rw-r--r-- 1 gaperrin ethz-susy 12K Jul 8 13:52 /shome/gaperrin/tnp_gael/SSDLBkgEstimationTP/TandP/.start_job.sh.swp
-rw-r--r-- 1 mdunser ethz-susy 88K May 5 10:33 /shome/mdunser/FakeLeptonFW/macros/.closure.py.swo
ls: cannot access /shome/bianchi/TTH-72X-heppy/CMSSW/src/TTH/MEIntegratorStandalone/test/validate_^W7^A: No such file or directory
-rw-r--r-- 1 mdunser ethz-susy 102K May 5 16:21 /shome/mdunser/.ipython/profile_default/history.sqlite
-rw-r--r-- 1 mdunser ethz-susy 102K May 5 16:21 /shome/mdunser/.ipython/profile_default/history.sqlite
t3wn27.psi.ch:
t3wn22.psi.ch:
t3wn16.psi.ch:
t3wn11.psi.ch:
t3wn10.psi.ch:
t3wn33.psi.ch:
t3ui05.psi.ch:
-rw-r--r-- 1 casal ethz-susy 16K May 18 15:22 /shome/casal/CMSSW/sms_prod/CMSSW_5_3_7_patch5/src/MT2analysis/Code/MT2AnalysisCode/RootMacros/.treeConversion.py.swp
t3wn32.psi.ch:
t3wn34.psi.ch:
t3service01:
t3ce02.psi.ch:
t3vmui01.psi.ch:
t3wn20.psi.ch:
t3cmsvobox01.psi.ch:
t3wn39.psi.ch:
t3wn43.psi.ch:
t3wn35.psi.ch:
t3fs13.psi.ch:
t3ui19.psi.ch:
t3ui12.psi.ch:
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/CalibMuon/DTCalibration/plugins/CalibMuonDTCalibrationPlugins/SealModule.o
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/TrackingTools/TrackAssociator/test/testTrackingToolsTrackAssociator/TestTrackAssociator.o
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/TrackingTools/TrackAssociator/test/testCaloMatchingExample/CaloMatchingExample.o
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/TrackingTools/TrackAssociator/plugins/TrackingToolsTrackAssociatorPlugins/DetIdAssociatorESProducer.o
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/TrackingTools/TrackAssociator/plugins/TrackingToolsTrackAssociatorPlugins/MuonDetIdAssociator.o
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/TrackingTools/TrackAssociator/plugins/TrackingToolsTrackAssociatorPlugins/modules.o
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/SimTracker/TrackerHitAssociation/plugins/SimTrackerTrackerHitAssociationPlugins/ClusterTPAssociationProducer.o
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/SimTracker/VertexAssociatorESProducer/src/SimTrackerVertexAssociatorESProducer/SealModules.o
-rw-rw-r-- 1 jngadiub uniz-higgs 0 Jul 8 14:05 /shome/jngadiub/EXOVVAnalysisRunII/CMSSW_7_4_3/tmp/slc6_amd64_gcc491/src/SimTracker/VertexAssociatorESProducer/src/SimTrackerVertexAssociatorESProducer/VertexAssociatorByTracksESProducer.o
-rw-r--r-- 1 jpata ethz-higgs 0 Jul 8 13:40 /shome/jpata/TTH-72X-heppy-dev/CMSSW/src/VHbbAnalysis/Heppy/test/test/log.txt
-rw-r--r-- 1 jpata ethz-higgs 228 Jul 8 13:40 /shome/jpata/TTH-72X-heppy-dev/CMSSW/src/VHbbAnalysis/Heppy/test/test/pileup.root
-rw-r--r-- 1 jpata ethz-higgs 14M Jul 8 14:05 /shome/jpata/TTH-72X-heppy-dev/CMSSW/src/VHbbAnalysis/Heppy/test/test/tree.root
t3wn13.psi.ch:
t3wn24.psi.ch:
t3mon01:
t3wn37.psi.ch:
t3wn41.psi.ch:
t3ui18.psi.ch:
t3wn15.psi.ch:
t3se01.psi.ch:
t3wn29.psi.ch:
t3wn25.psi.ch:
t3wn30.psi.ch:
t3wn44.psi.ch:
t3wn50.psi.ch:
t3ui16.psi.ch:
t3ui15.psi.ch:
t3wn14.psi.ch:
t3wn36.psi.ch:
t3dcachedb03.psi.ch:
RPC program nfs version 3 tcp is not running
In Nov 2014 we got this
CMSTier3Log67 case
- check nagios
- If
t3fs06
will fail then the t3ui1*
and the t3wn*
servers that mount t3fs06:/shome
will be immediately affected ; if you can't quickly recover t3fs06:/shome
( e.g. due to a failed motherboard ) you'll have to umount /shome
from those servers and mount t3fs05:/shome2
that is suppose to be an identical copy of t3fs06:/shome
; probably you'll need to make symbolic links /shome2 -> /shome
- On
t3fs05
obviously stop the cron sending by rsync /swshare
to t3fs06
.
- Tweak
t3nagios
to forget about t3fs06
Regular Maintenance work
Nagios
check nagios
Installation
crontab -l root
#ident "@(#)root 1.21 04/03/23 SMI"
#
# The root crontab should be used to perform accounting data collection.
#
#
10 3 * * * /usr/sbin/logadm
15 3 * * 0 /usr/lib/fs/nfs/nfsfind
30 3 * * * [ -x /usr/lib/gss/gsscred_clean ] && /usr/lib/gss/gsscred_clean
#
# The rtc command is run to adjust the real time clock if and when
# daylight savings time changes.
#
1 2 * * * [ -x /usr/sbin/rtc ] && /usr/sbin/rtc -c > /dev/null 2>&1
#
# create regular snapshots of the shome file system
#
#20 00 * * * /root/psit3-tools/regular-snapshot-new -f shome -v -s t3fs05 -r shome2/shomebup 2>&1 | /usr/bin/tee /var/cron/lastsnap.txt 2>&1 ; [[ $? -ne 0 ]] && /usr/bin/mail cms-tier3@lists.psi.ch < /var/cron/lastsnap.txt
#
# Added by cswcrontab for CSWlogwatch
02 4 * * * /opt/csw/bin/logwatch
#
# for ganglia monitoring of shome space
53 * * * * /root/gmetric/gmetric_partition_space-cron.sh
#
# for detailed local monitoring of user space
44 01 * * * /shome/monuser/shome-du.cron.sh
#
43 3 * * * [ -x /opt/csw/bin/gupdatedb ] && /opt/csw/bin/gupdatedb --prunepaths="/shome /dev /devices /proc /tmp /var/tmp" 1>/dev/null 2>&1 # Added by CSWfindutils
# 09/03/2015 - F.Martinelli
22 03 * * * /opt/zfsnap/zfssnap -v shome && /opt/csw/bin/rsync --progress -v --delete -a -e "ssh -c arcfour" /shome/ t3fs05:/shome2 2>&1 | /usr/bin/tee /var/cron/zfssnap.shome.log 2>&1
Shared File Systems on ZFS - OLD
-
/shome
: Two 9 disk raidz2 sets are used for shome.
-
/vmshare
: raidz2 set and spares. Hosts some of the older VMs
- spare disks
---------------------SunFireX4500------Rear----------------------------
36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47:
c6t3 c6t7 c5t3 c5t7 c8t3 c8t7 c7t3 c7t7 c1t3 c1t7 c0t3 c0t7
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35:
c6t2 c6t6 c5t2 c5t6 c8t2 c8t6 c7t2 c7t6 c1t2 c1t6 c0t2 c0t6
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:
c6t1 c6t5 c5t1 c5t5 c8t1 c8t5 c7t1 c7t5 c1t1 c1t5 c0t1 c0t5
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
c6t0 c6t4 c5t0 c5t4 c8t0 c8t4 c7t0 c7t4 c1t0 c1t4 c0t0 c0t4
^b+ ^b+ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
-------*-----------*-SunFireX4500--*---Front-----*-----------*----------
User quotas
After upgrading the ZFS version, it was necessary to initialise the accounting information. This can take quite some time...
zfs userspace shome
User quotas can be set and viewed in the following way (can use name or id for users)
zfs set userquota@3896=500G shome
zfs get userquota@3896 shome
The current usage of all users can be seen with
zfs userspace shome
zfs userspace -p -s used shome # exact value and sorted
zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/s10x_u8wos_08a@2011-Feb-18_11-51 46.3M - 3.70G -
rpool/ROOT/s10x_u8wos_08a@python-20110303 49.7M - 3.77G -
rpool/ROOT/s10x_u8wos_08a@31-May-2012 142M - 4.11G -
rpool/ROOT/s10x_u8wos_08a@09-Apr-2013 118M - 5.35G -
rpool/ROOT/s10x_u8wos_08a@05-Jun-2013 133M - 5.43G -
rpool/ROOT/s10x_u8wos_08a@28-Nov-2013 121M - 5.53G -
rpool/ROOT/s10x_u8wos_08a@21-03-2014 131M - 5.55G -
rpool/ROOT/s10x_u8wos_08a@24-Jun-2014 144M - 5.62G -
rpool/ROOT/s10x_u8wos_08a@11-Sep-2014 155M - 5.64G -
rpool/ROOT/s10x_u8wos_08a@20-01-2015 163M - 5.66G -
rpool/ROOT/s10x_u8wos_08a@30-01-2015 164M - 5.66G -
rpool/ROOT/s10x_u8wos_08a@06-03-2015 165M - 5.66G -
rpool/ROOT/s10x_u8wos_08a@03-06-2015 0 - 5.67G -
shome@zfssnap_2015-05-25_03.22.00--10d 3.08G - 4.98T - <-- /opt/zfsnap/zfssnap -v shome && /opt/csw/bin/rsync --progress -v --delete -a -e "ssh -c arcfour" /shome/ t3fs05:/shome2
shome@zfssnap_2015-05-26_03.22.00--10d 3.12G - 4.98T -
shome@zfssnap_2015-05-27_03.22.00--10d 6.70G - 4.94T -
shome@zfssnap_2015-05-28_03.22.00--10d 6.00G - 4.95T -
shome@zfssnap_2015-05-29_03.22.00--10d 4.08G - 4.95T -
shome@zfssnap_2015-05-30_03.22.00--10d 2.87G - 4.94T -
shome@zfssnap_2015-05-31_03.22.00--10d 2.87G - 4.94T -
shome@zfssnap_2015-06-01_03.22.00--10d 6.48G - 4.94T -
shome@zfssnap_2015-06-02_03.22.00--10d 3.06G - 4.96T -
shome@zfssnap_2015-06-03_03.22.00--10d 3.22G - 4.95T -
swshare2/swsharebup@auto2015-05-27_06:00:00 2.36G - 560G -
swshare2/swsharebup@auto2015-05-28_06:00:01 1.07M - 560G -
swshare2/swsharebup@auto2015-05-29_06:00:00 950K - 559G -
swshare2/swsharebup@auto2015-05-30_06:00:00 1.47M - 559G -
swshare2/swsharebup@auto2015-05-31_06:00:00 1.36M - 559G -
swshare2/swsharebup@auto2015-06-01_06:00:00 725K - 560G -
swshare2/swsharebup@auto2015-06-02_06:00:00 1.00M - 559G -
swshare2/swsharebup@auto2015-06-03_06:00:00 0 - 560G -
daily snapshots and backup
A script
/root/psit3-tools/regular-snapshot
is called by root's crontab to make a daily incremental snapshot of the
shome
ZFS file system to t3fs05. Users can retrieve files from these snapshots by themselves, as documented in
HowToRetrieveBackupFiles. The script also deletes the older snapshots. The script is run by cron.
Look also at the tests for doing incremental snapshot transfers in
CMSTier3Log12.
ZFS Backup server on t3fs05 - OLD
- shome2 : Backup of shome area
- vmshare : Backup of virtual machine area (for the older vmware-server based machines)
- swshare : cluster's shared software space (e.g. experiment SW)
- spare disks
---------------------SunFireX4500------Rear----------------------------
36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47:
c6t3 c6t7 c5t3 c5t7 c8t3 c8t7 c7t3 c7t7 c1t3 c1t7 c0t3 c0t7
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35:
c6t2 c6t6 c5t2 c5t6 c8t2 c8t6 c7t2 c7t6 c1t2 c1t6 c0t2 c0t6
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:
c6t1 c6t5 c5t1 c5t5 c8t1 c8t5 c7t1 c7t5 c1t1 c1t5 c0t1 c0t5
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
c6t0 c6t4 c5t0 c5t4 c8t0 c8t4 c7t0 c7t4 c1t0 c1t4 c0t0 c0t4
^b+ ^b+ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
-------*-----------*-SunFireX4500--*---Front-----*-----------*----------