Swiss Grid Operations Meeting on 2013-05-02
Agenda
Status
- CSCS (report Pablo, Miguel & George):
- Moved cream02 to new IBM hardware and upgraded it to SL 6.3 and CREAM-CE UMD-2.
- Comment: upgrade of cream01 is planned right before the maintenance (Monday 13 May).
- Prepared dCache 1.9.12 migration to 2.2.10 on preproduction.
- Comment: process seems straightforward but we have to be careful since it will involve also moving head nodes to new hardware and SL 6.4
- 'Fixed' issues with WNs not being reinstalled and moved installation procedure to SL 5.9.
- Comment: migration to SL 6 on WNs is planned for July, one month after ATLAS expected upgrade calendar.
- Installed new VM (atlas01) to replace atlasvobox and requested certificate.
- Question: When could this machine be ready? Gianfranco: hope for 2 weeks from now
- Increased specs of future replacement for cmsvobox (cms01) in order to cope with service's memory leaking problem.
- Question: When could cmsvobox be shut down?
- Maintenance scheduled for May 15: https://wiki.chipp.ch/twiki/bin/view/LCGTier2/SiteMaintenance20130515
- PSI (reports Fabio):
- Once again I remember that now SL6 delivers ZFS, I'll try for sure.
- PSI is migrating its VMWare cluster onto new HW/ESX, we got several VMs in fs readonly or totally stuck; this was due to a mistake of the local Admins but I was also not increasing the SCSI timeout; so even if your VM is not VMWare based is a good idea to run:
echo 180 > /sys/block/sda/device/timeout
- Implementing a "soft" dCache 2.2 quotas system based on
GIDs
; we're:
- Using LDAP PosixGroups ( defined in
/etc/openldap/schema/nis.schema
) to partition our 100 CMS users in ~10 new Primary groups ( a user belongs to just 1 group )
- Beacuse of this partitioning an user connected in a UI where
/pnfs
is mounted will see dirs like: ls -l /pnfs/psi.ch/cms/trivcat/store/user | drwxrwxr-x 3 cmsuser cms 512 Jun 15 2012 acaudron drwxr-xr-x 2 alschmid uniz-bphys 512 Feb 21 11:04 alschmid drwxr-xr-x 2 amarini ethz-ewk 512 Jan 24 15:53 amarini drwxr-xr-x 18 andis ethz-bphys 512 Jan 5 2010 andis drwxr-xr-x 36 arizzi ethz-bphys 512 Aug 3 2011 arizzi
- By setting the file/dir permissions to allow just his/her group to write users can protect their group files.
- By having these new Primary groups it's easy to make
/pnfs
accounting by GID
chimera=> select igid,sum(isize) as sum from t_inodes group by igid order by sum desc ; igid | sum ------+----------------- 500 | 212853755148360 # 500 is the CMS group that store all the Phedex files. 533 | 130072146368598 534 | 25902761600489 536 | 23833376390438 532 | 18310416555193 531 | 16152944599625 530 | 10590297019580 538 | 1140547783970 537 | 316829040978 535 | 36287017607 800 | 42217296 550 | 43560
- We defined a formula to compute
quota ( GID )
and we'll check by Nagios the real /pnfs
group usage vs quota ( GID )
- dCache side we dynamically create
/etc/grid-security/storage-authzdb
according to these new GIDs when a user leaves/joins the cluster.
- UNIBE (reports Gianfranco):
- Progress in commissioning of Phoenix PhaseC hardware: 75% of the WNs installed, SLC6.3, ROCKS 6.1, 15 thumpers for Lustre 2.4 OSS, 1 MDS for lustre 2.4.
- Installation is generally fiddly (thumpers need to be tried 2 or 3 times -with excatly the same procedure-, before it goes through), but it eventually works.
- A number of WN report: No disk is available for installation - Your BIOS is broken. Investigation is undergoing, BIOS version seems to be the same ol all nodes
- ARC and SGE versions installed (GE-2011.11p1-1.x86_64 and nordugrid-arc-2.0.1-1.el6.x86_64) do not play. /usr/share/arc/SGEmod.pm fills the infoprovider with the info from the batch service. The script supports versions 5 and 6 of GE, the qstat header turns out to be totally different in version 2011.11p1. Will open a feature request on the ND bugzilla, but a solution for immediate production will be to hack the script. Also GGUS ticket about this:
- UNIGE (reports Szymon):
- UZH (reports Sergio):
- Switch (reports Alessandro):
- End of EMI: impact on the EGI releases, bug fixes/update/support under discussion (MeDIA consortium)
- SGAS server at SWITCH to be shut down/migrated to a SWING partner by the end of 2013: ARC in current EMI3 release cannot publish to an APEL server, it will come with the next ARC release; ATLAS could decide to use the NorduGrid repository though...
- NGI_CH ARGUS server? Any thoughts?
- ARC gridftp test in Nagios was deprecated by EGI, but not by WLCG, now fixed
- Next week we will upgrade the Nagios production instance to update 20 (test instance was updated 2 weeks ago)
Other topics
- Next CSCS Phoenix maintenances are scheduled for May 15 (dCache update) and then July 3 (WNs move to SL6).
- Topic2
Next meeting date:
Proposed date is Thursday, June 6, 2013
Attendants
- CSCS: Pablo, Miguel, George
- CMS:
- ATLAS:
- LHCb: Roland
- EGI:
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20130502
Topic revision: r8 - 2013-05-02 - GianfrancoSciacca