Swiss Grid Operations Meeting on 2013-01-10
Agenda
Status
- CSCS (reports Pablo):
- New colleague: George Brown
- Quality of transfers was degraded during the week of the 10th of December. It was the Infiniband Subnet Manager, that was taken over by a physical switch a few weeks before. We increased the priority of the software-based ones to avoid this from happening again, and a nagios check is going to be developed.
- Site operation during Christmas period was good:
- There was an ARGUS caching problem
- And some jobs hammered the Scratch FS during a week. Scary, but stable.
- New DCS3700 are almost production ready. Benchmarks are good (almost 2 GB/s per controller and IO server) but still need to solve a couple of small issues (XFS module, and booting problem under special conditions)
- PSI (reports Fabio):
- dCache 1.9.5 to 1.9.12 migration during next days 11-14th Jan.
- PostgreSQL 8.4 to 9.2
- Probably we're going to migrate to dCache 2.2 during 1st quarter 2013, I want to invest my time on a recent version, need to agree that with T3 users, I count 18 WLCG sites with dCache 2.*
- UNIBE (reports Gianfranco):
- Stable unmanned operation over Holiday Season
- No updates since last meeting (status slide attached for reference)
- Time spent dealing with ARC bugs and on several tasks involving local resources
- Serious issue with ARC upgrade from 1.1.0 to 2.0.1 on production cluster, might be solved now
- UNIGE (reports Szymon):
- Unfortunate story of the site BDII last Dec
- got fixed, but I did not like how it went
- I would also like to know why we are running this
- New hardware
- CPU nodes
- 8 IBM x3755 M3 (2U, 32 cores, 96 GB RAM, 1.1 TB disk in HW RAID)
- in production 469 slots in the batch system (+70%)
- disk servers
- 2 IBM x3630 M3 (2U, 14 3 TB disks, 32 TB for data)
- not yet in production, most likely in the SE
- HW upgrade plan for 2013
- replace disk servers in the DPM SE running Solaris, run SLC only
- replace oldest disk servers doing NFS, stay with Solaris
- Upgrade of the DPM head node, ongoing
- Test instance on a VM first
- The real head node will run on the VM as well
- MySQL database on a physical machine
- Plan to virtualize all the services
- DPM head, ARC, Web + Ganglia + Installation, batch server
- Switch (reports Alessandro):
- UNIGE problem upgrading the bdii -> general support issue: can admin responsabilities be shared within NGI_CH?
- UNIBE-ID excluded from the A/R table: the problem is under investigation (December recalculation possible, November unfortunately not)
- Notice that gLite 3.2 is deprecated (e.g. end of January deadline for upgrade of the DPM)
- Sustainability survey and Amsterdam meeting (end of January): Sigve will be there.
- NGI_CH will not contribute extra resources for non HEP VOs (there was a request in this sense from EGI)
- Contribution to the Nagios probes working group still to be discussed (pending EGI decision)
Other topics
Next meeting date: February 7th
AOB
Attendants
- CSCS: Pablo, Miguel, George
- CMS: Fabio, Derek, Daniel
- ATLAS: Gianfranco, Szymon
- LHCb: Roland
- EGI: Alessandro
Action items
- reboot-pb-notes.txt: Logs after spontaneous reboot, new x3755 servers at UniGE . ( Fabio ) maybe too loaded server => NMI watchdog will stop the server. IBM DOC