Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2013-01-10

Agenda

Status

  • CSCS (reports Pablo):
    • New colleague: George Brown
    • Quality of transfers was degraded during the week of the 10th of December. It was the Infiniband Subnet Manager, that was taken over by a physical switch a few weeks before. We increased the priority of the software-based ones to avoid this from happening again, and a nagios check is going to be developed.
    • Site operation during Christmas period was good:
      • There was an ARGUS caching problem
      • And some jobs hammered the Scratch FS during a week. Scary, but stable.
    • New DCS3700 are almost production ready. Benchmarks are good (almost 2 GB/s per controller and IO server) but still need to solve a couple of small issues (XFS module, and booting problem under special conditions)
  • PSI (reports Fabio):
    • dCache 1.9.5 to 1.9.12 migration during next days 11-14th Jan.
    • PostgreSQL 8.4 to 9.2
    • Created a dedicated 'dcache' Linux/Solaris user instead of run dCache as 'root'.
    • Probably we're going to migrate to dCache 2.2 during 1st quarter 2013, I want to invest time on a recent version but we need to agree an other downtime with our T3 users; I count 18 WLCG sites already running dCache 2.*
  • UNIBE (reports Gianfranco):
    • Stable unmanned operation over Holiday Season
    • No updates since last meeting (status slide attached for reference)
    • Time spent dealing with ARC bugs and on several tasks involving local resources
    • Serious issue with ARC upgrade from 1.1.0 to 2.0.1 on production cluster, might be solved now
  • UNIGE (reports Szymon):
    • Unfortunate story of the site BDII last Dec
      • got fixed, but I did not like how it went
      • I would also like to know why we are running this
    • New hardware
      • CPU nodes
        • 8 IBM x3755 M3 (2U, 32 cores, 96 GB RAM, 1.1 TB disk in HW RAID)
        • in production 469 slots in the batch system (+70%)
      • disk servers
        • 2 IBM x3630 M3 (2U, 14 3 TB disks, 32 TB for data)
        • not yet in production, most likely in the SE
    • HW upgrade plan for 2013
      • replace disk servers in the DPM SE running Solaris, run SLC only
      • replace oldest disk servers doing NFS, stay with Solaris
    • Upgrade of the DPM head node, ongoing
      • Test instance on a VM first
      • The real head node will run on the VM as well
      • MySQL database on a physical machine
    • Plan to virtualize all the services
      • DPM head, ARC, Web + Ganglia + Installation, batch server
  • Switch (reports Alessandro):
    • UNIGE problem upgrading the bdii -> general support issue: can admin responsabilities be shared within NGI_CH?
    • UNIBE-ID excluded from the A/R table: the problem is under investigation (December recalculation possible, November unfortunately not)
    • Notice that gLite 3.2 is deprecated (e.g. end of January deadline for upgrade of the DPM)
    • Sustainability survey and Amsterdam meeting (end of January): Sigve will be there.
    • NGI_CH will not contribute extra resources for non HEP VOs (there was a request in this sense from EGI)
    • Contribution to the Nagios probes working group still to be discussed (pending EGI decision)
Other topics Next meeting date: February 7th

AOB

Attendants

  • CSCS: Pablo, Miguel, George
  • CMS: Fabio, Derek, Daniel
  • ATLAS: Gianfranco, Szymon
  • LHCb: Roland
  • EGI: Alessandro

Action items

  • Item1

  • reboot-pb-notes.txt: Logs after spontaneous reboot, new x3755 servers at UniGE . ( Fabio ) maybe too loaded server => NMI watchdog will stop the server. IBM DOC

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf ATLAS-DE-CloudReport09Jan13.pdf r2 r1 manage 2006.1 K 2013-01-10 - 12:00 GianfrancoSciacca ATLAS DE cloud report DEC12
PDFpdf CHIPP-CB-20121206.pdf r1 manage 405.4 K 2013-01-10 - 10:57 GianfrancoSciacca UNIBE-LHEP status 20121206
Texttxt reboot-pb-notes.txt r1 manage 2.8 K 2013-01-10 - 11:17 SzymonGadomski Logs after spontaneous reboot, new x3755 servers at UniGE
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r17 - 2013-04-04 - PabloFernandez
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback