Tags:
meeting1Add my vote for this tag NGI_CH_X509s1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2015-07-02

Site status

CSCS

PSI

  • Fabio will be on leave until 6th July

UNIBE-LHEP

  • Operations
    • Still bumpy and at about half capacity
    • Restored 320 old cores (from Ubelix), but many tend to crash
    • One more aircon issue (22th June). Many nodes lost power. Working on temp monitor system in the room (some input from PSI too, thanks!)
    • Likely related to aircon problem: lustre disks on 3 nodes went flaky.
    • More issues with nodes crashing on both clusters. Most of the times jobs remain in state "dr" in gridengine. These somehow prevent new jobs from being submitted (these remain in PREPARING state in ARC). Now added cron to clean these up and log the nodenames.
    • LAN down on ce01 on (Friday) 26th June. Very likely a hardware failure, but in the rush to reset the cluster online, failed to really establish whether it was really the case. Recovery: swap to unused network interface, register network and interface changes in ROCKS, redploy Lustre from scratch, power-up and re-install stuck nodes.
  • ATLAS specific operations
    • gridengine multicore scheduling improved. Changes to gridengine already in place a month ago, but success seemed limited. In addition, removed on one cluster a hack to scale up the requested walltime by 1.4/1.5. Increased difficulty in scheduling multicore jobs possibly explained by some ATLAS tasks with quiet high walltime.
    • All ARC failures still masked by crons. But bugfix release in apel-testing, will try to upgrade soon.
  • Ongoing work
    • ROCKS 6.2 just came out, prototyping the cluster deployment chain with this version now (CE, WN, lustre mds, lustre oss)
    • 6 IBM servers from CSCS collected and rackmounted. Will be deployed upon re-installation of ce01 (ROCKS 6.2)
    • Temperature monitoring in server room under work. New water-cooled rack by Theoretical Psysics monitors the inlet water temperature. Add ambient sensors in some racks. Monitor first to learn trends, try to automate in the future (e.g. drain clusters upon inlet water temperature over threshold)

UNIBE-ID

  • Michael/Nico cannot attend due to delivery of ESS wink
  • Operations:
    • smooth, high usage currently
  • ATLAS-related:
    • mcore jobs now better scheduled; changes made
      • resource reservation only set for mcore jobs (within submit-sge-jobs when priority is set)
      • increased max_reservation in scheduler conf from 7 to 32
      • default_duration in scheduler conf now increased from 24h to 97h == h_rt limit of queue where ATLAS jobs are running
    • ATM: WARNING in gridka nagios regarding latest EGI-trustanchors release
      • IGTF-1.65, 0 days old, all present. - SHA Fingerprint failed for ca-policy-lcg. - SHA Fingerprint failed for ca-policy-egi-core
      • Is this a broken release?
  • UBELIX Puppet Resources
    • As mentioned at hpc-forum presentation we now have a public platform for OSS stuff:
      • http://idos-code.unibe.ch - Stash with most of our puppet modules
      • http://idos-issues.unibe.ch - Jira, our issue tracker for the code above
      • Clone as you like. smile Contributions (aka pull requests) are welcome as soon as we have our Crowd instance ready (end of week) - don't register yet though it's possible.

UNIGE

  • Still un-manned, likely until 1st October 2015

NGI_CH

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

  • Reminder: Face To Face meeting to be held on 21 August 2015 at CSCS.

Attendants

  • CSCS: Miguel Gila
  • CMS: Daniel Meister
  • ATLAS: Gianfranco Sciacca
  • LHCb: Roland Bernet
  • EGI: Gianfranco Sciacca

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r10 - 2016-04-29 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback