Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2014-10-02

Site status

CSCS

  • Issues
    • Some nodes have been dropping off IB network randomly, thus many jobs have failed lately.
    • All these jobs failed produced a huge increase in the inode usage on GPFS that was impossible to clean up by policies (just too fast!)
    • Added an additional 2x 400GB SSDs to GPFS to provide even more inodes (150M)
    • GPFS is going to be decommissioned soon.

  • GPFS2
    • Fully configured using two file sets:
      Filesets in file system 'phoenix_scratch':
      Name                            Id      RootInode  ParentId Created                      InodeSpace      MaxInodes    AllocInodes Comment
      root                             0              3        -- Tue Sep 30 09:18:00 2014        0              1000128        1000128 root fileset
      scratch                          1        1048579         0 Tue Sep 30 09:21:51 2014        1             50000000       50000000
      gridhome                         2      134217731         0 Tue Sep 30 09:28:34 2014        2             30000000       30000000
      
                               Block Limits                                    |                     File Limits
      Name       type             KB      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace
      scratch    FILESET   120917920          0          0     105024     none |  5057529 35000000 40000000      254     none
      gridhome   FILESET           0          0          0      20480     none |        1 20000000 25000000       19     none
    • Each file set has its own quota of inodes, so we can cleanup the filesystem even if we reach the max. quota.
    • A new method of cleaning up storage (epilog, right after each job ends, completed or failed) is being tested. We hope this, along with periodic GPFS policies, will solve the inode problems once and for all.

  • Swiss users storage
    • Ready for ATLAS and LHCb (ATLASLOCALGROUPDISK = 160TB, LHCB-DISK = 290TB)
    • Need to be tested by CMS (CHCMS = 150TB)
  • Swiss users compute
    • To be ready in next maintenance
  • Next maintenance
    • Downtime set on GOCDB for whole CSCS-LCG2 on 15.10.2014
(https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=15651)
    • Changes to be applied:
      1. Update firmware on IB switches
      2. Set MTU to 1500 on all systems (all IB cards)
      3. Deploy GPFS2 to all nodes (same gpfs cluster instead of remote cluster)
      4. Reconfigure all grid nodes to use GPFS2 and enable CHCMS VOMS
      5. Deploy ARGUS servers
      6. Removal of /experiment_software on all WNs as is no longer used
      7. Update SLURM configuration: increase priority of atlaschXX,*cmschXX* and lhcbchXX
      8. Remove non existing/decommissioned nodes from SLURM config

PSI

UNIBE-ID

  • Infiniband Network with fat tree topology in production
    • Hardware installed and tested
    • Perfomance measurements done and ok
    • IPoIB setup done
  • Lot of disk crashes lately but no outages
  • Currently working on migration from RHEL-6 to CentOS-6
    • future config management with Puppet; testbed working excellent so far
    • new compute nodes ordered to abandon old nodes => testing environment for the new CentOS-based setup

UNIBE-LHEP

UNIGE

  • The 2014 upgrade is finished
    • two oldest disk servers retired (X4500 model 2006)
    • the ‘user’ and ‘software’ space migrated to two new servers (Sun, Solaris)
    • other data migrated to another new disk server (IBM, Linux)
    • two new machines for running four VMs that have critical services
  • Two astroparticle space experiments using the cluster
    • the DAMPE group starting
    • the AMS group will invest in hardware to have more disk space
  • Shellshock emergency...
  • ATLAS production (Andrej Filipicic) trying multi-core jobs

NGI_CH

  • ARGUS status and support
    • NGI_CH instance: https://ggus.eu/index.php?mode=ticket_info&ticket_id=99533
    • Support:
      • SWITCH bailed out
      • PEP client and server: no future support
      • PAP: INFN
      • PDP: no future support
      • ARGUS EES: NIKHEF
      • LCMAPS plugin: NIKHEF
      • New request to NGIs to rescue not supported components
      • No alternatives/plan B
    • Status of deployment and plans
      • Most NGIs run their National service
      • End Oct 2014: monitoring framework ready (national instances) - Nagios probe, list of instances to monitor (GOCDB query)
      • Nov 2014: pilot testing with 4-5 sites (possibly diversified mw), refine documentation
        • Test if ban information is available at the sites services: CE/SE/WMS (action on EGI-CSIRT)
      • End Nov 2014 to end Mar 2015 (?): wide deployment
      • Beyond (?): sote monitored for this feature
  • New VOMS server configuration for ops/LHC VOs

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS: George Brown, Miguel Gila, Gianni Ricciardi
  • CMS: Fabio Martinelli
  • ATLAS: Gianfranco Sciacca, Szymon Gadomski
  • LHCb: Roland Bernet
  • EGI: Gianfranco Sciacca

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r16 - 2014-10-02 - RolandBernet
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback