Swiss Grid Operations Meeting on 2014-10-02

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2014-10-02
- Site status
  - CSCS
  - PSI
  - UNIBE-ID
  - UNIBE-LHEP
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Issues
- Some nodes have been dropping off IB network randomly, thus many jobs have failed lately.
- All these jobs failed produced a huge increase in the inode usage on GPFS that was impossible to clean up by policies (just too fast!)
- Added an additional 2x 400GB SSDs to GPFS to provide even more inodes (150M)
- GPFS is going to be decommissioned soon.

GPFS2

Fully configured using two file sets:

Filesets in file system 'phoenix_scratch':
Name                            Id      RootInode  ParentId Created                      InodeSpace      MaxInodes    AllocInodes Comment
root                             0              3        -- Tue Sep 30 09:18:00 2014        0              1000128        1000128 root fileset
scratch                          1        1048579         0 Tue Sep 30 09:21:51 2014        1             50000000       50000000
gridhome                         2      134217731         0 Tue Sep 30 09:28:34 2014        2             30000000       30000000

                         Block Limits                                    |                     File Limits
Name       type             KB      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace
scratch    FILESET   120917920          0          0     105024     none |  5057529 35000000 40000000      254     none
gridhome   FILESET           0          0          0      20480     none |        1 20000000 25000000       19     none

Each file set has its own quota of inodes, so we can cleanup the filesystem even if we reach the max. quota.
A new method of cleaning up storage (epilog, right after each job ends, completed or failed) is being tested. We hope this, along with periodic GPFS policies, will solve the inode problems once and for all.

Swiss users storage
- Ready for ATLAS and LHCb (ATLASLOCALGROUPDISK = 160TB, LHCB-DISK = 290TB)
- Need to be tested by CMS (CHCMS = 150TB)
Swiss users compute
- To be ready in next maintenance
Next maintenance
- Downtime set on GOCDB for whole CSCS-LCG2 on 15.10.2014

(https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=15651

)

- Changes to be applied:
  1. Update firmware on IB switches
  2. Set MTU to 1500 on all systems (all IB cards)
  3. Deploy GPFS2 to all nodes (same gpfs cluster instead of remote cluster)
  4. Reconfigure all grid nodes to use GPFS2 and enable CHCMS VOMS
  5. Deploy ARGUS servers
  6. Removal of /experiment_software on all WNs as is no longer used
  7. Update SLURM configuration: increase priority of atlaschXX,*cmschXX* and lhcbchXX
  8. Remove non existing/decommissioned nodes from SLURM config

PSI

Maintenances
- dCache updated to 2.6.33
- 120 Seagate disks FW updated from v. MS01 to MS04 on a E5400
- SL5 and SL6 bash updated, twice.
- Puppet upgraded from v.2 to puppet-3.5.1-0.1rc1.el6.noarch
- Using Puppet Stdlib : to use it:
  - if you're root on the Puppet master: # yum install puppetlabs-stdlib.noarch
  - if you're NOT root on the Puppet master: [ myenv/modules ]$ git clone https://github.com/puppetlabs/puppetlabs-stdlib.git stdlib
Grid tools
- gfalFS
- gfal-copy stops to work with CMSSW ; same with ATLAS and LHCb ?
Next storage : we have to replace 9 dCache fileservers and 2 NFS fileservers
- about the 9 dCache: my current guess, we'll buy a server + a 60*4TB E5500 ~= 180TB net ; CSCS also uses E5500.
- about the 2 NFS fileservers, 3 options :
  - 2 NetApp FAS2525 to get a HA and Replicated NFS service.
  - 2 Oracle x4-2l installed as Solaris 11 + ZFS.
  - Perhaps PSI will install a central NetApp FAS8000 CIFS/NFS service ; cost for us ~600CHF*1TB*5y ( no backups, yes snapshots ) ; it would be great.
- In this context I used tcptrack to easily see our dcap bandwidth usages; by those usages you can do capacity planning.
Clouds
- CERN offers OpenStack for free used it, nice. PSI offers a VMWare cluster but to create/modify a VM I always have to involve a colleague, here I was 100% independent.
- CERN hosted Amazon to present the Amazon Scientific Computing Offer and the success stories ; very interesting ; they provide Free Research Grants.

UNIBE-ID

Infiniband Network with fat tree topology in production
- Hardware installed and tested
- Perfomance measurements done and ok
- IPoIB setup done
Lot of disk crashes lately but no outages
Currently working on migration from RHEL-6 to CentOS-6
- future config management with Puppet; testbed working excellent so far
- new compute nodes ordered to abandon old nodes => testing environment for the new CentOS-based setup

UNIBE-LHEP

Operations
- Smooth routine operations with minor issues:
  - a-rex crashes (x3) on ce01 (it used to happen on ce02)
  - nodes on ce01 (phaseC Sun Blades) tend to crash. Must clean manually jobs ("dr" state in GE) then re-install. Not tragic, yet tedious. Suspect memory starving. These have 24GB RAM, 1GB swap and run 16 threads each. Could reduce that but won't fit 2x8 core jobs on one node. Will check if makes sense to upgrade the RAM (4GB DDR3 SDRAM 666, Hynic Semiconductor Inc.)
  - Changed IP address on one DPM pool node. It took a few days to stabilise operation.
  - Memory failure on 1 ARECA controller on DPM pool node. After replacement, kernel panic at boot, needed re-install. Caused further SE (partial) failures for ATLAS but FTSs resumed promptly when the service is back online
ATLAS specific operations
- smooth routine operation
- HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2014-09-01&endTime=2014-09-30&templateType=isGolden
- SAM Nagios ATLAS_CRITICAL: http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time[]=lastMonth&granularity[]=default&profile=ATLAS_CRITICAL&group=All+sites&site[]=CSCS-LCG2&site[]=UNIBE-LHEP&site[]=UNIGE-DPNC&type=quality
- Up to 1st Oct this did not include the ARC CE tests. These have moved from WMS to Condor submission and added to the ATLAS_CRITICAL profile: http://dashb-atlas-sum-dev.cern.ch/dashboard/request.py/historicalsmryview-sum#view=serviceavl&time[]=last48&granularity[]=default&profile=ATLAS_CRITICAL&group=All+sites&site[]=UNIBE-LHEP&flavour[]=All+Service+Flavours&disabledFlavours=true
New Lustre deployment for the ce01 cluster
- Dalco servers with 5xLSI controllers each. Want RAID1 for OS HDDs and mdadm JBOD for the remaining HDDs
- Issues of hang at re-boot after install. Fine with the pre-install of CentOS and in re-install at Dalco lab . Hang after ROCKS+lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64. But hang also with vanilla SLC6 kernel at LHEP (with and without ROCKS)
- All LSI controllers to which the OS HDDs are attached flashed to allow hardware RAID1. Then re-install with ROCKS+lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64: 1 boots, 3 hangs, 2 not done yet
- ROCKS pxelinux too old? But hangs also when booting from BIOS and selecting the LSI Logical Volume
- Ongoing... will try to install+boot from vanilla SLC6

UNIGE

The 2014 upgrade is finished
- two oldest disk servers retired (X4500 model 2006)
- the ‘user’ and ‘software’ space migrated to two new servers (Sun, Solaris)
- other data migrated to another new disk server (IBM, Linux)
- two new machines for running four VMs that have critical services
Two astroparticle space experiments using the cluster
- the DAMPE group starting
- the AMS group will invest in hardware to have more disk space
Shellshock emergency...
ATLAS production (Andrej Filipicic) trying multi-core jobs

NGI_CH

ARGUS status and support
- NGI_CH instance: https://ggus.eu/index.php?mode=ticket_info&ticket_id=99533
- Support:
  - SWITCH bailed out
  - PEP client and server: no future support
  - PAP: INFN
  - PDP: no future support
  - ARGUS EES: NIKHEF
  - LCMAPS plugin: NIKHEF
  - New request to NGIs to rescue not supported components
  - No alternatives/plan B
- Status of deployment and plans
  - Most NGIs run their National service
  - End Oct 2014: monitoring framework ready (national instances) - Nagios probe, list of instances to monitor (GOCDB query)
  - Nov 2014: pilot testing with 4-5 sites (possibly diversified mw), refine documentation
    - Test if ban information is available at the sites services: CE/SE/WMS (action on EGI-CSIRT)
  - End Nov 2014 to end Mar 2015 (?): wide deployment
  - Beyond (?): sote monitored for this feature
New VOMS server configuration for ops/LHC VOs
- Old-ish issue, but we are asked to confirm that the new setup is in place
- All confirmed, ticket closed - https://ggus.eu/index.php?mode=ticket_info&ticket_id=108154