Swiss Grid Operations Meeting on 2014-10-02
Site status
CSCS
- Issues
- Some nodes have been dropping off IB network randomly, thus many jobs have failed lately.
- All these jobs failed produced a huge increase in the inode usage on GPFS that was impossible to clean up by policies (just too fast!)
- Added an additional 2x 400GB SSDs to GPFS to provide even more inodes (150M)
- GPFS is going to be decommissioned soon.
- Swiss users storage
- Ready for ATLAS and LHCb (ATLASLOCALGROUPDISK = 160TB, LHCB-DISK = 290TB)
- Need to be tested by CMS (CHCMS = 150TB)
- Swiss users compute
- To be ready in next maintenance
- Next maintenance
- Downtime set on GOCDB for whole CSCS-LCG2 on 15.10.2014
(
https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=15651)
-
- Changes to be applied:
- Update firmware on IB switches
- Set MTU to 1500 on all systems (all IB cards)
- Deploy GPFS2 to all nodes (same gpfs cluster instead of remote cluster)
- Reconfigure all grid nodes to use GPFS2 and enable CHCMS VOMS
- Deploy ARGUS servers
- Removal of /experiment_software on all WNs as is no longer used
- Update SLURM configuration: increase priority of atlaschXX,*cmschXX* and lhcbchXX
- Remove non existing/decommissioned nodes from SLURM config
PSI
- Maintenances
- dCache updated to 2.6.33
- 120 Seagate disks FW updated from v. MS01 to MS04 on a E5400
- SL5 and SL6 bash updated, twice.
- Puppet upgraded from v.2 to puppet-3.5.1-0.1rc1.el6.noarch
- Using Puppet Stdlib : to use it:
- Grid tools
- Next storage : we have to replace 9 dCache fileservers and 2 NFS fileservers
- about the 9 dCache: my current guess, we'll buy a server + a 60*4TB E5500 ~= 180TB net ; CSCS also uses E5500.
- about the 2 NFS fileservers, 3 options :
- In this context I used tcptrack to easily see our dcap bandwidth usages; by those usages you can do capacity planning.
- Clouds
UNIBE-ID
- Infiniband Network with fat tree topology in production
- Hardware installed and tested
- Perfomance measurements done and ok
- IPoIB setup done
- Lot of disk crashes lately but no outages
- Currently working on migration from RHEL-6 to CentOS-6
- future config management with Puppet; testbed working excellent so far
- new compute nodes ordered to abandon old nodes => testing environment for the new CentOS-based setup
UNIBE-LHEP
- Operations
- Smooth routine operations with minor issues:
- a-rex crashes (x3) on ce01 (it used to happen on ce02)
- nodes on ce01 (phaseC Sun Blades) tend to crash. Must clean manually jobs ("dr" state in GE) then re-install. Not tragic, yet tedious. Suspect memory starving. These have 24GB RAM, 1GB swap and run 16 threads each. Could reduce that but won't fit 2x8 core jobs on one node. Will check if makes sense to upgrade the RAM (4GB DDR3 SDRAM 666, Hynic Semiconductor Inc.)
- Changed IP address on one DPM pool node. It took a few days to stabilise operation.
- Memory failure on 1 ARECA controller on DPM pool node. After replacement, kernel panic at boot, needed re-install. Caused further SE (partial) failures for ATLAS but FTSs resumed promptly when the service is back online
- ATLAS specific operations
- New Lustre deployment for the ce01 cluster
- Dalco servers with 5xLSI controllers each. Want RAID1 for OS HDDs and mdadm JBOD for the remaining HDDs
- Issues of hang at re-boot after install. Fine with the pre-install of CentOS and in re-install at Dalco lab . Hang after ROCKS+lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64. But hang also with vanilla SLC6 kernel at LHEP (with and without ROCKS)
- All LSI controllers to which the OS HDDs are attached flashed to allow hardware RAID1. Then re-install with ROCKS+lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64: 1 boots, 3 hangs, 2 not done yet
- ROCKS pxelinux too old? But hangs also when booting from BIOS and selecting the LSI Logical Volume
- Ongoing... will try to install+boot from vanilla SLC6
UNIGE
- The 2014 upgrade is finished
- two oldest disk servers retired (X4500 model 2006)
- the ‘user’ and ‘software’ space migrated to two new servers (Sun, Solaris)
- other data migrated to another new disk server (IBM, Linux)
- two new machines for running four VMs that have critical services
- Two astroparticle space experiments using the cluster
- the DAMPE group starting
- the AMS group will invest in hardware to have more disk space
- Shellshock emergency...
- ATLAS production (Andrej Filipicic) trying multi-core jobs
NGI_CH
- ARGUS status and support
- NGI_CH instance: https://ggus.eu/index.php?mode=ticket_info&ticket_id=99533
- Support:
- SWITCH bailed out
- PEP client and server: no future support
- PAP: INFN
- PDP: no future support
- ARGUS EES: NIKHEF
- LCMAPS plugin: NIKHEF
- New request to NGIs to rescue not supported components
- No alternatives/plan B
- Status of deployment and plans
- Most NGIs run their National service
- End Oct 2014: monitoring framework ready (national instances) - Nagios probe, list of instances to monitor (GOCDB query)
- Nov 2014: pilot testing with 4-5 sites (possibly diversified mw), refine documentation
- Test if ban information is available at the sites services: CE/SE/WMS (action on EGI-CSIRT)
- End Nov 2014 to end Mar 2015 (?): wide deployment
- Beyond (?): sote monitored for this feature
- New VOMS server configuration for ops/LHC VOs
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS: George Brown, Miguel Gila, Gianni Ricciardi
- CMS: Fabio Martinelli
- ATLAS: Gianfranco Sciacca, Szymon Gadomski
- LHCb: Roland Bernet
- EGI: Gianfranco Sciacca
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20141002
Topic revision: r16 - 2014-10-02 - RolandBernet