Tags: view all tags

Swiss Grid Operations Meeting on 2015-10-15

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2015-10-15
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

dCache
- Security upgrade to version 2.6.52 also planned to upgrade to last version next month, Dario will perform the upgrade.
- Dario distributed the remaining storage according to our last f2f meeting.
- We set up a downtime for tomorrow October 16 to attempt to resolve dCache issues experienced this week
Network
- Infiniband switches and bridges upgraded to last version.

PSI

dCache
- upgraded from 2.10 to the latest 2.13.9 ; this was an easy upgrade, CSCS might try the upgrade 2.6 -> 2.10 -> 2.13 in order to avoid a downtime
- upgraded accordingly the xrootd monitoring plugin RPM and its conf
- upgraded to the latest Xrootd RPMs 4.2.3* More... Close
```
cms-xrootd-dcache-1.2-7.osg.el6.noarch gfal2-plugin-xrootd-0.3.4-1.el6.x86_64 xrootd-4.2.3-1.el6.x86_64 xrootd-client-4.2.3-1.el6.x86_64 xrootd-client-libs-4.2.3-1.el6.x86_64 xrootd-cmstfc-1.5.1-10.osg32.el6.x86_64 xrootd-fuse-4.2.3-1.el6.x86_64 xrootd-libs-4.2.3-1.el6.x86_64 xrootd-selinux-4.2.3-1.el6.noarch xrootd-server-4.2.3-1.el6.x86_64 xrootd-server-libs-4.2.3-1.el6.x86_64 
```
- my materialized views still work 'as they are' with 2.13.9
- instead the Derek's tools need to be updated because of the new 2.13 Admin door commands ( e.g. not cd Cell but \c Cell ) ; I've partially updated them by a bit of sed but it's still a working in progress.
Latest Python packages on SL6
- This is mainly addressed to the T3s because they face the final user issues ; at PSI Scientists use Anaconda on SL6 in order to easily use an updated and extended Python distribution ; I run an installation of Anaconda as well, it's very easy both use it and update it
NFSv3
- In order to replace our very old NFSv3 services based on Solaris 10 we've eventually ordered :
  - 1 * HP G9 DL380 24SFF wih 24*600GB SAS 15k 2.5"
  - 1 * HP G9 DL380 12LFF wih 12*3TB SATA 7.2k 3.5"
  - the former will act as the main fast NFSv3 service, the latter will act as a backup server not exposed to the users ; synchronisations will be made by the ZFS send/receive mechanism.
- servers will run CentOS7 and ZOL ; OS hardening by CIS RHEL7
- decided to make my 1st PXE UEFI installation ; after several unsuccessful trials I've managed to create this grub.cfg and kickstart file ; final aim is a mdadm RAID1 kickstart on the 2 OS disks
Nagios4
- Upgraded to Nagios 4.1.1
Compact UIs/WNs featuring many disks
- As the # of CPUs core grow we've to accordingly scale the # of disks ( and the net/RAM/.. ) ; CSCS uses a distributed GPFS scratch but in respect of our yearly budget we can just select DAS solutions ; in this context I've enquired the market, outcome is :
- Dalco offers the new 2u Intel chassis with 24 SFF disks, 4 servers, 6 disks per server, a dedicated RAID5 controller per server
- HP offers the 2u Apollo chassis ; nice but it's ~20% more expensive than Intel
- If you're willing to run mdadm RAID10 then Dalco offers the older but even cheaper 2u Intel chassis with 16 SFF disks ; because I know how to manage mdadm RAID10 this is my current choice but obviously the dedicated RAID5 controller per server reduces the Sys Admin efforts.
CMS PhEDEx
- Updated to PHEDEX_4_1_5
Son of Grid Engine and cgroups
- No progress, too busy.

UNIBE-LHEP

Operations
- Relatively smooth running period (partly unattended)
- I/O errors on lustre mds (ce02) due to a degraded RAID10. Power-cycled, lustre self-recovered relatively quickly
ATLAS specific operations
- Nothing specific to report
- ATLAS have made a hufe progress on task definitions and makinf the workflows considerably more failsafe, making sites' life considerably simpler
- Also many ARC bugfixes have contributed
Ongoing work
- Cluster re-installation workflow development finalised:
  - Rocks 6.2
  - SLC 6.7
  - ARC 5.0.3
  - SLURM 15.08.0-1
  - Lustre 2.5.3
- Plan to re-build ce01 starting next Tuesday
- Temperature monitoring in progress (room/racks/servers) - Very useful input from PSI

UNIBE-ID

Operations
- Smooth operatiosn, only few minor issues
Storage Migration
- Migration plan developed to move from generic GPFS cluster to new IBM ESS
- Online data migration since 3 weeks (though slow due to compute workload on ethernet)
- Planned downtime and moving all nodes to the new GPFS cluster on 2015-10-08 (yes, today, therefore no attendee from UNIBE-ID)