Tags: view all tags

Swiss Grid Operations Meeting on 2015-05-07

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2015-05-07
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Unscheduled downtime 15.04.15 and 16.04.15
- In the morning of April 15, we found GPFS was blocked and was spitting out messages such as:
```
[...]
Tue Apr 14 20:22:52.964 2015: Recovered 1 nodes for file system phoenix_scratch.
Tue Apr 14 20:29:13.455 2015: Accepted and connected to 148.187.65.62 wn62 <c0n14>
*** glibc detected *** /usr/lpp/mmfs/bin//mmfsd: invalid fastbin entry (free): 0x00007fbf202829b0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76166)[0x7fbf4fc59166]
/usr/lpp/mmfs/bin//mmfsd(_ZN10MsgDataBuf8freeDataEv+0x18)[0x90e5b8]
/usr/lpp/mmfs/bin//mmfsd(_ZN10MsgDataBufD1Ev+0x9)[0x910469]
/usr/lpp/mmfs/bin//mmfsd(_ZN7TcpConn9deleteMsgEP6RcvMsg+0x4c)[0x918cac]
/usr/lpp/mmfs/bin//mmfsd(_ZN10NsdRequest14processRequestEP9NsdBufferP8NsdQueue+0x385)[0x10f0d65]
/usr/lpp/mmfs/bin//mmfsd[0x10f17ba]
/usr/lpp/mmfs/bin//mmfsd(_ZN6Thread8callBodyEPS_+0x66)[0x5a4676]
/usr/lpp/mmfs/bin//mmfsd(_ZN6Thread15callBodyWrapperEPS_+0x79)[0x5963f9]
/lib64/libpthread.so.0(+0x79d1)[0x7fbf5070c9d1]
/lib64/libc.so.6(clone+0x6d)[0x7fbf4fccbb6d]
======= Memory map: ========
00400000-0134c000 r-xp 00000000 fd:01 2757550                            /usr/lpp/mmfs/bin/mmfsd
0144b000-01490000 rwxp 00f4b000 fd:01 2757550                            /usr/lpp/mmfs/bin/mmfsd
01490000-014f7000 rwxp 00000000 00:00 0
0237d000-025a2000 rwxp 00000000 00:00 0                                  [heap]
```
  Both metadata servers were affected with a difference of about 8 hours. This also caused one of the metadata servers to be out of sync (its SSD disks were expelled from GPFS).
- Upon discovery, we announced the problem to the CHIPP community, declared an official downtime and contacted IBM about the issue. In parallel, we manually re-added the SSDs that hold metadata on the server out of sync and started the sync process.
- After about 5 hours, IBM labs said that they could not reproduce the problem and suggested to upgrade to a newer version.
- Once the sync finished, following IBM's advice, we rebooted all GPFS servers, aligned GPFS package version in the servers with the rest of CSCS (3.5.0-21).
- Unfortunately, at this point we realised that the number of inodes was very close to the maximum. This is most likely due to the fact that at the time GPFS blocked, the cluster was full and some ~80 million files belonging to those failed jobs were stuck in the filesystem (if the filesystem is not accessible, jobs can't delete their output). The cleanup process took many hours (until 23:30 aprox.).
- When the process finished, we regenerated the scratch structure and slowly brought the system back to life. The structure we recreated was incomplete and it took us a bit to make sure all permissions were correct.
- Actions taken as a result of this downtime:
  1. The procedure to recreate the filesystem has been improved.
  2. We've further increased the frequency of the GPFS cleaning policy to run 2x day, removing everything older than the length of the longest job.
  3. Currently evaluating other possible configurations, such as dividing scratch in filesets (GPFS terms for folder with its own quotas). This increases complexity, but might help to mitigate problems in the event of issues with the filesystem (we could recreate filesets on a rolling update fashion, first VO1, then VO2 and ultimately VO3.
  4. In addition to upgrading GPFS, we tuned its configuration to avoid the massive swapping we found in the nodes that 'only' have 64GB RAM. Now it won't ever take more than 2GB of physical memory on any given WN.
ARC update
- arc01 re-installed with nordugrid-arc 5.0.0-2
- currently configuring and testing
Other operations
- Currently investigating feasibility of implementing cgroups in order to contain jobs. This would require a major SLURM upgrade. The following presentation done at HEPIX on this matter is very interesting.
- Currently testing dCache upgrade as dCache 2.6 will be out of support soon.
- Work to port Phoenix to Puppet is still ongoing.
- Installed pakiti on all WNs
Next maintenance on May 20th, 2015 between 8:00 and 20:00
- Due to works on the cooling systems and a critical maintenance on the NAS infrastructure, Phoenix is forced to go in maintenance.
- During the downtime, IBM will upgrade firmware on our DCS3700 storage controllers.
- At this point no other operations are planned.

PSI

Studying the Control Groups in Son of Grid Engine
- The 5' article about Control Groups to be read + Full Control Groups RHEL6 Reference it can be skipped the 1st time
- One of the recurring issues with the old SGE 6.2u5 running at PSI T3 is the user freedom to consume more CPU cores than the ones assigned by the batch system ; to definitively fix this issue I'm going to upgrade our SGE from SGE 6.2u5 to Son of Grid Engine 8.1.8 because of its support of the Control Groups ; some details about this integration are here ; there are already ATLAS sites using the Control Groups in HTCondor ; SLURM also supports the Control Groups
- Just for Gianfranco : you need to apply these confs to make work the Control Groups / Cpuset in Son of Grid Engine ( it cost me 1d of attempts ! ) :
  1. ```
  [wn] cat /etc/sysconfig/sgeexecd 
  export SGE_CGROUP_DIR=/dev/cpuset/sge
```
2. ```
[wn] grep -Hn setup-cgroups-etc /etc/init.d/sgeexecd.p6444 /etc/init.d/sgeexecd.p6444:441: 
/opt/sge/util/resources/scripts/setup-cgroups-etc start
```
  3. ```
  [wn] qconf -sconf |grep -Hn CGR (standard input):28:execd_params USE_SMAPS=true KEEP_ACTIVE=true 
  USE_CGROUPS=true ENABLE_BINDING=true \ 
```
4. ```
[submission_host] grep -v \# /opt/sge/default/common/sge_request | strings 
-binding set linear
```

Upgraded the PSI PhEDEx from SL5 to SL6

Lot of issues here due to the poor QA of the latest SRM client SRM client 2.10.7 (rpm)
The implicit X509 Proxy Delegation requested to copy files between 2 remote SRM endpoint ( e.g. from CSCS to PSI ) doesn't work if one uses the -copyjobfile option, like PhEDEx does ; dCache team acknowledged this bug

Also this is a bug :

$ srmls -debug=false -x509_user_proxy=/home/phedex/gridcert/proxy.cert -retry_num=0 'srm://t3se01.psi.ch:8443/srm/managerv2?SFN=/pnfs/psi.ch/cms/trivcat/store/mc/RunIIWinter15GS/RSGravToWWToLNQQ_kMpl01_M-4000_TuneCUETP8M1_13TeV-pythia8/GEN-SIM/MCRUN2_71_V1-v1/10000/2898A22B-62B0-E411-B1D4-002590D600EE.root' srm client error: java.lang.IllegalArgumentException: Multiple entries with same key: x509_user_proxy=/home/phedex/gridcert/proxy.cert and x509_user_proxy=/tmp/x509up_u205

Same here :

$ srm-advisory-delete -x509_user_proxy=${X509_USER_PROXY} -retry_num=0 srm client error: java.lang.IllegalArgumentException: Multiple entries with same key: x509_user_proxy=-retry_num=0 and x509_user_proxy=/tmp/x509up_u205

Eventually I tweaked the PhEDEx script to bypass all these bugs. Just for Daniel my FileDownloadDelete and FileDownloadSRMVerify corrections

Power cut at PSI
- My 1st power cut since 2011 ; 4 file servers rebooted, T3 users in panic mode ; luckily fixed without data lost
CMS Space Monitoring Project for CSCS ; Daniel is going to follow this task
- Task details
- Currently CSCS is in error concerning this task
- I made for PSI a dCache query related to this task https://bitbucket.org/fabio79ch/pnfs_space_usage_by_creation_time/wiki/Home ; CSCS might publish it on its website too

UNIBE-LHEP

Operations
- Mainly stable operation on both clusters (yet at about half capacity)
- Local netork sick on ce01 (22nd April): power-cycled
- 3 a-rex crashes caught by the cron
ATLAS specific operations
- As last month, mostly multi-core MC production+Reconstruction, mostly smooth. Until:
- Following ARC upgrade to 5.0.0-2.el6 : /var full caused the system to jam (ce01 only, grid-manager.log). Laborious clean-up needed
- Following clean-up: a-rex crashes at start. "Insider" tip: rm gm.fifo in controldir fixed it
- ARC 5 introduces handling of user-requested job-priority. Arc range: [0:100] is mapped in ARC 5 to [-1023:1024] but in griengine the range allowed for user control is [-1023:0]. Needed to hack submit-sge-script
- Yesterday, ce01 dropped out of the GIIS (all services reported themselves as running): needed restarting the infosys and a-rex. Update: just happened again
- Removal of voms.cern.ch alias to voms2.cern.ch cause file transfers to the SRM to fail. Apparently an obscure mis-configuration detail caused the authentication problem [ https://ggus.eu/index.php?mode=ticket_info&ticket_id=113485 ] (ops tests still failing on the SE, not clear why)
- HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2015-04-02&endTime=2015-05-07&templateType=isGolden
Plans
- Restore full capacity:
  - 320 (old) cores from UNIBE-ID
  - 144 (old) cores from CSCS
  - 128 (new) cores awaiting installation (E5-2650 v2 2.6GHz)
  - 512 (new) cores to be procured
- Plan is to re-deploy ce01 with ROCKS 6.2 (coming out any day now). At the moment with the current deployment of ROCKS 6.1, installation of new hardware works but kernel freezes at re-boot
- Testing install procedure with ce04 and ROCKS 6.1.1 for now (temporary deployment)
- Deploy 2 additional UI's (to be procured)

UNIBE-ID

Procurement:
- Tender regarding new storage ended at the beginning of this week => order placed for:
  - IBM ESS GL4 with 4TB NL-SAS disks:
  - Total Capacity: 928TB
  - Interconnect: IB
  - Filesystem: GPFS 4.1 STD + GPFS Native RAID
Operations:
- Stable operations most of the time
- Small glitch after migrating to new OpenLDAP installations
  - pam_ldap doesn't close connections properly => max open files violations on LDAP server; now set olcIdleTimeout to 60 - problem solved
- Upgraded Nordugrid ARC CE to 5.0.0
  - Upgrade process was seamless; no problems so far
  - according to http://goc-accounting.grid-support.ac.uk/apel/jobs2.html we get accounted, don't we?
- still working on moving from satellite based RHEL setup to foreman based CentOS setup

UNIGE

The next upgrade
- Our upgrade plans for 2015 were approved as proposed for the 50% co-funding scheme of the Uni.
- One order is out for the upgrade
  - 3 x Lenovo x3630 M4 with 6 TB disks (63 TB net in a 2U machine)(1 for neutrino, 2 for ATLAS)
  - 2 x Lenovo x3550 M4 as hosts for running virtual machines
- We also foresee to upgrade network to 10 Gbps for 7 disk servers doing NFS
  - no quotes yet, only price estimates
Cleanup of the SE UNIGE-DPNC_LOCALGROUPDISK, 90 TB free (21%)
- With run 2 starting, we wil likely need another round quickly
This is my last meeting. Good bye to all of you!

NGI_CH

EGI onthly OPS report for March circulated (24th April):
- NGI_CH 88% / 88% - CSCS and UNIGE have lower than usual numbers.
- There is also a reported 79% in the “Unknown” column (but no ticket from EGI about it)
WLCG monthly reports for the experiments (April):
- CH-CHIPP-CSCS is: 98%/98% (ATLAS) - 96%/96% (CMS) - 97%/97% (LHCB)
- http://wlcg-sam.cern.ch/reports/2015/201504/wlcg/WLCG_All_Sites_ATLAS_Apr2015.pdf
- http://wlcg-sam.cern.ch/reports/2015/201504/wlcg/WLCG_All_Sites_CMS_Apr2015.pdf
- http://wlcg-sam.cern.ch/reports/2015/201504/wlcg/WLCG_All_Sites_LHCB_Apr2015.pdf
Certificates:
- agreement with GRNET in place and procedure principle established. Will test workflow from Bern. WARNING: DNs will change when a new certificate is issued
Midddleware: Issues with Torque 4 in EPEL. If at version < 2.5.13 and wishing to upgrade, EGI offer advice
NGIs asked to assess:
- Need to have MW on CentOS7
- Nr of sties still using SL5 or equivalent and the decommissioning plan for them
Security:
- EGI accepted feedback proposed by NGI_CH about notification capacities to be added to the pakiti clients (installation of these will become 'somehow' a requirement)
- Sites encouraged to carry out a security readiness self-assessment prepared by the National Cyber Security Centrum: https://check.ncsc.nl/questionnaire/

A.O.B.

Gianni will attend the following pre-GDB to be held at CERN on 12.05.15.
Miguel will attend the upcoming Annual NorduGrid Conference to be held at UNIBE between 04.06.15 and 05.06.15
Gianfranco+Sigve will attend the upcoming EGI Confrence 2015 to be held in Lisbon between 18.05.15 and 22.05.15
Gianfranco will attend the upcoming Annual NorduGrid Conference to be held at UNIBE between 04.06.15 and 05.06.15

Attendants

CSCS: Dino Conciatore, Miguel Gila, Dario Petrusic. Apologies: Gianni Ricciardi, Nick Cardo.
CMS:
ATLAS: Gianfranco Sciacca, Szymon Gadomski
LHCb: Roland Bernet
EGI: Gianfranco Sciacca