Swiss Grid Operations Meeting on 2015-03-05

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2015-03-05
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Three new members of CSCS joined Phoenix: two new system engineers (Dario Petrusic and Dino Conciatore) and systems group lead Nicholas Cardo. They will be actively participating on all our meetings and operations starting now.
- Already have access to the systems, including TWiki and chat.
- Working to set certificate roles in /dteam and dteam/NGI_CH; then they will be added to GOC and Nagios@CSCS
Maintenance 10.03.14:
- dCache upgrade: security updates and dCache to 2.6.46
- GPFS config update: new maxFilesToCache setting in place, updating from 40k files per node, to 50k files.
- Reinstallation of as many WNs as possible with latest EMI-WN packages (plus security updates)
- Shutdown and physical removal of old hardware (puppet, nfs0[1-2], se[01-06], 3x 1/2 racks of IBM DC3500 storage)
A.O.B.
- Migration to Puppet 3.6 ongoing, new roles created but more work needs to be put in place. Managed to migrate cfengine from ageing hardware (>1000 days of uptime!) to new vmware VM.
- At some point before summer, we will need to upgrade GPFS (v. 4) and dCache (v. 2.10).
- (Gianfranco) ATLAS lcgadmin and pilot roles to be enabled/fixed on ARC CEs

PSI

A major NetApp E5400 error A drawer in the tray has become degraded, that led to lost 1/2 redudant paths to 12*3TB disks ; it was a FW bug ; solved by updating ONLINE the NetApp 5400 FW to 7.86.49.00 ; the RDAC driver Linux-side gracefully moved the paths to the RAID Controllers from one to the other, and back, during the Controllers reboot. I didn't unmount the XFS filesystems or stopped dCache. Nothing you can get from the NAS world. CSCS should update the FW as well.
Again in the same NetApp E5400 I got 2*3TB broken disks
In both cases I get the native NetApp e-mails routed through iptables NAT but also a Nagios e-mail
Preparing the dCache 2.6 to 2.10 migration ; in my case this will also mean upgrading Postgresql from 9.3 to 9.4, also because of this news; luckily my Chimera Materialized Views still work out of the box but there are some new table fields that I should include in the future

Using Puppet standalone over /afs because it's 10 times faster than having a Puppet master, and the clients don't crash ; each SL6 server in my cluster mount /afs and there is a /afs dir where both my Puppet recipes and the conf files are stored ; this /afs dir and its descendants are protected by AFS ACLs ; only the root account on the SL6 server can access my /afs dir by using a Kerberos Keytab

file ; Example:

# ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ 
ls: cannot access /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/: Permission denied 

# kinit -k -t /root/afs-keytabs/svcusr-t3_puppet.keytab svcusr-t3_puppet@D.PSI.CH && aklog  

# ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ 
total 4 
lrwxr-xr-x 1 martinelli_f cms 22 Jan 15 16:14 manifests -> puppet/TRUNK/manifests 
lrwxr-xr-x 1 martinelli_f cms 20 Jan 15 16:14 modules -> puppet/TRUNK/modules 
drwxr-xr-x 4 martinelli_f cms 2048 Jan 15 15:48 puppet

Many other tasks, but specific to PSI or CMS

UNIBE-LHEP

Operations
- Slow recovery of the ce01 cluster following the kernel+glibc security updates of January.
  - straightforward RPM upgrades would not work, needed to re-image the WNs and re-install
  - issue with the OpenIB modules freezing at shutdown. This implies power-cycling every node whenever a re-boot is needed (or re-installation)
  - turned out our IB stack (not updated for ~2 years) had an outdated setup
  - however: after a general update of the setup, the rdma modules are still not unloaded cleanly after starting up (even if lustre is not even started)
  - coocked a shutdown script that (teoretically) unloads all cleverly before running into the system freeze/crash
  - permanent solution is I suppose re-image the WM from scratch. However, this implies re-buikding ROCKS (and the CE) from scratch
- ce02 cluster needed power-cycling for the ethernet switches on 30th Jan, stable thereafter but almost halved in capacity
- Added cron jobs on both CE's to recover a-rex after crashing. Logging crashes, typically twice a month
ATLAS specific operations
- ATLAS still pretty quiet, picking up now
- Revived the webdaw access to the SE (ATLAS request)
- Monitoring:
  1. SAM Nagios ATLAS_CRITICAL: http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK
  2. HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2015-02-01&endTime=2015-03-04&templateType=isGolden

UNIBE-ID

Procurement
- Another 16 Dalco compute nodes are installed, setup and running smooth; ingredients:
  - 2x 8C Intel Xeon E5-2650v2 2.6GHz
  - 128GB 1866MHz DDR ECC REG (8*16GB)
  - 1x 1TB 7.2k rpm SATA 6.0Gb/s
  - 2x Gigabit-Ethernet onboard
  - Infiniband ConnectX-3 QDR HCA
- Prepared a tender to buy replacement storage
  - IBM GSS24 with 3TB disks => 696 TB total capacity; ~510 TB usable capacity
Decomissioning
- 23 Sun X2200 Pizza boxes shutdown and dumped
- 25 remaining and marked to be dumped within the next two months
Operations
- smooth and reliable, except...
- ... nordugrid-arc-bdii dead for almost a week while being on holiday => bad performance value in monthly report
  - same happened in January and at the beginning of this week
  - now installed a cron based guardian like we already have for a-rex (which btw was very stable the last few months)
AOB:
- (Gianfranco) ATLAS pilot role to be enabled/fixed on the ARC CE

UNIGE

Xxx
AOB:
- (Gianfranco) ATLAS request to enable multicore jobs (sent by mistake instructions for gridengine, but Geneva run Torque)

NGI_CH

January 2015 - RP/RC OLA performance: http://snf-631462.vm.okeanos.grnet.gr:8080/lavoisier/site_reports?ngi=NGI_CH
- UNIBE-ID low (understood): https://ggus.eu/index.php?mode=ticket_info&ticket_id=111896
Multicore accounting for EGI:
- http://accounting-devel.egi.eu/show.php?ExecutingSite=CSCS-LCG2&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs
- http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-LHEP&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN
- http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-ID&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs
Pakiti made easy https://pakiti.egi.eu/client.php?site=UNIBE-LHEP (simple cron job on all WNs - requires access to the CAs)
- Site Security Officer can check their own site: https://pakiti.egi.eu/ .
Issues with Certificates in CH following SWITCH withdrawal from the service as of 31st Aug 2015
- CERN not an option for non-users, servers non on the CERN network
- TERENA CS (flat fee 27k) would deal only with NRENs (i.e. SWITCH)
- Exploring possible solutions (EGI catch-all CA?)

A.O.B.

Attendants

CSCS: Gianni Ricciardi, Dino Conciatore, Dario Petrusic, Miguel Gila
CMS: Fabio Martinelli, Daniel Meister
ATLAS: Gianfranco Sciacca
UNIBE-ID: Michael Rolli
LHCb: Roland Bernet
EGI:

Action items

Item1

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20150305
Topic revision: r14 - 2015-06-09 - FabioMartinelli