Swiss Grid Operations Meeting on 2014-12-04

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2014-12-04
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Maintenance of December 3 went smoothly: CSCS connected via a 100G link to SWITCH (Phoenix still at 20G though)
ARC monitored on NGI Nagios: WebServices configuration issues (as for now enabled only on arc01.lcg.cscs.ch)
perfSONAR: a couple of old WNs chosen as HW replacement for the old instances
Reminder: Next F2F meeting on January 29 2015 at CSCS

PSI

Using the Puppet 3 source_permissions feature to copy files and dirs without specifying owner, group modes. It's like a rsync. I wasn't aware of it.
Using the SaltStack batch mode feature to run a command on groups of filtered servers:
- To appreciate this I assume you're use to older tools like cexec or pdsh
- Those tools require you to write a static configuration file where you define your cluster(s); these definitions can only use hostnames.
- In SaltStack each client ( minion ) constantly publishes its live info ( grains ); core grains are SSDs biosreleasedate biosversion cpu_flags cpu_model cpuarch domain fqdn fqdn_ip4 fqdn_ip6 gpus host hwaddr_interfaces id ip4_interfaces ip6_interfaces ip_interfaces ipv4 ipv6 kernel kernelrelease locale_info localhost machine_id manufacturer master mem_total nodename num_cpus num_gpus os os_family osarch oscodename osfinger osfullname osmajorrelease osrelease osrelease_info path productname ps pythonexecutable pythonpath pythonversion saltpath saltversion saltversioninfo selinux serialnumber server_id shell virtual zmqversion but you can define your own grains prod dev webserver db rackposition etc..
- By leveraging on the grains values you can dynamically filter the minions, split them in groups ( fixed amount or % ), and run a command in these groups as a sequence.
- Running in small groups is useful when you're involving a 3rd party service ftp http puppet rsync NFS ... and you don't want to open tens of connections against it.
- My most recurring case is puppet. saltmaster# salt -b 3 -C 't3wn* and G@osmajorrelease:6' cmd.run 'puppet agent -t '
- All the commands you run are saved by SaltStack, kinda 'job system'
- Another ( no groups this time ) example: salt -C 't3ui* and not G@kernelrelease:2.6.32-358.2.1.el6.x86_64' cmd.run 'uname -a'
Tried http://xrootd.org v4 ; I've the impression that it requires IPv6 since I couldn't start it without a IPv6 ip. Need to double check it.
Working together with my boss Derek to prepare the 5th PSI T3 Steering Board Meeting ( UniZ/ETHZ/PSI ): a lot of time spent here.
Reading the dCache 2.6 to 2.10 upgrade guide
Is somebody going to attend The Condor Workshop at CERN next week ? I'll probably attend it remotely.

UNIBE-LHEP

Operations
- Smooth routine operations with minor (or quickly remedied to) issues:
  - 4 workers on ce01 suddenly became black holes: disabled pending investigation (no time so far).
  - our main switch went nuts on 17th Nov (morning working hours luckily). Packets dropped all over the place: power-cycled, recovered. No useful traces in system log.
  - a-rex crashes once more on ce02. This is a persistent issue, happens randomly on both clusters. Only once in a month is a positive trend.
  - home dirs server (local users) crashed due to a file system problem: needed hard power-cycling on site and fs repairs from single user mode. Night-long downtime, recovered fine.
  - deployed a Nagios server with basic checks. Tuning alarm thresholds and adding progressively more sofisticated checks.
ATLAS specific operations
- High failure rate on many WNs[ http://bigpanda.cern.ch/wns/UNIBE-LHEP/ ]
  - mainly (exclusively?) due to vmem limit exceeded
  - tweaked gridengine settings and restored 16 slots per node on ce01 SunBlades (24GB RAM). vmem is set as consumable (dynamic allocation), but gridengine needs an explicit total mem value per node, it doeasn't fetch it automatically from e.g. meminfo. Once set that to 24GB, nodes no longer die from memory starving
  - yet problem not solved as jobs are killed instead
  - the recommended factor 2 up-scaling for limit in submit_sge_job is no longer sufficient
  - the real problem is that kernels have changed and vmem is no longer RSS+swap, it is the size of the address space. In 32-bit the difference was negligible, in 64-bit it is much larger (now all workflows are 64-bit)
  - in gridengine each job is assigned a GID and all the children spawned from that GID inherit it. GE adds up all resources used by all processes with the same GID and matches the limits against that
  - gets much worse with MCORE jobs, as the shared resources are added multiple times
  - it turns out that a reco jobs using e.g. 20GB is accounted as using as much as 40GB
  - cgroups could be the solution but gridengine (apparently) does not support it
  - either increase the scaling factor or remove the limit altogheter (latter could be ok for prod, but not for analysis)
  - increased (conservatively) the factor from 2 to 2.5 and watch the failure trend
- Monitoring
  1. SAM Nagios ATLAS_CRITICAL: http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK
  2. HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2014-11-04&endTime=2014-12-03&templateType=isGolden

UNIBE-ID

Security incident at site CAMK[EGI-20140130]
- Some attack attempts from the given IPs in EGI-Security report; no successful login found.
Operations
- smooth and reliable; no issues
- the 16 new DALCO compute nodes are operational => decommissioning of the old Sun Bladecenter on 2014-12-11

UNIGE

New disk space for the AMS experiment added
- +84 TB in NFS space
- disk now: 709 TB (474 TB in the DPM SE, 235 TB on NFS)
One incident with a full NFS file system
- a Solaris 9 disk server Sun X4540 blocked a few times
- impossible to unmount the file system or to shut down properly
- rebooting all clients, having to reset many of them
- this does not happen often...
ARC front end filling up /var
- lack of log rotate for /var/log/arc/bdii/bdii-update.log
Our /cvmfs over NFS getting slow again, overloaded
- no visible problem to the users, but need to watch this issue
- may need more machines for /cvmfs, we have many directories
  ls /cvmfs
  ams.cern.ch atlas.cern.ch atlas-condb.cern.ch atlas-nightlies.cern.ch geant4.cern.ch icecube.wisc.edu na61.cern.ch sft.cern.ch

NGI_CH

perfSONAR3.4 upgrade (re-instantiation) as response to ShellSchock. New instructions include new mesh configurations
SAM Update-23; Release early this week (?) - Old OPS VOMS decommissioned on November 26th
GOCDB: "Prod=Y and Mon=N" changed to "Prod=Y and Mon=Y" for all services except emi.ARGUS and VOMS
NGI_CH ARGUS deployment completed: https://ggus.eu/index.php?mode=ticket_info&ticket_id=99533

A.O.B.

Attendants

CSCS: Gianni Ricciardi
CMS: Fabio Martinelli, Daniel Meister
UNIBE-ID Nico Färber
ATLAS: Gianfranco Sciacca, Szymon Gadomski
LHCb: Roland Bernet
EGI: Gianfranco Sciacca

Action items

Item1

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20141204
Topic revision: r17 - 2015-03-03 - DanielMeister

Swiss Grid Operations Meeting on 2014-12-04

Site status

CSCS

PSI

UNIBE-LHEP

UNIBE-ID

UNIGE

NGI_CH

Other topics

A.O.B.

Attendants

Action items