MeetingSwissGridOperations20141204 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2014-12-04
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * Maintenance of December 3 went smoothly: CSCS connected via a 100G link to SWITCH (Phoenix still at 20G though)
   * ARC monitored on NGI Nagios: WebServices configuration issues (as for now enabled only on arc01.lcg.cscs.ch)
   * perfSONAR: a couple of old WNs chosen as HW replacement for the old instances
   * Reminder: Next F2F meeting on January 29 2015 at CSCS

---+++ PSI
   * Using the [[https://docs.puppetlabs.com/references/latest/type.html#file-attribute-source_permissions][Puppet 3 source_permissions]] feature to copy files and dirs without specifying owner, group modes. It's like a =rsync=. I wasn't aware of it.
   * Using the [[http://docs.saltstack.com/en/latest/topics/targeting/batch.html][SaltStack batch mode]] feature to run a command on groups of filtered servers: 
      * To appreciate this I assume you're use to older tools like [[http://www.csm.ornl.gov/torc/C3/Man/cexec.shtml][cexec]] or [[https://code.google.com/p/pdsh/][pdsh]]
      * Those tools require you to write a static configuration file where you define your cluster(s); these definitions can only use hostnames.
      * In [[http://docs.saltstack.com/en/latest/][SaltStack]] each client ( minion ) constantly publishes its live info ( grains ); core grains are =SSDs biosreleasedate biosversion cpu_flags cpu_model cpuarch domain fqdn fqdn_ip4 fqdn_ip6 gpus host hwaddr_interfaces id ip4_interfaces ip6_interfaces ip_interfaces ipv4 ipv6 kernel %GREEN%kernelrelease%ENDCOLOR% locale_info localhost machine_id manufacturer master mem_total nodename num_cpus num_gpus os os_family osarch oscodename osfinger osfullname %ORANGE%osmajorrelease%ENDCOLOR% osrelease osrelease_info path productname ps pythonexecutable pythonpath pythonversion saltpath saltversion saltversioninfo selinux serialnumber server_id shell virtual zmqversion= but you can define your own grains =prod dev webserver db rackposition= etc..
      * By leveraging on the grains values you can dynamically filter the minions, split them in groups ( %BLUE%fixed amount%ENDCOLOR% or % ), and run a command in these groups as a sequence.
      * Running in small groups is useful when you're involving a 3rd party service =ftp http puppet rsync NFS ...= and you don't want to open tens of connections against it.
      * *My most recurring case is puppet.* =saltmaster# salt %BLUE%-b 3%ENDCOLOR% -C 't3wn* and G@%ORANGE%osmajorrelease%ENDCOLOR%:6' cmd.run 'puppet agent -t '=
      * All the commands you run are saved by [[http://docs.saltstack.com/en/latest/][SaltStack]], kinda 'job system'
      * Another ( no groups this time ) example: =salt -C 't3ui* and not G@%GREEN%kernelrelease%ENDCOLOR%:2.6.32-358.2.1.el6.x86_64' cmd.run 'uname -a'=
   * Tried http://xrootd.org v4 ; I've the impression that it requires IPv6 since I couldn't start it without a IPv6 ip. Need to double check it.
   * Working together with my boss Derek to prepare the 5th PSI T3 Steering Board Meeting ( UniZ/ETHZ/PSI ): a lot of time spent here.
   * Reading the [[http://www.dcache.org/manuals/upgrade-2.10/upgrade-2.6-to-2.10.html][dCache 2.6 to 2.10 upgrade guide]]
   * Is somebody going to attend [[https://indico.cern.ch/event/272794/][The Condor Workshop at CERN]] next week ? I'll probably attend it remotely.
---+++ UNIBE-LHEP
   * Operations 
      * Smooth routine operations with minor (or quickly remedied to) issues: 
         * 4 workers on ce01 suddenly became black holes: disabled pending investigation (no time so far).
         * our main switch went nuts on 17th Nov (morning working hours luckily). Packets dropped all over the place: power-cycled, recovered. No useful traces in system log.
         * a-rex crashes once more on ce02. This is a persistent issue, happens randomly on both clusters. Only once in a month is a positive trend.
         * home dirs server (local users) crashed due to a file system problem: needed hard power-cycling on site and fs repairs from single user mode. Night-long downtime, recovered fine.
         * deployed a Nagios server with basic checks. Tuning alarm thresholds and adding progressively more sofisticated checks.
   * ATLAS specific operations 
      * High failure rate on many WNs[ http://bigpanda.cern.ch/wns/UNIBE-LHEP/ ] 
         * mainly (exclusively?) due to vmem limit exceeded
         * tweaked gridengine settings and restored 16 slots per node on ce01 SunBlades (24GB RAM). vmem is set as consumable (dynamic allocation), but gridengine needs an explicit total mem value per node, it doeasn't fetch it automatically from e.g. meminfo. Once set that to 24GB, nodes no longer die from memory starving
         * yet problem not solved as jobs are killed instead
         * the recommended factor 2 up-scaling for limit in submit_sge_job is no longer sufficient
         * the real problem is that kernels have changed and vmem is no longer RSS+swap, it is the size of the address space. In 32-bit the difference was negligible, in 64-bit it is much larger (now all workflows are 64-bit)
         * in gridengine each job is assigned a GID and all the children spawned from that GID inherit it. GE adds up all resources used by all processes with the same GID and matches the limits against that
         * gets much worse with MCORE jobs, as the shared resources are added multiple times
         * it turns out that a reco jobs using e.g. 20GB is accounted as using as much as 40GB
         * cgroups could be the solution but gridengine (apparently) does not support it
         * either increase the scaling factor or remove the limit altogheter (latter could be ok for prod, but not for analysis)
         * increased (conservatively) the factor from 2 to 2.5 and watch the failure trend
      * Monitoring 
         1 SAM Nagios ATLAS_CRITICAL: <a target="_top" href="http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK">http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK</a>
         1 HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2014-11-04&endTime=2014-12-03&templateType=isGolden
---+++ UNIBE-ID
   * Security incident at site CAMK[EGI-20140130] 
      * Some attack attempts from the given IPs in EGI-Security report; no successful login found.
   * <span style="background-color: transparent;">Operations</span> 
      * <span style="background-color: transparent;">smooth and reliable; no issues</span>
      * <span style="background-color: transparent;">the 16 new DALCO compute nodes are operational =&gt; decommissioning of the old Sun Bladecenter on 2014-12-11</span>

---+++ UNIGE
   * New disk space for the AMS experiment added 
      * +84 TB in NFS space
      * disk now: 709 TB (474 TB in the DPM SE, 235 TB on NFS)
   * One incident with a full NFS file system 
      * a Solaris 9 disk server Sun X4540 blocked a few times
      * impossible to unmount the file system or to shut down properly
      * rebooting all clients, having to reset many of them
      * this does not happen often...
   * ARC front end filling up /var 
      * lack of log rotate for /var/log/arc/bdii/bdii-update.log
   * Our /cvmfs over NFS getting slow again, overloaded 
      * no visible problem to the users, but need to watch this issue
      * may need more machines for /cvmfs, we have many directories <br />ls /cvmfs<br />ams.cern.ch atlas.cern.ch atlas-condb.cern.ch atlas-nightlies.cern.ch geant4.cern.ch icecube.wisc.edu na61.cern.ch sft.cern.ch

---+++ NGI_CH
<div title="Page 7">

   * <p>perfSONAR3.4 upgrade (re-instantiation) as response to ShellSchock. New instructions include new mesh configurations</p>
   * SAM Update-23; Release early this week (?) - Old OPS VOMS decommissioned on November 26th
   * GOCDB: "Prod=Y and Mon=N" changed to "Prod=Y and Mon=Y" for all services except emi.ARGUS and VOMS
   * NGI_CH ARGUS deployment completed: https://ggus.eu/index.php?mode=ticket_info&ticket_id=99533

</div>

---++ Other topics
   * Possibility of local accounts for a limited number of power users (direct batch submission) at the T2? (request from ETH CMS group)
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS: Gianni Ricciardi
   * CMS: Fabio Martinelli, Daniel Meister
   * UNIBE-ID Nico Färber
   * ATLAS: Gianfranco Sciacca, Szymon Gadomski
   * LHCb: Roland Bernet
   * EGI: Gianfranco Sciacca

---++ Action items
   * Item1