(r16) MeetingSwissGridOperations20150507 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2015-05-07
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * *Unscheduled downtime 15.04.15 and 16.04.15* 
      * In the morning of April 15, we found GPFS was blocked and was spitting out messages such as: <verbatim>[...]
Tue Apr 14 20:22:52.964 2015: Recovered 1 nodes for file system phoenix_scratch.
Tue Apr 14 20:29:13.455 2015: Accepted and connected to 148.187.65.62 wn62 <c0n14>
*** glibc detected *** /usr/lpp/mmfs/bin//mmfsd: invalid fastbin entry (free): 0x00007fbf202829b0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76166)[0x7fbf4fc59166]
/usr/lpp/mmfs/bin//mmfsd(_ZN10MsgDataBuf8freeDataEv+0x18)[0x90e5b8]
/usr/lpp/mmfs/bin//mmfsd(_ZN10MsgDataBufD1Ev+0x9)[0x910469]
/usr/lpp/mmfs/bin//mmfsd(_ZN7TcpConn9deleteMsgEP6RcvMsg+0x4c)[0x918cac]
/usr/lpp/mmfs/bin//mmfsd(_ZN10NsdRequest14processRequestEP9NsdBufferP8NsdQueue+0x385)[0x10f0d65]
/usr/lpp/mmfs/bin//mmfsd[0x10f17ba]
/usr/lpp/mmfs/bin//mmfsd(_ZN6Thread8callBodyEPS_+0x66)[0x5a4676]
/usr/lpp/mmfs/bin//mmfsd(_ZN6Thread15callBodyWrapperEPS_+0x79)[0x5963f9]
/lib64/libpthread.so.0(+0x79d1)[0x7fbf5070c9d1]
/lib64/libc.so.6(clone+0x6d)[0x7fbf4fccbb6d]
======= Memory map: ========
00400000-0134c000 r-xp 00000000 fd:01 2757550                            /usr/lpp/mmfs/bin/mmfsd
0144b000-01490000 rwxp 00f4b000 fd:01 2757550                            /usr/lpp/mmfs/bin/mmfsd
01490000-014f7000 rwxp 00000000 00:00 0
0237d000-025a2000 rwxp 00000000 00:00 0                                  [heap]</verbatim> Both metadata servers were affected with a difference of about 8 hours. This also caused one of the metadata servers to be out of sync (its SSD disks were expelled from GPFS).
      * Upon discovery, we announced the problem to the CHIPP community, declared an official downtime and contacted IBM about the issue. In parallel, we manually re-added the SSDs that hold metadata on the server out of sync and started the sync process.
      * After %BLUE%about 5 hours%ENDCOLOR%, IBM labs said that they could not reproduce the problem and suggested to upgrade to a newer version.
      * Once the sync finished, following IBM's advice, we rebooted all GPFS servers, aligned GPFS package version in the servers with the rest of CSCS (3.5.0-21).
      * Unfortunately, at this point we realised that the %BLUE%number of inodes was very close to the maximum%ENDCOLOR%. This is most likely due to the fact that at the time GPFS blocked, the cluster was full and some ~80 million files belonging to those failed jobs were stuck in the filesystem (if the filesystem is not accessible, jobs can't delete their output). The cleanup process took many hours (until 23:30 aprox.).
      * When the process finished, we *regenerated the scratch* structure and slowly brought the system back to life. The structure we recreated was incomplete and it took us a bit to make sure all permissions were correct.
      * %ICON{led-red}% *Actions taken as a result of this downtime:* 
         1 The procedure to recreate the filesystem has been improved.
         1 We've further increased the frequency of the GPFS cleaning policy to run 2x day, removing everything older than the length of the longest job.
         1 Currently evaluating other possible configurations, such as dividing scratch in filesets (GPFS terms for folder with its own quotas). This increases complexity, but might help to mitigate problems in the event of issues with the filesystem (we could recreate filesets on a rolling update fashion, first VO1, then VO2 and ultimately VO3.
         1 In addition to upgrading GPFS, we tuned its configuration to avoid the massive swapping we found in the nodes that 'only' have 64GB RAM. Now it won't ever take more than 2GB of physical memory on any given WN.
   * *ARC update* 
      * =arc01= re-installed with nordugrid-arc 5.0.0-2
      * currently configuring and testing
   * *Other operations* 
      * Currently investigating feasibility of implementing =cgroups= in order to contain jobs. This would require a major SLURM upgrade. The following presentation done at [[https://indico.cern.ch/event/346931/session/5/contribution/72/material/slides/2.pdf][HEPIX]] on this matter is very interesting.
      * Currently testing dCache upgrade as =dCache 2.6= will be out of support soon.
      * Work to port Phoenix to Puppet is still ongoing.
      * Installed pakiti on all WNs
   * *Next maintenance on May 20th, 2015 between 8:00 and 20:00* 
      * Due to works on the cooling systems and a critical maintenance on the NAS infrastructure, Phoenix is forced to go in maintenance.
      * During the downtime, IBM will upgrade firmware on our DCS3700 storage controllers.
      * At this point no other operations are planned.

---+++ PSI
   * *Studying the Control Groups in Son of Grid Engine* 
      * [[http://www.oracle.com/technetwork/articles/servers-storage-admin/resource-controllers-linux-1506602.html][The 5' article about Control Groups]] _to be read_ + [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html][Full Control Groups RHEL6 Reference]] _it can be skipped the 1st time_
      * One of the recurring issues with the old SGE 6.2u5 running at PSI T3 is the user freedom to consume more CPU cores than the ones assigned by the batch system ; to definitively fix this issue I'm going to upgrade our SGE from SGE 6.2u5 to [[https://arc.liv.ac.uk/trac/SGE][Son of Grid Engine 8.1.8]] because of its support of the Control Groups ; some details about this integration are [[http://blogs.scalablelogic.com/2012/05/grid-engine-cgroups-integration.html][here]] ; there are already [[https://indico.cern.ch/event/346931/session/5/contribution/38/material/slides/0.pdf][ATLAS sites using the Control Groups in HTCondor]] ; [[http://slurm.schedmd.com/cgroups.html][SLURM also supports the Control Groups]]
      * *Just for Gianfranco* : you need to apply these confs to make work the Control Groups / Cpuset in Son of Grid Engine ( it cost me 1d of attempts ! ) : 
         1 <pre>[wn] cat /etc/sysconfig/sgeexecd %BLUE%export SGE_CGROUP_DIR=/dev/cpuset/sge%ENDCOLOR%</pre>
         1 <pre>[wn] grep -Hn setup-cgroups-etc /etc/init.d/sgeexecd.p6444 /etc/init.d/sgeexecd.p6444:441: %BLUE%/opt/sge/util/resources/scripts/setup-cgroups-etc start%ENDCOLOR%</pre>
         1 <pre>[wn] qconf -sconf |grep -Hn CGR (standard input):28:execd_params USE_SMAPS=true KEEP_ACTIVE=true %BLUE%USE_CGROUPS=true ENABLE_BINDING=true%ENDCOLOR% \ </pre>
         1 <pre>[submission_host] grep -v \# /opt/sge/default/common/sge_request | strings %BLUE%-binding set linear%ENDCOLOR%</pre>
   * *Upgraded the PSI PhEDEx from SL5 to SL6* 
      * Lot of issues here due to the poor QA of the latest [[http://www.dcache.org/downloads/1.9/srm/][SRM client]] =SRM client 2.10.7 (rpm)=
      * The *implicit* X509 Proxy Delegation requested to copy files between 2 remote SRM endpoint ( e.g. from CSCS to PSI ) doesn't work if one uses the =-copyjobfile= option, like PhEDEx does ; dCache team acknowledged this bug
      * Also this is a bug : <pre>$ srmls -debug=false -x509_user_proxy=/home/phedex/gridcert/proxy.cert -retry_num=0 'srm://t3se01.psi.ch:8443/srm/managerv2?SFN=/pnfs/psi.ch/cms/trivcat/store/mc/RunIIWinter15GS/RSGravToWWToLNQQ_kMpl01_M-4000_TuneCUETP8M1_13TeV-pythia8/GEN-SIM/MCRUN2_71_V1-v1/10000/2898A22B-62B0-E411-B1D4-002590D600EE.root' srm client error: java.lang.IllegalArgumentException: Multiple entries with same key: x509_user_proxy=/home/phedex/gridcert/proxy.cert and x509_user_proxy=/tmp/x509up_u205 </pre>
      * Same here : <pre>$ srm-advisory-delete -x509_user_proxy=${X509_USER_PROXY} -retry_num=0 srm client error: java.lang.IllegalArgumentException: Multiple entries with same key: x509_user_proxy=-retry_num=0 and x509_user_proxy=/tmp/x509up_u205 </pre>
      * Eventually I tweaked the PhEDEx script to bypass all these bugs. *Just for Daniel* my [[https://cmsweb.cern.ch/gitweb/?p=siteconf/.git;a=blobdiff;f=T3_CH_PSI/PhEDEx/FileDownloadDelete;h=29e54573861cfb760390132f3be69c5fc3f42fca;hp=45bd4cabea41151a2619e64d3af1733831f0aaec;hb=65b98becf6f952a061a5fc2436c513938755b9d4;hpb=40743a21cbc06e9967530ec6874577f17a1ab9fc][FileDownloadDelete]] and [[https://cmsweb.cern.ch/gitweb/?p=siteconf/.git;a=blobdiff;f=T3_CH_PSI/PhEDEx/FileDownloadSRMVerify;h=8fbe5c5d576a0b78721fdbe824b45fd83fcfd538;hp=b3ebfbb248bc87d8b1519b71840f24eda198533b;hb=40743a21cbc06e9967530ec6874577f17a1ab9fc;hpb=34c72eb9ea126a3d0947c0cf314554b5fb76fd37][FileDownloadSRMVerify]] corrections
   * *Power cut at PSI* 
      * My 1st power cut since 2011 ; 4 file servers rebooted, T3 users in panic mode ; luckily fixed without data lost
   * *CMS Space Monitoring Project* for CSCS ; Daniel is going to follow this task 
      * [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompProjSpaceMon][Task details]]
      * [[http://dashb-ssb.cern.ch/dashboard/request.py/sitehistory?site=T2_CH_CSCS#currentView=test][Currently CSCS is in error concerning this task]]
      * I made for PSI a dCache query related to this task https://bitbucket.org/fabio79ch/pnfs_space_usage_by_creation_time/wiki/Home ; CSCS might publish it on its website too
---+++ UNIBE-LHEP
   * *Operations* 
      * Mainly stable operation on both clusters (yet at about half capacity)
      * Local netork sick on ce01 (22nd April): power-cycled
      * 3 a-rex crashes caught by the cron
   * *ATLAS specific operations* 
      * As last month, mostly multi-core MC production+Reconstruction, mostly smooth. Until:
      * Following ARC upgrade to <literal>5.0.0-2.el6</literal> : <literal>/var</literal> full caused the system to jam (ce01 only, grid-manager.log). Laborious clean-up needed
      * Following clean-up: a-rex crashes at start. "Insider" tip: <literal>rm&nbsp;gm.fifo</literal> in controldir fixed it
      * ARC 5 introduces handling of user-requested job-priority. Arc range: [0:100] is mapped in ARC 5 to [-1023:1024] but in griengine the range allowed for user control is [-1023:0]. Needed to hack <literal>submit-sge-script</literal>
      * Yesterday, ce01 dropped out of the GIIS (all services reported themselves as running): needed restarting the infosys *and* a-rex. Update: just happened again
      * Removal of voms.cern.ch alias to voms2.cern.ch cause file transfers to the SRM to fail. Apparently an obscure mis-configuration detail caused the authentication problem [ https://ggus.eu/index.php?mode=ticket_info&ticket_id=113485 ] (ops tests still failing on the SE, not clear why)
      * *HammerCloud gangarobot*: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2015-04-02&endTime=2015-05-07&templateType=isGolden
   * Plans 
      * Restore full capacity: 
         * 320 (old) cores from UNIBE-ID
         * 144 (old) cores from CSCS
         * 128 (new) cores awaiting installation (<span style="background-color: transparent;">E5-2650 v2 2.6GHz) </span>
         * 512 (new) cores to be procured
      * Plan is to re-deploy ce01 with ROCKS 6.2 (coming out any day now). At the moment with the current deployment of ROCKS 6.1, installation of new hardware works but kernel freezes at re-boot
      * Testing install procedure with ce04 and ROCKS 6.1.1 for now (temporary deployment)
      * Deploy 2 additional UI's (to be procured)
---+++ UNIBE-ID
   * *Procurement:* 
      * Tender regarding new storage ended at the beginning of this week =&gt; order placed for: 
         * IBM ESS GL4 with 4TB NL-SAS disks:
         * <span style="background-color: transparent;">Total Capacity: 928TB</span>
         * <span style="background-color: transparent;">Interconnect: IB</span>
         * <span style="background-color: transparent;">Filesystem: GPFS 4.1 STD + GPFS Native RAID</span>
   * *Operations*: 
      * Stable operations most of the time
      * Small glitch after migrating to new OpenLDAP installations 
         * pam_ldap doesn't close connections properly =&gt; max open files violations on LDAP server; now set olcIdleTimeout to 60 - problem solved
      * Upgraded Nordugrid ARC CE to 5.0.0 
         * Upgrade process was seamless; no problems so far
         * according to http://goc-accounting.grid-support.ac.uk/apel/jobs2.html we get accounted, don't we?
      * still working on moving from satellite based RHEL setup to foreman based CentOS setup

---+++ UNIGE
   * The next upgrade 
      * Our upgrade plans for 2015 were approved as proposed for the 50% co-funding scheme of the Uni.
      * One order is out for the upgrade 
         * 3 x Lenovo x3630 M4 with 6 TB disks (63 TB net in a 2U machine)(1 for neutrino, 2 for ATLAS)
         * 2 x Lenovo x3550 M4 as hosts for running virtual machines
      * We also foresee to upgrade network to 10 Gbps for 7 disk servers doing NFS 
         * no quotes yet, only price estimates
   * Cleanup of the SE UNIGE-DPNC_LOCALGROUPDISK, 90 TB free (21%) 
      * With run 2 starting, we wil likely need another round quickly
   * This is my last meeting. Good bye to all of you!
---+++ NGI_CH
   * EGI onthly OPS report for March circulated (24th April): 
      * NGI_CH 88% / 88% - <span style="background-color: transparent;">CSCS and UNIGE have lower than usual numbers.</span>
      * <span style="background-color: transparent;">There is also a reported 79% in the &ldquo;Unknown&rdquo; column (but no ticket from EGI about it)</span>
   * WLCG monthly reports for the experiments (April): 
      * CH-CHIPP-CSCS is: 98%/98% (ATLAS) - 96%/96% (CMS) - 97%/97% (LHCB)
      * http://wlcg-sam.cern.ch/reports/2015/201504/wlcg/WLCG_All_Sites_ATLAS_Apr2015.pdf
      * http://wlcg-sam.cern.ch/reports/2015/201504/wlcg/WLCG_All_Sites_CMS_Apr2015.pdf
      * http://wlcg-sam.cern.ch/reports/2015/201504/wlcg/WLCG_All_Sites_LHCB_Apr2015.pdf
   * Certificates: 
      * <span style="background-color: transparent;">agreement with GRNET in place and procedure principle established. Will test workflow from Bern. WARNING: DNs will change when a new certificate is issued</span>
   * Midddleware: Issues with Torque 4 in EPEL. If at version &lt; 2.5.13 and wishing to upgrade, EGI offer advice
   * NGIs asked to assess: 
      * Need to have MW on CentOS7
      * Nr of sties still using SL5 or equivalent and the decommissioning plan for them
   * Security: 
      * EGI accepted feedback proposed by NGI_CH about notification capacities to be added to the pakiti clients (installation of these will become 'somehow' a requirement)
      * Sites encouraged to carry out a security readiness self-assessment prepared by the National Cyber Security Centrum: https://check.ncsc.nl/questionnaire/
---++ Other topics

   * *Next meeting date*: The next meeting should be on June 4, but Gianfranco and Miguel will be away (NorduGrid conference, look below). Other suggestions could be: 
      * Wed, Jun 10 at 14:00
      * Thu, Jun 11 at 14:00
      * Thu, Jun 17 at 14:00
   * *Training for sysadmins*: The [[https://ethz.doodle.com/ua5dm3euw9gukgzi][doodle]] is set and we are awaiting to collect input and propose final dates. Currently evaluating possibility to do it online via Vidyo or Scopia.

---++ A.O.B.
   * Gianni will attend the following [[https://indico.cern.ch/event/319821/][pre-GDB]] to be held at CERN on 12.05.15.
   * Miguel will attend the upcoming [[http://www.lhep.unibe.ch/shaug/ngc2015/ngc2015.html][Annual NorduGrid Conference]] to be held at UNIBE between 04.06.15 and 05.06.15
   * Gianfranco+Sigve will attend the upcoming [[http://conf2015.egi.eu][EGI Confrence 2015]] to be held in Lisbon between 18.05.15 and 22.05.15
   * Gianfranco will attend the upcoming [[http://www.lhep.unibe.ch/shaug/ngc2015/ngc2015.html][Annual NorduGrid Conference]] to be held at UNIBE between 04.06.15 and 05.06.15

---++ Attendants
   * CSCS: Dino Conciatore, Miguel Gila, Dario Petrusic. Apologies: Gianni Ricciardi, Nick Cardo.
   * CMS:
   * ATLAS: Gianfranco Sciacca, Szymon Gadomski
   * LHCb: Roland Bernet
   * EGI: Gianfranco Sciacca

---++ Action items
   * Item1