MeetingSwissGridOperations20150305 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2015-03-05
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * Three new members of CSCS joined Phoenix: two new system engineers (Dario Petrusic and Dino Conciatore) and systems group lead Nicholas Cardo. They will be actively participating on all our meetings and operations starting now. 
      * Already have access to the systems, including TWiki and chat.
      * Working to set certificate roles in /dteam and dteam/NGI_CH; then they will be added to GOC and Nagios@CSCS
   * Maintenance 10.03.14: 
      * dCache upgrade: security updates and dCache to 2.6.46
      * GPFS config update: new maxFilesToCache setting in place, updating from 40k files per node, to 50k files.
      * Reinstallation of as many WNs as possible with latest EMI-WN packages (plus security updates)
      * Shutdown and physical removal of old hardware (puppet, nfs0[1-2], se[01-06], 3x 1/2 racks of IBM DC3500 storage)
   * A.O.B. 
      * Migration to Puppet 3.6 ongoing, new roles created but more work needs to be put in place. Managed to migrate cfengine from ageing hardware (&gt;1000 days of uptime!) to new vmware VM.
      * At some point before summer, we will need to upgrade GPFS (v. 4) and dCache (v. 2.10).
      * (Gianfranco) ATLAS lcgadmin and pilot roles to be enabled/fixed on ARC CEs

---+++ PSI

   * A major *NetApp E5400 error* =A drawer in the tray has become degraded=, that led to =lost 1/2 redudant paths to 12*3TB disks= ; it was a FW bug ; solved by updating ONLINE the NetApp 5400 FW to =7.86.49.00= ; the RDAC driver Linux-side gracefully moved the paths to the RAID Controllers from one to the other, and back, during the Controllers reboot. I didn't unmount the XFS filesystems or stopped dCache. Nothing you can get from the NAS world. CSCS should update the FW as well.
   * Again in the same NetApp E5400 I got 2*3TB broken disks
   * In both cases I get the native NetApp e-mails routed through =iptables NAT= but also a [[https://bitbucket.org/fabio79ch/check_netapp/wiki/Home][Nagios e-mail]]
   * Preparing the [[http://www.dcache.org/downloads/1.9/timeline-dCache.svg][dCache 2.6 to 2.10 migration]] ; in my case this will also mean upgrading Postgresql from 9.3 to 9.4, also because of this [[https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.4#REFRESH_MATERIALIZED_VIEW_CONCURRENTLY][news]]; luckily my [[https://bitbucket.org/fabio79ch/v_pnfs/wiki/Home][Chimera Materialized Views]] still work out of the box but there are some new table fields that I should include in the future
   * Using *Puppet standalone* over =/afs= because it's 10 times faster than having a Puppet master, and the clients don't crash ; each SL6 server in my cluster mount =/afs= and there is a =/afs= dir where both my Puppet recipes and the conf files are stored ; this =/afs= dir and its descendants are protected by AFS ACLs ; only the =root= account on the SL6 server can access my =/afs= dir by using a [[https://kb.iu.edu/d/aumh][Kerberos Keytab]] file ; Example: <pre>%BLUE%#%ENDCOLOR% ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ 
ls: cannot access /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/: %RED%Permission denied%ENDCOLOR% 

%BLUE%#%ENDCOLOR% kinit -k -t /root/afs-keytabs/svcusr-t3_puppet.keytab svcusr-t3_puppet@D.PSI.CH && aklog %BLUE% 

#%ENDCOLOR% ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ 
total 4 
lrwxr-xr-x 1 martinelli_f cms 22 Jan 15 16:14 manifests -&gt; puppet/TRUNK/manifests 
lrwxr-xr-x 1 martinelli_f cms 20 Jan 15 16:14 modules -&gt; puppet/TRUNK/modules 
drwxr-xr-x 4 martinelli_f cms 2048 Jan 15 15:48 puppet </pre>
   * Many other tasks, but specific to PSI or CMS

---+++ UNIBE-LHEP
   * Operations 
      * Slow recovery of the *ce01* cluster following the kernel+glibc security updates of January. 
         * straightforward RPM upgrades would not work, needed to re-image the WNs and re-install
         * issue with the OpenIB modules freezing at shutdown. This implies power-cycling every node whenever a re-boot is needed (or re-installation)
         * turned out our IB stack (not updated for ~2 years) had an outdated setup
         * however: after a general update of the setup, the rdma modules are still not unloaded cleanly after starting up (even if lustre is not even started)
         * coocked a shutdown script that (teoretically) unloads all cleverly before running into the system freeze/crash
         * permanent solution is I suppose re-image the WM from scratch. However, this implies re-buikding ROCKS (and the CE) from scratch
      * *ce02* cluster needed power-cycling for the ethernet switches on 30th Jan, stable thereafter but almost halved in capacity
      * Added cron jobs on both CE's to recover a-rex after crashing. Logging crashes, typically twice a month
   * <span style="background-color: transparent;">ATLAS specific operations</span> 
      * <span style="background-color: transparent;">ATLAS still pretty quiet, picking up now</span>
      * <span style="background-color: transparent;">Revived the webdaw access to the SE (ATLAS request)</span>
      * <span style="background-color: transparent;">Monitoring:</span> 
         1 SAM Nagios ATLAS_CRITICAL: <a target="_top" href="http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK">http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK</a>
         1 <span style="background-color: transparent;">HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2015-02-01&endTime=2015-03-04&templateType=isGolden</span>
---+++ UNIBE-ID
   * Procurement 
      * Another 16 Dalco compute nodes are installed, setup and running smooth; ingredients: 
         * <span style="background-color: transparent;">2x 8C Intel Xeon </span><span style="background-color: transparent;">E5-2650v2 2.6GHz</span>
         * <span style="background-color: transparent;">128GB 1866MHz DDR ECC REG (8*16GB)</span>
         * <span style="background-color: transparent;">1x 1TB 7.2k rpm SATA 6.0Gb/s</span>
         * <span style="background-color: transparent;">2x Gigabit-Ethernet onboard</span>
         * <span style="background-color: transparent;">Infiniband </span><a rel="nofollow" href="https://wiki.chipp.ch/twiki/bin/edit/LCGTier2/ConnectX?topicparent=LCGTier2.MeetingSwissGridOperations20141106;nowysiwyg=0" title="ConnectX (this topic does not yet exist; you can create it)">ConnectX</a><span style="background-color: transparent;">-3 QDR HCA</span>
      * Prepared a tender to buy replacement storage 
         * IBM GSS24 with 3TB disks =&gt; 696 TB total capacity; ~510 TB usable capacity
   * Decomissioning 
      * 23 Sun X2200 Pizza boxes shutdown and dumped
      * 25 remaining and marked to be dumped within the next two months
   * Operations 
      * smooth and reliable, except...
      * <span style="background-color: transparent;">... nordugrid-arc-bdii dead for almost a week while being on holiday =&gt; bad performance value in monthly report</span> 
         * same happened in January and at the beginning of this week
         * now installed a cron based guardian like we already have for a-rex (which btw was very stable the last few months)
   * AOB: 
      * <span style="background-color: transparent;">(Gianfranco) ATLAS pilot role to be enabled/fixed on the ARC CE</span>

---+++ UNIGE
   * Xxx
   * AOB: 
      * (Gianfranco) ATLAS request to enable multicore jobs (sent by mistake instructions for gridengine, but Geneva run Torque)

---+++ NGI_CH
   * <span style="background-color: transparent;">January 2015 - RP/RC OLA performance: http://snf-631462.vm.okeanos.grnet.gr:8080/lavoisier/site_reports?ngi=NGI_CH</span> 
      * UNIBE-ID low (understood): https://ggus.eu/index.php?mode=ticket_info&ticket_id=111896
   * Multicore accounting for EGI: 
      * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=CSCS-LCG2&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs</span>
      * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-LHEP&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN</span>
      * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-ID&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs</span>
   * <span style="background-color: transparent;">Pakiti made easy https://pakiti.egi.eu/client.php?site=UNIBE-LHEP (simple cron job on all WNs - requires access to the CAs)</span> 
      * Site Security Officer can check their own site: https://pakiti.egi.eu/ .
   * Issues with Certificates in CH following SWITCH withdrawal from the service as of 31st Aug 2015 
      * CERN not an option for non-users, servers non on the CERN network
      * TERENA CS (flat fee 27k) would deal only with NRENs (i.e. SWITCH)
      * Exploring possible solutions (EGI catch-all CA?)
---++ Other topics
   * UI accounts for CMS super users at the T2 for batch submission possible?
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS: Gianni Ricciardi, Dino Conciatore, Dario Petrusic, Miguel Gila
   * CMS: Fabio Martinelli, Daniel Meister
   * ATLAS: Gianfranco Sciacca
   * UNIBE-ID: Michael Rolli
   * LHCb: Roland Bernet
   * EGI:

---++ Action items
   * Item1