<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2015-03-05 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * Three new members of CSCS joined Phoenix: two new system engineers (Dario Petrusic and Dino Conciatore) and systems group lead Nicholas Cardo. They will be actively participating on all our meetings and operations starting now. * Already have access to the systems, including TWiki and chat. * Working to set certificate roles in /dteam and dteam/NGI_CH; then they will be added to GOC and Nagios@CSCS * Maintenance 10.03.14: * dCache upgrade: security updates and dCache to 2.6.46 * GPFS config update: new maxFilesToCache setting in place, updating from 40k files per node, to 50k files. * Reinstallation of as many WNs as possible with latest EMI-WN packages (plus security updates) * Shutdown and physical removal of old hardware (puppet, nfs0[1-2], se[01-06], 3x 1/2 racks of IBM DC3500 storage) * A.O.B. * Migration to Puppet 3.6 ongoing, new roles created but more work needs to be put in place. Managed to migrate cfengine from ageing hardware (>1000 days of uptime!) to new vmware VM. * At some point before summer, we will need to upgrade GPFS (v. 4) and dCache (v. 2.10). * (Gianfranco) ATLAS lcgadmin and pilot roles to be enabled/fixed on ARC CEs ---+++ PSI * A major *NetApp E5400 error* =A drawer in the tray has become degraded=, that led to =lost 1/2 redudant paths to 12*3TB disks= ; it was a FW bug ; solved by updating ONLINE the NetApp 5400 FW to =7.86.49.00= ; the RDAC driver Linux-side gracefully moved the paths to the RAID Controllers from one to the other, and back, during the Controllers reboot. I didn't unmount the XFS filesystems or stopped dCache. Nothing you can get from the NAS world. CSCS should update the FW as well. * Again in the same NetApp E5400 I got 2*3TB broken disks * In both cases I get the native NetApp e-mails routed through =iptables NAT= but also a [[https://bitbucket.org/fabio79ch/check_netapp/wiki/Home][Nagios e-mail]] * Preparing the [[http://www.dcache.org/downloads/1.9/timeline-dCache.svg][dCache 2.6 to 2.10 migration]] ; in my case this will also mean upgrading Postgresql from 9.3 to 9.4, also because of this [[https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.4#REFRESH_MATERIALIZED_VIEW_CONCURRENTLY][news]]; luckily my [[https://bitbucket.org/fabio79ch/v_pnfs/wiki/Home][Chimera Materialized Views]] still work out of the box but there are some new table fields that I should include in the future * Using *Puppet standalone* over =/afs= because it's 10 times faster than having a Puppet master, and the clients don't crash ; each SL6 server in my cluster mount =/afs= and there is a =/afs= dir where both my Puppet recipes and the conf files are stored ; this =/afs= dir and its descendants are protected by AFS ACLs ; only the =root= account on the SL6 server can access my =/afs= dir by using a [[https://kb.iu.edu/d/aumh][Kerberos Keytab]] file ; Example: <pre>%BLUE%#%ENDCOLOR% ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ ls: cannot access /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/: %RED%Permission denied%ENDCOLOR% %BLUE%#%ENDCOLOR% kinit -k -t /root/afs-keytabs/svcusr-t3_puppet.keytab svcusr-t3_puppet@D.PSI.CH && aklog %BLUE% #%ENDCOLOR% ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ total 4 lrwxr-xr-x 1 martinelli_f cms 22 Jan 15 16:14 manifests -> puppet/TRUNK/manifests lrwxr-xr-x 1 martinelli_f cms 20 Jan 15 16:14 modules -> puppet/TRUNK/modules drwxr-xr-x 4 martinelli_f cms 2048 Jan 15 15:48 puppet </pre> * Many other tasks, but specific to PSI or CMS ---+++ UNIBE-LHEP * Operations * Slow recovery of the *ce01* cluster following the kernel+glibc security updates of January. * straightforward RPM upgrades would not work, needed to re-image the WNs and re-install * issue with the OpenIB modules freezing at shutdown. This implies power-cycling every node whenever a re-boot is needed (or re-installation) * turned out our IB stack (not updated for ~2 years) had an outdated setup * however: after a general update of the setup, the rdma modules are still not unloaded cleanly after starting up (even if lustre is not even started) * coocked a shutdown script that (teoretically) unloads all cleverly before running into the system freeze/crash * permanent solution is I suppose re-image the WM from scratch. However, this implies re-buikding ROCKS (and the CE) from scratch * *ce02* cluster needed power-cycling for the ethernet switches on 30th Jan, stable thereafter but almost halved in capacity * Added cron jobs on both CE's to recover a-rex after crashing. Logging crashes, typically twice a month * <span style="background-color: transparent;">ATLAS specific operations</span> * <span style="background-color: transparent;">ATLAS still pretty quiet, picking up now</span> * <span style="background-color: transparent;">Revived the webdaw access to the SE (ATLAS request)</span> * <span style="background-color: transparent;">Monitoring:</span> 1 SAM Nagios ATLAS_CRITICAL: <a target="_top" href="http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK">http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK</a> 1 <span style="background-color: transparent;">HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2015-02-01&endTime=2015-03-04&templateType=isGolden</span> ---+++ UNIBE-ID * Procurement * Another 16 Dalco compute nodes are installed, setup and running smooth; ingredients: * <span style="background-color: transparent;">2x 8C Intel Xeon </span><span style="background-color: transparent;">E5-2650v2 2.6GHz</span> * <span style="background-color: transparent;">128GB 1866MHz DDR ECC REG (8*16GB)</span> * <span style="background-color: transparent;">1x 1TB 7.2k rpm SATA 6.0Gb/s</span> * <span style="background-color: transparent;">2x Gigabit-Ethernet onboard</span> * <span style="background-color: transparent;">Infiniband </span><a rel="nofollow" href="https://wiki.chipp.ch/twiki/bin/edit/LCGTier2/ConnectX?topicparent=LCGTier2.MeetingSwissGridOperations20141106;nowysiwyg=0" title="ConnectX (this topic does not yet exist; you can create it)">ConnectX</a><span style="background-color: transparent;">-3 QDR HCA</span> * Prepared a tender to buy replacement storage * IBM GSS24 with 3TB disks => 696 TB total capacity; ~510 TB usable capacity * Decomissioning * 23 Sun X2200 Pizza boxes shutdown and dumped * 25 remaining and marked to be dumped within the next two months * Operations * smooth and reliable, except... * <span style="background-color: transparent;">... nordugrid-arc-bdii dead for almost a week while being on holiday => bad performance value in monthly report</span> * same happened in January and at the beginning of this week * now installed a cron based guardian like we already have for a-rex (which btw was very stable the last few months) * AOB: * <span style="background-color: transparent;">(Gianfranco) ATLAS pilot role to be enabled/fixed on the ARC CE</span> ---+++ UNIGE * Xxx * AOB: * (Gianfranco) ATLAS request to enable multicore jobs (sent by mistake instructions for gridengine, but Geneva run Torque) ---+++ NGI_CH * <span style="background-color: transparent;">January 2015 - RP/RC OLA performance: http://snf-631462.vm.okeanos.grnet.gr:8080/lavoisier/site_reports?ngi=NGI_CH</span> * UNIBE-ID low (understood): https://ggus.eu/index.php?mode=ticket_info&ticket_id=111896 * Multicore accounting for EGI: * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=CSCS-LCG2&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs</span> * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-LHEP&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN</span> * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-ID&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs</span> * <span style="background-color: transparent;">Pakiti made easy https://pakiti.egi.eu/client.php?site=UNIBE-LHEP (simple cron job on all WNs - requires access to the CAs)</span> * Site Security Officer can check their own site: https://pakiti.egi.eu/ . * Issues with Certificates in CH following SWITCH withdrawal from the service as of 31st Aug 2015 * CERN not an option for non-users, servers non on the CERN network * TERENA CS (flat fee 27k) would deal only with NRENs (i.e. SWITCH) * Exploring possible solutions (EGI catch-all CA?) ---++ Other topics * UI accounts for CMS super users at the T2 for batch submission possible? * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Gianni Ricciardi, Dino Conciatore, Dario Petrusic, Miguel Gila * CMS: Fabio Martinelli, Daniel Meister * ATLAS: Gianfranco Sciacca * UNIBE-ID: Michael Rolli * LHCb: Roland Bernet * EGI: ---++ Action items * Item1
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20150305
Topic revision: r14 - 2015-06-09 - FabioMartinelli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback