<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-04-07 at 15:30 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * New ARC CE instance (arc03) installed along with a new SLURM instance (15.08.8) and all the recently purchased WNs<br />(this cluster is integrated into CSCS LDAP and central SLURM DB) * certificates mess last week (Gianni's fault!): thanks to Gianfranco and Sigve for their help * some time spent fixing the Information System [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=118922][(GGUS 118922)]] * tentative planned maintenance on 20160503 to replace IB/Eth bridges, moving some VMs, reinstalling arc02 * CREAM CEs to be dismissed by beginning of June * Accounting numbers (from scheduler) from last month *GPFS* * No issues to report * Metadata from local SSD to FC Flash migration should be performed on May 3rd *dCache* * Almost ready to deploy the first 500TB of new storage (from NETAPP 5560) * The additional 500TB will be ready by the first part of May (from SFA12K) * Investigating some "unexpected" files deletion (CMS) ---+++ PSI * Put in production the new CentOS7/ZFS/NFSv4 /homes hierarchy * [[http://zfsonlinux.org/][ZFS On Linux]] * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/space.report][/homes hierarchy space report]] * [[http://t3mon.psi.ch/ganglia/host_gmetrics.php?c=PSI%20Tier3%20services&h=t3nfs01.psi.ch][/homes hierarchy Ganglia ZFS/NFSv4 stats]] * Installing 9 new Dalco servers ( got 2 disks dead on arrival ) ; each : * Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 64 cores ( HT on ) * 128GB RAM * 4 disks 900GB 10k SAS in mdadm 1+0 by Kickstart * made a 100GB partition formatted as XFS in order to test [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-fscache.html][FS-Cache]] + NFSv4 * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]] ---+++ UNIBE-LHEP *Operations* * <span style="background-color: transparent;">mostly stable operation on both systems, except for:</span> * some random failures on some ce01 nodes ( *trans:* Transformation not installed in CE) * leads to flipping between black and white-listing by HC * usually a cvmfs related problem, but cvmfs reports fine on all nodes * under investigation right now * eth0 dropped twice within 12h on the ce01 lustre mds: <span style="background-color: transparent;">Mar 31 08:26:14 mds-2-1 kernel: irq 75: nobody cared (try booting with the "irqpoll" option)</span><br /><span style="background-color: transparent;">...</span><br /><span style="background-color: transparent;">Mar 31 08:26:31 mds-2-1 kernel: e1000e 0000:03:00.0: eth0: Reset adapter unexpectedly</span> <span style="background-color: transparent;"><br /></span> * leaves lustre hanging, <span style="background-color: transparent;">needs power-cycling to recover (lustre come back is quick)</span> * <span style="background-color: transparent;">maybe flacky h/w, getting a spare card to plug in case or recurrance</span> *ATLAS specific operations* * HC online 33% (last month, single core only - not huge impact since over 80% of work is MCORE): <span style="background-color: transparent;">http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE</span> <span style="background-color: transparent;"><br /></span> * 63% of ATLAS/CH WT, 70% CPUtime in March: <span style="background-color: transparent;">http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites[]=CSCS-LCG2&sites[]=UNIBE-LHEP&sitesCat[]=CH-CHIPP-CSCS&resourcetype=All&sitesSort=2&sitesCatSort=2&start=2016-03-01&end=2016-03-31&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All&activities[]=all</span> * Still on ice: No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6) * but I have asked to re-discuss this within ADC (in my view this sh<span style="background-color: transparent;">ould be implemented at the middleware level)</span> * <span style="background-color: transparent;">UNIBE-LHEP_CLOUD and UNIBE-LHEP_CLOUD_MCORE operating stably</span> * <span style="background-color: transparent;"> *Accounting numbers (from scheduler) from last month (Mar 2016)* - *NOTE*: ce03/CLOUD not reported yet</span> * <span style="background-color: transparent;">WC h: 936908 (ATLAS) - 149450 (t2k.org) - 13838 (uboone) - 13 (ops)</span> * *Accounting numbers (from ATLAS dashboard) from last month* (Mar 2016) * CPU h: 672148 (933386.8 with cloud) * WC h: 909450 (1243195.7 with cloud) ---+++ UNIBE-ID * All servers (but one) moved from RHEL to CentOS and all puppetized - finally * Short storage outages in March * in Feb Upgrade ESS-3.0 (GPFS-4.1.0) => ESS-3.5 (GPFS-4.1.1) * => GPFS cluster overload in certain moments => Stale File Handles * Turned off certain logging/tracing facilities in GPFS * now perfectly stable since 3w again * Ordered additional 76 nodes to 32 nodes we ordered last December: * Intel Xeon E5-2630v4 @ 2.2GHz, 20 cores (HT off) * 128GB RAM * => homogenous queue with 108 nodes (2160 core) exclusively for MPI usage ---+++ UNIGE * <span style="background-color: transparent;">Production:</span> * <span style="background-color: transparent;">Running smoothly under test mode for ATLAS (still pending some checks)</span> * <span style="background-color: transparent;">High load of cluster from local users (need to check batch system closer, since more changes of nodes down)</span> * <span style="background-color: transparent;">Host certificates recently replaced for DPM Head and Disk nodes + ARC-CE (running late because e-mails were sent to Szymon)</span> * Storage: * <span style="background-color: transparent;">ATLASLOCALGROUPDISK space token was almost full, now (after some cleaning of old datasets) it is at ~ 75% full (~106 TB free)</span> * <span style="background-color: transparent;">Only one user from </span>UniGe<span style="background-color: transparent;"> with useful dataset at CSCS, moving datasets to </span>UniGe<span style="background-color: transparent;">. Then, merge ATLASLOCALGROUPDISK with ATLASDISK</span> * <span style="background-color: transparent;">Providing ATLAS storage dumps every month</span> * <span style="background-color: transparent;">Outlook:</span> * <span style="background-color: transparent;">3 User Interfaces with SLC5 will be decommissioned and maybe start to move to </span><span style="background-color: transparent;">CentOS</span> * Accounting numbers (from scheduler) from last month ---+++ NGI_CH * Xxx <ins> * <span style="background-color: transparent;">Nothing of relevance</span> </ins> * NGI-CH Open Tickets review <ins> * <span style="background-color: transparent;">NGI-CH Open Tickets review</span> </ins> * CSCS-LCG2 * * * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120551">120551</a>: CSCS-LCG2_MCORE : 75%+ jobs failed with ... (ATLAS team) - Not fully fixed yet (blacklisted right now, some HC jobs do not run) * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120505">120505</a>: Large amount of GLEXEC ERRORS on T2_CH_C.. (CMS) - Not touched for a week, changed to "waiting for reply" * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120405">120405</a>: Problem with accessing files at CSCS-LCG... (LHCb team) - In progress * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=119171">119171</a>: Workflow failures at T2_CH_CSCS (CMS) - Changed to "waiting for reply" * UNIBE-LHEP * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120257">120257</a>: glidein validation errors for Microboone... (UBOONE) - Following up on OSG, this should be closed * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a>: ATLAS request- storage consistency check... (ATLAS) - On hold * NGI_CH * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120184">120184</a>: NGI_CH - February 2016 - RP/RC OLA performance - Slow response to 2 tickets (average March response 8.51): * <a target="_blank" href="https://ggus.eu/?mode=ticket_info&ticket_id=120045">https://ggus.eu/?mode=ticket_info&ticket_id=120045</a> (LHCb on arcbrisi) * <a target="_blank" href="https://ggus.eu/?mode=ticket_info&ticket_id=120293">https://ggus.eu/?mode=ticket_info&ticket_id=120293</a> (duplicate of the above, handled immediately, so: ???) * "please remind to set the proper status when handling the tickets" * replied to it now ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Pablo, Dario, Dino, Gianni * CMS: Fabio * ATLAS: Luis * LHCb: Roland * EGI: ---++ Action items * Item1 * Item1* Item1 * Item1
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20160407
Topic revision: r18 - 2016-04-07 - MichaelRolli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback