<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2015-12-10 * *Date and time*: 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * *Storage* * <span style="background-color: transparent;">dCache: stable but still have to run the cleaner manually. Upgrade to 2.10 will be performed on Wed 13th Jan 2016</span> * <span style="background-color: transparent;">Atlas: working on the monthly dumps</span> * <span style="background-color: transparent;">GPFS (scratch): nothing to report</span> * <span style="background-color: transparent;">New hardware: 4 server for dcache and ~1PB of storage. Working to move GPFS metadata disk on Flash based storage.</span> * *Compute* * <span style="background-color: transparent;"><em>Added some check function to nodehealtcheck:</em><br /></span> * <span style="background-color: transparent;">SWAP cleaner<br /></span> * <span style="background-color: transparent;">auto solve</span><span style="background-color: transparent;"> some blakhole scenarios like auto remount fs</span> * <span style="background-color: transparent;">after 60 + random number of days the node is putted in dreain for clean and reboot<br /></span> * Started some test with new slurm version, to migrate sltop. * Today we will order 40 new compute node with E5-2680v4 ---+++ PSI * Xxx ---+++ UNIBE-LHEP * *Operations* * ce01 cluster re-installation virtually completed (about 900 worker cores running, 120 still to be installed, 256 awaiting delivery) * Started with a simple slurm setup (slurm-15.08.1) in order to cut down on commissioning time: one partition with<br /> <verbatim>SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory MemLimitEnforce=no</verbatim> * We don't over-subscribe memory anymore: nodes don't starve and crash * Memory usage is properly accounted for in 15.08 (PSS): no jobs killed on (artificial) over-limit of "vmem" (now the full address space reserved by a process, no what's allocated or used) * Comparing job fail rates between ce01 and ce02 (still on old SGE) has convinced me to rush the re-installation of ce02 (started earlier today) * *ATLAS specific operations* * Stable worflows by ATLAS (very large improvement since beginning of run II) * Stuck with the implementation of monthly dumps of the namespace on the DPM SE: * headnode on SLC5: the dump script does not work and also generating a valid proxy is problematic * decided to push the re-deployment of the head node on SLC6 * legacy config tool (YAIM) no longer supported * puppet based configuration, got the right docs at the DPM workshop earlier this week in CERN * tests ongoing on a pps VM * also complicated by the fact my site-bdii is still co-located with the DPM head node * this will likely be the first task for 2016 ---+++ UNIBE-ID * Xxx ---+++ UNIGE * *Operations* * * atlasfs29.unige.ch : New certificate * Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed) * We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services * We will install puppet for DPM and probably cluster configuration and setup: Let's say we will use a testbed with atlasfs29 + 1 PC of service (1 out of 2, of the previous ones mentioned just above) * *Network - Outlook* * We intend for a new network switch of 10 Gb/s, but this is still under negotiation * Most likely, it will be in the beggining of next year * *Storage* * There wass a DPM SE workshop at CERN on December 7th-8th: <a target="_blank" href="https://mmm.cern.ch/owa/redir.aspx?C=A5Ciw3Yy_0igvsnpDRi7YE_8ZVAfBtMISpSD7CmKwnmO8HN8bNwD0QTHlTviRdJd79RAAEH3jzI.&URL=https%3a%2f%2findico.cern.ch%2fevent%2f432642%2f">https://indico.cern.ch/event/432642/</a> * Checking the data stored at the DPM SE for cleaning purposes, since ATLAS requested it * Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa ---+++ NGI_CH * Nothing to report ---++ Other topics * Proposal to add to this meeting: T2 monthly pledge review (CSCS, UNIBE); GGUS open ticket review * Coverage over the holiday season * Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: * CMS: * ATLAS:Gianfranco,Luis March * LHCb: * EGI:Gianfranco ---++ Action items * Item1
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20151210
Topic revision: r6 - 2015-12-10 - LuisMarch
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback