<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-02-04 at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * <strong>STORAGE</strong><br /><br /><strong>Hardware / Physical install</strong><br />- 8 Feb: new dCache servers (4x)<br />- 8 Feb: MPO in order to connect Phoenix to the CSCS SAN<br />- 9 Feb: NETAPP E5660 (~0.5PB)<br /><br /><strong>dCache</strong><br />- The ‘cleaner problem’ (mainly affecting CMS) is no more present. Space is freed automatically as expected<br />- Atlas dumps in place, something to adjust for 'atlasgroupdisk/perf-egamma' and 'atlasscratchdisk’ ( https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=428 )<br /><br /><strong>GPFS</strong><br />- Unplanned maintenance was needed on Wed 3rd Feb in order to recreate the filesystem because of a metadata inconsistency problem. * <span style="background-color: transparent;"> *Systems* </span> <div style="padding-left: 60px;" id="_mcePaste">- Preparing and consolidating racks for new arrivals end of this month</div> <div style="padding-left: 60px;" id="_mcePaste">- Checking published values of HEPspec</div> <div style="padding-left: 60px;" id="_mcePaste">- Tuned slurm config to improove cluster performance</div> <div style="padding-left: 60px;" id="_mcePaste">- Fixed two HP nodes, one of them whit IB failures and the other the 1G man network card</div> <div style="padding-left: 60px;" id="_mcePaste">- Testing complete Puppet installation for worker nodes, is working fine, i have just to check some cvmfs parameters and cream wrapper script.</div> * Accounting numbers (from scheduler) from last month * http://ganglia.lcg.cscs.ch/ganglia/SLURM_REPORTS/phoenix_slurm_report_201601.txt ---+++ PSI * Xxx * Accounting numbers (from scheduler) from last month ---+++ UNIBE-LHEP *Operations* * Nothing significant to report; stable operation on both systems * 256 new cores delivered yesterday, hope to deploy before weekend *ATLAS specific operations* * <span style="background-color: transparent;">No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6)</span> * <span style="background-color: transparent;">ANALY_UNIBE-LHEP blacklisted in HC: no time to debug but low impact since right now ANALY jobs aren't too many</span> * <span style="background-color: transparent;">A couple of stabile weeks of operation for UNIBE-LHEP_CLOUD_MCORE, then we lost the cluster and could not fix it yet</span> <strong>Accountin</strong><strong>g</strong> * Accounting numbers (from scheduler) from last month (Jan 2016) * CPU h: 792492 (ATLAS) - 12671 (t2k.org) - 1879 (uboone) - 25 (ops) * <span style="background-color: transparent;">Accounting numbers (from ATLAS dashboard) from last month (Jan 2016)</span> * CPU h: 662466 (774848 with cloud) * WC h: 679368 (796292 with cloud) ---+++ UNIBE-ID * Xxx * <span style="background-color: transparent;">Accounting numbers (from scheduler) from last month</span> ---+++ UNIGE *Operations* * Running smoothly: Higher user activity since last meeting * Grid (ATLAS) jobs: UNIGE-DPNC in "Test" status and ~ 1/3 oj jobs failed due to (apparently) "ran out of memory". Need checks * We plan a scheduled downtime at some point: Needed for upgrading system and security (related to get involved for ATLAS production also) *Storage* * Dump of DPM SE for ATLAS experiment finally submitted (this dump should be provided once a month) * In addition to these ATLAS checks, we should clean our DPM: Old user data and other projects (To Be Done) *Outlook* * Request for new network switch upgrade to 10 Gb/s + adquisition of 3 GPUs already submitted (wait for resolution in ~ March 2016) * GPU info (nvidia): http://www.microspot.ch/msp/fr/pc-komponenten/grafikkarten/gainward-geforce-gtx-980-grafikkarten-gf-gtx-9-0000948922 * A more detailed description of the GPU requested: * TYAN B7079F77CV10HR-N 2X10C - 256GB - 4XGTX980 - 64GB * <span style="font-family: Tahoma; font-size: 10pt;">4U, FT77C, C612</span> * <span style="font-family: Tahoma; font-size: 10pt;">(10) 2.5"" Hot-Swap bays,</span> * <span style="font-family: Tahoma; font-size: 10pt;">(8) PCI-E G3 x16, for NV GPU cards,</span> * <span style="font-family: Tahoma; font-size: 10pt;">3200W(2+1) 80+ platinum"</span> * <span style="font-family: Tahoma; font-size: 10pt;">2x Intel Xeon E5-2620v3 Six Core</span> * <span style="font-family: Tahoma; font-size: 10pt;">4x Samsung 16GB, DDR4-DIMM, PC4-17000 (2133MHz), registered, ECC • Low Voltage (1.2V)</span> * <span style="font-family: Tahoma; font-size: 10pt;">1x Samsung SSD 850 PRO 256GB</span> * <span style="font-family: Tahoma; font-size: 10pt;">8x Gainward GTX980, 4GB DDR5, PCI-E16x3.0 </span> * Install puppet for DPM SE (and probably also for cluster configuration and setup, replacing yaim) *Accounting* * Accounting numbers (from scheduler) from last month ---+++ NGI_CH * Nothing to report * NGI-CH Open Tickets review https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO * * CSCS-LCG2 * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117786">117786</a> (ATLAS: storage dumps) almost done - should fix two paths * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=119021">119021</a> (LHCb team: jobs failed) no information provided - changed to "waiting for reply" * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=119171">119171</a> (CMS: Workflow failures) in progress * UNIBE-LHEP * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> (ATLAS: storage dumps) on hold * NGI_CH * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=118922">118922</a> (affects CSCS-LCG2 and UNIBE-LHEP): GlueSubClusterPhysicalCPUs, GlueSubClusterLogicalCPUs in the bdii - added explicit notification to CSCS-LCG2 ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: * CMS: * ATLAS: Luis March * LHCb: * EGI: Luis March ---++ Action items * Item1
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20160204
Topic revision: r12 - 2016-02-04 - LuisMarch
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback