MeetingSwissGridOperations20160114 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2016-01-14
   * *Date and time*: At 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * *Compute* 
      * Confirmed order of 40 WN
         * *Details*: 
            * <span style="background-color: transparent;">DALCO r2264i5t 2U Scalable Compute Module (4 nodes </span><span style="background-color: transparent;">in one 2U chassis) configuration with:</span>
            * <span style="background-color: transparent;">Each node supports:</span> 
               * <span style="background-color: transparent;"> _2 x 14 Core Intel Xeon E5-2680 v4 2.4GHz Processor (~1Q2016)_ </span>
               * _128GB 2133MHz DDR4 ECC Reg. Server Memory_
               * _1 x 120GB Intel SSD Drive_
               * _2 x Intel 10/100/1000 Gigabit-Ethernet onboard_
               * _1 x ETH0/NIC1 LAN Interface for IPMI & KVM over IP_
               * _1 x Mellanox ConnectX-3 FDR Dual-Port Adapter_
               * <em>Two High-Efficiency 1600Watt hot-plug redundant power <span style="background-color: transparent;">supplies</span></em>
      * As discussed with Gianfranco we disabled reboot request every ~60 days for WNs to avoid idle
      * Updated nodes to SL 6.7 and packages on all wn
      * Started working on the new sltop python version using pyslurm python module
      * Fixing some small issues with the user creation with puppet on the WNs
      * We have some HP nodes down
   * 
   * Accounting numbers (from scheduler) from last month (<a target="_blank" href="http://ganglia.lcg.cscs.ch/ganglia/SLURM_REPORTS/phoenix_slurm_report_201512.txt">slurm report</a>)
   * *Storage* 
      * New hardware: 
         * 4x NETAPP E5660 (total 500TB) arrived + 500TB from CSCS SAN will be ready soon
         * 4x Lenovo M5 servers delivery scheduled for the next week
   * *GPFS* 
      * No issues or news
   * *DCACHE* 
      * Upgrade to the v2.10 has been completed.
      * Still have to fix some issues
      * Dumps for Atlas will work starting this month (host mapping from atlas01)

---+++ PSI
   * *Boot issues with HP Smart Array P440 Controller in HBA mode* 
      * Had boot issues with HP Smart Array P440 Controller on CentOS7 with the controller configured in HBA mode ; had to manually upgrade the controller driver to =hpsa-3.4.12= ; read about these driver updates cases [[http://cciss.sourceforge.net/][here]] ; usually you won't want to do that but..
   * *Nagios and HP HW* 
      * I recommend [[https://labs.consol.de/nagios/check_hpasm/][check_hpasm]]
      * Deploy also on your management node the OS independent [[http://www.gnu.org/software/freeipmi/download.html][Freeipmi 1.5.1]] + [[https://www.thomas-krenn.com/en/wiki/IPMI_Sensor_Monitoring_Plugin][check_ipmi_sensor_v3]] ; for instance :
      * =/opt/nagios/check_ipmi_sensor -f /opt/nagios/check_ipmi_sensor.user.pwd.privilege -H rmnfs01 -v -O '-D lan_2_0 -t Fan'= reports :
      * IPMI Status: OK | 'Fan 1 DutyCycle'=13.72 'Fan 2 DutyCycle'=13.72 'Fan 3 DutyCycle'=22.34 'Fan 4 DutyCycle'=42.73 'Fan 5 DutyCycle'=53.70 'Fan 6 DutyCycle'=53.70
      * And because I'm using the HP servers in HBA mode I also check the SMART values by [[http://www.claudiokuenzler.com/nagios-plugins/check_smart.php#.Vpd_bJMrLdQ][check_smart]]
   * *ZFS On CentOS7* 
      * So far satisfied , be aware of the new Jan'16 release [[https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5.4][0.6.5.4]] ; [[https://github.com/zfsonlinux/zfs/issues?utf8=%E2%9C%93&q=is%3Aclosed+milestone%3A0.6.5.4+][here]] the issues closed
      * Still not in production though ; there are a lot of news in CentOS7 that I want to thoroughly learn and the Christmas break didn't help me
   * *Finally new HW at PSI in 2016* 
      * With current $$ in Q1'16 we're going to buy : 
         * 1 x CISCO 10Gb/s Nexus Extender [[http://www.cisco.com/c/en/us/products/collateral/switches/nexus-2000-series-fabric-extenders/product_bulletin_c25-715278.html][2232TM-E]] ; it's not a switch, read the specs
         * 9 x [[http://ark.intel.com/products/88294/Intel-Compute-Module-HNS2600TP24R][Intel Server HNS2600TP24R]] ; each 64 cores type [[http://ark.intel.com/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M-Cache-2_30-GHz][E5-2698-v3]] , 128GB RAM, 2 ( or 4 if I have $$ to make RAID10 ) disks type [[https://www.hgst.com/products/hard-drives/ultrastar-c10k900][Hitachi c10k900]], 10Gb/s BASE-T onboard, 3y warranty
         * 1 x [[http://www8.hp.com/us/en/products/proliant-servers/product-detail.html?oid=7252836][HP DL360 G9]] as our new management node , 5y warranty
      * With new $$ after Q1'16 we're going to buy : 
         * 1 x NetApp [[http://www.microspot.ch/msp/de/harddisks-laufwerke/nas-storage/e2760-extreme-capacity-0001007649][E2760]], 60x6TB disks ( link reports 60x4TB ), SAS, 5y warranty ; I consider the more powerful [[http://www.netapp.com/us/products/storage-systems/e5600/e5600-tech-specs.aspx][E5660]] an overkill for our needs and I also have to save $$ for the other HW
         * 1 x [[http://www8.hp.com/us/en/products/iss-controllers/product-detail.html?oid=6995463][HP Smart Array P841/4GB FBWC 12Gb 4-ports Ext SAS Controller (726903-B21)]] to be deployed inside one of my two HP DL380 G9 to connect the previous E2760
         * Probably another smaller bunch of [[http://ark.intel.com/products/88294/Intel-Compute-Module-HNS2600TP24R][Intel Server HNS2600TP24R]]
---+++ UNIBE-LHEP

*Operations*
   * ce02 re-installed with ROCKS 6.2, SLC6.7 and SLURM 15.08.04 (320 worker cores still to be installed)
   * ce01 operating with SLURM 15.08.01 for 2+half month and very stable (256 new cores awaiting delivery)
   * 1980 cores after installation finalised
   * Finalised the integration of the Microboone VO (OSG - Fermilab) on both clusters
   * New CE (ce04) fronting a SLURM cluster running on the SWITCHEngines cloud infrastructure 
      * Elasticluster on OpenStack
      * Integrated in Panda ( UNIBE-LHEP_CLOUD )
      * Currently commissioning with HammerCloud
*ATLAS specific operations*
   * Very stable running over the end-of-year break
   * Monthly dumps of the namespace on the DPM SE still depending on re-deployment of the DPM head node on SLC6 (from yaim to puppet for configuration)

   * <span style="background-color: transparent;">Accounting numbers (from scheduler) from last month</span> 
      * <span style="background-color: transparent;">Incomplete+provisional (ce02 missing)</span>
      * <span style="background-color: transparent;">CPU h: 410292 (ATLAS) - 239 t2k.org - 572 uboome - 18 ops</span>

---+++ UNIBE-ID
   * Xxx

---+++ UNIGE

*Operations*
   * Running smoothly during last month and Christmas break (lower user activity at the cluster)
   * Outlook: We will install puppet for DPM (and probably also for cluster configuration and setup) 
      * A testbed with a File Server (atlasfs29) and 1 PC for services (puppet)
   * Outlook: We are going to request around 3 GPUs (image processing) for user request
<strong>Network - Outlook<br /></strong>
   * We are going to request a new network switch of 10 Gb/s (network swicth upgrade) for the cluster 
      * Basically focused on the data management system (DPM and NFS, mainly on NFS)
   * The request will be done at the end of the month and, if approved, then we could have it in Summer
*Storage*
   * Checking the data stored at the DPM SE (dump) for cleaning purposes, since ATLAS requested it
   * Not sent yet to ATLAS (scheduled to be done)
*Accounting*
   * Accounting numbers (from scheduler) from last month
   * For accounting, at least a roughly number, we have ganglia for monitoring purposes: <span style="font-family: Tahoma; color: black; font-size: x-small;"> </span>
   * <span style="font-family: Tahoma; color: black; font-size: x-small;"> <a target="_blank" href="https://mmm.cern.ch/owa/redir.aspx?C=99GNOTCMrUO23cwZK3FFVCBwzB9bItMIrJOPwLTdCxb9wJV3eEnDavc9zmH2ESQwQI5Cn41WM7g.&URL=http%3a%2f%2fatlasgrid.unige.ch%2fganglia%2f%3fr%3dmonth%26cs%3d%26ce%3d%26c%3dslc6-services%26h%3d%26tab%3dm%26vn%3d%26hide-hf%3dfalse%26m%3djobs_running%26sh%3d1%26z%3dsmall%26hc%3d4%26host_regex%3d%26max_graphs%3d0%26s%3dby%2bname">http://atlasgrid.unige.ch/ganglia/?r=month&cs=&ce=&c=slc6-services&h=&tab=m&vn=&hide-hf=false&m=jobs_running&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name</a></span>
   * <span style="font-family: Tahoma; color: black; font-size: x-small;"> </span>

---+++ NGI_CH
   * Reminder: *SL5 services must be decommissioned by end of April 2016* 
      * Affected: 
         * UNIBE-LHEP (DPM head node)
         * UNIGE-DPNC? (some DPM pool nodes?)
         * CSCS-LCG2? (some dCache pool nodes?)
   * <div id="_mcePaste"> *Feedback invited: Distributing middleware as Docker images* </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">releasing UMD4 products as Docker images in addition to RPMs</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">pros: can run on hardware, no virtualization platform needed</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">cons: maybe hard to create/maintain</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">to be provided by TPs and/or volunteer sites</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">possible profiles: site/top BDII, CEs</blockquote> </div>
   * *NGI-CH Open Tickets*: <span style="background-color: transparent;">https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO</span> 
      * UNIBE-LHEP: 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=118692">118692</a> (http support on the DPM SE) still need to investigate
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> (storage dumps for ATLAS) on hold, needs DPM head on SLC6 first
      * CSCS-LCG2: 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=118253">118253</a> (decommissioning dCache 2.6) can be closed
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117786">117786</a> (storage dumps for ATLAS)
      * UNIGE-DPNC: 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117900">117900</a> (storage dumps for ATLAS)

---++ Other topics
   * Topic1
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS:
   * CMS: Fabio, Joosep
   * ATLAS: Gianfranco
   * LHCb: Roland
   * EGI: Gianfranco

---++ Action items
   * Item1