<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-01-14 * *Date and time*: At 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * *Compute* * Confirmed order of 40 WN * *Details*: * <span style="background-color: transparent;">DALCO r2264i5t 2U Scalable Compute Module (4 nodes </span><span style="background-color: transparent;">in one 2U chassis) configuration with:</span> * <span style="background-color: transparent;">Each node supports:</span> * <span style="background-color: transparent;"> _2 x 14 Core Intel Xeon E5-2680 v4 2.4GHz Processor (~1Q2016)_ </span> * _128GB 2133MHz DDR4 ECC Reg. Server Memory_ * _1 x 120GB Intel SSD Drive_ * _2 x Intel 10/100/1000 Gigabit-Ethernet onboard_ * _1 x ETH0/NIC1 LAN Interface for IPMI & KVM over IP_ * _1 x Mellanox ConnectX-3 FDR Dual-Port Adapter_ * <em>Two High-Efficiency 1600Watt hot-plug redundant power <span style="background-color: transparent;">supplies</span></em> * As discussed with Gianfranco we disabled reboot request every ~60 days for WNs to avoid idle * Updated nodes to SL 6.7 and packages on all wn * Started working on the new sltop python version using pyslurm python module * Fixing some small issues with the user creation with puppet on the WNs * We have some HP nodes down * * Accounting numbers (from scheduler) from last month (<a target="_blank" href="http://ganglia.lcg.cscs.ch/ganglia/SLURM_REPORTS/phoenix_slurm_report_201512.txt">slurm report</a>) * *Storage* * New hardware: * 4x NETAPP E5660 (total 500TB) arrived + 500TB from CSCS SAN will be ready soon * 4x Lenovo M5 servers delivery scheduled for the next week * *GPFS* * No issues or news * *DCACHE* * Upgrade to the v2.10 has been completed. * Still have to fix some issues * Dumps for Atlas will work starting this month (host mapping from atlas01) ---+++ PSI * *Boot issues with HP Smart Array P440 Controller in HBA mode* * Had boot issues with HP Smart Array P440 Controller on CentOS7 with the controller configured in HBA mode ; had to manually upgrade the controller driver to =hpsa-3.4.12= ; read about these driver updates cases [[http://cciss.sourceforge.net/][here]] ; usually you won't want to do that but.. * *Nagios and HP HW* * I recommend [[https://labs.consol.de/nagios/check_hpasm/][check_hpasm]] * Deploy also on your management node the OS independent [[http://www.gnu.org/software/freeipmi/download.html][Freeipmi 1.5.1]] + [[https://www.thomas-krenn.com/en/wiki/IPMI_Sensor_Monitoring_Plugin][check_ipmi_sensor_v3]] ; for instance : * =/opt/nagios/check_ipmi_sensor -f /opt/nagios/check_ipmi_sensor.user.pwd.privilege -H rmnfs01 -v -O '-D lan_2_0 -t Fan'= reports : * IPMI Status: OK | 'Fan 1 DutyCycle'=13.72 'Fan 2 DutyCycle'=13.72 'Fan 3 DutyCycle'=22.34 'Fan 4 DutyCycle'=42.73 'Fan 5 DutyCycle'=53.70 'Fan 6 DutyCycle'=53.70 * And because I'm using the HP servers in HBA mode I also check the SMART values by [[http://www.claudiokuenzler.com/nagios-plugins/check_smart.php#.Vpd_bJMrLdQ][check_smart]] * *ZFS On CentOS7* * So far satisfied , be aware of the new Jan'16 release [[https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5.4][0.6.5.4]] ; [[https://github.com/zfsonlinux/zfs/issues?utf8=%E2%9C%93&q=is%3Aclosed+milestone%3A0.6.5.4+][here]] the issues closed * Still not in production though ; there are a lot of news in CentOS7 that I want to thoroughly learn and the Christmas break didn't help me * *Finally new HW at PSI in 2016* * With current $$ in Q1'16 we're going to buy : * 1 x CISCO 10Gb/s Nexus Extender [[http://www.cisco.com/c/en/us/products/collateral/switches/nexus-2000-series-fabric-extenders/product_bulletin_c25-715278.html][2232TM-E]] ; it's not a switch, read the specs * 9 x [[http://ark.intel.com/products/88294/Intel-Compute-Module-HNS2600TP24R][Intel Server HNS2600TP24R]] ; each 64 cores type [[http://ark.intel.com/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M-Cache-2_30-GHz][E5-2698-v3]] , 128GB RAM, 2 ( or 4 if I have $$ to make RAID10 ) disks type [[https://www.hgst.com/products/hard-drives/ultrastar-c10k900][Hitachi c10k900]], 10Gb/s BASE-T onboard, 3y warranty * 1 x [[http://www8.hp.com/us/en/products/proliant-servers/product-detail.html?oid=7252836][HP DL360 G9]] as our new management node , 5y warranty * With new $$ after Q1'16 we're going to buy : * 1 x NetApp [[http://www.microspot.ch/msp/de/harddisks-laufwerke/nas-storage/e2760-extreme-capacity-0001007649][E2760]], 60x6TB disks ( link reports 60x4TB ), SAS, 5y warranty ; I consider the more powerful [[http://www.netapp.com/us/products/storage-systems/e5600/e5600-tech-specs.aspx][E5660]] an overkill for our needs and I also have to save $$ for the other HW * 1 x [[http://www8.hp.com/us/en/products/iss-controllers/product-detail.html?oid=6995463][HP Smart Array P841/4GB FBWC 12Gb 4-ports Ext SAS Controller (726903-B21)]] to be deployed inside one of my two HP DL380 G9 to connect the previous E2760 * Probably another smaller bunch of [[http://ark.intel.com/products/88294/Intel-Compute-Module-HNS2600TP24R][Intel Server HNS2600TP24R]] ---+++ UNIBE-LHEP *Operations* * ce02 re-installed with ROCKS 6.2, SLC6.7 and SLURM 15.08.04 (320 worker cores still to be installed) * ce01 operating with SLURM 15.08.01 for 2+half month and very stable (256 new cores awaiting delivery) * 1980 cores after installation finalised * Finalised the integration of the Microboone VO (OSG - Fermilab) on both clusters * New CE (ce04) fronting a SLURM cluster running on the SWITCHEngines cloud infrastructure * Elasticluster on OpenStack * Integrated in Panda ( UNIBE-LHEP_CLOUD ) * Currently commissioning with HammerCloud *ATLAS specific operations* * Very stable running over the end-of-year break * Monthly dumps of the namespace on the DPM SE still depending on re-deployment of the DPM head node on SLC6 (from yaim to puppet for configuration) * <span style="background-color: transparent;">Accounting numbers (from scheduler) from last month</span> * <span style="background-color: transparent;">Incomplete+provisional (ce02 missing)</span> * <span style="background-color: transparent;">CPU h: 410292 (ATLAS) - 239 t2k.org - 572 uboome - 18 ops</span> ---+++ UNIBE-ID * Xxx ---+++ UNIGE *Operations* * Running smoothly during last month and Christmas break (lower user activity at the cluster) * Outlook: We will install puppet for DPM (and probably also for cluster configuration and setup) * A testbed with a File Server (atlasfs29) and 1 PC for services (puppet) * Outlook: We are going to request around 3 GPUs (image processing) for user request <strong>Network - Outlook<br /></strong> * We are going to request a new network switch of 10 Gb/s (network swicth upgrade) for the cluster * Basically focused on the data management system (DPM and NFS, mainly on NFS) * The request will be done at the end of the month and, if approved, then we could have it in Summer *Storage* * Checking the data stored at the DPM SE (dump) for cleaning purposes, since ATLAS requested it * Not sent yet to ATLAS (scheduled to be done) *Accounting* * Accounting numbers (from scheduler) from last month * For accounting, at least a roughly number, we have ganglia for monitoring purposes: <span style="font-family: Tahoma; color: black; font-size: x-small;"> </span> * <span style="font-family: Tahoma; color: black; font-size: x-small;"> <a target="_blank" href="https://mmm.cern.ch/owa/redir.aspx?C=99GNOTCMrUO23cwZK3FFVCBwzB9bItMIrJOPwLTdCxb9wJV3eEnDavc9zmH2ESQwQI5Cn41WM7g.&URL=http%3a%2f%2fatlasgrid.unige.ch%2fganglia%2f%3fr%3dmonth%26cs%3d%26ce%3d%26c%3dslc6-services%26h%3d%26tab%3dm%26vn%3d%26hide-hf%3dfalse%26m%3djobs_running%26sh%3d1%26z%3dsmall%26hc%3d4%26host_regex%3d%26max_graphs%3d0%26s%3dby%2bname">http://atlasgrid.unige.ch/ganglia/?r=month&cs=&ce=&c=slc6-services&h=&tab=m&vn=&hide-hf=false&m=jobs_running&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name</a></span> * <span style="font-family: Tahoma; color: black; font-size: x-small;"> </span> ---+++ NGI_CH * Reminder: *SL5 services must be decommissioned by end of April 2016* * Affected: * UNIBE-LHEP (DPM head node) * UNIGE-DPNC? (some DPM pool nodes?) * CSCS-LCG2? (some dCache pool nodes?) * <div id="_mcePaste"> *Feedback invited: Distributing middleware as Docker images* </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">releasing UMD4 products as Docker images in addition to RPMs</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">pros: can run on hardware, no virtualization platform needed</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">cons: maybe hard to create/maintain</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">to be provided by TPs and/or volunteer sites</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">possible profiles: site/top BDII, CEs</blockquote> </div> * *NGI-CH Open Tickets*: <span style="background-color: transparent;">https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO</span> * UNIBE-LHEP: * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=118692">118692</a> (http support on the DPM SE) still need to investigate * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> (storage dumps for ATLAS) on hold, needs DPM head on SLC6 first * CSCS-LCG2: * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=118253">118253</a> (decommissioning dCache 2.6) can be closed * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117786">117786</a> (storage dumps for ATLAS) * UNIGE-DPNC: * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117900">117900</a> (storage dumps for ATLAS) ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: * CMS: Fabio, Joosep * ATLAS: Gianfranco * LHCb: Roland * EGI: Gianfranco ---++ Action items * Item1
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
Accounting_last_month.png
r1
manage
39.7 K
2016-01-14 - 13:45
LuisMarch
UniGe
accounting (last month)
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20160114
Topic revision: r14 - 2016-01-22 - FabioMartinelli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback