Tags:
meeting
1
SwissGridOperationsMeeting
1
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-01-14 * *Date and time*: At 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * *Compute* * Confirmed order of 40 WN * *Details*: * <span style="background-color: transparent;">DALCO r2264i5t 2U Scalable Compute Module (4 nodes </span><span style="background-color: transparent;">in one 2U chassis) configuration with:</span> * <span style="background-color: transparent;">Each node supports:</span> * <span style="background-color: transparent;"> _2 x 14 Core Intel Xeon E5-2680 v4 2.4GHz Processor (~1Q2016)_ </span> * _128GB 2133MHz DDR4 ECC Reg. Server Memory_ * _1 x 120GB Intel SSD Drive_ * _2 x Intel 10/100/1000 Gigabit-Ethernet onboard_ * _1 x ETH0/NIC1 LAN Interface for IPMI & KVM over IP_ * _1 x Mellanox ConnectX-3 FDR Dual-Port Adapter_ * <em>Two High-Efficiency 1600Watt hot-plug redundant power <span style="background-color: transparent;">supplies</span></em> * As discussed with Gianfranco we disabled reboot request every ~60 days for WNs to avoid idle * Updated nodes to SL 6.7 and packages on all wn * Started working on the new sltop python version using pyslurm python module * Fixing some small issues with the user creation with puppet on the WNs * We have some HP nodes down * * Accounting numbers (from scheduler) from last month (<a target="_blank" href="http://ganglia.lcg.cscs.ch/ganglia/SLURM_REPORTS/phoenix_slurm_report_201512.txt">slurm report</a>) * *Storage* * New hardware: * 4x NETAPP E5660 (total 500TB) arrived + 500TB from CSCS SAN will be ready soon * 4x Lenovo M5 servers delivery scheduled for the next week * *GPFS* * No issues or news * *DCACHE* * Upgrade to the v2.10 has been completed. * Still have to fix some issues * Dumps for Atlas will work starting this month (host mapping from atlas01) ---+++ PSI * *Boot issues with HP Smart Array P440 Controller in HBA mode* * Had boot issues with HP Smart Array P440 Controller on CentOS7 with the controller configured in HBA mode ; had to manually upgrade the controller driver to =hpsa-3.4.12= ; read about these driver updates cases [[http://cciss.sourceforge.net/][here]] ; usually you won't want to do that but.. * *Nagios and HP HW* * I recommend [[https://labs.consol.de/nagios/check_hpasm/][check_hpasm]] * Deploy also on your management node the OS independent [[http://www.gnu.org/software/freeipmi/download.html][Freeipmi 1.5.1]] + [[https://www.thomas-krenn.com/en/wiki/IPMI_Sensor_Monitoring_Plugin][check_ipmi_sensor_v3]] ; for instance : * =/opt/nagios/check_ipmi_sensor -f /opt/nagios/check_ipmi_sensor.user.pwd.privilege -H rmnfs01 -v -O '-D lan_2_0 -t Fan'= reports : * IPMI Status: OK | 'Fan 1 DutyCycle'=13.72 'Fan 2 DutyCycle'=13.72 'Fan 3 DutyCycle'=22.34 'Fan 4 DutyCycle'=42.73 'Fan 5 DutyCycle'=53.70 'Fan 6 DutyCycle'=53.70 * And because I'm using the HP servers in HBA mode I also check the SMART values by [[http://www.claudiokuenzler.com/nagios-plugins/check_smart.php#.Vpd_bJMrLdQ][check_smart]] * *ZFS On CentOS7* * So far satisfied , be aware of the new Jan'16 release [[https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5.4][0.6.5.4]] ; [[https://github.com/zfsonlinux/zfs/issues?utf8=%E2%9C%93&q=is%3Aclosed+milestone%3A0.6.5.4+][here]] the issues closed * Still not in production though ; there are a lot of news in CentOS7 that I want to thoroughly learn and the Christmas break didn't help me * *Finally new HW at PSI in 2016* * With current $$ in Q1'16 we're going to buy : * 1 x CISCO 10Gb/s Nexus Extender [[http://www.cisco.com/c/en/us/products/collateral/switches/nexus-2000-series-fabric-extenders/product_bulletin_c25-715278.html][2232TM-E]] ; it's not a switch, read the specs * 9 x [[http://ark.intel.com/products/88294/Intel-Compute-Module-HNS2600TP24R][Intel Server HNS2600TP24R]] ; each 64 cores type [[http://ark.intel.com/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M-Cache-2_30-GHz][E5-2698-v3]] , 128GB RAM, 2 ( or 4 if I have $$ to make RAID10 ) disks type [[https://www.hgst.com/products/hard-drives/ultrastar-c10k900][Hitachi c10k900]], 10Gb/s BASE-T onboard, 3y warranty * 1 x [[http://www8.hp.com/us/en/products/proliant-servers/product-detail.html?oid=7252836][HP DL360 G9]] as our new management node , 5y warranty * With new $$ after Q1'16 we're going to buy : * 1 x NetApp [[http://www.microspot.ch/msp/de/harddisks-laufwerke/nas-storage/e2760-extreme-capacity-0001007649][E2760]], 60x6TB disks ( link reports 60x4TB ), SAS, 5y warranty ; I consider the more powerful [[http://www.netapp.com/us/products/storage-systems/e5600/e5600-tech-specs.aspx][E5660]] an overkill for our needs and I also have to save $$ for the other HW * 1 x [[http://www8.hp.com/us/en/products/iss-controllers/product-detail.html?oid=6995463][HP Smart Array P841/4GB FBWC 12Gb 4-ports Ext SAS Controller (726903-B21)]] to be deployed inside one of my two HP DL380 G9 to connect the previous E2760 * Probably another smaller bunch of [[http://ark.intel.com/products/88294/Intel-Compute-Module-HNS2600TP24R][Intel Server HNS2600TP24R]] ---+++ UNIBE-LHEP *Operations* * ce02 re-installed with ROCKS 6.2, SLC6.7 and SLURM 15.08.04 (320 worker cores still to be installed) * ce01 operating with SLURM 15.08.01 for 2+half month and very stable (256 new cores awaiting delivery) * 1980 cores after installation finalised * Finalised the integration of the Microboone VO (OSG - Fermilab) on both clusters * New CE (ce04) fronting a SLURM cluster running on the SWITCHEngines cloud infrastructure * Elasticluster on OpenStack * Integrated in Panda ( UNIBE-LHEP_CLOUD ) * Currently commissioning with HammerCloud *ATLAS specific operations* * Very stable running over the end-of-year break * Monthly dumps of the namespace on the DPM SE still depending on re-deployment of the DPM head node on SLC6 (from yaim to puppet for configuration) * <span style="background-color: transparent;">Accounting numbers (from scheduler) from last month</span> * <span style="background-color: transparent;">Incomplete+provisional (ce02 missing)</span> * <span style="background-color: transparent;">CPU h: 410292 (ATLAS) - 239 t2k.org - 572 uboome - 18 ops</span> ---+++ UNIBE-ID * Xxx ---+++ UNIGE *Operations* * Running smoothly during last month and Christmas break (lower user activity at the cluster) * Outlook: We will install puppet for DPM (and probably also for cluster configuration and setup) * A testbed with a File Server (atlasfs29) and 1 PC for services (puppet) * Outlook: We are going to request around 3 GPUs (image processing) for user request <strong>Network - Outlook<br /></strong> * We are going to request a new network switch of 10 Gb/s (network swicth upgrade) for the cluster * Basically focused on the data management system (DPM and NFS, mainly on NFS) * The request will be done at the end of the month and, if approved, then we could have it in Summer *Storage* * Checking the data stored at the DPM SE (dump) for cleaning purposes, since ATLAS requested it * Not sent yet to ATLAS (scheduled to be done) *Accounting* * Accounting numbers (from scheduler) from last month * For accounting, at least a roughly number, we have ganglia for monitoring purposes: <span style="font-family: Tahoma; color: black; font-size: x-small;"> </span> * <span style="font-family: Tahoma; color: black; font-size: x-small;"> <a target="_blank" href="https://mmm.cern.ch/owa/redir.aspx?C=99GNOTCMrUO23cwZK3FFVCBwzB9bItMIrJOPwLTdCxb9wJV3eEnDavc9zmH2ESQwQI5Cn41WM7g.&URL=http%3a%2f%2fatlasgrid.unige.ch%2fganglia%2f%3fr%3dmonth%26cs%3d%26ce%3d%26c%3dslc6-services%26h%3d%26tab%3dm%26vn%3d%26hide-hf%3dfalse%26m%3djobs_running%26sh%3d1%26z%3dsmall%26hc%3d4%26host_regex%3d%26max_graphs%3d0%26s%3dby%2bname">http://atlasgrid.unige.ch/ganglia/?r=month&cs=&ce=&c=slc6-services&h=&tab=m&vn=&hide-hf=false&m=jobs_running&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name</a></span> * <span style="font-family: Tahoma; color: black; font-size: x-small;"> </span> ---+++ NGI_CH * Reminder: *SL5 services must be decommissioned by end of April 2016* * Affected: * UNIBE-LHEP (DPM head node) * UNIGE-DPNC? (some DPM pool nodes?) * CSCS-LCG2? (some dCache pool nodes?) * <div id="_mcePaste"> *Feedback invited: Distributing middleware as Docker images* </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">releasing UMD4 products as Docker images in addition to RPMs</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">pros: can run on hardware, no virtualization platform needed</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">cons: maybe hard to create/maintain</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">to be provided by TPs and/or volunteer sites</blockquote> </div> <div id="_mcePaste"> <blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">possible profiles: site/top BDII, CEs</blockquote> </div> * *NGI-CH Open Tickets*: <span style="background-color: transparent;">https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO</span> * UNIBE-LHEP: * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=118692">118692</a> (http support on the DPM SE) still need to investigate * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> (storage dumps for ATLAS) on hold, needs DPM head on SLC6 first * CSCS-LCG2: * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=118253">118253</a> (decommissioning dCache 2.6) can be closed * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117786">117786</a> (storage dumps for ATLAS) * UNIGE-DPNC: * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117900">117900</a> (storage dumps for ATLAS) ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: * CMS: Fabio, Joosep * ATLAS: Gianfranco * LHCb: Roland * EGI: Gianfranco ---++ Action items * Item1
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
Accounting_last_month.png
r1
manage
39.7 K
2016-01-14 - 13:45
LuisMarch
UniGe
accounting (last month)
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r14
<
r13
<
r12
<
r11
<
r10
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r14 - 2016-01-22
-
FabioMartinelli
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback