Tags:
meeting
1
SwissGridOperationsMeeting
1
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2014-10-02 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * Issues * Some nodes have been dropping off IB network randomly, thus many jobs have failed lately. * All these jobs failed produced a huge increase in the inode usage on GPFS that was impossible to clean up by policies (just too fast!) * Added an additional 2x 400GB SSDs to GPFS to provide even more inodes (150M) * GPFS is going to be decommissioned soon. * GPFS2 * Fully configured using two file sets: <verbatim>Filesets in file system 'phoenix_scratch': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment root 0 3 -- Tue Sep 30 09:18:00 2014 0 1000128 1000128 root fileset scratch 1 1048579 0 Tue Sep 30 09:21:51 2014 1 50000000 50000000 gridhome 2 134217731 0 Tue Sep 30 09:28:34 2014 2 30000000 30000000 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace scratch FILESET 120917920 0 0 105024 none | 5057529 35000000 40000000 254 none gridhome FILESET 0 0 0 20480 none | 1 20000000 25000000 19 none</verbatim> * Each file set has its own quota of inodes, so we can cleanup the filesystem even if we reach the max. quota. * A new method of cleaning up storage (epilog, right after each job ends, completed or failed) is being tested. We hope this, along with periodic GPFS policies, will solve the inode problems once and for all. * Swiss users storage * Ready for ATLAS and LHCb (ATLASLOCALGROUPDISK = 160TB, LHCB-DISK = 290TB) * Need to be tested by CMS (CHCMS = 150TB) * Swiss users compute * To be ready in next maintenance * Next maintenance * Downtime set on GOCDB for whole CSCS-LCG2 on 15.10.2014 (https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=15651) * * Changes to be applied: 1 Update firmware on IB switches 1 Set MTU to 1500 on all systems (all IB cards) 1 Deploy GPFS2 to all nodes (same gpfs cluster instead of remote cluster) 1 Reconfigure all grid nodes to use GPFS2 and enable CHCMS VOMS 1 Deploy ARGUS servers 1 Removal of /experiment_software on all WNs as is no longer used 1 Update SLURM configuration: increase priority of *atlaschXX*,*cmschXX* and *lhcbchXX* 1 Remove non existing/decommissioned nodes from SLURM config ---+++ PSI * Maintenances * dCache updated to 2.6.33 * 120 Seagate disks FW updated from v. MS01 to MS04 on a E5400 * SL5 and SL6 bash updated, twice. * Puppet upgraded from v.2 to puppet-3.5.1-0.1rc1.el6.noarch * Using [[http://puppetlabs.com/blog/module-of-the-week-puppetlabsstdlib-puppet-labs-standard-library][Puppet Stdlib]] : to use it: * if you're root on the Puppet master: =# yum install puppetlabs-stdlib.noarch= * if you're NOT root on the Puppet master: =[ myenv/modules ]$ git clone https://github.com/puppetlabs/puppetlabs-stdlib.git stdlib= * Grid tools * [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToAccessSe#gfalFS][gfalFS]] * [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/HowToAccessSe#CMSSW_lack_of_support][gfal-copy stops to work with CMSSW]] ; same with ATLAS and LHCb ? * Next storage : we have to replace 9 dCache fileservers and 2 NFS fileservers * about the 9 dCache: my current guess, we'll buy a server + a 60*4TB [[http://www.netapp.com/us/products/storage-systems/e5500/][E5500]] ~= 180TB net ; CSCS also uses E5500. * about the 2 NFS fileservers, 3 options : * 2 [[http://www.netapp.com/us/products/storage-systems/fas2500/fas2500-product-comparison.aspx][NetApp FAS2525]] to get a *HA and Replicated* NFS service. * 2 [[http://www.oracle.com/us/products/servers-storage/servers/x86/x4-2l/features/index.html][Oracle x4-2l]] installed as Solaris 11 + ZFS. * *Perhaps* PSI will install a central [[http://www.netapp.com/us/products/storage-systems/fas8000/fas8000-tech-specs.aspx][NetApp FAS8000 CIFS/NFS service]] ; cost for us ~600CHF*1TB*5y ( no backups, yes snapshots ) ; it would be great. * In this context I used [[http://linux.die.net/man/1/tcptrack][tcptrack]] to easily see our [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/CMSTier3Log65#29_09_2014_dcaps_average_MB_s][dcap bandwidth usages]]; by those usages you can do capacity planning. * Clouds * [[http://information-technology.web.cern.ch/book/cern-private-cloud-user-guide/openstack-information][CERN offers OpenStack for free]] used it, nice. PSI offers a VMWare cluster but to create/modify a VM I always have to involve a colleague, here I was 100% independent. * [[https://indico.cern.ch/event/341357/][CERN hosted Amazon]] to present the [[http://aws.amazon.com/hpc/?nc1=h_l2_bh][Amazon Scientific Computing Offer]] and the success stories ; *very interesting* ; they provide [[http://aws.amazon.com/grants/][Free Research Grants]]. ---+++ UNIBE-ID * Infiniband Network with fat tree topology in production * Hardware installed and tested * Perfomance measurements done and ok * IPoIB setup done * Lot of disk crashes lately but no outages * Currently working on migration from RHEL-6 to CentOS-6 * future config management with Puppet; testbed working excellent so far * new compute nodes ordered to abandon old nodes => testing environment for the new CentOS-based setup ---+++ UNIBE-LHEP * Operations * Smooth routine operations with minor issues: * a-rex crashes (x3) on ce01 (it used to happen on ce02) * nodes on ce01 (phaseC Sun Blades) tend to crash. Must clean manually jobs ("dr" state in GE) then re-install. Not tragic, yet tedious. Suspect memory starving. These have 24GB RAM, 1GB swap and run 16 threads each. Could reduce that but won't fit 2x8 core jobs on one node. Will check if makes sense to upgrade the RAM (4GB DDR3 SDRAM 666, Hynic Semiconductor Inc.) * Changed IP address on one DPM pool node. It took a few days to stabilise operation. * Memory failure on 1 ARECA controller on DPM pool node. After replacement, kernel panic at boot, needed re-install. Caused further SE (partial) failures for ATLAS but FTSs resumed promptly when the service is back online * ATLAS specific operations * smooth routine operation * HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2014-09-01&endTime=2014-09-30&templateType=isGolden * SAM Nagios ATLAS_CRITICAL: http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time[]=lastMonth&granularity[]=default&profile=ATLAS_CRITICAL&group=All+sites&site[]=CSCS-LCG2&site[]=UNIBE-LHEP&site[]=UNIGE-DPNC&type=quality * Up to 1st Oct this did not include the ARC CE tests. These have moved from WMS to Condor submission and added to the ATLAS_CRITICAL profile: http://dashb-atlas-sum-dev.cern.ch/dashboard/request.py/historicalsmryview-sum#view=serviceavl&time[]=last48&granularity[]=default&profile=ATLAS_CRITICAL&group=All+sites&site[]=UNIBE-LHEP&flavour[]=All+Service+Flavours&disabledFlavours=true * New Lustre deployment for the ce01 cluster * Dalco servers with 5xLSI controllers each. Want RAID1 for OS HDDs and mdadm JBOD for the remaining HDDs * Issues of hang at re-boot after install. Fine with the pre-install of CentOS and in re-install at Dalco lab . Hang after ROCKS+lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64. But hang also with vanilla SLC6 kernel at LHEP (with and without ROCKS) * All LSI controllers to which the OS HDDs are attached flashed to allow hardware RAID1. Then re-install with ROCKS+lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64: 1 boots, 3 hangs, 2 not done yet * ROCKS pxelinux too old? But hangs also when booting from BIOS and selecting the LSI Logical Volume * Ongoing... will try to install+boot from vanilla SLC6 ---+++ UNIGE * The 2014 upgrade is finished * two oldest disk servers retired (X4500 model 2006) * the ‘user’ and ‘software’ space migrated to two new servers (Sun, Solaris) * other data migrated to another new disk server (IBM, Linux) * two new machines for running four VMs that have critical services * Two astroparticle space experiments using the cluster * the DAMPE group starting * the AMS group will invest in hardware to have more disk space * Shellshock emergency... * ATLAS production (Andrej Filipicic) trying multi-core jobs ---+++ NGI_CH * ARGUS status and support * NGI_CH instance: https://ggus.eu/index.php?mode=ticket_info&ticket_id=99533 * Support: * SWITCH bailed out * PEP client and server: no future support * PAP: INFN * PDP: no future support * ARGUS EES: NIKHEF * LCMAPS plugin: NIKHEF * New request to NGIs to rescue not supported components * No alternatives/plan B * Status of deployment and plans * Most NGIs run their National service * End Oct 2014: monitoring framework ready (national instances) - Nagios probe, list of instances to monitor (GOCDB query) * Nov 2014: pilot testing with 4-5 sites (possibly diversified mw), refine documentation * Test if ban information is available at the sites services: CE/SE/WMS (action on EGI-CSIRT) * End Nov 2014 to end Mar 2015 (?): wide deployment * Beyond (?): sote monitored for this feature * New VOMS server configuration for ops/LHC VOs * Old-ish issue, but we are asked to confirm that the new setup is in place * All confirmed, ticket closed - https://ggus.eu/index.php?mode=ticket_info&ticket_id=108154 ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. * Next F2F meeting doodle: https://ethz.doodle.com/sfmduevfsc64d2vm * Only options are Fri 23.01.2015 or Thu 29.01.2015 ---++ Attendants * CSCS: George Brown, Miguel Gila, Gianni Ricciardi * CMS: Fabio Martinelli * ATLAS: Gianfranco Sciacca, Szymon Gadomski * LHCb: Roland Bernet * EGI: Gianfranco Sciacca ---++ Action items * Item1
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r16
<
r15
<
r14
<
r13
<
r12
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r16 - 2014-10-02
-
RolandBernet
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback