Tags:
meeting
1
SwissGridOperationsMeeting
1
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2015-11-10 * *Time*: 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS <div id="_mcePaste"> *Systems:* </div> <div id="_mcePaste"> * <span style="background-color: transparent;">HP Smart array issues (config loss and no boot), lost a lot of time with the HP support. Self solution found: Disable smart array and enable legacy mode for the boot disk.</span> * <span style="background-color: transparent;">Prolonged IB Bridges warranty until spring 2016</span> * <span style="background-color: transparent;">Requested new certificates for argus* with correct DNS AltName</span> * <span style="background-color: transparent;">LHCb job are still not running well, we suggested to Vladimir to use the right runtime env (env/proxy and glite), but still no changes.</span> * <span style="background-color: transparent;">CMS is testing multicore jobs</span> * <span style="background-color: transparent;">Working hard to finalize arc02 puppet cofiguration.<br /></span> * <span style="background-color: transparent;">We are planning to dismiss cream04</span> * <span style="background-color: transparent;">Planning the upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) almost all services on the cluster are affected.</span> * <span style="background-color: transparent;">Getting offers for the Phoenix expansion</span> </div> <div id="_mcePaste"> *Storage:* </div> <div id="_mcePaste"> * <span style="background-color: transparent;">Scratch - GPFS: Netapp storage firmware upgrade (no service interruption).</span> * <span style="background-color: transparent;">dCache:</span> * <span style="background-color: transparent;">We still have the cleaner problem, mainly with CMS. At the moment the cleaner needs to be executed manually but the situation has been stabilized after some big deletions from CMS.</span> * <span style="background-color: transparent;">This week we should finalise the configuration of a pre-production system where we will test the 2.6 -> 2.10 (2.13) upgrade in order to be able to upgrade the production by the end of this month.</span> </div> ---+++ PSI * *NFSv4* * Context : MeetingSwissGridOperations20151015#PSI * Eventually I made a RAID10 with 24disks, no spare * Instead of a single ZFS filesystem I made a hierarchy of filesystems, as advised by [[https://docs.oracle.com/cd/E23823_01/html/819-5461/gaypa.html][Oracle]] * By setting properties on the root of the hierarchy they'll get propagated to each descendant * Taking a recursive snapshot of the root of the hierarchy will take a snapshot of each descendant, *atomically at the same time*. * Taking snapshots ( but without giving the destroy permission ) can be delegated to each user on his/her own filesystem and also managed by simple NFSv4 =mkdir= commands ! [[http://docs.oracle.com/cd/E19253-01/819-5461/gebxb/index.html][Oracle Ref]] ; it needs a [[https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5][tweaking]] on ZFS on Linux <pre>The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been disabled by default both locally and via NFS clients. The zfs_admin_snapshot module option can be used to re-enable this functionality. </pre> * further tasks ongoing.. * *dCache* : To CSCS, at PSI I've tuned this dCache Xrootd threshold xrootd.limits.threads=160 ; default is 1000 that was too high for us ; we were recurrently getting 1000 Xrootd sessions from Internet that eventually expired with a timeout. * *Security* : Processed the [[https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183][EGI SVG Advisory - 'Critical' risk. Remote arbitrary code execution vulnerabilities in the core crypto library used by RedHat.]] * *General Interest* : 1TB [[https://owncloud.org/][OwnCloud]]/[[http://information-technology.web.cern.ch/services/eos-service][EOS]] @ CERN : http://cernbox.web.cern.ch/ ---+++ UNIBE-LHEP * *Operations* * Smooth(-ish) operation on ce02, quite stable at just over 500 cores. Nothing of relevance to report * Re-deployment of the ce01 cluster under way: * SLC 6.7 and ARC 5.0.3 (needed a downgrade of opeldap* to have a functional resource bdii on the ARC CE) * about 900 worker-cores installed * new lustre (version 2.5.3, 200 disks), Thumpers decommissioned * moved to slurm, cutting my teeth on it. * hope to go online in the next few hours * Patching against CVE-2015-7183 (nss*, nspr* from slc6-testing) * *ATLAS specific operations* * Implementing the requested monthly dumps of the namespace on the DPM SE. ---+++ UNIBE-ID * *Commissioning* * Ordered the first 32 new compute nodes (Broadwell) with a total of 640 cores; delivered in 12/2015 * Another 32 nodes will get ordered early in 2016 * *Operations* * Prolonged maintenance down due to painful migration to the new GPFS storage * Lesson learned (us + IBM techie!): Using AFM and additonally doing rsyncs is a huge no go and leads to a corrupted filesystem when disabling AFM in the end * though no data loss * Since then smooth operation again * Upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) done tomorrow within the already setup maintenance down * <strong>ATLAS specific operations<br /></strong> * no problems * ordered new SSL certificate for nordugrid.unibe.ch due to <span style="background-color: transparent;">STRICT_RFC2818 switch by Globus GSI clients</span> ---+++ UNIGE * *Operations* * atlasfs18.unige.ch : ATLAS File Server, users reported problems with data transfers * According to first checks from monitoring (Ganglia and Nagios) the machine was up and running * No remote access was allowed * Once re-started manually, not able to get it back: It is assumed a RAID controller problem * Fortunately, this machine is still under warranty by IBM (will be contacted for reparation) * A spare File Server was used instead (this is temporarily), disks moved to the temporary machine * No further problems observed since then for atlasfs18.unige.ch * I will ask for a host certificate, for a new ATLAS File Server to be added into the cluster * Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed) * We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services * *Network - Outlook* * We intend for a new network switch of 10 Gb/s, but this is still under negotiation * Most likely, it will be in the beggining of next year * *Storage* * There is a DPM SE workshop at CERN on December 7th-8th (probably intesresting for other sites with DPM SE). I will attend it * Checking the data stored at the DPM SE for cleaning purposes, since ATLAS before had a data management tool called "dq2"and now it is "rucio" * Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa ---+++ NGI_CH * Profile ch.cern.sam-ROC_CRITICAL for ops: http://mon.egi.eu/myegi/sa/?view=2&graph=1&vo=104&profile=26&filters-value-Regions_or_Tiers=115&filters-value-Sites=&production=1&preproduction=1&dateorperiod=pd&period=pM&startdate=01-08-2015&enddate=30-09-2015 * https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183 * Survey on "Quality process and ISO certification" (Quality Management, IT Service Management, Information Security Management): https://www.surveymonkey.com/r/isocertification ---++ Other topics * Daniel being replaced as CMS contact person * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Pablo, Dario, Dino, Gianni * CMS: Fabio Martinelli, Daniel Meister * ATLAS: Gianfranco, Luis March * LHCb: Roland Bernet * EGI: Gianfranco ---++ Action items * Item1
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r19
<
r18
<
r17
<
r16
<
r15
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r19 - 2015-11-11
-
FabioMartinelli
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback