Tags:
meeting
1
SwissGridOperationsMeeting
1
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> ---+ Swiss Grid Operations Meeting on 2013-07-04 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9227296) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=Nrq24qRR4V1u * *Phone gate*: From Switzerland: 0225330322 (portal) + 9227296 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) ---++ Agenda Status * CSCS (reports Miguel): * All workernodes updated to SL6/ UMD 2 * No longer using fakeraid, one disk for OS one for CVMFS * Problems with gridftp transfers from certain sites, cause was our IB/Ethernet bridge not negotiating MTUs correctly * atlasvobox decommissioned * cmsvobox was found hung and would no longer boot under Xen. Migrated to KVM and brought machine back up. * lrms02 moved to KVM, no more Xen machines * Cream machines updated to latest release * PSI (reports Fabio/Daniel): * Generally quiet (holiday period); but some new users that need support and/or additional software packages installed * Cluster was "offline" for about 2 hours on Thursday June 26th (SWITCHlan network issue; [[http://www.switch.ch/de/network/operation/tts/index.html?action=show&id=726][ticket]] that left PSI without any network connection) * Virtual infrastructure at PSI seems to have stabilized (at least we did not see an other problems with our crucial dCache Chimera VM) * Usual fileserver/HDD problems continue; luckily everything so far was recoverable by reboots only (i.e. no data migrations necessary) * Our Chimera DB constantly has >250 connections open (out of 300 we have configured as a maximum) * The numbers seems to be almost constant; independent of actual usage * Maybe it's just trying to improve performance as best as possible within the defined limit or maybe there is something wrong with our configuration/installation * Situation not yet exactly understood (however, as it constantly stays <~90% this is not a high-priority issue for us) * Constantly "fighting" over-usage and clean-up laziness by certain users; our SE is now ~95% full * Started doing some tests with OpenMPI (so far single node only) * UNIBE (reports Gianfranco): * Older cluster running stable, will move it to SLC6 after summer and run it until it dies. * Newer cluster running stable at full load, with I/O light tasks (had problems with eth0 lockup on at least 3 Lustre OSS nodes) * Now ready to move Lustre to the ib0 network (negotiating downtime with Andrej, very likely early next week) * Then commission for I/O heavy MC tasks (Reco) and eventually Analysis * Ready to move to Lustre 2.1.6 on servers (this supports the latest kernel 2.6.32-358.11.1.el6.x86_64). Not clear if will do this now * Network drop last week: we had a long-ish interruption, but were not affected really, jobs resumed happily afterwards * Issue: we are not publishing to APEL the usage records from the newer cluster. * Needs intervention on the Swiss SGAS side (who is maintaining this now, I guess it is still under SWITCH) * Jura publisher of latest version of ARC not ptoduction ready yet (but we can piggy back on efforts in DE and UK who are moving to ARC) * Not clear whether we are advertising our cluster correctly on the site-bdii (e.g. "GlueHostProcessorOtherDescription") * UNIBE-ID cluster operating smoothly and excellent synergy with admins. 600 slots for ATLAS, might pledge resources from there too * UNIGE (reports Szymon): * Xxx * UZH (reports Sergio): * Xxx * Switch (reports Alessandro): * Xxx Other topics * CMS * CMS site configuration was migrated to git Wednesday June 26th. A trivial error in the migration script lead to a lot of killed jobs everywhere within a few hours that day; so if you saw something on that day for CMS this was most probably not a site issue. * From the CMS side the main issue of the last month really was the network problems that started mid-June (see e.g. on this [[https://cmsweb.cern.ch/phedex/graphs/quality_all?link=src&no_mss=true&to_node=CSCS&from_node=.%2A&conn=Debug%2FWebSite&starttime=1370649600&span=86400&endtime=1371859200][transfer quality plot]] that 3 links to T1 got bad around June 13th) * After some painstaking investigation we found out it was an MTU/MSS issue that is now temporarily solved by setting =ifconfig ib0 mtu 2044= on all dCache head nodes (I guess more details can/will be discussed in the CSCS part) * Do we also *need to apply this fix on WNs*? Otherwise stage-out from the WNs to e.g. FNAL could also fail * Transfers have resumed Monday evening and from our side it looks ok now * Did *something change in the Scheduled Downtime workflow* from our side? For about two months now CMS seems to be unable to detect those downtimes correctly; also, we do not receive the automated start/finish e-mails from GOCDB anymore. * Successfully [[https://savannah.cern.ch/support/?138461][registered the CSCS queues as SL6 queues]] within the CMS submitting infrastructure - waiting for first results * =cmsvobox= was down over the last weekend and was then migrated to a KVM machine; will try to migrate the last service that runs there within the next days * ATLAS * HammerCloud random failures increasing in frequency (see attachment). Many autoexclusions/withelistings for the ANALY queue * Hard to hunt down the cause, error misleading (related to the size of the workdir). Real cause not understood yet (no site issue) * FAX and PerfSonar deployment still open * ATLAS DE cloud face to face at CSCS: date fixed for 30 Sep- 01 Oct 2013 (official announcement will follow) * Topic3 Next meeting date: AOB ---++ Attendants * CSCS: * CMS: * ATLAS: * LHCb: Roland * EGI: ---++ Action items * Item1
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
ATLAS_HC-CSCS-June.png
r1
manage
59.8 K
2013-07-04 - 09:30
GianfrancoSciacca
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r6
<
r5
<
r4
<
r3
<
r2
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r6 - 2013-07-04
-
RolandBernet
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback