Tags: view all tags

Swiss Grid Operations Meeting on 2013-07-04

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9227296)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=Nrq24qRR4V1u
Phone gate: From Switzerland: 0225330322 (portal) + 9227296 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Agenda

Status

CSCS (reports Miguel):
- (report from the last maintenance)
PSI (reports Fabio/Daniel):
- Generally quiet (holiday period); but some new users that need support and/or additional software packages installed
- Cluster was "offline" for about 2 hours on Thursday June 26th (SWITCHlan network issue; ticket that left PSI without any network connection)
- Virtual infrastructure at PSI seems to have stabilized (at least we did not see an other problems with our crucial dCache Chimera VM)
- Usual fileserver/HDD problems continue; luckily everything so far was recoverable by reboots only (i.e. no data migrations necessary)
- Our Chimera DB constantly has >250 connections open (out of 300 we have configured as a maximum)
  - The numbers seems to be almost constant; independent of actual usage
  - Maybe it's just trying to improve performance as best as possible within the defined limit or maybe there is something wrong with our configuration/installation
  - Situation not yet exactly understood (however, as it constantly stays <~90% this is not a high-priority issue for us)
- Constantly "fighting" over-usage and clean-up laziness by certain users; our SE is now ~95% full
- Started doing some tests with OpenMPI (so far single node only)
UNIBE (reports Gianfranco):
- Xxx
UNIGE (reports Szymon):
- Xxx
UZH (reports Sergio):
- Xxx
Switch (reports Alessandro):
- Xxx

Other topics

CMS
- CMS site configuration was migrated to git Wednesday June 26th. A trivial error in the migration script lead to a lot of killed jobs everywhere within a few hours that day; so if you saw something on that day for CMS this was most probably not a site issue.
- From the CMS side the main issue of the last month really was the network problems that started mid-June (see e.g. on this transfer quality plot that 3 links to T1 got bad around June 13th)
- After some painstaking investigation we found out it was an MTU/MSS issue that is now temporarily solved by setting ifconfig ib0 mtu 2044 on all dCache head nodes (I guess more details can/will be discussed in the CSCS part)
  - Do we also need to apply this fix on WNs? Otherwise stage-out from the WNs to e.g. FNAL could also fail
- Transfers have resumed Monday evening and from our side it looks ok now
- Successfully registered the CSCS queues as SL6 queues within the CMS submitting infrastructure - waiting for first results
- cmsvobox was down over the last weekend and was then migrated to a KVM machine; will try to migrate the last service that runs there within the next days
Topic2

Next meeting date:

AOB

Attendants

CSCS:
CMS:
ATLAS:
LHCb:
EGI:

Action items

Item1

~~Edit~~ | ~~Attach~~ | ~~Watch~~ | Print version | History: r6 | r4 < r3 < r2 < r1 | Backlinks | Raw View | ~~Raw edit~~ | ~~More topic actions...~~

Topic revision: r2 - 2013-07-03 - DanielMeister

LCGTier2

Log In

(Topic)

Home
LCGTier2 Web
- Users
- Groups
- Index
- Search
- Changes
- Notifications
- RSS Feed
- Statistics
- Preferences
P
View
Edit

Warning: Can't find topic "".""

Account
- Log In

~~Edit~~
~~Attach~~

Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback