Tags:
meeting
1
SwissGridOperationsMeeting
1
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> ---+ Swiss Grid Operations Meeting on 2013-08-08 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9227296) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=Nrq24qRR4V1u * *Phone gate*: From Switzerland: 0225330322 (portal) + 9227296 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) ---++ Agenda Status * CSCS (reports George): * Storage01 root partition became full on the 29th July resulting in failed transfers. A full reboot of the machine was required to get it back into service. * Restarting the dCache services resulted in the used space reported by the operating system to return to the normal ~20% * The next day we noticed disk space was still growing caused by dCache creating trace files under /tmp * Altering the log levels within the dCache CLI/ pcells didn't have any affect. * For the time being we are log rotating this files * The relavent line in the logback.xml has been changed so when dcache is next restarted these files will not be created. * We have just provisioned slurm1, slurm2 and cream04 to begin testing Slurm/ EMI3 * Achieved a 99% availability and 100% reliability in the tier 2 report for July http://sam-reports.web.cern.ch/sam-reports/2013/201307/wlcg/WLCG_Tier2_OPS_Jul2013.pdf * PSI (reports Fabio): * Good news: * I've physically installed our new [[http://www.netapp.com/us/products/storage-systems/e5400/e5400-tech-specs.aspx][NetApp E5460 360TB raw]]; If you never saw a NetApp E5460 look this [[http://www.youtube.com/watch?v=n5ULb2OPFD8][youtube]]. * RAID6 creation ( took ~4 days ), VolGroups, Vols and FC Hosts were *automatically* created by simply loading the configuration of our other E5460 ( I saved a lot of time ! ). * I'm preparing an SL6 + RDAC + FC installation to stress the E5460 before to merge it into our production environments. * [[http://support.netapp.com/][NetApp support]] was unresponsive and really remote; it's based in India, so *answers sent in their Timezone*; it took days to get a not guest account, download Santricity and activate/map to my ID the E5460 serial number. * I did not have time to try but the [[http://www.netapp.com/us/services-support/autosupport.aspx][NetApp AutoSupport]] service to be remotely monitored by NetApp looks nice. * Bad news: *Our T3 is partially down since 5th Aug* ! returning from a Sunday in Italy it failed: * The Milano -> Zurich train, I'm arrived @ home 3am ! * On 5th Aug, a Solaris dCache server X4540 =t3fs07= got frozen, I rebooted it and Solaris 10 could not boot because the Flash Card was failed + 3 broken SATA disks in the server such that 2 of them in the same =raidz2= + again 1 of them producing an endless =Disconnected command timeout for Target 1=. * In the meantime 2 disks were failing in an other server X4540 =t3fs11=, again in the same =raidz2= .. * to close this chain, got an other disk failed in an other server X4540 =t3fs10=, but that was easy to fix. * To fix I've: * Installed Solaris Express 11 into a new Flash Card, *precious inheritance coming from a CSCS X4540* and booted the new =t3fs07=. * Stopped dCache on =t3fs11= to avoid an additional load, and because ZFS was already in rebuilding using 2 spares I let it run and once done I've promoted the 2 spares as 2 new pool disks. * UNIBE (reports Gianfranco - please note: I will not attend the meeting): * ce.lhep cluster (older CentOS 5) upgraded to ARC 3.0.2 * obscure bug causing slapd not to start with no trace of error (unless you increase log level to 256 (!) *and* instruct syslog to turn the infosys log on) Actual bug is here: [[https://bugzilla.nordugrid.org/show_bug.cgi?id=3226]] * infoprovider does not work if stale job files (typically from previously failed jobs) exists in the controldir. After some long debugging, understood the problem and Andrej provided a cleanup script. * ce.lhep cluster will grow ~4x in size with nodes from the HLT farm in CERN. At the same time (Sept), it will be re-installed with SLC6 * ce01.lhep cluster (newer SLC6): problems reported previously not permanently solved yet (issue wuth Thumpers lockup, CVMFS cache becoming full). Still ready to move lustre to IB, but the transition will require both me and Andrej to be around for some days at least (which has not happened since last month's meeting) * UNIGE (reports Szymon): * First batch worker node running SLC6 put in operation. Most jobs are OK. * Hardware procurement for the 2013 upgrade is under way. * Replace Solaris in DPM (six machines, 96 TB net) * with new hardware (IBM x3630 M4, 4 machines, 172 TB total). * !CernVM file system was set up. We use NFS do deploy it. * Adaptation of operational procedures, especially the cleanup of "dark data", to the new version of the ATLAS Distributed Data Management software "Rucio". * Hardware failures: * A few disk failures on IBM x3630 M3 and Sun X4500+ machines * Memory errors on one Sun X4540 disk server * Failure of hardware raid on one IBM x3630, due to overheating * Automatic cleanup of /tmp is affecting very long jobs. Files are removed after 5 days. We still don't understand why. It is not the tmpwatch. * UZH (reports Sergio): * Xxx * Switch (reports Alessandro): * Xxx Other topics * Topic1 * Topic2 Next meeting date: AOB ---++ Attendants * CSCS: * CMS: Daniel * ATLAS: * LHCb: * EGI: ---++ Action items * Item1
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r10
|
r8
<
r7
<
r6
<
r5
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r6 - 2013-08-08
-
FabioMartinelli
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback