Swiss Grid Operations Meeting on 2013-06-06
Agenda
Status
- CSCS (reports Miguel):
- Updated SE head-nodes to SL6 on new hardware with dCache 2.2. This is new IBM hardware on SL6.4 with SL IB stack.
- Updated SE pool nodes to dCache 2.2. PhaseD pools stay on SL 5.7 (rdac) and PhaseG pools are on SL6.4 (multipath).
- Firmware upgrade on Phase D storage controllers (DS3500).
- Added new CVMFS service on
cvmfs1
on more powerful hardware. Old cvmfs
system continues to work as a redundant backup.
- Updated remaining CREAM-CEs (
cream01
and cream02
) to SL6 on new hardware.
- Deployed new atlasvobox on
atlas01
. This is a VM on SL6.4 with more resources than the previous system.
- Issues:
- There is a bug on CREAM-CE software that has forced us to upgrade the blah parser package. During next maintenance they will have to be upgraded to last UMD-2 release.
- Found a problem on the SL6.4 kernel (version > 2.6.32-279.19) that has an unresolved bug on the ipoib rdma module. This randomly produces a kernel panic and makes the machine reboot automatically (NULL pointer). The "solution" was to downgrade to
2.6.32-279.19.1.el6.x86_64
- We have seen a bug on CREAM-CE software with a memory leak. Graphs for cream01 and cream02
- PSI (reports Fabio):
- ( Daniel found this ) VOMS FNAL file for CMS not more needed, so this dCache file is now old /etc/grid-security/vomsdir/cms/voms.fnal.gov.lsc and also the VOMS files on the UIs it sometimes lead to the creation of a user VOMS FNAL proxy rejected by the remote grid services.
- Last EGI SL5 WN Tarball deployed and running.
- VMWare nightly snaphots make freeze our
t3dcachedb
VM: PSI VMWare Team is following this issue. Our Site Log here.
- Post Summer planned upgrade of our 360TB raw [SGI IS5500 + Expansion] in 2 x 360TB raw [SGI IS5500 + Expansion]. Link.
- FYI Fabio will be on leave [ 22/06, 14/07 ], Derek on [ 10/06, 19/06 ]
- Please look my dCache notes below.
- UNIBE (reports Gianfranco):
- SubBlade 6048 cluster in production (80% of nodes), with Lustre 2.1.5 on SLC6 (on TCP to begin with). Running with 16 slots per node, trying to make use of some advanced memory management of the batch scheduler gridengine. Ran up to 1.2k jobs simultaneously so far.
- Issues:
- Lost two OSSs (thumpers), 'disappeared' from the network. Disabled them and carried on pending diagnostics.
- 4 nodes had a full CVMFS cache (disabled pending diagnostics)
- 3 nodes did not get the ROCKS customised partiton table, so could not run CVMFS. Even after re-install the problem persists. (disabled pending diagnostics)
- 2 nodes flagged as problematic by Andrej (disabled pending diagnostics)
- We do not seem to experience the CVMFS bug reported below (our kernel should be affected)
- Immediate plans:
- Upgrade offlined wn's to newer CVMFS ( we have cvmfs-2.0.19-1.el6.x86_64 now). Train them for some days, if all well, do a rolling upgrade of wn's
- Investigate OST network failures (tricky as ROCKS assigns random root pwd's these days). Have spare thumpers to put into service anyway.
- Polish IB topology, then move lustre to IB network
- UNIGE (reports Szymon):
- UZH (reports Sergio):
- Switch (reports Alessandro):
Other topics
- CSCS WN migration to SL6: There is a kernel update that breaks current
cvmfs
(kernel > 2.6.32-279). https://cern.service-now.com/service-portal/view-incident.do?n=INC299258
- CSCS next scheduled maintenance: Wed. July 3, 2013 https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ScheduledMaintenanceOn20130703
- CSCS deployment of perfSonar is somehow problematic: there is no mesh for DE cloud.
- ATLAS specific migration to SL6. There's a meta-package for all needed libs and some migration notes:
- Allow UNIBE-ID and UNIBE-LHEP clusters on ATLAS CVMFS squid (will send limited IP list separately)
- ATLAS DE cloud asks for FAX deployment, did not find time yet, I hope to start very soon
- ATLAS DE cloud f2f meeting at CSCS dates' proposal: 16-27 Sept are void dates. (EGI forum & D-grid meeting)
-
cms01
is almost ready; we'll need cmsvobox
for a few more weeks unfortunately, as some critical software ( PhEDEx
) is not yet officially supporting SL6
Next meeting date: Proposed Thu. July 4, 2013
AOB
Attendants
- CSCS: Miguel, George
- CMS: Fabio
- ATLAS: Gianfranco
- LHCb: Roland
- EGI: Alessandro
Action items
http://www.dcache.org/manuals/upgrade/upgrade-2.2-to-2.6.html
Generally speaking dCache is moving its Web interface toward a Webmin customization, with also Warnings inside.
Paul Millar gPlazma gplazma1 will definitively diseappear in 2.6, please migrate to gPlazma2 ! ( our T3 + T2 already did it )
Ron from SARA LDAP for dCache
WebDAV authentications: interesting but out of scope for our T3 and T2.
Christian Bernardt - the most useful talk for our typical day
- They want to introduce the SSH2 shell into the dCache webpage
- IPv6
- Logback configuration --> central dCache logging http://logback.qos.ch/documentation.html
- dcache status provides more info about logs files, uptime of the services
- IT Hit as a new WebDAV server for dCache
- counters in PnfsManagers ( like nfsstat )
- a lot of dCache new stats !
- print srm counters
- Nagios friendly JMX method to extract statistics from dCache
dCache Messaging - showing the dCache internals - Roles of the Domains in dCache - Talk very related to the dCache internals well known cells Routing Manager it receives notification maintains routing table sends notification
OpenMQ has a number of advantages over the default dCache broker.
dCache queuing ( I/O queues for a pool ) very related to the dCache internals, probably no useful for our typical day.
Gerd after a very long talk about the dCache internals presented
a suggested doors/domain map with 3 servers ( which door on which domain )
SRM turning slides - Dimitry - worth to read some tips on the Thread Pool size common errors and possible solutions
DB tuning - Gerd it's possible dump an .html file describing the dCache DBs tables: dcache database Gerd uses HSQL Database Manager utility
http://hsqldb.org/
Patrick slides about CMS disks and tapes + working in progress dCache initiatives Federated Identities Hierarchical storage in dCache with SSD disks Tape concerns about small files: solved by collecting small files into a dir and then dCache runs a 'tar', the result will be uploaded into the tape system.
HTTP/WebDAV federation of dCache installations ( Fabrizio Furano ) Client optimizations ( but still a working in progress ) Individual dCache ( Private Cloud Storage )
*********
Not dCache related, but found interesting talking with a CMS INFN Roma1 colleague Ivano Talamo, I did not have time to verify his hints but it's worth to report them:
- Puppet + Augeas to avoid to have many configuration files with small differences.
- NRPE in Nagios with individual Nagios checks included 1 by 1 instead of a monolithic NRPE file with all the checks listed inside ( include_dir=/etc/nagios/nrpe.d/ )