Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2013-06-06

Agenda

Status

  • CSCS (reports Miguel):
    • Updated SE head-nodes to SL6 on new hardware with dCache 2.2. This is new IBM hardware on SL6.4 with SL IB stack.
    • Updated SE pool nodes to dCache 2.2. PhaseD pools stay on SL 5.7 (rdac) and PhaseG pools are on SL6.4 (multipath).
    • Firmware upgrade on Phase D storage controllers (DS3500).
    • Added new CVMFS service on cvmfs1 on more powerful hardware. Old cvmfs system continues to work as a redundant backup.
    • Updated remaining CREAM-CEs ( cream01 and cream02) to SL6 on new hardware.
    • Deployed new atlasvobox on atlas01. This is a VM on SL6.4 with more resources than the previous system.
    • Issues:
      • There is a bug on CREAM-CE software that has forced us to upgrade the blah parser package. During next maintenance they will have to be upgraded to last UMD-2 release.
      • Found a problem on the SL6.4 kernel (version > 2.6.32-279.19) that has an unresolved bug on the ipoib rdma module. This randomly produces a kernel panic and makes the machine reboot automatically (NULL pointer). The "solution" was to downgrade to 2.6.32-279.19.1.el6.x86_64
      • We have seen a bug on CREAM-CE software with a memory leak. Graphs for cream01 and cream02
  • PSI (reports Fabio):
    • ( Daniel found this ) VOMS FNAL file for CMS not more needed, so this dCache file is now outdated /etc/grid-security/vomsdir/cms/voms.fnal.gov.lsc and also the VOMS files on the UIs; on the UIs it sometimes lead to the creation of a user VOMS FNAL proxy then rejected by the remote grid services.
    • Latest EGI SL5 WN Tarball deployed and running.
    • VMWare nightly snaphots make freeze our dCache Database VM: the PSI VMWare Team is following this issue. Our Site Log.
    • After Summer we planned an upgrade of our 360TB raw SGI IS5500 + Expansion in 2 x 360TB raw SGI IS5500 + Expansion.
    • Derek will be on leave on [ 10/06, 19/06 ], Fabio on [ 22/06, 14/07 ], just regular T3 maintenances during this period ( no downtimes, no new HW/SW systems, improving doc )
    • dCache admins please look my notes below.
  • UNIBE (reports Gianfranco):
    • SubBlade 6048 cluster in production (80% of nodes), with Lustre 2.1.5 on SLC6 (on TCP to begin with). Running with 16 slots per node, trying to make use of some advanced memory management of the batch scheduler gridengine. Ran up to 1.2k jobs simultaneously so far.
    • Issues:
      • Lost two OSSs (thumpers), 'disappeared' from the network. Disabled them and carried on pending diagnostics.
      • 4 nodes had a full CVMFS cache (disabled pending diagnostics)
      • 3 nodes did not get the ROCKS customised partiton table, so could not run CVMFS. Even after re-install the problem persists. (disabled pending diagnostics)
      • 2 nodes flagged as problematic by Andrej (disabled pending diagnostics)
    • We do not seem to experience the CVMFS bug reported below (our kernel should be affected)
    • Immediate plans:
      • Upgrade offlined wn's to newer CVMFS ( we have cvmfs-2.0.19-1.el6.x86_64 now). Train them for some days, if all well, do a rolling upgrade of wn's
      • Investigate OST network failures (tricky as ROCKS assigns random root pwd's these days). Have spare thumpers to put into service anyway.
      • Polish IB topology, then move lustre to IB network
  • UNIGE (reports Szymon):
    • Xxx
  • UZH (reports Sergio):
    • Xxx
  • Switch (reports Alessandro):
    • Xxx
Other topics Next meeting date: Proposed Thu. July 4, 2013

AOB

Attendants

  • CSCS: Miguel, George
  • CMS: Fabio, Daniel (partly)
  • ATLAS: Gianfranco
  • LHCb: Roland
  • EGI: Alessandro

Action items

  • Item1

Fabio's notes about 2013 7th dCache Workshop

New 2.2 to 2.6 Upgrade Guide

Generally speaking dCache is moving its Web interface toward a Webmin customization, with also Warnings reported.

Paul Millar gPlazma - gplazma1 will definitively diseappear in 2.6, please migrate to gPlazma2 ( Swiss CMS T3 + T2 already did it )

Ron from SARA - LDAP for dCache WebDAV authentications: interesting, but out of scope for us.

Christian Bernardt - the most useful talk for our typical admin day

  • They want to introduce the SSH2 shell into the dCache Webmin webpage
  • IPv6 in dCache
  • Logback configuration --> central dCache logging http://logback.qos.ch/documentation.html
  • dcache status provides now more info about logs files, uptime of the services
  • IT Hit as a new WebDAV server for dCache
  • counters in PnfsManagers ( like nfsstat )
  • a lot of dCache new stats !
  • print srm counters
  • Nagios friendly JMX method to extract statistics from dCache

dCache Messaging - showing the dCache internals - Roles of the Domains in dCache - Talk very related to the dCache internals well known cells Routing Manager it receives notification maintains routing table sends notification OpenMQ has a number of advantages over the default dCache broker.

dCache queuing ( I/O queues for a pool ) very related to the dCache internals, probably not useful for our typical day.

Gerd after a very long talk about the dCache internals presented a suggested doors/domain map with 3 servers ( which door on which domain )

Dimitry SRM turning slides - it's worth to read some tips on the Thread Pool size common errors and possible solutions

Gerd DB tuning - it's possible dump an .html file describing the dCache DBs tables: dcache database Gerd uses HSQL Database Manager utility http://hsqldb.org/

Patrick slides about CMS disks and tapes + working in progress dCache initiatives

  • Federated Identities Hierarchical storage in dCache with SSD disks Tape concerns about small files: solved by collecting small files into a dir and then dCache runs a 'tar', the result will be uploaded into the tape system HTTP/WebDAV federation of dCache installations ( Fabrizio Furano )
  • Client optimizations ( but still a working in progress )
  • Individual dCache ( Private Cloud Storage )

*********

Not dCache related, but I found it interesting by talking with a CMS INFN Roma1 colleague Ivano Talamo:

  • Puppet + Augeas to avoid to have many configuration files with small differences.
  • NRPE in Nagios with individual Nagios checks included 1 by 1 instead of a monolithic NRPE file with all the checks listed inside ( include_dir=/etc/nagios/nrpe.d/ ) <-- I confirm it easily works
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r19 - 2016-11-08 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback