Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2014-04-03

Agenda

Status

  • CSCS (reports George):
    • Procurement:
      • nothing to report
    • Operations:
      • 14x HP WNs put in production. 2x need to be replaced (DOA), service request done.
      • 8x HP WNs for Swiss users being tested and benchmarked.
      • IBM servers and NetApp storage not ready yet, currently being tested and benchmarked.
      • Decided against provisioning WNs with Puppet as CERN modules are not mature enough. Provisioning and configuration done with Razor + CFengine + YAIM.
      • New GPFS configuration ongoing (ServiceGPFS2)
      • More detailed Slurm CPU monthly reports on the way.
      • ARC CEs publishing correctly and steadily to APEL production server, long-lived GGUS ticket eventually closed.
      • We started an official collaboration with APEL team to check and improve the quality of the APEL parser running on CREAM CEs using SLURM as batch system.
      • Added a new page with SLURM stats for everyone to see (Fairshare multiplier included): http://ganglia.lcg.cscs.ch/ganglia3/?r=hour&cs=&ce=&tab=v&vn=SLURM+dashboard
    • Issues:
      • Need to improve the node health check. We just saw one of the new nodes becoming a job black hole and the system kept sending jobs there.
      • Heartbleed implications: all CEs certs have been renewed and old ones will be revoked.
      • Seeing soft lockups on some WN CPU cores, this appears to trigger a cascading failure until the node hangs. Latest kernel doesn't seem to have this issue so far (2.6.32-431.x)
  • PSI (reports Fabio):
    • I was sick 1 week, so few things to report this time.
    • Updated the PSI PhEDEx agents to version 4_1_3 + updated its UI middleware to the latest emi-ui-3.0.3-1.el5; CSCS is still running PhEDEx version 4_1_0.
    • The gfal-tools are now the suggested data management tools at PSI; the outdated lcg-* tools will be slowly replaced by the T3 users in their code.
    • Updated to openssl-1.0.1e-16.el6_5.7.x86_64 ~10 SL6 servers and recreated 3 server X509 because of the critical Heartbleed OpenSSL bug
    • Updated the dCache Chimera view v_pnfs + made an SSL PostgreSQL configuration to allow each T3 user to independently query v_pnfs by either providing their LDAP password or their PSI Kerberos password ( see attachments ) ; aim is to bring the T3 users in front of their TB of forgotten data to erase them; already got some erase request.
  • UNIBE (reports Gianfranco):
    • Several weekd of stable operations with close to no maintenance needed
    • OpenSSL vulnerability:
      • WN's not affected (openssl-1.0.0-20.el6_2.5.x86_64 across oth systems)
      • 9 servers affected, certificates issued (delay from RA) and replaced this morning. Old certificates revoked around 11:40 today.
    • ANALY_UNIBE-LHEP queue created in AGIS, added to HammerCloud. This is the first queue to test the funcionality of the new ATLAS Control Tower developed bu Andrej, running in CERN, which has many new funcionalities, e.g.:
      • Ability to have PanDA queues at sites rather than have all sites grouped under ARC and ARC_T2 PanDA queues
      • Ability to receive HammerCloud tests (automated auto-exclusion, whitelisting)
      • Automated exclusion in case of GOCDB downtime
      • Ability to use the DATADISK and SCRATCHDISK at the local site, rather than only use the ND T1 SE
      • ...
      • Andrej currenty debugging, no HC jobs running yet (I think)
    • Accounting on ARC CE's switched from ur-logger->SGAS to Jura->SGAS on ce02
      • Chain a-rex-SGAS-APEL working, but archiving does not yet, support from developers a bit slow
      • Plan is: get archiving to work, then add the APEL test server to the jobreport= option and send records to both in parallel. When satisfied that it works, switch to production APEL (GGUS tkt needed?) and turn off SGAS
    • New ARC CE (ce03) setup with SLURM and i WN (all VMs) as testbed for future HPC submission
    • DPM SE setup and monitored for xrootd and joined The German xrootd Federation (FAX) http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=FAX+endpoints&fullscreen=true&highlight=false
    • 6x36-disk (RAID10) servers with Infiniband delivered by DALCO. For new Lustre deployment to replace Thumpers. Started iozone tests. Not sure if keep Lustre 2.1.x or go to 2.4.x yet
    • GIIS (giis.lhep.unibe.ch) and VOMS (voms.lhep.unibe.ch) servers deployment complete and both in production. Clients and ARC CEs configurations shoudl now point to them (was giis.smscg.ch - voms.smscg.ch)
  • UNIGE (reports Szymon):
    • Finished the accounts review
      • 18 closed accounts, 59 active of which 7 “leaving” (a transition phase)
      • 183 people have or had accounts
    • Adding Ex Trigger hardware
      • 35 taken, 2 dead, 26 up and running
      • 7 to do, but physically there
      • cores removed: 8 login, 24 batch
      • cores added: 16 login, 192 batch
      • total batch slots now: 616
      • waiting to be added 56 cores
    • Migration from SLC5 to SLC6
      • Of our 69 batch and login machines, 33 run SLC6
    • ATLAS data transfer rates have become too good
      • risk saturating the 10 Gbps of the UniGE-SWITCH link
      • FTS3 pilot server deployed for us, control of the total rate
      • the progress of the rates is spectacula, over the years
    • FAX (Federated data Access using Xrootd)
      • idea: read data from other site, over WAN
      • it is working for us
      • performance tests, reading data at ~2 MB/s from the US, not bad
* redirection to CERN was not working, pb found at CERN, being fixed * redirection to CERN was not working, pb at CERN, fixed * Mandatory upgrade of the site BDII
      • new one set up from scratch on another machine
      • new name (a DNS alias) in the GOCDB
    • Maintenance issues
      • two crashes of Solaris disk servers => OS and firmware updates
      • three disk failures in IBM disk servers (in 16 machines x 14 disks)
      • there is a problem with the IBM support, waiting 1.5 weeks for a disk now "disk out of stock"
      • maui and openssl security issues
  • - UNIBE-ID
    • Procurement
      • Purchased 4x IBM x3550 M4 servers to replace very old infrastructure servers (jobscheduler, frontend nodes, …)
      • Purchased 7x IBM HS23 (24c, 96GB RAM, 1x SATA disd, each) to fill the last BladeCenter with new WNs
    • Operations
      • Mostly quite operational despite the issues describe below
      • Deployed two hardware firewalls in active/active failover configuration to secure the private networks of UBELIX as previous software-based firewall/fouter servers reached its EOL; Setup is being tested and will get productive in April.
    • Issues
      • Gridengine started segfaulting randomly in CW13; segfaults probably triggered by newly submitted jobs or finished jobs. Restarts/Reboots did not help. Traces/Logs not meaningful.
      • Still have network issues (since relocation despite recently changed switch models) though not that severe anymore; Symptom: degraded GPFS performance and occasionally killed jobs due to packet loss (discards by switches due to buffer overflow); midterm plans to move the GPFS storage to IB
  • UZH (reports Sergio):
    • Xxx
  • Switch (reports Alessandro):
    • Xxx
Other topics
  • Swiss users CPU allocation at CSCS: 8 compute nodes have been purchased to serve specifically to Swiss users. There are many ways of implementing this, but before doing anything we need to answer these questions:
    • Are these 8 nodes to be published as part of Phoenix or kept "hidden"? Currently they are NOT published (only 166 CPUs published).
    • Are dedicated CEs to be used for these specific users? If so, ARC and/or CREAM?
    • What is the hierarchy of these users in regards of other VO users? Dedicated VOMS and queues? This greatly affects fairshare.
    • Is accounting for jobs of these users to be published or not?

  • Host certificate released procedure by SWITCH/QuoVadis
    • in Bern we have issues with delays in obtaining certs
    • Procedure: Request->get email->signi it->bring it in person to the Bern RA->prey
    • What is the experience at other sites?
    • Is there need for a general revision of the procedure?
    • Apparently cumbersome procedure only in Bern. Elsewhere, all web based, certificate issued same day or at most 24h
Next meeting date:

AOB

Attendants

  • CSCS: George, Gianni, Miguel
  • CMS: Fabio
  • ATLAS: Gianfranco
  • LHCb:
  • EGI:

Action items

  • Item1

Attachments

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf HC-week14-CSCS.pdf r1 manage 84.8 K 2014-04-11 - 12:24 GianfrancoSciacca HammerCloud-week14-cscs
Unix shell scriptsh nodeHealthCheck.sh r1 manage 9.5 K 2014-04-11 - 13:18 MiguelGila Node Health Check CSCS
JPEGjpg v_pnfs.jpg r1 manage 36.8 K 2014-04-10 - 10:11 FabioMartinelli PostgreSQL 9.3 v_pnfs conf
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r16 - 2014-05-31 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback