Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2014-07-03

Site status

CSCS

  • GPFS status update
    • We found a damaged IB cable on one of the GPFS storage servers that was running at ~10MB/s instead of ~1100MB/s. This made worker nodes to randomly drop off GPFS and consequently, all jobs on the node being kicked out would fail. This is the root cause of the high number of failed jobs in the last months.
    • GPFS filesystem ran out of inodes due to 2 things:
      1. All failed jobs left their output on the filesystem
      2. The GPFS policies did not run for a while. This was because as the system gets filled, policies take longer to run. Also there was a mismatch in the GPFS version on the nodes that prevented policies to run at times.
    • This is the second time this happens in the last year. After replacing the cable, we've dedicated 3 nodes to OPS jobs and GPFS cleaning policies, running every 2 days and cleaning all files older than 6 days. The 'good' side of this problem is that we checked that the nodeHealthCheck.sh script works and this time the filesystem didn't break: CSCS 'only' stopped running jobs.
    • All WNs run now the same IB stack (Mellanox), SL version (6.5), EMI, GPFS, SLURM and CVMFS packages.
    • We're working on deploying a Nagios check to make sure the performance of the IB cables is what it should be. This is a complex task in which other CSCS personnel is involved.

  • GPFS2 (new GFPS filesystem) ServiceGPFS2
    • The system is being configured right now by the storage team of CSCS. Extensive tests are being done on the HW and GPFS configuration.
    • This new filesystem will be GPFS 3.5 with most likely 1MB block size and metadata storing data (small files). It will be divided in two areas (file sets) that will have their own quotas and inode counts; this, in turn, will make the policies to run faster and will (hopefully) prevent the system from getting filled:
      1. /gridhome with ~5TB of storage
      2. /scratch with the rest (~65TB)
    • We have the intention to move cream04 (not in production) to this new system and test production jobs on the machine. Once we are satisfied, we'd need to establish a downtime and move all the CREAM and ARC-CEs.
  • SLURM issues
    • Fairshare configuration in SLURM seems to be problematic when one VO does not submit jobs for a while (weekends) and others constantly do (ATLAS). Because of the fairshare calculations, there will be times in which jobs of only one VO will run. Over time this is fair, but VOs should always be able to run some jobs (the status of the site depends on this).
      • To limit this behaviour, we need to establish a minimum number of job slots reserved per VO. In order to accomplish this, we're going to change the way nodes are assigned to the different partitions. For now, 2 nodes (64 core) will be reserved completely to ATLAS, another 2 to CMS (64 core) and 1 to LHCb (32 core).
      • If this works well, we'd like to assign 4 32-core nodes to ATLAS, another 4 32-core nodes to CMS and 2 32-core nodes to LHCb.
      • Of course, the downside is that if the partition for a specific VO is empty, there will be a few cores unused; but overall we think sacrificing 5-10 nodes during short periods of time is a good compromise.
    • On the new release of SLURM we've seen that efficiency values are being shown. Working on providing a Ganglia chart. So far we've seen a huge discrepancy between short jobs (<1h) and long jobs.
  • Swiss Resources: after last year's purchase, now we have commissioned some resources that are to be used specifically by Swiss Grid users. This consists on 8x 40-core IvyBridge Machines and ~360TB of storage capacity (dCache). We need to establish how to differentiate normal Grid users from CH Grid users (VOMS?) and do the user mapping.
    • Compute: Resources are fully deployed (and used) as part of the cluster. Once we have the mapping done, we will give Swiss users more priority over standard WLCG grid users. VO shares stay the same. Over time we will evaluate how well this works for the users.
    • Storage: The intention is to deploy the storage (already configured) to dCache and use SPACE TOKENS to distribute the space across the different VOs. We suggest the following names and configuration ATLASCH 150T (41.6%), CMSCH 150T (41.6%), LHCBCH 60T (~17%).
  • Availability/Reliability: what is the cause of CSCS being so low for ATLAS compared to other VOs?
    1. May 2014: ATLAS 66%/67% CMS 95%/97% LHCb 92%94%
    2. June 2014: ATLAS 77%/77% CMS 98%/98% LHCb 99%99%
  • Next Downtimes: as mentioned, cream04 will be in downtime once ServiceGPFS2 is ready and we will use it to test production jobs on the new storage. Also in september, CSCS will most likely upgrade their Ethernet infrastructure and a site-wide downtime needs to be established. This will affect Phoenix, but hopefully by then we'll be ready with GPFS2 to shift all CEs there and avoid another maintenance.
  • Purchases: as usual, the purchasing period for this year's phase (Phase J) will start soon. The numbers of this phase will be presented during our next F2F meeting in August 19.
  • dCache bugs

PSI

  • Fabio on leave until 6th July

UNIBE-LHEP

  • Operations
    • smooth routine operations
    • twice a-rex crashed, manual restart
    • fsck needed twice on NFS homes/LDAP (UIs). 8-year old server, procuring a replacement now
  • ATLAS specific operations
  • Accounting
    • Jura eventually fixed on both clusters and now running in production
    • There is a good feel that >90k job records where lost for May (during the transaction):
      • APEL changet their broker network hosts (without explicit warning), and gave us some random host names to try which eventually resulted in some (apparently) unrecoverable mess.
      • all records were sent to SGAS in parallel, but re-publishing of job summaries has not changed the numbers
      • obscure suggestions from JG at APEL not usable, and SGAS server will be turned off tomorrow
      • decided to archive the case and live with the loss

UNIBE-ID (written only, absent due to short-term maintainence down - cooling system related)

  • Smooth and stable operations in June
  • New submit servers using LSNAT feature of core switch for loadbalancing on the way (currently in provisioning/testing stage)
  • GPFS: updated from 3.4.0 to 3.5.0 in two steps:
    • First upgraded all nodes (clients & servers) to 3.5.0 rpms
    • During downtime switched to using the new features by changing the fs attributes (mmchconfig release=LATEST; mmchfs <fs> -V full
    • Worked like a charm
  • ARC-CE: after some weeks of continuous operation, a-rex again unexpectedly quit it's duty two days ago (manually restarted after ~30min). No usable log entries about the cause.

UNIGE

  • Stable operations, nearly constant load of 400 grid jobs
  • FAX (Federated ATLAS data access using XrootD )
    • redirection to CERN is now working
    • tested performance up to 60 MB/s reading data stored at CERN
    • documented for the users
  • Hardware RAID cards in certain IBM disk servers
    • concerns IBM x3630 M3 (2011-12), the M4 is OK
    • out of 12 machines, four had overheating RAID
    • broken plastic rivets, radiator detaches from the chip (see photos)
    • opened and inspected all the machines, repaired two
  • Draining and rebooting of batch machines have been automatized
  • Procurement of hardware for the 2014 upgrade ongoing

NGI_CH

  • Nothing to report from OMB

Other topics

  • NGI_CH ARGUS: What is the plan for ticket 284? CSCS is open to host and run a National ARGUS instance, but need all NGI_CH sysadmins to compromise on co-administering it.
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS: George Brown, Miguel Gila
  • CMS:
  • ATLAS:Szymon Gadomski
  • LHCb: Roland Bernet
  • EGI:

Material

  • IBMhwRAID.pdf: Photos of hardware RAID in IBM x3630 M3 at UNIGE

Action items

  • Item1
Topic attachments
I Attachment History Action SizeSorted ascending Date Who Comment
PNGpng prod_vojobs-running-week.gif.png r1 manage 13.2 K 2014-07-03 - 12:11 MiguelGila CSCS jobs running over a week
PDFpdf IBMhwRAID.pdf r1 manage 3183.4 K 2014-07-03 - 08:40 SzymonGadomski Photos of hardware RAID in IBM x3630 M3 at UNIGE
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r13 - 2014-07-09 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback