Tags: view all tags

Swiss Grid Operations Meeting on 2014-07-03

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2014-07-03

Site status

CSCS

GPFS status update
- We found a damaged IB cable on one of the GPFS storage servers that was running at ~10MB/s instead of ~1100MB/s. This made worker nodes to randomly drop off GPFS and consequently, all jobs on the node being kicked out would fail. This is the root cause of the high number of failed jobs in the last months.
- GPFS filesystem ran out of inodes due to 2 things:
  1. All failed jobs left their output on the filesystem
  2. The GPFS policies did not run for a while. This was because as the system gets filled, policies take longer to run. Also there was a mismatch in the GPFS version on the nodes that prevented policies to run at times.
- This is the second time this happens in the last year. After replacing the cable, we've dedicated 3 nodes to OPS jobs and GPFS cleaning policies, running every 2 days and cleaning all files older than 6 days. The 'good' side of this problem is that we checked that the nodeHealthCheck.sh script works and this time the filesystem didn't break: CSCS 'only' stopped running jobs.
- All WNs run now the same IB stack (Mellanox), SL version (6.5), EMI, GPFS, SLURM and CVMFS packages.
- We're working on deploying a Nagios check to make sure the performance of the IB cables is what it should be. This is a complex task in which other CSCS personnel is involved.

GPFS2 (new GFPS filesystem) ServiceGPFS2
- The system is being configured right now by the storage team of CSCS. Extensive tests are being done on the HW and GPFS configuration.
- This new filesystem will be GPFS 3.5 with most likely 1MB block size and metadata storing data (small files). It will be divided in two areas (file sets) that will have their own quotas and inode counts; this, in turn, will make the policies to run faster and will (hopefully) prevent the system from getting filled:
  1. /gridhome with ~5TB of storage
  2. /scratch with the rest (~65TB)
- We have the intention to move cream04 (not in production) to this new system and test production jobs on the machine. Once we are satisfied, we'd need to establish a downtime and move all the CREAM and ARC-CEs.
SLURM issues
- Fairshare configuration in SLURM seems to be problematic when one VO does not submit jobs for a while (weekends) and others constantly do (ATLAS). Because of the fairshare calculations, there will be times in which jobs of only one VO will run. Over time this is fair, but VOs should always be able to run some jobs (the status of the site depends on this).
  - To limit this behaviour, we need to establish a minimum number of job slots reserved per VO. In order to accomplish this, we're going to change the way nodes are assigned to the different partitions. For now, 2 nodes (64 core) will be reserved completely to ATLAS, another 2 to CMS (64 core) and 1 to LHCb (32 core).
  - If this works well, we'd like to assign 4 32-core nodes to ATLAS, another 4 32-core nodes to CMS and 2 32-core nodes to LHCb.
  - Of course, the downside is that if the partition for a specific VO is empty, there will be a few cores unused; but overall we think sacrificing 5-10 nodes during short periods of time is a good compromise.
- On the new release of SLURM we've seen that efficiency values are being shown. Working on providing a Ganglia chart. So far we've seen a huge discrepancy between short jobs (<1h) and long jobs.
Swiss Resources: after last year's purchase, now we have commissioned some resources that are to be used specifically by Swiss Grid users. This consists on 8x 40-core IvyBridge Machines and ~360TB of storage capacity (dCache). We need to establish how to differentiate normal Grid users from CH Grid users (VOMS?) and do the user mapping.
- Compute: Resources are fully deployed (and used) as part of the cluster. Once we have the mapping done, we will give Swiss users more priority over standard WLCG grid users. VO shares stay the same. Over time we will evaluate how well this works for the users.
- Storage: The intention is to deploy the storage (already configured) to dCache and use SPACE TOKENS to distribute the space across the different VOs. We suggest the following names and configuration ATLASCH 150T (41.6%), CMSCH 150T (41.6%), LHCBCH 60T (~17%).
Availability/Reliability: what is the cause of CSCS being so low for ATLAS compared to other VOs?
1. May 2014: ATLAS 66%/67% CMS 95%/97% LHCb 92%94%
2. June 2014: ATLAS 77%/77% CMS 98%/98% LHCb 99%99%
Next Downtimes: as mentioned, cream04 will be in downtime once ServiceGPFS2 is ready and we will use it to test production jobs on the new storage. Also in september, CSCS will most likely upgrade their Ethernet infrastructure and a site-wide downtime needs to be established. This will affect Phoenix, but hopefully by then we'll be ready with GPFS2 to shift all CEs there and avoid another maintenance.
Purchases: as usual, the purchasing period for this year's phase (Phase J) will start soon. The numbers of this phase will be presented during our next F2F meeting in August 19.
dCache bugs
- Symlinks broken after postgres upgrade, fixed in 2.6.30. https://github.com/DmitryLitvintsev/dcache/commit/1026c389dd02ce493da0236ffd0dd1e50f43deab
- Disable publishing of NFS domain, fixed in 2.6.31. We currently use a sed to hack around this. https://github.com/dCache/dcache/commit/ee43118

PSI

Fabio on leave until 6th July

UNIBE-LHEP

Operations
- smooth routine operations
- twice a-rex crashed, manual restart
- fsck needed twice on NFS homes/LDAP (UIs). 8-year old server, procuring a replacement now
ATLAS specific operations
- smooth ruotine operations
- HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2014-06-01&endTime=2014-06-30&templateType=isGolden
Accounting
- Jura eventually fixed on both clusters and now running in production
- There is a good feel that >90k job records where lost for May (during the transaction):
  - APEL changet their broker network hosts (without explicit warning), and gave us some random host names to try which eventually resulted in some (apparently) unrecoverable mess.
  - all records were sent to SGAS in parallel, but re-publishing of job summaries has not changed the numbers
  - obscure suggestions from JG at APEL not usable, and SGAS server will be turned off tomorrow
  - decided to archive the case and live with the loss

UNIBE-ID (written only, absent due to short-term maintainence down - cooling system related)

Smooth and stable operations in June
New submit servers using LSNAT feature of core switch for loadbalancing on the way (currently in provisioning/testing stage)
GPFS: updated from 3.4.0 to 3.5.0 in two steps:
- First upgraded all nodes (clients & servers) to 3.5.0 rpms
- During downtime switched to using the new features by changing the fs attributes (mmchconfig release=LATEST; mmchfs <fs> -V full
- Worked like a charm
ARC-CE: after some weeks of continuous operation, a-rex again unexpectedly quit it's duty two days ago (manually restarted after ~30min). No usable log entries about the cause.

UNIGE

Stable operations, nearly constant load of 400 grid jobs
FAX (Federated ATLAS data access using XrootD )
- redirection to CERN is now working
- tested performance up to 60 MB/s reading data stored at CERN
- documented for the users
Hardware RAID cards in certain IBM disk servers
- concerns IBM x3630 M3 (2011-12), the M4 is OK
- out of 12 machines, four had overheating RAID
- broken plastic rivets, radiator set aches from the chip (see photos)
- opened and inspected all the machines, repaired two
Draining and rebooting of batch machines have been automatized
Procurement of hardware for the 2014 upgrade ongoing

NGI_CH

Nothing to report from OMB

A.O.B.

Attendants

CSCS: George Brown, Miguel Gila
CMS:
ATLAS:Szymon Gadomski
LHCb:
EGI:

Material

IBMhwRAID.pdf: Photos of hardware RAID in IBM x3630 M3 at UNIGE

Action items

Item1

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
png	prod_vojobs-running-week.gif.png	r1	manage	13.2 K	2014-07-03 - 12:11	MiguelGila	CSCS jobs running over a week
pdf	IBMhwRAID.pdf	r1	manage	3183.4 K	2014-07-03 - 08:40	SzymonGadomski	Photos of hardware RAID in IBM x3630 M3 at UNIGE