(r9) SwissGridOperationsMeeting20140703 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2014-07-03
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * *GPFS status update* 
      * We found a damaged IB cable on one of the GPFS storage servers that was running at ~10MB/s instead of ~1100MB/s. This made worker nodes to randomly drop off GPFS and consequently, all jobs on the node being kicked out would fail. This is the root cause of the high number of failed jobs in the last months.
      * GPFS filesystem ran out of inodes due to 2 things: 
         1 All failed jobs left their output on the filesystem
         1 The GPFS policies did not run for a while. This was because as the system gets filled, policies take longer to run. Also there was a mismatch in the GPFS version on the nodes that prevented policies to run at times.
      * This is the second time this happens in the last year. After replacing the cable, we've dedicated 3 nodes to OPS jobs and GPFS cleaning policies, running every 2 days and cleaning all files older than 6 days. The 'good' side of this problem is that we checked that the =nodeHealthCheck.sh= script works and this time the filesystem didn't break: CSCS 'only' stopped running jobs.
      * All WNs run now the same IB stack (Mellanox), SL version (6.5), EMI, GPFS, SLURM and CVMFS packages.
      * We're working on deploying a Nagios check to make sure the performance of the IB cables is what it should be. This is a complex task in which other CSCS personnel is involved.
<img alt="" src="http://ganglia.lcg.cscs.ch/ganglia/inodes.png" /> <img alt="" src="http://ganglia.lcg.cscs.ch/ganglia/gpfs_perf.png" />

   * *GPFS2* (new GFPS filesystem) ServiceGPFS2 
      * The system is being configured right now by the storage team of CSCS. Extensive tests are being done on the HW and GPFS configuration.
      * This new filesystem will be GPFS 3.5 with most likely 1MB block size and metadata storing data (small files). It will be divided in two areas (file sets) that will have their own quotas and inode counts; this, in turn, will make the policies to run faster and will (hopefully) prevent the system from getting filled: 
         1 =/gridhome= with ~5TB of storage
         1 =/scratch= with the rest (~65TB)
      * We have the intention to move =cream04= (not in production) to this new system and test production jobs on the machine. Once we are satisfied, we'd need to establish a downtime and move all the CREAM and ARC-CEs.
   * *SLURM issues* 
      * *Fairshare* configuration in SLURM seems to be problematic when one VO does not submit jobs for a while (weekends) and others constantly do (ATLAS). Because of the fairshare calculations, there will be times in which jobs of only one VO will run. Over time this is fair, but VOs should always be able to run some jobs (the status of the site depends on this). 
         * To limit this behaviour, we need to establish a minimum number of job slots reserved per VO. In order to accomplish this, we're going to change the way nodes are assigned to the different partitions. For now, 2 nodes (64 core) will be reserved completely to ATLAS, another 2 to CMS (64 core) and 1 to LHCb (32 core).
         * If this works well, we'd like to assign 4 32-core nodes to ATLAS, another 4 32-core nodes to CMS and 2 32-core nodes to LHCb.
         * Of course, the downside is that if the partition for a specific VO is empty, there will be a few cores unused; but overall we think sacrificing 5-10 nodes during short periods of time is a good compromise.
      * On the new release of SLURM we've seen that *efficiency values* are being shown. Working on providing a Ganglia chart. So far we've seen a huge discrepancy between short jobs (&lt;1h) and long jobs.
   * *Swiss Resources:* after last year's purchase, now we have commissioned some resources that are to be used specifically by Swiss Grid users. This consists on 8x 40-core IvyBridge Machines and ~360TB of storage capacity (dCache). We need to establish how to differentiate normal Grid users from CH Grid users (VOMS?) and do the user mapping. 
      * Compute: Resources are fully deployed (and used) as part of the cluster. Once we have the mapping done, we will give Swiss users more priority over standard WLCG grid users. VO shares stay the same. Over time we will evaluate how well this works for the users.
      * Storage: The intention is to deploy the storage (already configured) to dCache and use SPACE TOKENS to distribute the space across the different VOs. We suggest the following names and configuration ATLASCH 150T (41.6%), CMSCH 150T (41.6%), LHCBCH 60T (~17%).
   * *Availability/Reliability:* what is the cause of CSCS being so low for ATLAS compared to other VOs? 
      1 May 2014: ATLAS 66%/67% CMS 95%/97% LHCb 92%94%
      1 June 2014: ATLAS 77%/77% CMS 98%/98% LHCb 99%99%
   * *Next Downtimes*: as mentioned, =cream04= will be in downtime once ServiceGPFS2 is ready and we will use it to test production jobs on the new storage. Also in september, CSCS will most likely upgrade their Ethernet infrastructure and a site-wide downtime needs to be established. This will affect Phoenix, but hopefully by then we'll be ready with [[ServiceGPFS2][GPFS2]] to shift all CEs there and avoid another maintenance.
   * *Purchases:* as usual, the purchasing period for this year's phase (Phase J) will start soon. The numbers of this phase will be presented during our next F2F meeting in August 19.
   * *dCache bugs* 
      * Symlinks broken after postgres upgrade, fixed in 2.6.30. https://github.com/DmitryLitvintsev/dcache/commit/1026c389dd02ce493da0236ffd0dd1e50f43deab
      * Disable publishing of NFS domain, fixed in 2.6.31. We currently use a sed to hack around this. https://github.com/dCache/dcache/commit/ee43118

---+++ PSI
   * Fabio on leave until 6th July

---+++ UNIBE-LHEP
   * Operations 
      * smooth routine operations
      * twice a-rex crashed, manual restart
      * fsck needed twice on NFS homes/LDAP (UIs). 8-year old server, procuring a replacement now
   * ATLAS specific operations 
      * smooth ruotine operations
      * HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2014-06-01&endTime=2014-06-30&templateType=isGolden
   * Accounting 
      * Jura eventually fixed on both clusters and now running in production
      * There is a good feel that &gt;90k job records where lost for May (during the transaction): 
         * APEL changet their broker network hosts (without explicit warning), and gave us some random host names to try which eventually resulted in some (apparently) unrecoverable mess.
         * all records were sent to SGAS in parallel, but re-publishing of job summaries has not changed the numbers
         * obscure suggestions from JG at APEL not usable, and SGAS server will be turned off tomorrow
         * decided to archive the case and live with the loss
---+++ UNIBE-ID (written only, absent due to short-term maintainence down - cooling system related)
   * Smooth and stable operations in June
   * New submit servers using LSNAT feature of core switch for loadbalancing on the way (currently in provisioning/testing stage)
   * *GPFS:* updated from 3.4.0 to 3.5.0 in two steps: 
      * First upgraded all nodes (clients & servers) to 3.5.0 rpms
      * During downtime switched to using the new features by changing the fs attributes (mmchconfig release=LATEST; mmchfs &lt;fs&gt; -V full
      * Worked like a charm
   * *ARC-CE:* after some weeks of continuous operation, a-rex again unexpectedly quit it's duty two days ago (manually restarted after ~30min). No usable log entries about the cause.

---+++ UNIGE
   * Stable operations, nearly constant load of 400 grid jobs
   * FAX (Federated ATLAS data access using !XrootD ) 
      * redirection to CERN is now working
      * tested performance up to 60 MB/s reading data stored at CERN
      * documented for the users
   * Hardware RAID cards in certain IBM disk servers 
      * concerns IBM x3630 M3 (2011-12), the M4 is OK
      * out of 12 machines, four had overheating RAID
      * broken plastic rivets, radiator set aches from the chip (see photos)
      * opened and inspected all the machines, repaired two
   * Draining and rebooting of batch machines have been automatized
   * Procurement of hardware for the 2014 upgrade ongoing
---+++ NGI_CH
   * Nothing to report from OMB

---++ Other topics
   * NGI_CH ARGUS: What is the plan for [[https://xgus.ggus.eu/ngi_ch/?mode=ticket_info&ticket_id=284][ticket 284]]? CSCS is open to host and run a National ARGUS instance, but need all NGI_CH sysadmins to compromise on co-administering it.
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS: George Brown, Miguel Gila
   * CMS:
   * ATLAS:Szymon Gadomski
   * LHCb:
   * EGI:
---++ Material
   * [[%ATTACHURL%/IBMhwRAID.pdf][IBMhwRAID.pdf]]: Photos of hardware RAID in IBM x3630 M3 at UNIGE
---++ Action items
   * Item1