MeetingSwissGridOperations20131031 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup
-->

---+ Swiss Grid Operations Meeting on 2013-10-31
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9227296)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=Nrq24qRR4V1u
   * *Phone gate*: From Switzerland: +41227671400 (portal) + 9227296 (extension) + # (pound sign) (for more details see [[http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone][the CERN info page]])
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

---++ Agenda

Status
   * *CSCS* (reports Miguel): 
      * Report of the situation of the scratch filesystem: After this week's events, the system seems to be stable. 
         1 We have cleaned more than 50 million inodes and deployed new GPFS policies to prevent this from happening again. <verbatim>Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/gpfs            150626304 51891413 98734891   35% /gpfs
virident1           293937152     1005 Yes      No         25172992 (  9%)      40936480 (14%)
virident2           293937152     1006 Yes      No         25139200 (  9%)      40991648 (14%)
ssd1                390711360     1007 Yes      No        335058944 ( 86%)       6415328 ( 2%)
ssd2                390711360     1008 Yes      No        335084544 ( 86%)       6396320 ( 2%)</verbatim>
         1 Also new =cp_1.sh= (prolog) and =cp_3.sh= (epilog) files have been deployed to make jobs run under =/tmpdir_slurm/$CREAMCENAME/$JOBID= instead of under =$HOME=. <verbatim>#!/bin/bash
# cp_1.sh: this is _sourced_ by the CREAM job wrapper
export TMPDIR="/gpfs/tmpdir_slurm/${SLURM_SUBMIT_HOST}/${SLURM_JOB_ID}"
export MYJOBDIR=${TMPDIR}
mkdir -p ${TMPDIR}
cd ${TMPDIR}

#! /bin/bash
# cp_3.sh: this is _sourced_ by the CREAM job wrapper
# This gets executed at the end
rmdir ${MYJOBDIR}</verbatim>
         1 This scratch filesystem runs on very old hardware (~4yr) and needs to be decommissioned ASAP. We are working on finding a good solution in terms of performance/price.
      * SLURM migration status: 
         1 EMI-3 CREAM and ARC CEs work fine and have been accepting/running jobs for a while. No major issues found (except for some mismatches with the information system).
         1 UMD-2 CREAM and ARC CEs are in downtime and will be migrated to EMI-3 next week.
         1 WNs: 9 nodes remain on Torque/pbs (UMD-2). By end of next week all will be migrated to SLURM (EMI-3).
         1 Accounting (APEL). Old UMD-2 APEL shutdown, new one fetching data from EMI-3 CREAM-CEs. Still waiting for APEL people (John Gordon) to give us the green light to publish to the new APEL system. Accounting from SLURM in October may be lost due to official APEL migration process not clear (not working!). ARC to publish directly without passing thru the APEL server at CSCS.
         1 BDII: still running UMD-2. Need to be upgraded to EMI-3 to solve issues with glue2 publication. Will be done ASAP.
      * CVMFS upgraded to 2.1.15
      * dCache migration: 
         * Testing 2.6 in pre-production - So far it seems to be a fairly simple migration with minimal changes
         * We will also use this opertunity to upgrade to Postgres 9.3
      * Working on new monitoring system: http://ganglia.lcg.cscs.ch/ganglia3/ (views for GPFS/dCache IO stats)
   * *PSI* (reports Fabio): 
      * Collected HW offers to request funds for the next 2 years. Basically we will need more: 
         * WNs
         * a new NetApp
         * a couple of [[http://www.oracle.com/us/products/servers-storage/servers/x86/x4-2l/overview/index.html][2u Oracle NAS]] to relocate onto new HW the current replicated and shared =/shome= ( today we use 2 * SUN Thumper ) 
            * We might decide to build the =/shome= as a bunch of 10*2T disks stored inside the future new NetApp and [[http://docs.oracle.com/cd/E19965-01/E22493/z4000c271005163.html#scrolltoc][SAS]] connected, [[http://docs.oracle.com/cd/E19253-01/820-1931/agkar/index.html][MPxIO managed]] to an Active/Passive couple of [[http://www.oracle.com/us/products/servers-storage/servers/x86/x4-2/overview/index.html][1u Oracle]]; the Active node will format the 10*2TB disks as a ZFS, and it will offer the resulting filesystem by NFSv3; if the Active node crashes then I simply have to import the ZFS into the Passive node, and switch on NFSv3. This conf is cheaper than having a couple of [[http://www.oracle.com/us/products/servers-storage/servers/x86/x4-2l/overview/index.html][2u NAS replicated]] ( less disks, cheaper nodes ).
         * maybe a 32 * 10Gbit/s ports [[http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps10110/data_sheet_c78-507093.html][Cisco N7K &ndash; Extender]]; not urgent but at a certain point 10Gbit/s Ethernet will become the default.
      * Updated the CMS Frontier Squid to the [[http://frontier.cern.ch/dist/rpms/RPMS/x86_64/frontier-squid-2.7.STABLE9-16.1.x86_64.rpm][latest version]]
      * Scheduled downtime on 8th Nov to upgrade both to dCache 2.6, and to UMD3 our [[http://repository.egi.eu/sw/production/umd/3/sl5/x86_64/updates/emi-ui-3.0.2-1.el5.x86_64.rpm][SL5 UIs]]; not sure about the SL5 WNs because of the [[http://repository.egi.eu/mirrors/EMI/tarball/production/sl5/emi3-emi-wn/][lack of the new tarball]]
      * dCache 2.6 upgrade 
         * This can be also useful for CSCS if confirmed: with dCache 2.6 you can avoid to have a separate gPlazma1 configuration for the Xrootd door and simply use the common gPlazma2 cell. I wrote my conf [[https://twiki.cern.ch/twiki/bin/view/Main/DcacheXrootd#Xrootd_gPlazma2_and_dcache_2_6_1][here]]. I asked for a confirmation.
         * Since Spring we're trying to partition the T3 users in primary groups rapresenting the disjoint T3 subgroups: this allows us to easily compute the =/pnfs= group space usage by running a simple query vs Chimera DB; profiting from a such partitioning one can also produce Ganglia plots and similar; I hope to introduce this change on 8th Nov. In the future CSCS could consider to adopt the same partitioning, being conscious that this raises both complexity and security.
         * The Xrootd dCache pool plugin was incompatible with dCache 2.6.11, I asked to produce a [[http://linuxsoft.cern.ch/wlcg/sl6/x86_64/dcache26-plugin-xrootd-monitor-5.0.0-2.noarch.rpm][compatible one]]
   * *UNIBE* (reports Gianfranco): 
      * ce01.lhep cluster 
         * Still stable after moving Lustre to the Infiniband layer. Still one occurrance of Thumper NIC/PCI lockup a couple of weeks back. Lustre now under heavier load, working well so far. Will try to commission for analysis soon.
         * Latest version of CVMFS is said to cure the cache full issue we suffer from. Will updgrade soon
         * NFS v4 new user mapping defaults broke file ownership on a NFS share on the ce01.lhep cluster. This prevented the SW validation jobs from writing the SW tags to the shared area. The fix consists in explicitely declare the <span style="color: #7a4707; font-size: 10px; white-space: pre;">Domain</span> in <span style="color: #7a4707; font-size: 10px; white-space: pre;">/etc/idmapd.conf </span>
      * c202.lhep cluster 
         * All ROCKS images (MDS, OSS, WN) ready, mass installation under way
         * ARC not yet installed. Might hope tu run test jobs tomorrow
   * *UNIGE* (reports Szymon): 
      * Xxx
   * *UZH* (reports Sergio): 
      * Xxx
   * *Switch* (reports Alessandro): 
      * Xxx
Other topics
   * Topic1
   * Topic2
Next meeting date:

AOB

---++ Attendants
   * CSCS: Miguel Gila, George Brown
   * CMS: Fabio, Daniel
   * ATLAS:
   * LHCb: Roland Bernet
   * EGI:

---++ Action items
   * Item1