MeetingSwissWLCGOperations20120906 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup
-->

---+ Swiss WLCG Operations Meeting on 2012-09-06
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: EVO, password: chipp
   * *External link / EVO*: http://evo.caltech.edu/evoNext/koala.jnlp?meeting=vsvivIese2IsIvaiaMItas

---++ Agenda
---+++ CSCS Status
(Reports Miguel)
   1. Storage Element:
      * dCache extension: Storage extension chosen and in the process of purchasing it. Should arrive in November (AFAIK, Miguel).
      * dCache upgrade to 1.9.12. The process has started on preproduction, but since it's a complicated matter, we are being extra careful to assure no data is lost in the process.
   1. WN:
      * WN: 12 extra Sandy Bridge nodes (384 job slots) are physically installed, but we have had no time to configure them. Will do ASAP.
      * WN software: As of today, wn[01-46] are gLite 3.2 WN and wn[47-59] are UMD 1 WN. During next maintenance we plan to upgrade all nodes to UMD 1.
   1. Network:
      * Ethernet Network replacement: The Cisco switches have arrived and the Network Administrator at CSCS is preparing the infrastructure and configuration required for them.
   1. Problems:
      * Yesterday's problem with Argus: Some error on Argus caused all CREAM-CEs to stop accepting jobs. A mail to argus-support has been sent, waiting for reply.
      * We have a small problem with our KVM solution: Convirture is unable to work due to database corruption, so we have to shut down all KMV VMs during one maintenance to re-add them. Thinking about a permanent solution, either commercial or open source, but rock-solid.
      * Sun HW is failing at an alarming rate. This week the old MDT connected to =puppet= failed 3 disks on a RAID-6. We were able to recover with some filesystem corruption, but if this HW is failing, other hardware of the same batch might start failing too (critical ones: dCache head nodes). Unclear yet whether this has affected our ability to install machines (kickstart files).
      * We have detected a problem with NGI-DE/CH TopBDII: at times it is very slow answering queries and, therefore, the status of CSCS is degraded on NGI checks. We have seen DESY using their own internal TopBDII, so we are thinking about doing the same internally for CSCS. *NOT* to all the NGI_CH cloud. At the moment the BDII at CERN is being used as a primary BDII, but it's a temporary solution. =lcg-bdii.cern.ch:2170,bdii-fzk.gridka.de:2170=
   1. AOB
      * Atlasvobox: We have seen that it is possible to use the Squid provided by Scientific Linux (currently used on CVMFS) to host, also, the atlasvobox. The process seems simple, but some work needs to be done and a lot of testing is important. We are working on it.
      * Fabio requested access to our preproduction cluster to test some changes on the CREAM-CE. Please, submit a ticket, so we can get to work on it ASAP.

---+++ PSI Status
(Reports Fabio)
   * Designing a *Fast, HA, SAN 10TB /home* based on *GPFS* with:
      * Two servers, like 2u HP Proliant + [[http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.v3r50-3.gpfs300.doc%2Fbl1ins_nodqtieb.htm][GPFS 3.5 - Node quorum with tiebreaker disks]]
      * Well tested dual card Qlogic FC 8Gbit/s
      * 2u 24-bay IBM [[http://www-03.ibm.com/systems/storage/disk/ds3500/specifications.html][DS3524]] *or* 2u 24-bay [[http://www.sgi.com/products/storage/raid/5000.html][SGI IS5000]].
      * 6 Gbps SAS 2.5" 900GB 10k disks, but I'd want to put the GPFS metadata on SSD or 15k in RAID1 ( opinions? )
      * Tot Cost with 10k disks is: ~50k CHF.
      * *BTW* it's still missing features like snapshots but also *GlusterFS*, now called [[https://access.redhat.com/knowledge/docs/Red_Hat_Storage/][Red Hat Storage 2.0]], can implement a *cheap HA /home with 2 NAS*.
   * Because of several WN panics, we introduced SGE queues memory limits, default is 3GB per Job, users can ask up to 6GB.
   * We introduced a recent/all hierarchical [[https://wiki.chipp.ch/twiki/bin/view/CmsTier3/CMSTier3Log27][SGE accounting file]] to speed up the =qacct= response times.
   * Testing dCache 1.9.12 inside our VMWare testbed.

---+++ UNIBE Status
(Reports Gianfranco):
   * Xxx
---+++ UNIGE Status
(Rreports Szymon):
   * Xxx

---++Other topics
   * Topic1
   * Topic2

---++Next meeting date

---++AOB



---++ Attendants
   * CSCS: Miguel
   * CMS: Fabio, Daniel
   * ATLAS:
   * LHCb:
   * EGI:

---++ Action items
   * Item1