Tags:
create new tag
view all tags

Swiss Grid Operations Meeting on 2015-05-07

Site status

CSCS

  • Unscheduled downtime 15.04.15 and 16.04.15
    • In the morning of April 15, we found GPFS was blocked and was spitting out messages such as:
      [...]
      Tue Apr 14 20:22:52.964 2015: Recovered 1 nodes for file system phoenix_scratch.
      Tue Apr 14 20:29:13.455 2015: Accepted and connected to 148.187.65.62 wn62 <c0n14>
      *** glibc detected *** /usr/lpp/mmfs/bin//mmfsd: invalid fastbin entry (free): 0x00007fbf202829b0 ***
      ======= Backtrace: =========
      /lib64/libc.so.6(+0x76166)[0x7fbf4fc59166]
      /usr/lpp/mmfs/bin//mmfsd(_ZN10MsgDataBuf8freeDataEv+0x18)[0x90e5b8]
      /usr/lpp/mmfs/bin//mmfsd(_ZN10MsgDataBufD1Ev+0x9)[0x910469]
      /usr/lpp/mmfs/bin//mmfsd(_ZN7TcpConn9deleteMsgEP6RcvMsg+0x4c)[0x918cac]
      /usr/lpp/mmfs/bin//mmfsd(_ZN10NsdRequest14processRequestEP9NsdBufferP8NsdQueue+0x385)[0x10f0d65]
      /usr/lpp/mmfs/bin//mmfsd[0x10f17ba]
      /usr/lpp/mmfs/bin//mmfsd(_ZN6Thread8callBodyEPS_+0x66)[0x5a4676]
      /usr/lpp/mmfs/bin//mmfsd(_ZN6Thread15callBodyWrapperEPS_+0x79)[0x5963f9]
      /lib64/libpthread.so.0(+0x79d1)[0x7fbf5070c9d1]
      /lib64/libc.so.6(clone+0x6d)[0x7fbf4fccbb6d]
      ======= Memory map: ========
      00400000-0134c000 r-xp 00000000 fd:01 2757550                            /usr/lpp/mmfs/bin/mmfsd
      0144b000-01490000 rwxp 00f4b000 fd:01 2757550                            /usr/lpp/mmfs/bin/mmfsd
      01490000-014f7000 rwxp 00000000 00:00 0
      0237d000-025a2000 rwxp 00000000 00:00 0                                  [heap]
      Both metadata servers were affected with a difference of about 8 hours. This also caused one of the metadata servers to be out of sync (its SSD disks were expelled from GPFS).
    • Upon discovery, we announced the problem to the CHIPP community, declared an official downtime and contacted IBM about the issue. In parallel, we manually re-added the SSDs that hold metadata on the server out of sync and started the sync process.
    • After about 5 hours, IBM labs said that they could not reproduce the problem and suggested to upgrade to a newer version.
    • Once the sync finished, following IBM's advice, we rebooted all GPFS servers, aligned GPFS package version in the servers with the rest of CSCS (3.5.0-21).
    • Unfortunately, at this point we realised that the number of inodes was very close to the maximum. This is most likely due to the fact that at the time GPFS blocked, the cluster was full and some ~80 million files belonging to those failed jobs were stuck in the filesystem (if the filesystem is not accessible, jobs can't delete their output). The cleanup process took many hours (until 23:30 aprox.).
    • When the process finished, we regenerated the scratch structure and slowly brought the system back to life. The structure we recreated was incomplete and it took us a bit to make sure all permissions were correct.
    • Red led Actions taken as a result of this downtime:
      1. The procedure to recreate the filesystem has been improved.
      2. We've further increased the frequency of the GPFS cleaning policy to run 2x day, removing everything older than the length of the longest job.
      3. Currently evaluating other possible configurations, such as dividing scratch in filesets (GPFS terms for folder with its own quotas). This increases complexity, but might help to mitigate problems in the event of issues with the filesystem (we could recreate filesets on a rolling update fashion, first VO1, then VO2 and ultimately VO3.
      4. In addition to upgrading GPFS, we tuned its configuration to avoid the massive swapping we found in the nodes that 'only' have 64GB RAM. Now it won't ever take more than 2GB of physical memory on any given WN.
  • ARC update
    • arc01 re-installed with nordugrid-arc 5.0.0-2
    • currently configuring and testing
  • Other operations
    • Currently investigating feasibility of implementing cgroups in order to contain jobs. This would require a major SLURM upgrade. The following presentation done at HEPIX on this matter is very interesting.
    • Currently testing dCache upgrade as dCache 2.6 will be out of support soon.
    • Work to port Phoenix to Puppet is still ongoing.
    • Installed pakiti on all WNs
  • Next maintenance on May 20th, 2015 between 8:00 and 20:00
    • Due to works on the cooling systems and a critical maintenance on the NAS infrastructure, Phoenix is forced to go in maintenance.
    • During the downtime, IBM will upgrade firmware on our DCS3700 storage controllers.
    • At this point no other operations are planned.

PSI

  • Studying the Control Groups in Son of Grid Engine
    • The 5' article about Control Groups to be read + Full Control Groups RHEL6 Reference it can be skipped the 1st time
    • One of the recurring issues with the old SGE 6.2u5 running at PSI T3 is the user freedom to consume more CPU cores than the ones assigned by the batch system ; to definitively fix this issue I'm going to upgrade our SGE from SGE 6.2u5 to Son of Grid Engine 8.1.8 because of its support of the Control Groups ; some details about this integration are here ; there are already ATLAS sites using the Control Groups in HTCondor ; SLURM also supports the Control Groups
    • Just for Gianfranco : you need to apply these confs to make work the Control Groups / Cpuset in Son of Grid Engine ( it cost me 1d of attempts ! ) :
      1. [wn] cat /etc/sysconfig/sgeexecd 
        export SGE_CGROUP_DIR=/dev/cpuset/sge
      2. [wn] grep -Hn setup-cgroups-etc /etc/init.d/sgeexecd.p6444 /etc/init.d/sgeexecd.p6444:441: 
        /opt/sge/util/resources/scripts/setup-cgroups-etc start
      3. [wn] qconf -sconf |grep -Hn CGR (standard input):28:execd_params USE_SMAPS=true KEEP_ACTIVE=true 
        USE_CGROUPS=true ENABLE_BINDING=true \ 
      4. [submission_host] grep -v \# /opt/sge/default/common/sge_request | strings 
        -binding set linear
  • Upgraded the PSI PhEDEx from SL5 to SL6
    • Lot of issues here due to the poor QA of the latest SRM client SRM client 2.10.7 (rpm)
    • The implicit X509 Proxy Delegation requested to copy files between 2 remote SRM endpoint ( e.g. from CSCS to PSI ) doesn't work if one uses the -copyjobfile option, like PhEDEx does ; dCache team acknowledged this bug
    • Also this is a bug :
      $ srmls -debug=false -x509_user_proxy=/home/phedex/gridcert/proxy.cert -retry_num=0 'srm://t3se01.psi.ch:8443/srm/managerv2?SFN=/pnfs/psi.ch/cms/trivcat/store/mc/RunIIWinter15GS/RSGravToWWToLNQQ_kMpl01_M-4000_TuneCUETP8M1_13TeV-pythia8/GEN-SIM/MCRUN2_71_V1-v1/10000/2898A22B-62B0-E411-B1D4-002590D600EE.root' srm client error: java.lang.IllegalArgumentException: Multiple entries with same key: x509_user_proxy=/home/phedex/gridcert/proxy.cert and x509_user_proxy=/tmp/x509up_u205 
    • Same here :
      $ srm-advisory-delete -x509_user_proxy=${X509_USER_PROXY} -retry_num=0 srm client error: java.lang.IllegalArgumentException: Multiple entries with same key: x509_user_proxy=-retry_num=0 and x509_user_proxy=/tmp/x509up_u205 
    • Eventually I tweaked the PhEDEx script to bypass all these bugs. Just for Daniel my FileDownloadDelete and FileDownloadSRMVerify corrections
  • Power cut at PSI
    • My 1st power cut since 2011 ; 4 file servers rebooted, T3 users in panic mode ; luckily fixed without data lost
  • CMS Space Monitoring Project for CSCS ; Daniel is going to follow this task

UNIBE-LHEP

  • Operations
    • Mainly stable operation on both clusters (yet at about half capacity)
    • Local netork sick on ce01 (22nd April): power-cycled
    • 3 a-rex crashes caught by the cron
  • ATLAS specific operations
    • As last month, mostly multi-core MC production+Reconstruction, mostly smooth. Until:
    • Following ARC upgrade to 5.0.0-2.el6 : /var full caused the system to jam (ce01 only, grid-manager.log). Laborious clean-up needed
    • Following clean-up: a-rex crashes at start. "Insider" tip: rm gm.fifo in controldir fixed it
    • ARC 5 introduces handling of user-requested job-priority. Arc range: [0:100] is mapped in ARC 5 to [-1023:1024] but in griengine the range allowed for user control is [-1023:0]. Needed to hack submit-sge-script
    • Yesterday, ce01 dropped out of the GIIS (all services reported themselves as running): needed restarting the infosys and a-rex. Update: just happened again
    • Removal of voms.cern.ch alias to voms2.cern.ch cause file transfers to the SRM to fail. Apparently an obscure mis-configuration detail caused the authentication problem [ https://ggus.eu/index.php?mode=ticket_info&ticket_id=113485 ] (ops tests still failing on the SE, not clear why)
    • HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2015-04-02&endTime=2015-05-07&templateType=isGolden
  • Plans
    • Restore full capacity:
      • 320 (old) cores from UNIBE-ID
      • 144 (old) cores from CSCS
      • 128 (new) cores awaiting installation (E5-2650 v2 2.6GHz)
      • 512 (new) cores to be procured
    • Plan is to re-deploy ce01 with ROCKS 6.2 (coming out any day now). At the moment with the current deployment of ROCKS 6.1, installation of new hardware works but kernel freezes at re-boot
    • Testing install procedure with ce04 and ROCKS 6.1.1 for now (temporary deployment)
    • Deploy 2 additional UI's (to be procured)

UNIBE-ID

  • Procurement:
    • Tender regarding new storage ended at the beginning of this week => order placed for:
      • IBM ESS GL4 with 4TB NL-SAS disks:
      • Total Capacity: 928TB
      • Interconnect: IB
      • Filesystem: GPFS 4.1 STD + GPFS Native RAID
  • Operations:
    • Stable operations most of the time
    • Small glitch after migrating to new OpenLDAP installations
      • pam_ldap doesn't close connections properly => max open files violations on LDAP server; now set olcIdleTimeout to 60 - problem solved
    • Upgraded Nordugrid ARC CE to 5.0.0
    • still working on moving from satellite based RHEL setup to foreman based CentOS setup

UNIGE

  • The next upgrade
    • Our upgrade plans for 2015 were approved as proposed for the 50% co-funding scheme of the Uni.
    • One order is out for the upgrade
      • 3 x Lenovo x3630 M4 with 6 TB disks (63 TB net in a 2U machine)(1 for neutrino, 2 for ATLAS)
      • 2 x Lenovo x3550 M4 as hosts for running virtual machines
    • We also foresee to upgrade network to 10 Gbps for 7 disk servers doing NFS
      • no quotes yet, only price estimates
  • Cleanup of the SE UNIGE-DPNC_LOCALGROUPDISK, 90 TB free (21%)
    • With run 2 starting, we wil likely need another round quickly
  • This is my last meeting. Good bye to all of you!

NGI_CH

Other topics

  • Next meeting date: The next meeting should be on June 4, but Gianfranco and Miguel will be away (NorduGrid conference, look below). Other suggestions could be:
    • Wed, Jun 10 at 14:00
    • Thu, Jun 11 at 14:00
    • Thu, Jun 17 at 14:00
  • Training for sysadmins: The doodle is set and we are awaiting to collect input and propose final dates. Currently evaluating possibility to do it online via Vidyo or Scopia.

A.O.B.

  • Gianni will attend the following pre-GDB to be held at CERN on 12.05.15.
  • Miguel will attend the upcoming Annual NorduGrid Conference to be held at UNIBE between 04.06.15 and 05.06.15
  • Gianfranco+Sigve will attend the upcoming EGI Confrence 2015 to be held in Lisbon between 18.05.15 and 22.05.15
  • Gianfranco will attend the upcoming Annual NorduGrid Conference to be held at UNIBE between 04.06.15 and 05.06.15

Attendants

  • CSCS: Dino Conciatore, Miguel Gila, Dario Petrusic. Apologies: Gianni Ricciardi, Nick Cardo.
  • CMS:
  • ATLAS: Gianfranco Sciacca, Szymon Gadomski
  • LHCb: Roland Bernet
  • EGI: Gianfranco Sciacca

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r17 - 2015-05-08 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback