Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2015-03-05

Site status

CSCS

  • Three new members of CSCS joined Phoenix: two new system engineers (Dario Petrusic and Dino Conciatore) and systems group lead Nicholas Cardo. They will be actively participating on all our meetings and operations starting now.
    • Already have access to the systems, including TWiki and chat.
    • Working to set certificate roles in /dteam and dteam/NGI_CH; then they will be added to GOC and Nagios@CSCS
  • Maintenance 10.03.14:
    • dCache upgrade: security updates and dCache to 2.6.46
    • GPFS config update: new maxFilesToCache setting in place, updating from 40k files per node, to 50k files.
    • Reinstallation of as many WNs as possible with latest EMI-WN packages (plus security updates)
    • Shutdown and physical removal of old hardware (puppet, nfs0[1-2], se[01-06], 3x 1/2 racks of IBM DC3500 storage)
  • A.O.B.
    • Migration to Puppet 3.6 ongoing, new roles created but more work needs to be put in place. Managed to migrate cfengine from ageing hardware (>1000 days of uptime!) to new vmware VM.
    • At some point before summer, we will need to upgrade GPFS (v. 4) and dCache (v. 2.10).
    • (Gianfranco) ATLAS lcgadmin and pilot roles to be enabled/fixed on ARC CEs

PSI

  • A major NetApp E5400 error A drawer in the tray has become degraded, that led to lost 1/2 redudant paths to 12*3TB disks ; it was a FW bug ; solved by updating ONLINE the NetApp 5400 FW to 7.86.49.00 ; the RDAC driver Linux-side gracefully moved the paths to the RAID Controllers from one to the other, and back, during the Controllers reboot. I didn't unmount the XFS filesystems or stopped dCache. Nothing you can get from the NAS world. CSCS should update the FW as well.
  • Again in the same NetApp E5400 I got 2*3TB broken disks
  • In both cases I get the native NetApp e-mails routed through iptables NAT but also a Nagios e-mail
  • Preparing the dCache 2.6 to 2.10 migration ; in my case this will also mean upgrading Postgresql from 9.3 to 9.4, also because of this news; luckily my Chimera Materialized Views still work out of the box but there are some new table fields that I should include in the future
  • Using Puppet standalone over /afs because it's 10 times faster than having a Puppet master, and the clients don't crash ; each SL6 server in my cluster mount /afs and there is a /afs dir where both my Puppet recipes and the conf files are stored ; this /afs dir and its descendants are protected by AFS ACLs ; only the root account on the SL6 server can access my /afs dir by using a Kerberos Keytab file ; Example:
    # ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ 
    ls: cannot access /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/: Permission denied 
    
    # kinit -k -t /root/afs-keytabs/svcusr-t3_puppet.keytab svcusr-t3_puppet@D.PSI.CH && aklog  
    
    # ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ 
    total 4 
    lrwxr-xr-x 1 martinelli_f cms 22 Jan 15 16:14 manifests -> puppet/TRUNK/manifests 
    lrwxr-xr-x 1 martinelli_f cms 20 Jan 15 16:14 modules -> puppet/TRUNK/modules 
    drwxr-xr-x 4 martinelli_f cms 2048 Jan 15 15:48 puppet 
  • Many other tasks, but specific to PSI or CMS

UNIBE-LHEP

UNIBE-ID

  • Procurement
    • Another 16 Dalco compute nodes are installed, setup and running smooth; ingredients:
      • 2x 8C Intel Xeon E5-2650v2 2.6GHz
      • 128GB 1866MHz DDR ECC REG (8*16GB)
      • 1x 1TB 7.2k rpm SATA 6.0Gb/s
      • 2x Gigabit-Ethernet onboard
      • Infiniband ConnectX-3 QDR HCA
    • Prepared a tender to buy replacement storage
      • IBM GSS24 with 3TB disks => 696 TB total capacity; ~510 TB usable capacity
  • Decomissioning
    • 23 Sun X2200 Pizza boxes shutdown and dumped
    • 25 remaining and marked to be dumped within the next two months
  • Operations
    • smooth and reliable, except...
    • ... nordugrid-arc-bdii dead for almost a week while being on holiday => bad performance value in monthly report
      • same happened in January and at the beginning of this week
      • now installed a cron based guardian like we already have for a-rex (which btw was very stable the last few months)
  • AOB:
    • (Gianfranco) ATLAS pilot role to be enabled/fixed on the ARC CE

UNIGE

  • Xxx
  • AOB:
    • (Gianfranco) ATLAS request to enable multicore jobs (sent by mistake instructions for gridengine, but Geneva run Torque)

NGI_CH

  • January 2015 - RP/RC OLA performance: http://snf-631462.vm.okeanos.grnet.gr:8080/lavoisier/site_reports?ngi=NGI_CH
  • Multicore accounting for EGI:
    • http://accounting-devel.egi.eu/show.php?ExecutingSite=CSCS-LCG2&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs
    • http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-LHEP&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN
    • http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-ID&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs
  • Pakiti made easy https://pakiti.egi.eu/client.php?site=UNIBE-LHEP (simple cron job on all WNs - requires access to the CAs)
  • Issues with Certificates in CH following SWITCH withdrawal from the service as of 31st Aug 2015
    • CERN not an option for non-users, servers non on the CERN network
    • TERENA CS (flat fee 27k) would deal only with NRENs (i.e. SWITCH)
    • Exploring possible solutions (EGI catch-all CA?)

Other topics

  • UI accounts for CMS super users at the T2 for batch submission possible?
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS: Gianni Ricciardi, Dino Conciatore, Dario Petrusic, Miguel Gila
  • CMS: Fabio Martinelli, Daniel Meister
  • ATLAS: Gianfranco Sciacca
  • UNIBE-ID: Michael Rolli
  • LHCb: Roland Bernet
  • EGI:

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r14 - 2015-06-09 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback