Swiss Grid Operations Meeting on 2015-03-05
Site status
CSCS
- Three new members of CSCS joined Phoenix: two new system engineers (Dario Petrusic and Dino Conciatore) and systems group lead Nicholas Cardo. They will be actively participating on all our meetings and operations starting now.
- Already have access to the systems, including TWiki and chat.
- Working to set certificate roles in /dteam and dteam/NGI_CH; then they will be added to GOC and Nagios@CSCS
- Maintenance 10.03.14:
- dCache upgrade: security updates and dCache to 2.6.46
- GPFS config update: new maxFilesToCache setting in place, updating from 40k files per node, to 50k files.
- Reinstallation of as many WNs as possible with latest EMI-WN packages (plus security updates)
- Shutdown and physical removal of old hardware (puppet, nfs0[1-2], se[01-06], 3x 1/2 racks of IBM DC3500 storage)
- A.O.B.
- Migration to Puppet 3.6 ongoing, new roles created but more work needs to be put in place. Managed to migrate cfengine from ageing hardware (>1000 days of uptime!) to new vmware VM.
- At some point before summer, we will need to upgrade GPFS (v. 4) and dCache (v. 2.10).
- (Gianfranco) ATLAS lcgadmin and pilot roles to be enabled/fixed on ARC CEs
PSI
UNIBE-LHEP
- Operations
- Slow recovery of the ce01 cluster following the kernel+glibc security updates of January.
- straightforward RPM upgrades would not work, needed to re-image the WNs and re-install
- issue with the OpenIB modules freezing at shutdown. This implies power-cycling every node whenever a re-boot is needed (or re-installation)
- turned out our IB stack (not updated for ~2 years) had an outdated setup
- however: after a general update of the setup, the rdma modules are still not unloaded cleanly after starting up (even if lustre is not even started)
- coocked a shutdown script that (teoretically) unloads all cleverly before running into the system freeze/crash
- permanent solution is I suppose re-image the WM from scratch. However, this implies re-buikding ROCKS (and the CE) from scratch
- ce02 cluster needed power-cycling for the ethernet switches on 30th Jan, stable thereafter but almost halved in capacity
- Added cron jobs on both CE's to recover a-rex after crashing. Logging crashes, typically twice a month
- ATLAS specific operations
UNIBE-ID
- Procurement
- Another 16 Dalco compute nodes are installed, setup and running smooth; ingredients:
- 2x 8C Intel Xeon E5-2650v2 2.6GHz
- 128GB 1866MHz DDR ECC REG (8*16GB)
- 1x 1TB 7.2k rpm SATA 6.0Gb/s
- 2x Gigabit-Ethernet onboard
- Infiniband ConnectX-3 QDR HCA
- Prepared a tender to buy replacement storage
- IBM GSS24 with 3TB disks => 696 TB total capacity; ~510 TB usable capacity
- Decomissioning
- 23 Sun X2200 Pizza boxes shutdown and dumped
- 25 remaining and marked to be dumped within the next two months
- Operations
- smooth and reliable, except...
- ... nordugrid-arc-bdii dead for almost a week while being on holiday => bad performance value in monthly report
- same happened in January and at the beginning of this week
- now installed a cron based guardian like we already have for a-rex (which btw was very stable the last few months)
- AOB:
- (Gianfranco) ATLAS pilot role to be enabled/fixed on the ARC CE
UNIGE
- Xxx
- AOB:
- (Gianfranco) ATLAS request to enable multicore jobs (sent by mistake instructions for gridengine, but Geneva run Torque)
NGI_CH
- January 2015 - RP/RC OLA performance: http://snf-631462.vm.okeanos.grnet.gr:8080/lavoisier/site_reports?ngi=NGI_CH
- Multicore accounting for EGI:
- http://accounting-devel.egi.eu/show.php?ExecutingSite=CSCS-LCG2&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs
- http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-LHEP&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN
- http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-ID&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs
- Pakiti made easy https://pakiti.egi.eu/client.php?site=UNIBE-LHEP (simple cron job on all WNs - requires access to the CAs)
- Issues with Certificates in CH following SWITCH withdrawal from the service as of 31st Aug 2015
- CERN not an option for non-users, servers non on the CERN network
- TERENA CS (flat fee 27k) would deal only with NRENs (i.e. SWITCH)
- Exploring possible solutions (EGI catch-all CA?)
Other topics
- UI accounts for CMS super users at the T2 for batch submission possible?
- Topic2
Next meeting date:
A.O.B.
Attendants
- CSCS: Gianni Ricciardi, Dino Conciatore, Dario Petrusic, Miguel Gila
- CMS: Fabio Martinelli, Daniel Meister
- ATLAS: Gianfranco Sciacca
- UNIBE-ID: Michael Rolli
- LHCb: Roland Bernet
- EGI:
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20150305
Topic revision: r14 - 2015-06-09 - FabioMartinelli