Swiss Grid Operations Meeting on 2014-12-04
Site status
CSCS
- Maintenance of December 3 went smoothly: CSCS connected via a 100G link to SWITCH (Phoenix still at 20G though)
- ARC monitored on NGI Nagios: WebServices configuration issues (as for now enabled only on arc01.lcg.cscs.ch)
- perfSONAR: a couple of old WNs chosen as HW replacement for the old instances
- Reminder: Next F2F meeting on January 29 2015 at CSCS
PSI
- Using the Puppet 3 source_permissions feature to copy files and dirs without specifying owner, group modes. It's like a
rsync
. I wasn't aware of it.
- Using the SaltStack batch mode feature to run a command on groups of filtered servers:
- To appreciate this I assume you're use to older tools like cexec or pdsh
- Those tools require you to write a static configuration file where you define your cluster(s); these definitions can only use hostnames.
- In SaltStack each client ( minion ) constantly publishes its live info ( grains ); core grains are
SSDs biosreleasedate biosversion cpu_flags cpu_model cpuarch domain fqdn fqdn_ip4 fqdn_ip6 gpus host hwaddr_interfaces id ip4_interfaces ip6_interfaces ip_interfaces ipv4 ipv6 kernel kernelrelease locale_info localhost machine_id manufacturer master mem_total nodename num_cpus num_gpus os os_family osarch oscodename osfinger osfullname osmajorrelease osrelease osrelease_info path productname ps pythonexecutable pythonpath pythonversion saltpath saltversion saltversioninfo selinux serialnumber server_id shell virtual zmqversion
but you can define your own grains prod dev webserver db rackposition
etc..
- By leveraging on the grains values you can dynamically filter the minions, split them in groups ( fixed amount or % ), and run a command in these groups as a sequence.
- Running in small groups is useful when you're involving a 3rd party service
ftp http puppet rsync NFS ...
and you don't want to open tens of connections against it.
- My most recurring case is puppet.
saltmaster# salt -b 3 -C 't3wn* and G@osmajorrelease:6' cmd.run 'puppet agent -t '
- All the commands you run are saved by SaltStack, kinda 'job system'
- Another ( no groups this time ) example:
salt -C 't3ui* and not G@kernelrelease:2.6.32-358.2.1.el6.x86_64' cmd.run 'uname -a'
- Tried http://xrootd.org v4 ; I've the impression that it requires IPv6 since I couldn't start it without a IPv6 ip. Need to double check it.
- Working together with my boss Derek to prepare the 5th PSI T3 Steering Board Meeting ( UniZ/ETHZ/PSI ): a lot of time spent here.
- Reading the dCache 2.6 to 2.10 upgrade guide
- Is somebody going to attend The Condor Workshop at CERN next week ? I'll probably attend it remotely.
UNIBE-LHEP
- Operations
- Smooth routine operations with minor (or quickly remedied to) issues:
- 4 workers on ce01 suddenly became black holes: disabled pending investigation (no time so far).
- our main switch went nuts on 17th Nov (morning working hours luckily). Packets dropped all over the place: power-cycled, recovered. No useful traces in system log.
- a-rex crashes once more on ce02. This is a persistent issue, happens randomly on both clusters. Only once in a month is a positive trend.
- home dirs server (local users) crashed due to a file system problem: needed hard power-cycling on site and fs repairs from single user mode. Night-long downtime, recovered fine.
- deployed a Nagios server with basic checks. Tuning alarm thresholds and adding progressively more sofisticated checks.
- ATLAS specific operations
UNIBE-ID
- Security incident at site CAMK[EGI-20140130]
- Some attack attempts from the given IPs in EGI-Security report; no successful login found.
- Operations
- smooth and reliable; no issues
- the 16 new DALCO compute nodes are operational => decommissioning of the old Sun Bladecenter on 2014-12-11
UNIGE
- New disk space for the AMS experiment added
- +84 TB in NFS space
- disk now: 709 TB (474 TB in the DPM SE, 235 TB on NFS)
- One incident with a full NFS file system
- a Solaris 9 disk server Sun X4540 blocked a few times
- impossible to unmount the file system or to shut down properly
- rebooting all clients, having to reset many of them
- this does not happen often...
- ARC front end filling up /var
- lack of log rotate for /var/log/arc/bdii/bdii-update.log
- Our /cvmfs over NFS getting slow again, overloaded
- no visible problem to the users, but need to watch this issue
- may need more machines for /cvmfs, we have many directories
ls /cvmfs
ams.cern.ch atlas.cern.ch atlas-condb.cern.ch atlas-nightlies.cern.ch geant4.cern.ch icecube.wisc.edu na61.cern.ch sft.cern.ch
NGI_CH
Other topics
- Possibility of local accounts for a limited number of power users (direct batch submission) at the T2? (request from ETH CMS group)
- Topic2
Next meeting date:
A.O.B.
Attendants
- CSCS: Gianni Ricciardi
- CMS: Fabio Martinelli, Daniel Meister
- UNIBE-ID Nico Färber
- ATLAS: Gianfranco Sciacca, Szymon Gadomski
- LHCb: Roland Bernet
- EGI: Gianfranco Sciacca
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20141204
Topic revision: r17 - 2015-03-03 - DanielMeister