Tags:
meeting
1
SwissGridOperationsMeeting
1
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2014-12-04 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * Maintenance of December 3 went smoothly: CSCS connected via a 100G link to SWITCH (Phoenix still at 20G though) * ARC monitored on NGI Nagios: WebServices configuration issues (as for now enabled only on arc01.lcg.cscs.ch) * perfSONAR: a couple of old WNs chosen as HW replacement for the old instances * Reminder: Next F2F meeting on January 29 2015 at CSCS ---+++ PSI * Using the [[https://docs.puppetlabs.com/references/latest/type.html#file-attribute-source_permissions][Puppet 3 source_permissions]] feature to copy files and dirs without specifying owner, group modes. It's like a =rsync=. I wasn't aware of it. * Using the [[http://docs.saltstack.com/en/latest/topics/targeting/batch.html][SaltStack batch mode]] feature to run a command on groups of filtered servers: * To appreciate this I assume you're use to older tools like [[http://www.csm.ornl.gov/torc/C3/Man/cexec.shtml][cexec]] or [[https://code.google.com/p/pdsh/][pdsh]] * Those tools require you to write a static configuration file where you define your cluster(s); these definitions can only use hostnames. * In [[http://docs.saltstack.com/en/latest/][SaltStack]] each client ( minion ) constantly publishes its live info ( grains ); core grains are =SSDs biosreleasedate biosversion cpu_flags cpu_model cpuarch domain fqdn fqdn_ip4 fqdn_ip6 gpus host hwaddr_interfaces id ip4_interfaces ip6_interfaces ip_interfaces ipv4 ipv6 kernel %GREEN%kernelrelease%ENDCOLOR% locale_info localhost machine_id manufacturer master mem_total nodename num_cpus num_gpus os os_family osarch oscodename osfinger osfullname %ORANGE%osmajorrelease%ENDCOLOR% osrelease osrelease_info path productname ps pythonexecutable pythonpath pythonversion saltpath saltversion saltversioninfo selinux serialnumber server_id shell virtual zmqversion= but you can define your own grains =prod dev webserver db rackposition= etc.. * By leveraging on the grains values you can dynamically filter the minions, split them in groups ( %BLUE%fixed amount%ENDCOLOR% or % ), and run a command in these groups as a sequence. * Running in small groups is useful when you're involving a 3rd party service =ftp http puppet rsync NFS ...= and you don't want to open tens of connections against it. * *My most recurring case is puppet.* =saltmaster# salt %BLUE%-b 3%ENDCOLOR% -C 't3wn* and G@%ORANGE%osmajorrelease%ENDCOLOR%:6' cmd.run 'puppet agent -t '= * All the commands you run are saved by [[http://docs.saltstack.com/en/latest/][SaltStack]], kinda 'job system' * Another ( no groups this time ) example: =salt -C 't3ui* and not G@%GREEN%kernelrelease%ENDCOLOR%:2.6.32-358.2.1.el6.x86_64' cmd.run 'uname -a'= * Tried http://xrootd.org v4 ; I've the impression that it requires IPv6 since I couldn't start it without a IPv6 ip. Need to double check it. * Working together with my boss Derek to prepare the 5th PSI T3 Steering Board Meeting ( UniZ/ETHZ/PSI ): a lot of time spent here. * Reading the [[http://www.dcache.org/manuals/upgrade-2.10/upgrade-2.6-to-2.10.html][dCache 2.6 to 2.10 upgrade guide]] * Is somebody going to attend [[https://indico.cern.ch/event/272794/][The Condor Workshop at CERN]] next week ? I'll probably attend it remotely. ---+++ UNIBE-LHEP * Operations * Smooth routine operations with minor (or quickly remedied to) issues: * 4 workers on ce01 suddenly became black holes: disabled pending investigation (no time so far). * our main switch went nuts on 17th Nov (morning working hours luckily). Packets dropped all over the place: power-cycled, recovered. No useful traces in system log. * a-rex crashes once more on ce02. This is a persistent issue, happens randomly on both clusters. Only once in a month is a positive trend. * home dirs server (local users) crashed due to a file system problem: needed hard power-cycling on site and fs repairs from single user mode. Night-long downtime, recovered fine. * deployed a Nagios server with basic checks. Tuning alarm thresholds and adding progressively more sofisticated checks. * ATLAS specific operations * High failure rate on many WNs[ http://bigpanda.cern.ch/wns/UNIBE-LHEP/ ] * mainly (exclusively?) due to vmem limit exceeded * tweaked gridengine settings and restored 16 slots per node on ce01 SunBlades (24GB RAM). vmem is set as consumable (dynamic allocation), but gridengine needs an explicit total mem value per node, it doeasn't fetch it automatically from e.g. meminfo. Once set that to 24GB, nodes no longer die from memory starving * yet problem not solved as jobs are killed instead * the recommended factor 2 up-scaling for limit in submit_sge_job is no longer sufficient * the real problem is that kernels have changed and vmem is no longer RSS+swap, it is the size of the address space. In 32-bit the difference was negligible, in 64-bit it is much larger (now all workflows are 64-bit) * in gridengine each job is assigned a GID and all the children spawned from that GID inherit it. GE adds up all resources used by all processes with the same GID and matches the limits against that * gets much worse with MCORE jobs, as the shared resources are added multiple times * it turns out that a reco jobs using e.g. 20GB is accounted as using as much as 40GB * cgroups could be the solution but gridengine (apparently) does not support it * either increase the scaling factor or remove the limit altogheter (latter could be ok for prod, but not for analysis) * increased (conservatively) the factor from 2 to 2.5 and watch the failure trend * Monitoring 1 SAM Nagios ATLAS_CRITICAL: <a target="_top" href="http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK">http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK</a> 1 HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2014-11-04&endTime=2014-12-03&templateType=isGolden ---+++ UNIBE-ID * Security incident at site CAMK[EGI-20140130] * Some attack attempts from the given IPs in EGI-Security report; no successful login found. * <span style="background-color: transparent;">Operations</span> * <span style="background-color: transparent;">smooth and reliable; no issues</span> * <span style="background-color: transparent;">the 16 new DALCO compute nodes are operational => decommissioning of the old Sun Bladecenter on 2014-12-11</span> ---+++ UNIGE * New disk space for the AMS experiment added * +84 TB in NFS space * disk now: 709 TB (474 TB in the DPM SE, 235 TB on NFS) * One incident with a full NFS file system * a Solaris 9 disk server Sun X4540 blocked a few times * impossible to unmount the file system or to shut down properly * rebooting all clients, having to reset many of them * this does not happen often... * ARC front end filling up /var * lack of log rotate for /var/log/arc/bdii/bdii-update.log * Our /cvmfs over NFS getting slow again, overloaded * no visible problem to the users, but need to watch this issue * may need more machines for /cvmfs, we have many directories <br />ls /cvmfs<br />ams.cern.ch atlas.cern.ch atlas-condb.cern.ch atlas-nightlies.cern.ch geant4.cern.ch icecube.wisc.edu na61.cern.ch sft.cern.ch ---+++ NGI_CH <div title="Page 7"> * <p>perfSONAR3.4 upgrade (re-instantiation) as response to ShellSchock. New instructions include new mesh configurations</p> * SAM Update-23; Release early this week (?) - Old OPS VOMS decommissioned on November 26th * GOCDB: "Prod=Y and Mon=N" changed to "Prod=Y and Mon=Y" for all services except emi.ARGUS and VOMS * NGI_CH ARGUS deployment completed: https://ggus.eu/index.php?mode=ticket_info&ticket_id=99533 </div> ---++ Other topics * Possibility of local accounts for a limited number of power users (direct batch submission) at the T2? (request from ETH CMS group) * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Gianni Ricciardi * CMS: Fabio Martinelli, Daniel Meister * UNIBE-ID Nico Färber * ATLAS: Gianfranco Sciacca, Szymon Gadomski * LHCb: Roland Bernet * EGI: Gianfranco Sciacca ---++ Action items * Item1
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r17
<
r16
<
r15
<
r14
<
r13
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r17 - 2015-03-03
-
DanielMeister
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback