Tags:
meeting
1
SwissGridOperationsMeeting
1
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2015-03-05 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * Three new members of CSCS joined Phoenix: two new system engineers (Dario Petrusic and Dino Conciatore) and systems group lead Nicholas Cardo. They will be actively participating on all our meetings and operations starting now. * Already have access to the systems, including TWiki and chat. * Working to set certificate roles in /dteam and dteam/NGI_CH; then they will be added to GOC and Nagios@CSCS * Maintenance 10.03.14: * dCache upgrade: security updates and dCache to 2.6.46 * GPFS config update: new maxFilesToCache setting in place, updating from 40k files per node, to 50k files. * Reinstallation of as many WNs as possible with latest EMI-WN packages (plus security updates) * Shutdown and physical removal of old hardware (puppet, nfs0[1-2], se[01-06], 3x 1/2 racks of IBM DC3500 storage) * A.O.B. * Migration to Puppet 3.6 ongoing, new roles created but more work needs to be put in place. Managed to migrate cfengine from ageing hardware (>1000 days of uptime!) to new vmware VM. * At some point before summer, we will need to upgrade GPFS (v. 4) and dCache (v. 2.10). * (Gianfranco) ATLAS lcgadmin and pilot roles to be enabled/fixed on ARC CEs ---+++ PSI * A major *NetApp E5400 error* =A drawer in the tray has become degraded=, that led to =lost 1/2 redudant paths to 12*3TB disks= ; it was a FW bug ; solved by updating ONLINE the NetApp 5400 FW to =7.86.49.00= ; the RDAC driver Linux-side gracefully moved the paths to the RAID Controllers from one to the other, and back, during the Controllers reboot. I didn't unmount the XFS filesystems or stopped dCache. Nothing you can get from the NAS world. CSCS should update the FW as well. * Again in the same NetApp E5400 I got 2*3TB broken disks * In both cases I get the native NetApp e-mails routed through =iptables NAT= but also a [[https://bitbucket.org/fabio79ch/check_netapp/wiki/Home][Nagios e-mail]] * Preparing the [[http://www.dcache.org/downloads/1.9/timeline-dCache.svg][dCache 2.6 to 2.10 migration]] ; in my case this will also mean upgrading Postgresql from 9.3 to 9.4, also because of this [[https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.4#REFRESH_MATERIALIZED_VIEW_CONCURRENTLY][news]]; luckily my [[https://bitbucket.org/fabio79ch/v_pnfs/wiki/Home][Chimera Materialized Views]] still work out of the box but there are some new table fields that I should include in the future * Using *Puppet standalone* over =/afs= because it's 10 times faster than having a Puppet master, and the clients don't crash ; each SL6 server in my cluster mount =/afs= and there is a =/afs= dir where both my Puppet recipes and the conf files are stored ; this =/afs= dir and its descendants are protected by AFS ACLs ; only the =root= account on the SL6 server can access my =/afs= dir by using a [[https://kb.iu.edu/d/aumh][Kerberos Keytab]] file ; Example: <pre>%BLUE%#%ENDCOLOR% ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ ls: cannot access /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/: %RED%Permission denied%ENDCOLOR% %BLUE%#%ENDCOLOR% kinit -k -t /root/afs-keytabs/svcusr-t3_puppet.keytab svcusr-t3_puppet@D.PSI.CH && aklog %BLUE% #%ENDCOLOR% ll /afs/psi.ch/service/linux/puppet/var/puppet/environments/FabioDevelopment/ total 4 lrwxr-xr-x 1 martinelli_f cms 22 Jan 15 16:14 manifests -> puppet/TRUNK/manifests lrwxr-xr-x 1 martinelli_f cms 20 Jan 15 16:14 modules -> puppet/TRUNK/modules drwxr-xr-x 4 martinelli_f cms 2048 Jan 15 15:48 puppet </pre> * Many other tasks, but specific to PSI or CMS ---+++ UNIBE-LHEP * Operations * Slow recovery of the *ce01* cluster following the kernel+glibc security updates of January. * straightforward RPM upgrades would not work, needed to re-image the WNs and re-install * issue with the OpenIB modules freezing at shutdown. This implies power-cycling every node whenever a re-boot is needed (or re-installation) * turned out our IB stack (not updated for ~2 years) had an outdated setup * however: after a general update of the setup, the rdma modules are still not unloaded cleanly after starting up (even if lustre is not even started) * coocked a shutdown script that (teoretically) unloads all cleverly before running into the system freeze/crash * permanent solution is I suppose re-image the WM from scratch. However, this implies re-buikding ROCKS (and the CE) from scratch * *ce02* cluster needed power-cycling for the ethernet switches on 30th Jan, stable thereafter but almost halved in capacity * Added cron jobs on both CE's to recover a-rex after crashing. Logging crashes, typically twice a month * <span style="background-color: transparent;">ATLAS specific operations</span> * <span style="background-color: transparent;">ATLAS still pretty quiet, picking up now</span> * <span style="background-color: transparent;">Revived the webdaw access to the SE (ATLAS request)</span> * <span style="background-color: transparent;">Monitoring:</span> 1 SAM Nagios ATLAS_CRITICAL: <a target="_top" href="http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK">http://wlcg-sam-atlas.cern.ch/templates/ember/#/plot?flavours=SRMv2%2CCREAM-CE%2CARC-CE&group=All%20sites&metrics=org.sam.CONDOR-JobSubmit%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_pilot%29%2Corg.atlas.WN-swspace%20%28%2Fatlas%2FRole_lcgadmin%29%2Corg.atlas.SRM-VOPut%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VOGet%20%28%2Fatlas%2FRole_production%29%2Corg.atlas.SRM-VODel%20%28%2Fatlas%2FRole_production%29&profile=ATLAS_CRITICAL&sites=CSCS-LCG2%2CUNIBE-LHEP%2CUNIGE-DPNC&status=MISSING%2CUNKNOWN%2CCRITICAL%2CWARNING%2COK</a> 1 <span style="background-color: transparent;">HammerCloud gangarobot: http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UNIBE-LHEP&startTime=2015-02-01&endTime=2015-03-04&templateType=isGolden</span> ---+++ UNIBE-ID * Procurement * Another 16 Dalco compute nodes are installed, setup and running smooth; ingredients: * <span style="background-color: transparent;">2x 8C Intel Xeon </span><span style="background-color: transparent;">E5-2650v2 2.6GHz</span> * <span style="background-color: transparent;">128GB 1866MHz DDR ECC REG (8*16GB)</span> * <span style="background-color: transparent;">1x 1TB 7.2k rpm SATA 6.0Gb/s</span> * <span style="background-color: transparent;">2x Gigabit-Ethernet onboard</span> * <span style="background-color: transparent;">Infiniband </span><a rel="nofollow" href="https://wiki.chipp.ch/twiki/bin/edit/LCGTier2/ConnectX?topicparent=LCGTier2.MeetingSwissGridOperations20141106;nowysiwyg=0" title="ConnectX (this topic does not yet exist; you can create it)">ConnectX</a><span style="background-color: transparent;">-3 QDR HCA</span> * Prepared a tender to buy replacement storage * IBM GSS24 with 3TB disks => 696 TB total capacity; ~510 TB usable capacity * Decomissioning * 23 Sun X2200 Pizza boxes shutdown and dumped * 25 remaining and marked to be dumped within the next two months * Operations * smooth and reliable, except... * <span style="background-color: transparent;">... nordugrid-arc-bdii dead for almost a week while being on holiday => bad performance value in monthly report</span> * same happened in January and at the beginning of this week * now installed a cron based guardian like we already have for a-rex (which btw was very stable the last few months) * AOB: * <span style="background-color: transparent;">(Gianfranco) ATLAS pilot role to be enabled/fixed on the ARC CE</span> ---+++ UNIGE * Xxx * AOB: * (Gianfranco) ATLAS request to enable multicore jobs (sent by mistake instructions for gridengine, but Geneva run Torque) ---+++ NGI_CH * <span style="background-color: transparent;">January 2015 - RP/RC OLA performance: http://snf-631462.vm.okeanos.grnet.gr:8080/lavoisier/site_reports?ngi=NGI_CH</span> * UNIBE-ID low (understood): https://ggus.eu/index.php?mode=ticket_info&ticket_id=111896 * Multicore accounting for EGI: * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=CSCS-LCG2&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs</span> * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-LHEP&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN</span> * <span style="color: blue; text-decoration: underline; background-color: transparent;">http://accounting-devel.egi.eu/show.php?ExecutingSite=UNIBE-ID&query=sum_normcpu&startYear=2015&startMonth=1&endYear=2015&endMonth=3&yrange=SubmitHost&xrange=NUMBER+PROCESSORS&groupVO=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs</span> * <span style="background-color: transparent;">Pakiti made easy https://pakiti.egi.eu/client.php?site=UNIBE-LHEP (simple cron job on all WNs - requires access to the CAs)</span> * Site Security Officer can check their own site: https://pakiti.egi.eu/ . * Issues with Certificates in CH following SWITCH withdrawal from the service as of 31st Aug 2015 * CERN not an option for non-users, servers non on the CERN network * TERENA CS (flat fee 27k) would deal only with NRENs (i.e. SWITCH) * Exploring possible solutions (EGI catch-all CA?) ---++ Other topics * UI accounts for CMS super users at the T2 for batch submission possible? * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Gianni Ricciardi, Dino Conciatore, Dario Petrusic, Miguel Gila * CMS: Fabio Martinelli, Daniel Meister * ATLAS: Gianfranco Sciacca * UNIBE-ID: Michael Rolli * LHCb: Roland Bernet * EGI: ---++ Action items * Item1
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r14
<
r13
<
r12
<
r11
<
r10
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r14 - 2015-06-09
-
FabioMartinelli
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback