Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-04-07 at 15:30 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * New ARC CE instance (arc03) installed along with a new SLURM instance (15.08.8) and all the recently purchased WNs<br />(this cluster is integrated into CSCS LDAP and central SLURM DB) * certificates mess last week (Gianni's fault!): thanks to Gianfranco and Sigve for their help * some time spent fixing the Information System [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=118922][(GGUS 118922)]] * tentative planned maintenance on 20160503 to replace IB/Eth bridges, moving some VMs, reinstalling arc02 * CREAM CEs to be dismissed by beginning of June * Accounting numbers (from scheduler) from last month *GPFS* * No issues to report * Metadata from local SSD to FC Flash migration should be performed on May 3rd *dCache* * Almost ready to deploy the first 500TB of new storage (from NETAPP 5560) * The additional 500TB will be ready by the first part of May (from SFA12K) * Investigating some "unexpected" files deletion (CMS) ---+++ PSI * Put in production the new CentOS7/ZFS/NFSv4 /homes hierarchy * [[http://zfsonlinux.org/][ZFS On Linux]] * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/space.report][/homes hierarchy space report]] * [[http://t3mon.psi.ch/ganglia/host_gmetrics.php?c=PSI%20Tier3%20services&h=t3nfs01.psi.ch][/homes hierarchy Ganglia ZFS/NFSv4 stats]] * Installing 9 new Dalco servers ( got 2 disks dead on arrival ) ; each : * Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 64 cores ( HT on ) * 128GB RAM * 4 disks 900GB 10k SAS in mdadm 1+0 by Kickstart * made a 100GB partition formatted as XFS in order to test [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-fscache.html][FS-Cache]] + NFSv4 * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]] ---+++ UNIBE-LHEP *Operations* * <span style="background-color: transparent;">mostly stable operation on both systems, except for:</span> * some random failures on some ce01 nodes ( *trans:* Transformation not installed in CE) * leads to flipping between black and white-listing by HC * usually a cvmfs related problem, but cvmfs reports fine on all nodes * under investigation right now * eth0 dropped twice within 12h on the ce01 lustre mds: <span style="background-color: transparent;">Mar 31 08:26:14 mds-2-1 kernel: irq 75: nobody cared (try booting with the "irqpoll" option)</span><br /><span style="background-color: transparent;">...</span><br /><span style="background-color: transparent;">Mar 31 08:26:31 mds-2-1 kernel: e1000e 0000:03:00.0: eth0: Reset adapter unexpectedly</span> <span style="background-color: transparent;"><br /></span> * leaves lustre hanging, <span style="background-color: transparent;">needs power-cycling to recover (lustre come back is quick)</span> * <span style="background-color: transparent;">maybe flacky h/w, getting a spare card to plug in case or recurrance</span> *ATLAS specific operations* * HC online 33% (last month, single core only - not huge impact since over 80% of work is MCORE): <span style="background-color: transparent;">http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE</span> <span style="background-color: transparent;"><br /></span> * 63% of ATLAS/CH WT, 70% CPUtime in March: <span style="background-color: transparent;">http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites[]=CSCS-LCG2&sites[]=UNIBE-LHEP&sitesCat[]=CH-CHIPP-CSCS&resourcetype=All&sitesSort=2&sitesCatSort=2&start=2016-03-01&end=2016-03-31&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All&activities[]=all</span> * Still on ice: No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6) * but I have asked to re-discuss this within ADC (in my view this sh<span style="background-color: transparent;">ould be implemented at the middleware level)</span> * <span style="background-color: transparent;">UNIBE-LHEP_CLOUD and UNIBE-LHEP_CLOUD_MCORE operating stably</span> * <span style="background-color: transparent;"> *Accounting numbers (from scheduler) from last month (Mar 2016)* - *NOTE*: ce03/CLOUD not reported yet</span> * <span style="background-color: transparent;">WC h: 936908 (ATLAS) - 149450 (t2k.org) - 13838 (uboone) - 13 (ops)</span> * *Accounting numbers (from ATLAS dashboard) from last month* (Mar 2016) * CPU h: 672148 (933386.8 with cloud) * WC h: 909450 (1243195.7 with cloud) ---+++ UNIBE-ID * All servers (but one) moved from RHEL to CentOS and all puppetized - finally * Short storage outages in March * in Feb Upgrade ESS-3.0 (GPFS-4.1.0) => ESS-3.5 (GPFS-4.1.1) * => GPFS cluster overload in certain moments => Stale File Handles * Turned off certain logging/tracing facilities in GPFS * now perfectly stable since 3w again * Ordered additional 76 nodes to 32 nodes we ordered last December: * Intel Xeon E5-2630v4 @ 2.2GHz, 20 cores (HT off) * 128GB RAM * => homogenous queue with 108 nodes (2160 core) exclusively for MPI usage * *Accounting numbers (from scheduler) from last month (Mar 2016):* * CPU h: 195476 * WC: h: 67481 ---+++ UNIGE * <span style="background-color: transparent;">Production:</span> * <span style="background-color: transparent;">Running smoothly under test mode for ATLAS (still pending some checks)</span> * <span style="background-color: transparent;">High load of cluster from local users (need to check batch system closer, since more chances of nodes down)</span> * <span style="background-color: transparent;">Host certificates recently replaced for DPM Head and Disk nodes + ARC-CE (running late because e-mails were sent to Szymon)</span> * Storage: * <span style="background-color: transparent;">ATLASLOCALGROUPDISK space token was almost full, now (after some cleaning of old datasets) it is at ~ 75% full (~106 TB free)</span> * <span style="background-color: transparent;">Only one user from </span>UniGe<span style="background-color: transparent;"> with useful dataset at CSCS, moving datasets to </span>UniGe<span style="background-color: transparent;">. Then, merge ATLASLOCALGROUPDISK with ATLASSCRATCHDISK</span> * <span style="background-color: transparent;">Providing ATLAS storage dumps every month</span> * <span style="background-color: transparent;">Outlook:</span> * <span style="background-color: transparent;">3 User Interfaces with SLC5 will be decommissioned and maybe a good chance to start moving to </span><span style="background-color: transparent;">CentOS</span> * Accounting numbers (from scheduler) from last month (Files attached for <span style="background-color: transparent;">[[https://wiki.chipp.ch/twiki/pub/LCGTier2/MeetingSwissGridOperations20160407/g07.201602.log][Feb 2016]] </span><span style="background-color: transparent;">and [[https://wiki.chipp.ch/twiki/pub/LCGTier2/MeetingSwissGridOperations20160407/g07.2016.log][Jan-Feb 2016]])</span> ---+++ NGI_CH * <span style="background-color: transparent; color: green;">Nothing of relevance</span> * <span style="background-color: transparent;">NGI-CH Open Tickets review </span><ins> * <span style="background-color: transparent;">NGI-CH Open Tickets review</span> </ins><span style="background-color: transparent;"> * </span> * <span style="background-color: transparent;">CSCS-LCG2</span> * * * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120551">120551</a>: CSCS-LCG2_MCORE : 75%+ jobs failed with ... (ATLAS team) - Not fully fixed yet (blacklisted right now, some HC jobs do not run) * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120505">120505</a>: Large amount of GLEXEC ERRORS on T2_CH_C.. (CMS) - Not touched for a week, changed to "waiting for reply" * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120405">120405</a>: Problem with accessing files at CSCS-LCG... (LHCb team) - In progress * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=119171">119171</a>: Workflow failures at T2_CH_CSCS (CMS) - Changed to "waiting for reply" * UNIBE-LHEP * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120257">120257</a>: glidein validation errors for Microboone... (UBOONE) - Following up on OSG, this should be closed * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a>: ATLAS request- storage consistency check... (ATLAS) - On hold * NGI_CH * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120184">120184</a>: NGI_CH - February 2016 - RP/RC OLA performance - Slow response to 2 tickets (average March response 8.51): * <a target="_blank" href="https://ggus.eu/?mode=ticket_info&ticket_id=120045">https://ggus.eu/?mode=ticket_info&ticket_id=120045</a> (LHCb on arcbrisi) * <a target="_blank" href="https://ggus.eu/?mode=ticket_info&ticket_id=120293">https://ggus.eu/?mode=ticket_info&ticket_id=120293</a> (duplicate of the above, handled immediately, so: ???) * "please remind to set the proper status when handling the tickets" * replied to it now ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Pablo, Dario, Dino, Gianni * CMS: Fabio * ATLAS: Luis * LHCb: Roland * EGI: ---++ Action items * Item1 * Item1* Item1 * Item1
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
log
g07.2016.log
r1
manage
1.2 K
2016-04-07 - 13:52
LuisMarch
log
g07.201602.log
r1
manage
1.0 K
2016-04-07 - 13:50
LuisMarch
Feb 2016
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r21
<
r20
<
r19
<
r18
<
r17
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r21 - 2016-04-07
-
MichaelRolli
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback