Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2014-03-06

Agenda

Status

  • CSCS (reports Miguel):
    • Procurement:
      • Received 16x HP WNs (40core, 128GB, 1x SATA disk, each) HP ProLiant SL210t Gen8 1U Server (total of 8U)
      • Purchased additional storage (375TB) for the Storage Element (2x NetApp E5500B FC with 4TB drives + 2x IBM x3650 servers)
      • Purchased 2x IBM x3650 servers for virtualisation (128GB, 1x 10GbE, 5x400GB SAS SSD)
      • Purchased 8x HP WNs (40core, 128GB, 1x SATA disk, each) HP ProLiant SL210t Gen8 1U Server (total of 4U). To be used preferably by Swiss users.
      • With this, the purchases of phase H are complete
    • Operations:
      • Provisioned new GPFS nodes using Razor and Puppet. Configuration ongoing.
      • Working on a new monitoring system.
      • Received WNs to be provisioned soon. They will be puppetized (=no more YAIM).
      • Perfsonar deployed with the help from the NGI_DE admins.
    • Issues:
      • DONE Thanks to ATLAS folks, spotted a mistake on the FairShare configuration that prevented ATLAS and CMS jobs to utilise the system according to the agreed share. This has been fixed now.
      • DONE ARC accounting successfully published to APEL testing servers. Publishing on production servers on the way (currently coordinating with APEL team) along with archived records collected by ARC CEs during previuos months.
      • DONE Found a problem on WNs where /tmp/slurmd was not created by SLURM. This made the node become a job black hole not detected by the health check system.
      • DONE Resolved intermittent issue with DHCP, during provisioning transfer of initrd could stall. Root cause was as the ethernet and infiniband fabrics share the same VLAN the DHCP server needs to know the two subnets exist as a "shared-network"
      • Some CMS jobs writing to /tmp (local) instead of /tmpdir_slurm (GPFS).
      • There seems to be still a lot of accounting data missing on the official accounting portal (related to CREAM CEs). Working on to figure out why and how to fix it.

  • PSI (reports Fabio):
  • UNIBE (reports Gianfranco):
    • Fairly stable operations with day-to-day maintenance, except:
      • Issues with stale files in ARC sessiondir. These are files left over by failed jobs, which end up clogging the directory. Added a weekly cron to perform a cleanup
      • Issues with large amount of job files in sessiondir. In this case, these are genuine job files not cleaned up by the users. t2k users were not yet experienced and not aware they have to regularly retrieve outputs
      • In both cases, the symptom is the culsters' infosys dropping out of the GIIS, which causes them to be invisible. Cleanup and proper service restore is painstaking, but now all seems to have stabilised.
    • Kernel panic on one lustre OSS (thumper). Disabled it on all nodes and carried on with limited pain
    • Performed crosscheck of accounting data between batch servers and EGI accounting portal from Jan 2013 to date
      • Some months are low on the EGI portal: Alessandro trying republishing. If numbers don't change, it means that records are missing on the SGAS server, therefore they are lost.
      • Discovered that gridengine does not seem to account properly for multi-core jobs. ATLAS-MCORE-setup_for_GE.rtf

    • DPM SE configured for xrootd protocol. Setup to join ATLAS DE FAX federation in progress (working with Geneva). CSCS final setup still pending (setup is different for DPM and dCache)
    • Obsolete Glue 2 entries in site-bdii: https://xgus.ggus.eu/ngi_ch/?mode=ticket_info&ticket_id=303
    • No progress on:
      • Transition SGAS to Jura
      • GIIS final setup
      • Solid roadtest of VOMS server (first tests are OK)
    • Placed an order for six 36-disk servers for Lustre (intending to phase out thumpers). JBODs with 5x8 port LSI RAID controllers
    • Some new WNs procurement expected in April. Strategy is to let current ailing hardware to die (no effort to rescue, repair) and progressively replace with new nodes
  • UNIGE (reports Szymon): absent
    • Xxx
  • UZH (reports Sergio):
    • Xxx
  • Switch (reports Alessandro):
    • Xxx
Other topics
  • "Swiss" Resources at CSCS
    • Maybe we should have a dedicated meeting (Vidyo conference is fine) discussing the plans (time-line, configuration, testing) for these non-pledged resources (especially the computing; the storage is trivial at least from the CMS side) or we can schedule it for the next monthly meeting?
    • From the CMS side we would like to prioritize the jobs of Swiss users so their jobs come first on this Swiss resources
      • For that we have an additional group in the CMS VOMS server /cms/chcms that can be administrated by Christoph and Daniel.
      • For the moment Derek, Fabio, and Daniel are in this group and can provide proxy certificates with this additional role for testing purposes.
    • Are there similar plans from the ATLAS/LHCb side?
  • Topic2
Next meeting date:

AOB

  • Next week Pre-GDB and GDB to be held in Bologna. Miguel will attend and give small presentation on the migration to SLURM of CSCS-LCG2.

Attendants

  • CSCS:
  • CMS:
  • ATLAS:
  • LHCb:
  • EGI:

Action items

  • Item1

A user banned in dCache 2.6

We banned a use because of the following hundreds of requests always for the same two files => Xrootd DoS on the dCache pools hosting those two files.
xrootd-queued-requests.png
Relevant dCache banning logs:
More... Close
04 Mar 2014 17:18:00 (gPlazma) [Xrootd-t3se01 Login AUTH voms] Certificate verification: Verifying certificate 'DC=ch,DC=cern,OU=computers,CN=voms.cern.ch'
04 Mar 2014 17:18:00 (gPlazma) [Xrootd-t3se01 Login MAP vorolemap] VOMS authorization successful for user with DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=silveira/CN=705497/CN=Gustavo Gil Da Silveira and FQAN: /cms for user name: cmsuser.
04 Mar 2014 17:18:00 (gPlazma) [Xrootd-t3se01 Login] Login attempt failed; detailed explanation follows:
LOGIN FAIL
 |    in: X509 Certificate chain:
 |          |
 |          +--CN=1758990799,CN=1394689704,CN=Gustavo Gil Da Silveira,CN=705497,CN=silveira,OU=Users,OU=Organic Units,DC=cern,DC=ch [1758990799]
 |          |    |
 |          |    +--Issuer: CN=1394689704,CN=Gustavo Gil Da Silveira,CN=705497,CN=silveira,OU=Users,OU=Organic Units,DC=cern,DC=ch
 |          |    +--Validity: OK for 16 hours, 54 minutes and 51.3 seconds
 |          |    +--Algorithm: SHA-1 with RSA
 |          |    +--Public key: RSA 1024 bits
 |          |    +--Key usage: digital signature, key encipherment, data encipherment
 |          |
 |          +--CN=1394689704,CN=Gustavo Gil Da Silveira,CN=705497,CN=silveira,OU=Users,OU=Organic Units,DC=cern,DC=ch [1455016582116403591211701295546859998117066254174]
 |          |    |
 |          |    +--Issuer: CN=Gustavo Gil Da Silveira,CN=705497,CN=silveira,OU=Users,OU=Organic Units,DC=cern,DC=ch
 |          |    +--Validity: OK for 7 days, 16 hours, 22 minutes and 19.2 seconds
 |          |    +--Algorithm: SHA-1 with RSA
 |          |    +--Public key: RSA 1024 bits
 |          |    +--Attribute certificates:
 |          |    |    |
 |          |    |    +--DC=ch,DC=cern,OU=computers,CN=voms.cern.ch
 |          |    |         +--Validity: OK for 7 days, 16 hours, 22 minutes and 19.2 seconds
 |          |    |         +--Algorithm: SHA-1 with RSA
 |          |    |         +--FQANs: /cms, /cms/becms
 |          |    +--Key usage: digital signature, key encipherment, data encipherment
 |          |
 |          +--CN=Gustavo Gil Da Silveira,CN=705497,CN=silveira,OU=Users,OU=Organic Units,DC=cern,DC=ch [315385076395555361510222]
 |               |
 |               +--Issuer: CN=CERN Trusted Certification Authority,DC=cern,DC=ch
 |               +--Validity: OK for 28 days, 22 hours, 41 minutes and 41.2 seconds
 |               +--Algorithm: SHA-1 with RSA
 |               +--Public key: RSA 2048 bits
 |               +--Subject alternative names:
 |               |      otherName: 302a060a2b060104018237140203a01ca01a0c186775737461766f2e73696c7665697261406365726e2e6368
 |               |      email: gustavo.silveira@cern.ch
 |               +--Key usage: digital signature, key encipherment, SSL client, email protection, Microsoft EPS
 |        
 |
 +--AUTH OK
 |   |    added: FQANPrincipal[/cms,primary]
 |   |           FQANPrincipal[/cms/becms]
 |   |           /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=silveira/CN=705497/CN=Gustavo Gil Da Silveira
 |   |
 |   +--x509 OPTIONAL:OK => OK
 |   |      added: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=silveira/CN=705497/CN=Gustavo Gil Da Silveira
 |   |
 |   +--voms OPTIONAL:OK => OK
 |          added: FQANPrincipal[/cms,primary]
 |                 FQANPrincipal[/cms/becms]
 |
 +--MAP OK
 |   |    added: GidPrincipal[500,primary]
 |   |           UidPrincipal[501]
 |   |           UserNamePrincipal[cmsuser]
 |   |           GroupNamePrincipal[cmsuser,primary]
 |   |
 |   +--vorolemap REQUISITE:OK => OK
 |   |      added: GroupNamePrincipal[cmsuser,primary]
 |   |
 |   +--authzdb REQUISITE:OK => OK
 |          added: GidPrincipal[500,primary]
 |                 UidPrincipal[501]
 |                 UserNamePrincipal[cmsuser]
 |
 +--ACCOUNT FAIL
 |   |
 |   +--banfile REQUISITE:FAIL (user banned) => FAIL (ends the phase)
 |
 +--(SESSION) skipped
 |
 +--(VALIDATION) skipped

A not trivial Salt run

Here I retrieve as a Python dict all the top processes running inside my WNs.
More... Close
[root@t3admin01 ~]# ipython 
Python 2.7.6 |Anaconda 1.9.0 (64-bit)| (default, Jan 17 2014, 10:13:17) 
Type "copyright", "credits" or "license" for more information.

IPython 1.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import salt.client

In [2]: saltclient = salt.client.LocalClient()  # I connect to the salt master

In [3]: mydict = saltclient.cmd('nodename:t3wn*', 'ps.top', expr_form='grain' )  # I run the Python top program only on the WNs

# For each WN, the 5 top processes
In [4]: print mydict
{'t3wn34': [
   {'status': 0, 'mem.vms': 1043345408, 'cmd': ['cmsRun', '/shome/cgalloni/TestSim/CMSSW_5_3_2_patch4/src/SIM/MR_M1000_SIM_cfi.py', 'maxEvents=500', 'skipEvents=228000', 'seed=3936709'], 'pid': 9985, 'cpu.user': 10463.36, 'cpu.system': 3.38, 'create_time': 1394091559.43, 'user': 'cgalloni', 'mem.rss': 725270528}, 
   {'status': 0, 'mem.vms': 1024819200, 'cmd': ['cmsRun', '/shome/cgalloni/TestSim/CMSSW_5_3_2_patch4/src/SIM/MR_M1000_SIM_cfi.py', 'maxEvents=500', 'skipEvents=228500', 'seed=7239021'], 'pid': 10243, 'cpu.user': 10431.48, 'cpu.system': 3.3, 'create_time': 1394091591.95, 'user': 'cgalloni', 'mem.rss': 703696896}, 
   {'status': 0, 'mem.vms': 1021722624, 'cmd': ['cmsRun', '/shome/cgalloni/TestSim/CMSSW_5_3_2_patch4/src/SIM/MR_M1000_SIM_cfi.py', 'maxEvents=500', 'skipEvents=224000', 'seed=2991033'], 'pid': 8030, 'cpu.user': 10838.1, 'cpu.system': 2.81, 'create_time': 1394091186.8, 'user': 'cgalloni', 'mem.rss': 702164992}, 
   {'status': 0, 'mem.vms': 1019928576, 'cmd': ['cmsRun', '/shome/cgalloni/TestSim/CMSSW_5_3_2_patch4/src/SIM/MR_M1000_SIM_cfi.py', 'maxEvents=500', 'skipEvents=234000', 'seed=9386177'], 'pid': 11313, 'cpu.user': 10170.57, 'cpu.system': 2.74, 'create_time': 1394091852.93, 'user': 'cgalloni', 'mem.rss': 703258624}, 
   {'status': 1, 'mem.vms': 4165632, 'cmd': ['mdadm', '--monitor', '--scan', '-f', '--pid-file=/var/run/mdadm/mdadm.pid'], 'pid': 20269, 'cpu.user': 0.0, 'cpu.system': 0.01, 'create_time': 1394036233.94, 'user': 'root', 'mem.rss': 401408}
                            ], 
't3wn33':   .........

Topic attachments
I Attachment History Action Size Date Who Comment
Microsoft Word filertf ATLAS-MCORE-setup_for_GE.rtf r1 manage 2.0 K 2014-03-06 - 15:01 GianfrancoSciacca ATLAS-MCORE-setup_for_GE
PNGpng xrootd-queued-requests.png r1 manage 10.9 K 2014-03-04 - 14:41 FabioMartinelli T3 PSI xrootd queued requests
Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r22 - 2014-03-06 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback