Swiss Grid Operations Meeting on 2016-07-07 at 14:00

Site status

CSCS

  • Xxx
  • Accounting numbers (from scheduler) from last month

PSI

  • Upgraded my 2 HP CentOS7 NFSv4 NAS to ZoL v0.6.5.7
    • 1st is the primary NAS featuring 24 SAS disks 15k 600GB
    • 2ns is the secondary NAS featuring 12 SATA disks 7.2k 3000GB ( cold backup )
    • both owns a dual 10Gb/s card put in LACP bonding mode
  • dCache on ZoL
    • again on the secondary NAS I made ZFS fs for dCache :
    • [root@t3nfs02 ~]# zfs list -d1
      NAME                    USED  AVAIL  REFER  MOUNTPOINT
      data01                 1.33T  9.15T  32.0K  /zfs/data01
      data01/dcache           100G  9.15T  32.0K  /zfs/data01/dcache
      data01/t3nfs01_data01  1.23T  9.15T  32.0K  /zfs/data01/t3nfs01_data01
      data02                 4.33T  6.15T  32.0K  /zfs/data02
      data02/dcache           100G  6.15T  32.0K  /zfs/data02/dcache
      data02/t3nfs01_data01  4.23T  6.15T  32.0K  /zfs/data02/t3nfs01_data01
      
  • dCache tuning
    • [root@t3se01 layouts]# grep max /etc/dcache/layouts/t3se01.conf 
      srm.request.max-requests=400
      srm.request.put.max-requests=100
      srm.request.get.max-inprogress=100
      srm.request.copy.max-inprogress=100
      srm.request.max-transfers=100
      
  • Accounting numbers (from scheduler) from last month

UNIBE-LHEP

  • Xxx
  • Accounting numbers (from scheduler) from last month

UNIBE-ID

  • Mostly smooth operation
  • Procurement:
    • 80 new server (76*20 + 4*16 => 1584 new cores; disontinued 144 cores (oldest nodes)
      • installed and provisioned
  • Migration from OGSGE => Slurm planned for Q4
  • Probs with NAMD jobs (using ibverbs directly) => low level IB errors from mlx4 regarding qp
    • no errors with MPI jobs using ompi or the like
    • no errors with storage (GPFS over RDMA)
  • ATLAS specific: large number of random a-rex crashes within the last 2 weeks
    • reason unknown, happened 24x between 2016-06-15 and last monday; no crash since 3 days

UNIGE

  • Operations
    • 10 machines added into the batch system (80 cores) + 3 machines replaced:
    • DELL - Intel Xeon @ 2.4 GHz - with 8 cores and 48 GB of memory
    • RAID controller: Common problem for our DPM and NFS File servers (It happened like 3/4 times during last months)
    • Increased activity from DPNC users to run in the batch system (other groups, in addition to ATLAS)
    • Still not in ATLAS production, problems related with memory (hints provided by Gianfranco)
  • Data Management:
  • Accounting numbers (from scheduler) from last month

NGI_CH

  • Xxx
  • NGI-CH Open Tickets review

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS:
  • ATLAS: Michael Rolli (UNIBE-ID) => absent being ill, nevertheless some text above
  • LHCb:
  • EGI:

Action items

  • Item1
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatlog g07.2016.06.log r1 manage 1.1 K 2016-07-07 - 11:05 LuisMarch Accounting UniGe June 2016
Edit | Attach | Watch | Print version | History: r16 | r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r12 - 2016-07-07 - FabioMartinelli
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback