Swiss Grid Operations Meeting on 2016-07-07 at 14:00

Site status


  • Some accounting numbers
    account % num jobs % of wall count()* walltime sec sum(round(max_vsize/1024)) sum_tres_req_mem mem_diff %

    atlas 100.00% 100.00% 288913 2694614271 1,617,843,841 1,389,126,500 228,717,341 116.46%
    cms 100.00% 100.00% 50840 1535630187 230,934,497 356,035,968 -125,101,471 64.86%
    lhcb 100.00% 100.00% 57574 3211019505 255,594,384 115,148,000 140,446,384 221.97%


    atlas 68.50% 43.09% 197903 1160991397 547,848,230 386,762,000 161,086,230 141.65%
    cms 74.38% 0.28% 37816 4244836 30,376,806 75,632,000 -45,255,194 40.16%
    lhcb 100.00% 100.00% 57572 3210873808 255,585,171 115,144,000 140,441,171 221.97%


    atlas 31.50% 56.91% 91007 1533609961 1,069,984,255 1,002,358,500 67,625,755 106.75%
    cms 25.62% 99.72% 13024 1531385351 200,557,691 280,403,968 -79,846,277 71.52%

    0.00% 0.00%

    Query used:
  • SELECT account, count(*), sum(phoenix_job_table.time_end - phoenix_job_table.time_start) as walltime, sum(round(max_vsize/1024)),
    sum(substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1)) as sum_tres_req_mem,
    sum(round(max_vsize/1024)) - sum(substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1)) as mem_diff
    FROM slurm_acct_db.phoenix_step_table,slurm_acct_db.phoenix_job_table
    WHERE phoenix_job_table.job_db_inx = phoenix_step_table.job_db_inx
    and substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1) > 2000
    and account in ('atlas', 'cms', 'lhcb')
    and phoenix_step_table.state = 3
    group by account


  • Upgraded my 2 HP CentOS7 NFSv4 NAS to ZoL v0.6.5.7
    • 1st is the primary NAS featuring 24 SAS disks 15k 600GB
    • 2ns is the secondary NAS featuring 12 SATA disks 7.2k 3000GB ( cold backup )
    • both owns a dual 10Gb/s card put in LACP bonding mode
  • dCache on ZoL
    • again on the secondary NAS I made ZFS fs for dCache :
    • [root@t3nfs02 ~]# zfs list -d1 NAME USED AVAIL REFER MOUNTPOINT data01 1.33T 9.15T 32.0K /zfs/data01 data01/dcache 100G 9.15T 32.0K /zfs/data01/dcache data01/t3nfs01_data01 1.23T 9.15T 32.0K /zfs/data01/t3nfs01_data01 data02 4.33T 6.15T 32.0K /zfs/data02 data02/dcache 100G 6.15T 32.0K /zfs/data02/dcache data02/t3nfs01_data01 4.23T 6.15T 32.0K /zfs/data02/t3nfs01_data01 
  • dCache tuning
    • [root@t3se01 layouts]# grep max /etc/dcache/layouts/t3se01.conf srm.request.max-requests=400 srm.request.put.max-requests=100 srm.request.get.max-inprogress=100 srm.request.copy.max-inprogress=100 srm.request.max-transfers=100 
  • Accounting numbers (from scheduler) from last month


  • Accounting numbers (from scheduler) from last month


  • Mostly smooth operation
  • Procurement:
    • 80 new server (76*20 + 4*16 => 1584 new cores; disontinued 144 cores (oldest nodes)
      • installed and provisioned
  • Migration from OGSGE => Slurm planned for Q4
  • Probs with NAMD jobs (using ibverbs directly) => low level IB errors from mlx4 regarding qp
    • no errors with MPI jobs using ompi or the like
    • no errors with storage (GPFS over RDMA)
  • ATLAS specific: large number of random a-rex crashes within the last 2 weeks
    • reason unknown, happened 24x between 2016-06-15 and last monday; no crash since 3 days


  • Operations
    • 10 machines added into the batch system (80 cores) + 3 machines replaced:
    • DELL - Intel Xeon @ 2.4 GHz - with 8 cores and 48 GB of memory
    • RAID controller: Common problem for our DPM and NFS File servers (It happened like 3/4 times during last months)
    • Increased activity from DPNC users to run in the batch system (other groups, in addition to ATLAS)
    • Still not in ATLAS production, problems related with memory (hints provided by Gianfranco)
  • Data Management:
  • Accounting numbers (from scheduler) from last month


  • NGI-CH Open Tickets review

Other topics

Next meeting date:



  • CSCS:
  • CMS:
  • ATLAS: Michael Rolli (UNIBE-ID) => absent being ill, nevertheless some text above
  • LHCb:
  • EGI:

Action items

  • Item1
