Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2016-07-07 at 14:00

Site status

CSCS

  • Some accounting numbers
    account % num jobs % of wall count()* walltime sec sum(round(max_vsize/1024)) sum_tres_req_mem mem_diff %
    total:







    atlas 100.00% 100.00% 288913 2694614271 1,617,843,841 1,389,126,500 228,717,341 116.46%
    cms 100.00% 100.00% 50840 1535630187 230,934,497 356,035,968 -125,101,471 64.86%
    lhcb 100.00% 100.00% 57574 3211019505 255,594,384 115,148,000 140,446,384 221.97%


















    req<=2000:







    atlas 68.50% 43.09% 197903 1160991397 547,848,230 386,762,000 161,086,230 141.65%
    cms 74.38% 0.28% 37816 4244836 30,376,806 75,632,000 -45,255,194 40.16%
    lhcb 100.00% 100.00% 57572 3210873808 255,585,171 115,144,000 140,441,171 221.97%


















    req>2000:







    atlas 31.50% 56.91% 91007 1533609961 1,069,984,255 1,002,358,500 67,625,755 106.75%
    cms 25.62% 99.72% 13024 1531385351 200,557,691 280,403,968 -79,846,277 71.52%

    0.00% 0.00%





    Query used:
  • SELECT account, count(*), sum(phoenix_job_table.time_end - phoenix_job_table.time_start) as walltime, sum(round(max_vsize/1024)),
    sum(substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1)) as sum_tres_req_mem,
    sum(round(max_vsize/1024)) - sum(substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1)) as mem_diff
    FROM slurm_acct_db.phoenix_step_table,slurm_acct_db.phoenix_job_table
    WHERE phoenix_job_table.job_db_inx = phoenix_step_table.job_db_inx
    and substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1) > 2000
    and account in ('atlas', 'cms', 'lhcb')
    and phoenix_step_table.state = 3
    group by account

PSI

  • Upgraded my 2 HP CentOS7 NFSv4 NAS to ZoL v0.6.5.7
    • 1st is the primary NAS featuring 24 SAS disks 15k 600GB
    • 2ns is the secondary NAS featuring 12 SATA disks 7.2k 3000GB ( cold backup )
    • both owns a dual 10Gb/s card put in LACP bonding mode
  • dCache on ZoL
    • again on the secondary NAS I made ZFS fs for dCache :
    • [root@t3nfs02 ~]# zfs list -d1 NAME USED AVAIL REFER MOUNTPOINT data01 1.33T 9.15T 32.0K /zfs/data01 data01/dcache 100G 9.15T 32.0K /zfs/data01/dcache data01/t3nfs01_data01 1.23T 9.15T 32.0K /zfs/data01/t3nfs01_data01 data02 4.33T 6.15T 32.0K /zfs/data02 data02/dcache 100G 6.15T 32.0K /zfs/data02/dcache data02/t3nfs01_data01 4.23T 6.15T 32.0K /zfs/data02/t3nfs01_data01 
  • dCache tuning
    • [root@t3se01 layouts]# grep max /etc/dcache/layouts/t3se01.conf srm.request.max-requests=400 srm.request.put.max-requests=100 srm.request.get.max-inprogress=100 srm.request.copy.max-inprogress=100 srm.request.max-transfers=100 
  • Accounting numbers (from scheduler) from last month

UNIBE-LHEP

  • Operations

    • tough month: several issues with full root partitions on wn's and one lustre oss not working well. Also the cloud cluster didn't perform too well (didn't follow-up with SWITCH yet)
  • ATLAS specific operations
    • ICHEP conference in August => steep rise in analysis jobs (lustre suffers)
    • One user's jobs very instrumental in killing the shared file system. Could not discover exactly what was wrong with these and had not the time to follow up, so ended up banning analysis temporarily
    • Also plenty of data intensive prod workloads (mainly derivations) runnign concurrently (lustre suffers more)
    • Issue with some event generation workloads (madgraph) writing large files in /tmp. Root partitions are too small on SunBlade nodes to absorbe that, even with a very aggressive cleanup cron job. Ended up having to ban evegen+simulation from the site as a temporary measure!
    • DPM head node migration to SLC6 and ATLAS storage dumps still on hold
  • HammerCloud report [1]
    • UNIBE-LHEP online 79% (last month). Reflects the instabilities mentioned above
    • UNIBE-ID 99% (this doesn't run the high I/O workloads, btu it runs analysis)
    • UNIBE-LHEP_CLOUD* <71% (I bleieve this is poor network, to follow up on)
[1] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

  • ATLAS resource delivery UNIBE-LHEP vs CSCS-LCG2 [2]
    • All jobs: 56% of ATLAS/CH (WallTime), 77% of ATLAS/CH (CPUtime)
    • Good jobs: 69% of ATLAS CH (WallTime), 79% of ATLAS/CH (CPUtime)
[2] http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites%5B%5D=CSCS-LCG2&sites%5B%5D=UNIBE-LHEP&sitesCat%5B%5D=All+Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-06-01&end=2016-06-30&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All

  • Accounting numbers (from scheduler) for last month (Jun 2016) (includes ce03/CLOUD)
  • WC h: 960084 (ATLAS) - 1172 (t2k.org) - 1104 (uboone) - 16 (ops)
    • Accounting numbers (from ATLAS dashboard) from last month (Jun 2016)
      • CPU h: 858693 (May value: 1194137)
      • WC h: 1057196 (May value: 1358408)

    • Memory accounting numbers
      account % num jobs % of wall count()* walltime sec sum(round(max_vsize/1024)) sum_tres_req_mem vmem_diff %
      total:







      atlas 100.00% 100.00% 483754 40601107936 1,830,348,765 2,866,590,264 -1,036,241,499 x%

















      req<=2000:







      atlas x% x% 309456 13627312862 585,601,123 579,037,953 6,563,170 x%

















      req>2000:







      atlas x% x% 174298 26973795074 1,244,747,642 2,287,552,311 -1,042,804,669 x%










      0.00% 0.00%





      Query used:
    • SELECT account, count(*), sum(`unibe-lhep_job_table`.time_end - `unibe-lhep_job_table`.time_start) as walltime, sum(round(max_vsize/1024)), sum(round(max_rss/1024)),
      sum(substring_index(substring_index(`unibe-lhep_job_table`.tres_req,',',2),'2=',-1)) as sum_tres_req_mem,

      sum(round(max_vsize/1024)) - sum(substring_index(substring_index(`unibe-lhep_job_table`.tres_req,',',2),'2=',-1)) as vmem_diff,
      sum(round(max_rss/1024)) - sum(substring_index(substring_index(`unibe-lhep_job_table`.tres_req,',',2),'2=',-1)) as rss_diff
      FROM `unibe-lhep_step_table`,`unibe-lhep_job_table`
      WHERE `unibe-lhep_job_table`.job_db_inx = `unibe-lhep_step_table`.job_db_inx
      and substring_index(substring_index(`unibe-lhep_job_table`.tres_req,',',2),'2=',-1) < 2001
      and account in ('atlasch001', 'atlas-sw', 'atlasplt002', 'atlasprod002', 'atlasplt003', 'atlasch008', 'atlasch002', 'atlasch009')
      and `unibe-lhep_step_table`.state = 3
      group by account;

    UNIBE-ID

    • Mostly smooth operation
    • Procurement:
      • 80 new server (76*20 + 4*16 => 1584 new cores; disontinued 144 cores (oldest nodes)
        • installed and provisioned
    • Migration from OGSGE => Slurm planned for Q4
    • Probs with NAMD jobs (using ibverbs directly) => low level IB errors from mlx4 regarding qp
      • no errors with MPI jobs using ompi or the like
      • no errors with storage (GPFS over RDMA)
    • ATLAS specific: large number of random a-rex crashes within the last 2 weeks
      • reason unknown, happened 24x between 2016-06-15 and last monday; no crash since 3 days

    UNIGE

    • Operations
      • 10 machines added into the batch system (80 cores) + 3 machines replaced:
      • DELL - Intel Xeon @ 2.4 GHz - with 8 cores and 48 GB of memory
      • RAID controller: Common problem for our DPM and NFS File servers (It happened like 3/4 times during last months)
      • Increased activity from DPNC users to run in the batch system (other groups, in addition to ATLAS)
      • Still not in ATLAS production, problems related with memory (hints provided by Gianfranco)
    • Data Management:
    • Accounting numbers (from scheduler) from last month

    NGI_CH

    • Xxx
    • NGI-CH Open Tickets review

    Other topics

    • Topic1
    • Topic2
    Next meeting date:

    A.O.B.

    Attendants

    • CSCS:
    • CMS:
    • ATLAS: Michael Rolli (UNIBE-ID) => absent being ill, nevertheless some text above
    • LHCb: Roland Bernet
    • EGI:

    Action items

    • Item1
    Topic attachments
    I Attachment History Action Size Date Who Comment
    Unknown file formatlog g07.2016.06.log r1 manage 1.1 K 2016-07-07 - 11:05 LuisMarch Accounting UniGe June 2016
    Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions
    Topic revision: r16 - 2016-08-09 - GianfrancoSciacca
     
    This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
    Ideas, requests, problems regarding TWiki? Send feedback