Swiss Grid Operations Meeting on 2016-07-07 at 14:00

Site status

CSCS

  • Some accounting numbers:
    account % num jobs % of wall count(*) walltime sec sum(round(max_vsize/1024)) sum_tres_req_mem mem_diff %
    total:







    atlas 100.00% 100.00% 288913 2694614271 1,617,843,841 1,389,126,500 228,717,341 116.46%
    cms 100.00% 100.00% 50840 1535630187 230,934,497 356,035,968 -125,101,471 64.86%
    lhcb 100.00% 100.00% 57574 3211019505 255,594,384 115,148,000 140,446,384 221.97%


















    req<=2000:







    atlas 68.50% 43.09% 197903 1160991397 547,848,230 386,762,000 161,086,230 141.65%
    cms 74.38% 0.28% 37816 4244836 30,376,806 75,632,000 -45,255,194 40.16%
    lhcb 100.00% 100.00% 57572 3210873808 255,585,171 115,144,000 140,441,171 221.97%


















    req>2000:







    atlas 31.50% 56.91% 91007 1533609961 1,069,984,255 1,002,358,500 67,625,755 106.75%
    cms 25.62% 99.72% 13024 1531385351 200,557,691 280,403,968 -79,846,277 71.52%

    0.00% 0.00%







  • Accounting numbers (from scheduler) from last month

PSI

  • Upgraded my 2 HP CentOS7 NFSv4 NAS to ZoL v0.6.5.7
    • 1st NAS is the primary featuring 24 SAS disks 15k 600GB
    • 2ns NAS is the secondary featuring 12 SATA disks 7.2k 3000GB ( cold backup )
    • both feature a dual 10Gb/s card put in LACP bonding mode
* dCache on ZoL * on the secondary NAS I'm going to make a ZFS fs for dCache and provide ~5TB to the PSI T3 ; it's a shame to use this HW only for backups ( 5y warranty ) * again on the secondary NAS I made ZFS fs for dCache : * Accounting numbers (from scheduler) from last month *
[root@t3nfs02 ~]# zfs list -d1
NAME                    USED  AVAIL  REFER  MOUNTPOINT
---+++ UNIBE-LHEP
data01                 1.33T  9.15T  32.0K  /zfs/data01
data01/dcache           100G  9.15T  32.0K  /zfs/data01/dcache
*Operations*
data01/t3nfs01_data01  1.23T  9.15T  32.0K  /zfs/data01/t3nfs01_data01
   * tough month: several issues with full root partitions on wn's and one lustre oss not working well. Also the cloud cluster didn't perform too well (didn't follow-up with SWITCH yet)
data02                 4.33T  6.15T  32.0K  /zfs/data02
*ATLAS specific operations*
data02/dcache           100G  6.15T  32.0K  /zfs/data02/dcache
   * 
data02/t3nfs01_data01  4.23T  6.15T  32.0K  /zfs/data02/t3nfs01_data01
      * ICHEP conference in August => steep rise in analysis jobs (lustre suffers)
* One user jobs very instrumental in killing the shared file system. Could not discover exactly what was wrong with these and had not the time to follow up, so ended up bannign analysis temporarily * dCache tuning * Also plenty of data intensive prod workloads (mainly derivations) runnign concurrently (lustre suffers more) *
[root@t3se01 layouts]# grep max /etc/dcache/layouts/t3se01.conf 
      * Issue with some event generation workloads (madgraph) writing large files in /tmp. Root are too small On SunBlade nodes to absorbe that, even with a very aggressive cleanup cron job. Ended up having to ban evegen+simulation from the site as a temporary measure!
srm.request.max-requests=400
*HammerCloud report [1]*
srm.request.put.max-requests=100
   * UNIBE-LHEP online >79% (last month). Reflects the instabilities mentioned above
srm.request.get.max-inprogress=100
   * UNIBE-ID 99%
srm.request.copy.max-inprogress=100
   * UNIBE-LHEP_CLOUD* 71% 
srm.request.max-transfers=100
[1] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE
* Accounting numbers (from scheduler) from last month *ATLAS resource delivery [2]* ---+++ UNIBE-LHEP * All jobs: 56% of ATLAS/CH (WallTime), 77% of ATLAS/CH (CPUtime) * Xxx * Good jobs: 69% of ATLAS CH (WallTime), 79% of ATLAS/CH (CPUtime) * Accounting numbers (from scheduler) from last month [2] http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites%5B%5D=CSCS-LCG2&sites%5B%5D=UNIBE-LHEP&sitesCat%5B%5D=All+Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-06-01&end=2016-06-30&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All ---+++ UNIBE-ID * Accounting numbers (from scheduler) from last month (Jun 2016) ( includes ce03/CLOUD ) * Mostly smooth operation * WC h: 960084 (ATLAS) - 1172 (t2k.org) - 1104 (uboone) - 16 (ops) * Procurement: * Accounting numbers (from ATLAS dashboard) from last month (Jun 2016) * 80 new server (76*20 + 4*16 => 1584 new cores; disontinued 144 cores (oldest nodes) * CPU h: 858693 (May value: 1194137) * installed and provisioned * WC h: 1057196 (May value: 1358408) * Migration from OGSGE => Slurm planned for Q4 * Probs with NAMD jobs (using ibverbs directly) => low level IB errors from mlx4 regarding qp ---+++ UNIBE-ID * no errors with MPI jobs using ompi or the like * Xxx * no errors with storage (GPFS over RDMA)
  • ATLAS specific: large number of random a-rex crashes within the last 2 weeks
    • reason unknown, happened 24x between 2016-06-15 and last monday; no crash since 3 days

UNIGE

* Operations * Xxx * 10 machines added into the batch system (80 cores) + 3 machines replaced: * Accounting numbers (from scheduler) from last month * DELL - Intel Xeon @ 2.4 GHz - with 8 cores and 48 GB of memory

    • RAID controller: Common problem for our DPM and NFS File servers (It happened like 3/4 times during last months)
    • Increased activity from DPNC users to run in the batch system (other groups, in addition to ATLAS)
    • Still not in ATLAS production, problems related with memory (hints provided by Gianfranco)
  • Data Management:
  • Accounting numbers (from scheduler) from last month

NGI_CH

  • Xxx
  • NGI-CH Open Tickets review

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS:
* ATLAS: Michael Rolli (UNIBE-ID) => absent being ill, nevertheless some text above * ATLAS: * LHCb:
  • EGI:

Action items

  • Item1
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Screen_Shot_2016-07-07_at_14.04.57.png r1 manage 76.4 K 2016-07-07 - 12:08 GianfrancoSciacca Memory request summary UNIBE-LHEP
Unknown file formatlog g07.2016.06.log r1 manage 1.1 K 2016-07-07 - 11:05 LuisMarch Accounting UniGe June 2016
Edit | Attach | Watch | Print version | History: r16 | r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2016-07-07 - DinoConciatore
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback