Swiss Grid Operations Meeting on 2016-07-07 at 14:00

Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
External link: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
Phone gate: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
Switch Vidyo SIP IP: 137.138.248.204

Swiss Grid Operations Meeting on 2016-07-07 at 14:00
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Some accounting numbers

account	% num jobs	% of wall	count()*	walltime sec	sum(round(max_vsize/1024))	sum_tres_req_mem	mem_diff	%
total:
atlas	100.00%	100.00%	288913	2694614271	1,617,843,841	1,389,126,500	228,717,341	116.46%
cms	100.00%	100.00%	50840	1535630187	230,934,497	356,035,968	-125,101,471	64.86%
lhcb	100.00%	100.00%	57574	3211019505	255,594,384	115,148,000	140,446,384	221.97%


req<=2000:
atlas	68.50%	43.09%	197903	1160991397	547,848,230	386,762,000	161,086,230	141.65%
cms	74.38%	0.28%	37816	4244836	30,376,806	75,632,000	-45,255,194	40.16%
lhcb	100.00%	100.00%	57572	3210873808	255,585,171	115,144,000	140,441,171	221.97%


req>2000:
atlas	31.50%	56.91%	91007	1533609961	1,069,984,255	1,002,358,500	67,625,755	106.75%
cms	25.62%	99.72%	13024	1531385351	200,557,691	280,403,968	-79,846,277	71.52%
	0.00%	0.00%

Query used:

SELECT account, count(*), sum(phoenix_job_table.time_end - phoenix_job_table.time_start) as walltime, sum(round(max_vsize/1024)),
 sum(substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1)) as sum_tres_req_mem,
 sum(round(max_vsize/1024)) - sum(substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1)) as mem_diff
 FROM slurm_acct_db.phoenix_step_table,slurm_acct_db.phoenix_job_table
 WHERE phoenix_job_table.job_db_inx = phoenix_step_table.job_db_inx
 and substring_index(substring_index(phoenix_job_table.tres_req,',',2),'2=',-1) > 2000
and account in ('atlas', 'cms', 'lhcb')
 and phoenix_step_table.state = 3
 group by account

PSI

Upgraded my 2 HP CentOS7 NFSv4 NAS to ZoL v0.6.5.7
- 1st is the primary NAS featuring 24 SAS disks 15k 600GB
- 2ns is the secondary NAS featuring 12 SATA disks 7.2k 3000GB ( cold backup )
- both owns a dual 10Gb/s card put in LACP bonding mode

dCache on ZoL

again on the secondary NAS I made ZFS fs for dCache :

[root@t3nfs02 ~]# zfs list -d1 NAME USED AVAIL REFER MOUNTPOINT data01 1.33T 9.15T 32.0K /zfs/data01 data01/dcache 100G 9.15T 32.0K /zfs/data01/dcache data01/t3nfs01_data01 1.23T 9.15T 32.0K /zfs/data01/t3nfs01_data01 data02 4.33T 6.15T 32.0K /zfs/data02 data02/dcache 100G 6.15T 32.0K /zfs/data02/dcache data02/t3nfs01_data01 4.23T 6.15T 32.0K /zfs/data02/t3nfs01_data01

dCache tuning

[root@t3se01 layouts]# grep max /etc/dcache/layouts/t3se01.conf srm.request.max-requests=400 srm.request.put.max-requests=100 srm.request.get.max-inprogress=100 srm.request.copy.max-inprogress=100 srm.request.max-transfers=100

Accounting numbers (from scheduler) from last month

UNIBE-LHEP

Operations
- tough month: several issues with full root partitions on wn's and one lustre oss not working well. Also the cloud cluster didn't perform too well (didn't follow-up with SWITCH yet)
ATLAS specific operations
- ICHEP conference in August => steep rise in analysis jobs (lustre suffers)
- One user's jobs very instrumental in killing the shared file system. Could not discover exactly what was wrong with these and had not the time to follow up, so ended up banning analysis temporarily
- Also plenty of data intensive prod workloads (mainly derivations) runnign concurrently (lustre suffers more)
- Issue with some event generation workloads (madgraph) writing large files in /tmp. Root partitions are too small on SunBlade nodes to absorbe that, even with a very aggressive cleanup cron job. Ended up having to ban evegen+simulation from the site as a temporary measure!
- DPM head node migration to SLC6 and ATLAS storage dumps still on hold
HammerCloud report [1]
- UNIBE-LHEP online 79% (last month). Reflects the instabilities mentioned above
- UNIBE-ID 99% (this doesn't run the high I/O workloads, btu it runs analysis)
- UNIBE-LHEP_CLOUD* <71% (I bleieve this is poor network, to follow up on)

[1] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

ATLAS resource delivery UNIBE-LHEP vs CSCS-LCG2 [2]
- All jobs: 56% of ATLAS/CH (WallTime), 77% of ATLAS/CH (CPUtime)
- Good jobs: 69% of ATLAS CH (WallTime), 79% of ATLAS/CH (CPUtime)

[2] http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites%5B%5D=CSCS-LCG2&sites%5B%5D=UNIBE-LHEP&sitesCat%5B%5D=All+Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-06-01&end=2016-06-30&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All

Accounting numbers (from scheduler) for last month (Jun 2016) (includes ce03/CLOUD)

WC h: 960084 (ATLAS) - 1172 (t2k.org) - 1104 (uboone) - 16 (ops)

Accounting numbers (from ATLAS dashboard) from last month (Jun 2016)
- CPU h: 858693 (May value: 1194137)
- WC h: 1057196 (May value: 1358408)

Memory accounting numbers

account	% num jobs	% of wall	count()*	walltime sec	sum(round(max_vsize/1024))	sum_tres_req_mem	vmem_diff	%
total:
atlas	100.00%	100.00%	483754	40601107936	1,830,348,765	2,866,590,264	-1,036,241,499	x%


req<=2000:
atlas	x%	x%	309456	13627312862	585,601,123	579,037,953	6,563,170	x%


req>2000:
atlas	x%	x%	174298	26973795074	1,244,747,642	2,287,552,311	-1,042,804,669	x%

	0.00%	0.00%

Query used:

SELECT account, count(*), sum(`unibe-lhep_job_table`.time_end - `unibe-lhep_job_table`.time_start) as walltime, sum(round(max_vsize/1024)), sum(round(max_rss/1024)),

sum(substring_index(substring_index(`unibe-lhep_job_table`.tres_req,',',2),'2=',-1)) as sum_tres_req_mem,
sum(round(max_vsize/1024)) - sum(substring_index(substring_index(`unibe-lhep_job_table`.tres_req,',',2),'2=',-1)) as vmem_diff,
 
sum(round(max_rss/1024)) - sum(substring_index(substring_index(`unibe-lhep_job_table`.tres_req,',',2),'2=',-1)) as rss_diff 
 FROM `unibe-lhep_step_table`,`unibe-lhep_job_table` 
 WHERE `unibe-lhep_job_table`.job_db_inx = `unibe-lhep_step_table`.job_db_inx
 and substring_index(substring_index(`unibe-lhep_job_table`.tres_req,',',2),'2=',-1) < 2001 
and account in ('atlasch001', 'atlas-sw', 'atlasplt002', 'atlasprod002', 'atlasplt003', 'atlasch008', 'atlasch002', 'atlasch009')
 and `unibe-lhep_step_table`.state = 3
 group by account;

UNIBE-ID

Mostly smooth operation
Procurement:
- 80 new server (76*20 + 4*16 => 1584 new cores; disontinued 144 cores (oldest nodes)
  - installed and provisioned
Migration from OGSGE => Slurm planned for Q4
Probs with NAMD jobs (using ibverbs directly) => low level IB errors from mlx4 regarding qp
- no errors with MPI jobs using ompi or the like
- no errors with storage (GPFS over RDMA)
ATLAS specific: large number of random a-rex crashes within the last 2 weeks
- reason unknown, happened 24x between 2016-06-15 and last monday; no crash since 3 days

UNIGE

Operations
- 10 machines added into the batch system (80 cores) + 3 machines replaced:
- DELL - Intel Xeon @ 2.4 GHz - with 8 cores and 48 GB of memory
- RAID controller: Common problem for our DPM and NFS File servers (It happened like 3/4 times during last months)
- Increased activity from DPNC users to run in the batch system (other groups, in addition to ATLAS)
- Still not in ATLAS production, problems related with memory (hints provided by Gianfranco)
Data Management:
- User datasets from UniGe for ATLASLOCALGROUPDISK at CSCS deleted (space can be moved to ATLASSCRATCHDISK)
- Some problems for central deletion (fixed) - permissions related: https://ggus.eu/index.php?mode=ticket_info&ticket_id=122024
Accounting numbers (from scheduler) from last month