Acceptance tests for Piz Daint as production resource during 2017

Acceptance tests for Piz Daint as production resource during 2017

Track record for the different production runs for running the CSCS Tier2 compute resources on CRAY.

Run 1. 2017-05-05 to 2017-05-26 (22 days)

Agreed metrics (per VO) for both Phoenix and CRAY:

Produced walltime (good & bad) per core, per type of job
Walltime of good vs failed jobs, per type of job
CPU/Wallclock efficiency for successful jobs, per type of job
Site Availability
Alternative site HepSpec value, if wanted
Any other metric you think important, for discussion

Agreed metrics (for CSCS) for both Phoenix and CRAY:

Produced walltime per core per VO
Fair share distribution

ATLAS

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	Any score	405726	0.92	0.98
Piz Daint	Any score	19445	0.71	0.96
Phoenix	Any mcore	577002	0.86	0.91
Piz Daint	Any mcore	18870	0.48	0.72
Phoenix	Analy	38422	0.61	0.74
Piz Daint	Analy	0	-	-

Site Availability: Phoenix 56.53 - Piz Daint 58.21 (measured with HammerCloud exclusion functional tests [https://tinyurl.com/ybyp6z5t])
Comments: Availability figures: Production queues only (no analy). Availability affected by long dCache related downtime. Analy queue on Piz Daint has been kept offline still.

CMS

Site Availability: X%
Comments

Job statistics for T2_CH_CSCS (Phoenix) from 2017-05-05 to 2017-05-26

jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
analysistest	0.0	0.3	0.0	1.9	0.0	0.0
hctest	4221.3	4233.0	5013.2	7213.2	69.5	99.7
production	267.1	267.1	283.1	390.4	72.5	100.0
reprocessing	1756.9	1756.9	2051.8	2051.9	100.0	100.0
hcxrootd	942.0	944.7	1480.2	1571.9	94.2	99.7
psst	0.0	0.0	0.0	0.0
analysis	105662.1	133938.9	165278.4	326166.2	50.7	78.9
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	112849.4	141140.9	174106.7	337395.5	51.6	80.0

LHCb

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	all jobs	901'404 hours	0.95	0.97
Phoenix	Simulation	725'502 hours	0.96	0.99
Phoenix	Reconstruction	132'014 hours	0.97	0.86
Phoenix	User	42'069 hours	0.80	0.88
Piz Daint	all jobs	324'996 hours	0.90	0.99
Piz Daint	Simulation	306'638 hours	0.95	0.99
Piz Daint	Reconstruction	2 hours	---	---
Piz Daint	User	18'557 hours	0.09	0.99

Site Availability: Phoenix: 55%, Piz Daint: 88%
Site Reliability: Phoenix: 73%, Piz Daint: 100%
Site HS06 delivered: Phoenix: 556 kHS06-days, Piz Daint: 231 kHS06-days -- measured by LHCb (in job)
Site HS06 delivered: Phoenix: 420 kHS06-days, Piz Daint: 159 kHS06-days -- measured by CSCS
The storage is attached to Phoenix, therefore a storage outage affects only the numbers of Phoenix not Piz Daint.
Site Availability/Reliability are measured with SAM jobs.
Piz Daint user jobs: Of the 90% failed jobs, 10% got killed, 80% stalled (most likely pilot got killed).

CSCS

Cluster	VO	Produced walltime core-hours	Share %
Phoenix	ATLAS	1010071	36.4%
Phoenix	CMS	862819	31.1%
Phoenix	LHCb	901593	32.5%
Phoenix	TOTAL	2774483
Piz Daint	ATLAS	63535	17.8%
Piz Daint	CMS	0	0%
Piz Daint	LHCb	294200	82.2%
Piz Daint	TOTAL	357735

Site Availability: X%
Capacity Phoenix 3125760 core-hours (5920*24*22), used at 88%
Capacity Piz Daint 844800 core-hours (25*64*24*22), used at 42%
Published HS06 value: 11.19 HS06/core on Phoenix, 12.96 HS06/core on CRAY (830/64, needs to be re-calculated)
1-week downtime due to problems with dcache

Minutes / action items

Attendants - CSCS: Pablo, Stefano, Miguel, Dino - CHIPP: Christoph, Gianfranco, Thomas, Nina, Derek, Roland
Pablo asks whether we would like him to continue organize the meetings - Yes
Christoph - in resource rev board there was discussion on how to measure - one should not only look at the total - should also look at the /event throughput/. - we need to identify how the Experiments measure the events
Pablo asks in regard to the suitability of the metrics - GF says that the current dashboard is not able to yield all the needed information (I am unsure, but I guess it was in regard to the job types)
GF for ATLAS - notices that there may not have been enough statistics due to some downtime - analysis of errors - thinks that many job failures due to dcache problems but was not yet able to delve deeper - Pablo notes about difference in walltime between what they measure locally and what is showed by dashboard. For LHCb it is almost exact - GF says that analysis jobs were not included in his numbers and if these were added would come close.
Derek for CMS - I apologize for the fact that we noticed too late that there is a problem with the job submission to the Piz Daint system. CMS still is sending jobs to arcbrisi. But the acitive ARC CE is arc04.lcg.cscs.ch. - We will resolve this problem by contacting CMS operations - We will concentrate on the measurement
Roland for LHCb- Discussion on the high user job failure rate - Roland thinks it may have to do with the pilot job being killed, since the jobs show up as "stalled" which often is a sign of that failure mode.
CSCS on the local metrics - GF how does the TOTAL relate to the total theoretical wallclock hours that were available on the system? - Pablo tries to explain the usage distribution on Phoenix. LHCb seemingly not so hit by storage downtime. - 1-week downtime due to problems with dcache - Pablo asks all the experiments to investigate failures on the HPC system so as to identify issues in this period
Action items 1. CMS - [ ] CMS will provide the numbers for Phoenix after the meeting, though we cannot provide the numbers for Piz Daint. - [ ] CMS needs to get jobs routed via arc04.lcg.cscs.ch and no longer via arcbrisi. - [ ] CMS will try to prepare a page where we share the links for the monitoring plots and how we derive the numbers. 2. We will repeat the exercise for three weeks - Measurement period should start as soon as CMS is online - We experiments should try to share the numbers during the period. 3. Pablo will convene another meeting in approx. 1 month of time 4. All VOs need to add the event throughput data to the tables in the already finished and subsequent runs

Run 2. 2017-05-27 to 2017-06-25 (both inclusive, 30 days)

Agreed metrics as in Run 1.

ATLAS

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	Any score	529111.1	0.80	0.96
Piz Daint	Any score	150984.0	0.76	0.92
Phoenix	Any mcore	1491964.9	0.90	0.87
Piz Daint	Any mcore	329364.5	0.82	0.79
Phoenix	Analy	47633.3	0.82	0.77
Piz Daint	Analy	32604.1	0.17	0.49
Phoenix	Total	2068709.3
Piz Daint	Total	512952.6

Site Availability (measured with HC Functional Tests for auto-exclusion)
PROD: Phoenix 96.86 - Piz Daint 96.93
ANALY:Phoenix 97.49 - Pix Daint 68.29
Comments: Piz Daint downtimes not recorded (HC cannot blacklist if no jobs run)

CMS

Site Availability: X%
Comments

Job statistics for T2_CH_CSCS from 2017-05-27 to 2017-06-25

jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
analysistest	45.3	68.2	74.4	109.1	68.2	66.4
hctest	4709.4	4711.1	5003.3	5649.4	88.6	100.0
production	5777.8	5823.2	6213.2	6329.6	98.2	99.2
reprocessing	69881.8	69886.3	79344.2	80329.7	98.8	100.0
hcxrootd	1060.3	1062.2	1486.7	1691.5	87.9	99.8
psst	0.0	0.0	0.0	0.0
analysis	137818.8	173808.5	230806.3	449626.3	51.3	79.3
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	219293.4	255359.5	322928.1	543735.6	59.4	85.9

LHCb

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	all jobs	960'261 hours	0.96	0.97
Phoenix	Simulation	812'851 hours	0.99	0.99
Phoenix	Reconstruction	66'408 hours	0.98	0.90
Phoenix	User	79'697 hours	0.72	0.87
Piz Daint	all jobs	189'854 hours	0.85	0.99
Piz Daint	Simulation	186'929 hours	0.86	0.99
Piz Daint	Reconstruction	0 hours	---	---
Piz Daint	User	2'959 hours	0.85	0.99

Site Availability: Phoenix: 97%, Piz Daint: 89%
Site Reliability: Phoenix: 98%, Piz Daint: 100%
Site HS06 delivered: Phoenix: 585 kHS06-days, Piz Daint: 150 kHS06-days -- measured by LHCb (in job)
Site HS06 delivered: Phoenix: 439 kHS06-days, Piz Daint: 127 kHS06-days -- measured by CSCS
The storage is attached to Phoenix, therefore a storage outage affects only the numbers of Phoenix not Piz Daint.
Site Availability/Reliability are measured with SAM jobs.
Piz Daint simulation jobs: Of the 15% failed jobs, 50% stalled, 50% had an application error.
Piz Daint user jobs: Of the 15% failed jobs, 0% got killed, 100% stalled (most likely pilot got killed).

CSCS

Cluster	VO	Produced walltime core-hours	Share %
Phoenix	ATLAS	1,978,540	53%
Phoenix	CMS	832,580	22%
Phoenix	LHCb	941,094	25%
Phoenix	TOTAL	3,752,214
Piz Daint	ATLAS	499,775	68%
Piz Daint	CMS	664	0%
Piz Daint	LHCb	235,594	32%
Piz Daint	TOTAL	736,033

Site Availability: X%
Capacity Phoenix 4262400 core-hours (5920*24*30), used at 88%
Capacity Piz Daint 1152000 core-hours (25*64*24*30), used at 64%
Published HS06 value: 11.19 HS06/core on Phoenix, 12.96 HS06/core on CRAY (830/64, needs to be re-calculated)

Run 3. 2017-06-26 to 2017-07-19 (both inclusive, 23 days)

Agreed metrics as in Run 1.

NOTE: Canceled due to serious instability of the system.

ATLAS

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	Any score	255675.6	0.99	0.92
Piz Daint	Any score	212296.1	0.65	0.90
Phoenix	Any mcore	1191852.1	0.94	0.65
Piz Daint	Any mcore	113858.0	0.56	0.67
Phoenix	Analy	56873.0	0.87	0.77
Piz Daint	Analy	19799.1	0.68	0.55
Phoenix	Total	1504400.8
Piz Daint	Total	345953.1

Site Availability (measured with HC Functional Tests for auto-exclusion)
PROD: Phoenix 100 - Piz Daint 97.66
ANALY:Phoenix 100 - Pix Daint 89.6
Comments: too many instabilities

CMS

Job statistics for T2_CH_CSCS from 2017-06-26 to 2017-07-19
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
analysistest	40.0	52.3	123.0	412.4	29.8	76.5
hctest	5446.8	5449.3	5934.5	5984.3	99.2	100.0
production	9683.2	25484.3	53574.9	90813.2	59.0	38.0
reprocessing	0.0	0.0	1.6	1.6	100.0
hcxrootd	1046.5	1048.1	1569.2	1615.6	97.1	99.8
analysis	206448.3	273650.6	357210.2	645075.6	55.4	75.4
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	222664.8	305684.6	418413.4	743902.7	56.2	72.8

LHCb

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	all jobs	799'132 hours	0.98	0.97
Phoenix	Simulation	708'050 hours	0.99	0.98
Phoenix	Reconstruction	40'495 hours	0.96	0.79
Phoenix	User	46'845 hours	0.85	0.96
Piz Daint	all jobs	230'709 hours	0.80	0.99
Piz Daint	Simulation	229'404 hours	0.80	0.99
Piz Daint	Reconstruction	0 hours	---	---
Piz Daint	User	1'305 hours	0.84	0.98

Site Availability: Phoenix: 97%, Piz Daint: 97%
Site Reliability: Phoenix: 100%, Piz Daint: 100%
Site State Unknown: Phoenix: 5%, Piz Daint: 59%
Site HS06 delivered: Phoenix: 538 kHS06-days, Piz Daint: 170 kHS06-days -- measured by LHCb (in job)
Site HS06 delivered: Phoenix: ??? kHS06-days, Piz Daint: ??? kHS06-days -- measured by CSCS
The storage is attached to Phoenix, therefore a storage outage affects only the numbers of Phoenix not Piz Daint.
Site Availability/Reliability are measured with SAM jobs.
Piz Daint simulation jobs: Of the 18% failed jobs, 79% stalled, 5% had an application error.
Piz Daint user jobs: Of the 19% failed jobs, 16% got killed, 75% stalled (most likely pilot got killed), 9% had an application error..

CSCS

Cluster	VO	Produced walltime core-hours	Share %
Phoenix	ATLAS	1461016	46%
Phoenix	CMS	935585	29%
Phoenix	LHCb	794847	25%
Phoenix	TOTAL	3191448
Piz Daint	ATLAS	446023	60%
Piz Daint	CMS	648	0%
Piz Daint	LHCb	296859	40%
Piz Daint	TOTAL	743530

Site Availability: X%
Capacity Phoenix 3267840 core-hours (5920*24*23), used at 97%
Capacity Piz Daint 883200 core-hours (25*64*24*23), used at 84%
Published HS06 value: 11.19 HS06/core on Phoenix, 12.96 HS06/core on CRAY (830/64, needs to be re-calculated)

Run 4. 2017-07-25 to 2017-08-02 (both inclusive, 9 days)

Agreed metrics as in Run 1.

ATLAS

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	Any score	4'817.7	0.98	0.84
Piz Daint	Any score	909.8	0.87	0.50
Phoenix	Any mcore	42'4519.3	0.96	0.88
Piz Daint	Any mcore	47'825.9	0.91	0.83
Phoenix	Analy	46'789.2	0.70	0.78
Piz Daint	Analy	31'322.6	0.75	0.71
Phoenix	Total	476'126.3
Piz Daint	Total	80'058.2

Site Availability (measured with HC Functional Tests for auto-exclusion)
PROD: Phoenix 100 - Piz Daint 100
ANALY:Phoenix 100 - Pix Daint 99.74
Comments: Same efficiency trend as recorded in the previous runs

CMS

Comments by Derek
- Tables generated by automatic scripts
- After a long toil and hunting for responsible contacts within CMS we are getting results! Needed Thomas, Miguel and me in an effort that was much too big. But hopefully we are now reestablishing the contacts and knowledge that was lost in the changes beginning of this year.
- The successful running of CMS production jobs is still very fresh, so I only put in the results for the short period since Sep 21st. But the outlook is good.

Job statistics for T2_CH_CSCS from 2017-07-25 to 2017-08-03
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
analysistest	17.8	19.6	22.9	25.3	90.5	90.8
hctest	2719.8	2719.8	2951.1	2991.7	98.6	100.0
production	11957.9	11957.9	14411.8	14445.4	99.8	100.0
hcxrootd	320.2	321.1	485.7	506.8	95.8	99.7
analysis	122246.0	143367.4	153624.5	208432.8	73.7	85.3
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	137261.7	158385.8	171496.0	226402.0	75.7	86.7

Job statistics for T2_CH_CSCS_HPC from 2017-08-01 to 2017-08-03
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
production	1058.5	1058.5	1304.3	1304.3	100.0	100.0
ALL_JOBS	1058.5	1058.5	1304.3	1304.3	100.0	100.0

LHCb

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	all jobs	357'804 hours	0.98	0.97
Phoenix	Simulation	322'884 hours	0.99	0.98
Phoenix	Reconstruction	12'127 hours	0.99	0.75
Phoenix	User	22'800 hours	0.86	0.97
Piz Daint	all jobs	188'174 hours	0.92	0.99
Piz Daint	Simulation	182'817 hours	0.93	0.99
Piz Daint	Reconstruction	0 hours	---	---
Piz Daint	User	5'356 hours	0.85	0.99

Site Availability: Phoenix: 99%, Piz Daint: 100%
Site Reliability: Phoenix: 100%, Piz Daint: 100%
Site State Unknown: Phoenix: 0%, Piz Daint: 10%
Site HS06 delivered: Phoenix: 231 kHS06-days, Piz Daint: 123 kHS06-days -- measured by LHCb (in job)
Site HS06 delivered: Phoenix: 146 kHS06-days, Piz Daint: 72 kHS06-days -- measured by CSCS
The storage is attached to Phoenix, therefore a storage outage affects only the numbers of Phoenix not Piz Daint.
Site Availability/Reliability are measured with SAM jobs.
Piz Daint simulation jobs: Of the 9% failed jobs, 36% stalled, 7% had an application error.
Piz Daint user jobs: Of the 23% failed jobs, 12% got killed, 2% stalled (most likely pilot got killed), 85% had an application error..

CSCS

Cluster	VO	Produced walltime core-hours	Share %
Phoenix	ATLAS	467'573	38.89%
Phoenix	CMS	419'353	34.88%
Phoenix	LHCb	315'243	26.22%
Phoenix	TOTAL	1'202'169
Piz Daint	ATLAS	85'421	36.23%
Piz Daint	CMS	16'930	7.18%
Piz Daint	LHCb	133'392	56.58%
Piz Daint	TOTAL	235'743

Capacity Phoenix 1'278'720 core-hours (5920*24*9)
Capacity Piz Daint 345'600 core-hours (25*64*24*9)
Published HS06 value: 11.19 HS06/core on Phoenix, 12.96 HS06/core on CRAY (830/64, needs to be re-calculated)
LHConCRAY-Run4_CSCS.pdf: Run4 from a system perspective

Run 5. 2017-08-03 to 2017-08-31 (both inclusive, 28 days)

Agreed metrics as in Run 1.

ATLAS

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	Any mcore	1'070'072	0.67 (0.94)	0.59 (0.65)
Phoenix	Total	1’494’955
Piz Daint	Analy	16'725	0.63 (0.68)	0.50 (0.55)
Piz Daint	Any mcore	64'376	0.70 (0.56)	0.56 (0.67)
Phoenix	Analy	164'167	0.59 (0.87)	0.56 (0.77)
Phoenix	Any score	260'715	0.95 (0.99)	0.93 (0.92)
Piz Daint	Any score	309'055	0.91 (0.65)	0.94 (0.90)
Piz Daint	Total	390’158

Efficiencies in () are from Run 3 (23 days)
Site Availability (measured with HC Functional Tests for auto-exclusion)
PROD: Phoenix 96.4% (100% Run 3) - Piz Daint 96.7% (100%)
ANALY:Phoenix 41% (100% Run 3) - Pix Daint 41% (99.74% Run 3)
Comments: gap in the efficiencies bridging up, but Phoenix efficiences dropped vs. Run

CMS

Job statistics for T2_CH_CSCS from 2017-08-03 to 2017-08-31
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
analysistest	0.0	13.1	0.0	197.1	0.0	0.0
hctest	6371.5	6375.6	7110.5	7447.6	95.5	99.9
production	11892.9	11927.9	254941.4	265243.8	96.1	99.7
hcxrootd	1500.7	1503.9	2630.0	2710.9	97.0	99.8
psst	0.0	0.0	0.0	0.0
analysis	196284.4	212641.1	289323.5	559109.0	51.7	92.3
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	216049.5	232461.6	554005.4	834708.4	66.4	92.9

Job statistics for T2_CH_CSCS_HPC from 2017-08-03 to 2017-08-31
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
hctest	3716.9	3756.2	4816.0	4990.0	96.5	99.0
production	1096.9	1137.0	1570.5	2157.6	72.8	96.5
hcxrootd	47.2	47.2	109.3	111.3	98.2	100.0
psst	0.0	0.0	0.0	0.0
analysis	5647.9	7127.6	15577.5	34449.7	45.2	79.2
ALL_JOBS	10508.9	12068.0	22073.3	41708.6	52.9	87.1

LHCb

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	all jobs	564'720 hours	0.92	0.96
Phoenix	Simulation	498'984 hours	0.91	0.98
Phoenix	Reconstruction	39'912 hours	0.97	0.67
Phoenix	User	25'824 hours	0.91	0.96
Piz Daint	all jobs	315'096 hours	0.91	0.99
Piz Daint	Simulation	311'496 hours	0.91	0.99
Piz Daint	Reconstruction	0 hours	---	---
Piz Daint	User	1'200 hours	0.94	0.96

Site Availability: Phoenix: 93%, Piz Daint: 47%
Site Reliability: Phoenix: 94%, Piz Daint: 50%
Site State Unknown: Phoenix: 19%, Piz Daint: 73%
Site HS06 delivered: Phoenix: 372 kHS06-days, Piz Daint: 220 kHS06-days -- measured by LHCb (in job)
Site HS06 delivered: Phoenix: 295 kHS06-days, Piz Daint: 192 kHS06-days -- measured by CSCS
The storage is attached to Phoenix, therefore a storage outage affects only the numbers of Phoenix not Piz Daint.
Site Availability/Reliability are measured with SAM jobs.
Piz Daint simulation jobs: Of the 9% failed jobs, 24% stalled, 74% had an application error.
Piz Daint user jobs: Of the 6% failed jobs, 11% got killed, 3% stalled (most likely pilot got killed), 80% had an application error..

CSCS

Configuration for reference:
- 64 GB of SWAP using DWS on 4 LHC DWS nodes
- 64 core/node allocatable by jobs
- Memory limits set to 6000MB/core, but memory is not a consumable resource, only cores.
- CVMFS tiered cache: upper layer 6GB in RAM, lower layer preloaded on GPFS

Cluster	VO	Produced walltime core-hours	Share %
Phoenix	ATLAS	1322165	41%
Phoenix	CMS	1240106	39%
Phoenix	LHCb	631976	20%
Phoenix	TOTAL	3'194'247
Piz Daint	ATLAS	408'706	45%
Piz Daint	CMS	152'226	17%
Piz Daint	LHCb	355'457	39%
Piz Daint	TOTAL	916'389

Capacity Phoenix 3'978'240 core-hours (5920*24*28), utilized at 80%
Capacity Piz Daint 1'075'200 core-hours (25*64*24*28), utilized at 85%
Published HS06 value: 11.19 HS06/core on Phoenix, 12.96 HS06/core on CRAY (830/64, needs to be re-calculated)

LHConCRAY-Run5_CSCS.pdf: Run5 from a system perspective

Other remarks/notes

ATLAS and CMS report high failure rates in both Phoenix and CRAY
There is a big mismatch (4-5x) between produced CMS wallclock in the portal and that reported by CSCS, which requires an investigation. CMS will contact CSCS if precise job accounting data is needed.
There are reported big loads in the DVS in Daint, which are going to be addressed (or at least try to) by using DataWarp for Scratch
Fairshare effectiveness in Piz Daint needs to be looked at, since LHCb should have half the produced hours than ATLAS (both VOs seemed to be fully productive during this Run)

Run 6. 2017-09-01 to 2017-10-01 (both inclusive, 31 days)

Agreed metrics as in Run 1.

ATLAS

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	Any score	590900	0.72 (0.95)	0.92 (0.93)
Piz Daint	Any score	362439	0.80 (0.91)	0.80 (0.94)
Phoenix	Any mcore	1133939	0.70 (0.67)	0.71 (0.59)
Piz Daint	Any mcore	61799	0.77 (0.70)	0.64 (0.56)
Phoenix	Analy	82868	0.70 (0.59)	0.65 (0.56)
Piz Daint	Analy	22488	0.76 (0.63)	0.48 (0.50)
Phoenix	Total	1807707	0.72	0.79
Piz Daint	Total	446726	0.78	0.83

Site Availability (measured with HC Functional Tests for auto-exclusion)
PROD: Phoenix 90.27 - Piz Daint % 87.97
ANALY:Phoenix 90.35 - Pix Daint % 87.74
Site Availability and Reliability (measured with the SAM profile ATLAS_CRITICAL)
Availability CSCS-LCG2: 82.86
Reliability CSCS-LCG2: 91.82%
Comments: GPFS instabilities caused more failures than usual

CMS

Job statistics for T2_CH_CSCS from 2017-09-01 to 2017-10-02
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
analysistest	57.1	57.9	68.7	75.9	90.5	98.6
hctest	5617.0	5655.1	6274.7	6877.5	91.2	99.3
production	0.1	0.1	0.1	17.8	0.6	100.0
reprocessing	462.9	462.9	796.4	799.9	99.6	100.0
hcxrootd	729.0	731.7	1144.1	1234.7	92.7	99.6
analysis	143147.3	148428.5	242445.1	401364.0	60.4	96.4
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	150013.4	155336.2	250729.1	410369.8	61.1	96.6

Job statistics for T2_CH_CSCS_HPC from 2017-09-01 to 2017-10-02
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
hctest	4848.4	4871.0	6093.8	6954.6	87.6	99.5
production	629.5	3453.5	1431.5	5837.7	24.5	18.2
reprocessing	2967.0	3069.8	3845.9	4754.4	80.9	96.7
hcxrootd	166.2	167.2	393.1	420.7	93.4	99.4
analysis	25037.5	27624.1	54453.5	115497.7	47.1	90.6
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	33648.6	39185.6	66217.8	133465.1	49.6	85.9

LHCb

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	all jobs	318'081 hours	0.77	0.89
Phoenix	Simulation	210'420 hours	0.71	0.97
Phoenix	Reconstruction	90'933 hours	0.91	0.72
Phoenix	WG Production	376 hours	0.99	0.95
Phoenix	User	16'377 hours	0.72	0.95
Piz Daint	all jobs	235'449 hours	0.93	0.98
Piz Daint	Simulation	213'141 hours	0.93	0.98
Piz Daint	Reconstruction	17'263 hours	0.95	0.96
Piz Daint	WG Production	132 hours	0.98	0.69
Piz Daint	User	4'912 hours	0.79	0.99

Site Availability: Phoenix: 86%, Piz Daint: 80%
Site Reliability: Phoenix: 91%, Piz Daint: 89%
Site State Unknown: Phoenix: 5%, Piz Daint: 38%
Site HS06 delivered: Phoenix: 199 kHS06-days, 174 Piz Daint: kHS06-days -- measured by LHCb (in job)
Site HS06 delivered: Phoenix: 156 kHS06-days, 131 Piz Daint: kHS06-days -- measured by CSCS
The storage is attached to Phoenix, but Piz Daint is not configured as 'helper', therefore has also some access to the storage on Phoenix from the LHCb point of view.
Site Availability/Reliability are measured with SAM jobs.
Piz Daint simulation jobs: Of the 7% failed jobs, 55% stalled, 27% got killed, 16% had an application error.
Piz Daint reconstruction jobs: Of the 3% failed jobs, 62% stalled, 26% had input resolution error.
Piz Daint user jobs: Of the 24% failed jobs, 54% got killed, 37% stalled (most likely pilot got killed), 8% had an application error..

CSCS

Configuration for reference:
- 64 GB of SWAP using DWS on 4 LHC DWS nodes
- 64 core/node allocatable by jobs
- Memory limits set to 6000MB/core, but memory is not a consumable resource, only cores.
- CVMFS tiered cache: upper layer 6GB in RAM, lower layer preloaded on GPFS

Cluster	VO	Produced walltime core-hours	Share %
Phoenix	ATLAS	1758366	57%
Phoenix	CMS	981433	32%
Phoenix	LHCb	338541	11%
Phoenix	TOTAL	3078340
Piz Daint	ATLAS	506462	51%
Piz Daint	CMS	238881	24%
Piz Daint	LHCb	241716	25%
Piz Daint	TOTAL	987059

Capacity Phoenix 4404480 core-hours (5920*24*31), utilized at 70%
Capacity Piz Daint 1264800 core-hours (25*68*24*31), utilized at 78%
Published HS06 value: 11.19 HS06/core on Phoenix, 12.96 HS06/core on CRAY (830/64, needs to be re-calculated)

Other remarks/notes

item

Run 7. 2017-10-10 to 2017-11-05 (both inclusive, 26 days). LAST RUN.

Agreed metrics as in Run 1.

ATLAS

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	Any score	673966	0.89	0.83
Piz Daint	Any score	196668	0.92	0.77
Phoenix	Any mcore	714590	0.84	0.61
Piz Daint	Any mcore	56275	0.74	0.57
Phoenix	Analy	32027	0.68	0.59
Piz Daint	Analy	13665	0.74	0.39
Phoenix	Total	1420583	0.86	0.73
Piz Daint	Total	266609	0.89	0.75

* Site Availability (measured with HC Functional Tests for auto-exclusion)
PROD: Phoenix 98% - Piz Daint 98%
ANALY:Phoenix 98% - Pix Daint 98%

* Site Availability (measured with HC Functional Tests for auto-exclusion)
PROD: Phoenix % - Piz Daint %
ANALY:Phoenix % - Pix Daint % * Comments:CPU/WC efficiency quite poor up to 19 Oct, then it recovered * Comments: * Comments: * *

CMS

Job statistics for T2_CH_CSCS from 2017-10-10 to 2017-11-05
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
hctest	7057.4	7059.4	8038.8	8167.4	98.4	100.0
production	409915.8	431406.9	581209.1	720282.5	80.7	95.0
reprocessing	170.9	170.9	366.8	366.8	100.0	100.0
hcxrootd	532.3	533.4	798.4	2035.6	39.2	99.8
analysis	149900.7	154513.4	205241.6	251789.2	81.5	97.0
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	567577.1	593684.0	795654.7	982641.5	81.0	95.6

Job statistics for T2_CH_CSCS_HPC from 2017-10-10 to 2017-11-05
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
hctest	5510.7	5549.9	6329.1	7171.9	88.2	99.3
production	96071.1	100042.1	139929.2	154781.0	90.4	96.0
reprocessing	156.7	156.7	1003.8	1003.9	100.0	100.0
hcxrootd	193.1	193.6	359.3	1411.5	25.5	99.7
analysis	72703.1	79637.6	91096.4	117774.4	77.3	91.3
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	174634.7	185579.9	238717.8	282142.7	84.6	94.1

weekly CMS efficiency

Job statistics for T2_CH_CSCS_HPC from 2017-10-10 to 2017-10-17
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
hctest	1694.3	1702.3	1978.5	2005.5	98.7	99.5
production	13774.7	13968.2	17279.2	19689.9	87.8	98.6
hcxrootd	35.2	35.3	66.1	74.7	88.5	99.7
analysis	11419.6	11531.6	12782.7	15935.5	80.2	99.0
ALL_JOBS	26923.8	27237.4	32106.5	37705.6	85.2	98.8

Job statistics for T2_CH_CSCS_HPC from 2017-10-17 to 2017-10-24
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
hctest	2052.3	2077.7	2329.0	2956.6	78.8	98.8
production	42200.4	43137.0	72621.6	77336.4	93.9	97.8
hcxrootd	77.1	77.5	115.0	1056.2	10.9	99.5
analysis	21026.6	25143.5	25377.1	38515.5	65.9	83.6
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	65356.4	70435.7	100442.7	119864.7	83.8	92.8

Job statistics for T2_CH_CSCS_HPC from 2017-10-24 to 2017-10-31
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
hctest	1465.6	1474.1	1750.6	1994.0	87.8	99.4
production	32458.7	34446.1	46902.6	52890.0	88.7	94.2
reprocessing	145.8	145.8	983.0	983.0	100.0	100.0
hcxrootd	51.6	51.6	134.6	331.5	40.6	100.0
analysis	31451.5	34825.2	42114.4	51559.4	81.7	90.3
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	65573.2	70942.8	91885.2	107757.9	85.3	92.4

Job statistics for T2_CH_CSCS_HPC from 2017-10-31 to 2017-11-05
jobtype	good_cpu_h	all_cpu_h	good_wallt_h	all_wallt_h	good_wallt_%	good_cpu_eff_%
hctest	1181.1	1183.9	1372.6	1391.3	98.7	99.8
production	21490.6	22789.4	25158.2	29444.8	85.4	94.3
reprocessing	156.7	156.7	1003.8	1003.9	100.0	100.0
hcxrootd	46.3	46.3	89.1	94.0	94.8	100.0
analysis	18164.4	18582.8	23575.4	27271.8	86.4	97.7
unknown	0.0	0.0	0.0	0.0
ALL_JOBS	41039.1	42759.1	51199.1	59205.8	86.5	96.0

LHCb

Cluster	Job type	Produced walltime core-hours	Good vs Bad walltime %	CPU efficiency good jobs %
Phoenix	all jobs	68'378 hours	0.81	0.96
Phoenix	Simulation	53'436 hours	0.85	0.97
Phoenix	Reconstruction	---	---	---
Phoenix	WG Production	199 hours	0.49	0.85
Phoenix	User	14'733 hours	0.66	0.92
Piz Daint	all jobs	191'940 hours	0.92	0.85
Piz Daint	Simulation	119'757 hours	0.96	0.97
Piz Daint	Reconstruction	70'593 hours	0.85	0.62
Piz Daint	WG Production	773 hours	0.90	0.52
Piz Daint	User	799 hours	0.99	0.98

Site Availability: Phoenix: 100%, Piz Daint: 100%
Site Reliability: Phoenix: 100%, Piz Daint: 100%
Site State Unknown: Phoenix: 10%, Piz Daint: 74%

Site HS06 delivered: Phoenix: 40 kHS06-days, Piz Daint: 129 kHS06-days -- measured by LHCb (in job)
Site HS06 delivered: Phoenix: 37 kHS06-days, Piz Daint: 144 kHS06-days -- measured by CSCS

The storage is attached to Phoenix, but Piz Daint is not configured as 'helper', therefore has also access to the storage on Phoenix from the LHCb point of view.
Site Availability/Reliability are measured with SAM jobs.
Piz Daint simulation jobs: Of the 7% failed jobs, 17% stalled, 82% had an application error.
Piz Daint reconstruction jobs: Of the 11% failed jobs, 71% stalled, 4% had an application error, 1% had input resolution error.
Piz Daint user jobs: Of the 8% failed jobs, 28% stalled (most likely pilot got killed), 71% had an application error.

CSCS

Configuration for reference:
- 64 GB of SWAP using DWS on 4 LHC DWS nodes
- 64 core/node allocatable by jobs
- Memory limits set to 6000MB/core, but memory is not a consumable resource, only cores.
- CVMFS tiered cache: upper layer 6GB in RAM, lower layer preloaded on GPFS

Cluster	VO	Produced walltime core-hours	Share %
Phoenix	ATLAS	1720288	56%
Phoenix	CMS	1296112	42%
Phoenix	LHCb	78467	3%
Phoenix	TOTAL	3094867
Piz Daint	ATLAS	334704	35%
Piz Daint	CMS	354490	37%
Piz Daint	LHCb	267907	28%
Piz Daint	TOTAL	957101

Capacity Phoenix 3,694,080 core-hours (5920*24*26), utilized at 83,8%
Capacity Piz Daint 998,400 core-hours (25*64*24*26), utilized at 95,9%
Published HS06 value: 11.19 HS06/core on Phoenix, 12.96 HS06/core on CRAY (830/64, needs to be re-calculated)

Other remarks/notes

item

ATLAS-scale-up-test-Nov2017.pdf: ATLAS scale up test

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	ATLAS-scale-up-test-Nov2017.pdf	r1	manage	1069.7 K	2017-11-07 - 13:00	GianfrancoSciacca	ATLAS scale up test
png	CPU-eff-vs-run.png	r1	manage	112.6 K	2017-09-01 - 13:33	GianfrancoSciacca	CPU-eff vs run up to 5
pdf	LHConCRAY-Run4_CSCS.pdf	r1	manage	2278.0 K	2017-08-03 - 11:45	MiguelGila	Run4 from a system perspective
pdf	LHConCRAY-Run5_CSCS.pdf	r1	manage	1108.6 K	2017-09-01 - 11:55	MiguelGila	Run5 from a system perspective
pdf	LHConCRAY-Run6_CSCS.pdf	r1	manage	1300.9 K	2017-10-02 - 10:58	MiguelGila	Run6 from CSCS perspective
png	ok-fail-vs-run.png	r1	manage	136.3 K	2017-09-01 - 13:34	GianfrancoSciacca	ok-fail vs run up to 5

This topic: LCGTier2 > WebHome > PhoenixClusterBlog > BlogAcceptanceTests2017
Topic revision: r72 - 2017-11-07 - RolandBernet