CHIPP-CSCS Face to Face Meeting on 2019-09-13

Date and time: Friday 13th of September at 10:15
Place: Zurich ETHZ (LEE E 126 map)
External link / EVO: probably not possible

Agenda

10:15 - Welcome and agenda
10:30 - VO Status report (~last 6 months)
- LHCb (20' - Roland)
- ATLAS (20' - Gianfranco)
- CMS (20' - Vinzenz)
11:30 - Tier-2 status, plans & pledges
- CSCS (45' - Various people)
- Long-term resource provisioning overview (15' - Pablo)
- Discussion (30')
13:15 - Lunch
14:30 - Tier-2 status, plans & pledges
- UNIBE-LHEP (30' - Gianfranco)
15:00 - Tier-3 status and plans
- PSI (15' - Nina)
- UNIBE-ID (15' - Gianfranco)
- UNIGE (15' - Gianfranco)
15:45 - Coffee break
16:00 - NGI_CH (20' - Gianfranco)
- Open tickets:
16:30 - End of meeting

Attendants

CSCS:
CMS:
ATLAS:
LHCb:

Minutes

Please check ALSO the action items, and the attachments with the individual reports.

# LHCb
- Check 2500 job aparent limit
- Higher failure rate compared to other sites (still low, not important)
- SAM tests are VERY bad, but nobody seems to care either
- Re-calculate HS06 with Singularity. Don't publish results in the middle of the month.

# CMS
- CSCS is low in the performance rank (job success rate)
- CMS Vo Box should be updated.
- CMS should check if Phedex will be needed after they adopt Rucio
- Some charts might be wrong: Vinzenz will recheck them soon

# ATLAS
- Generally CSCS is doing well (below the pledge mark, but very close)
- IF a service is down for more than 12 hours, one should declare a downtime
(you can declare a single service down) and send a message on the chat
- Discussion regarding the 40:40:20 shares - will be addressed in the context of next year's grant application and pledges
- discussion on involvement of the VOs in monthly meetings
- Migration to ARC 6 is needed, still not clear how to proceed
- Dashboard: add "cumulative CPU utilization per VO (pie chart Nick gave Gianfranco on the chat" ; + "hammercloud state (green/red box) per VO"
- Bern cluster re-deployment delayed due to delayed availability (and details about) of the decommissioned Phoenix hardware
- Bern could not plan for infrastructure despite having asked for inventory details since over 12 months
- Bern under pledge (largely as a consequence of the above), aiming at levelling up unitl the end of Q4
- Pablo points out Bern should not plan pledged resources counting on "opportunistic resources"
- Gianfranco points out Phoenix resources are not opportunistic, but part of the "LHConCray" agreement, upon which ATLAS has planned for pledges until 2021. These have been attached to the LHConCray report and later presented at at least TWO face2face meetings without meeting objections.
- "Not available" Phoenix nodes information requested (27k HS06)

# Long-term provisioning:
- Two goals for VO-reps to inquire with the VOs:
a) increase usage of accelerators (GPUs atCSCS), and
b) inquire about reducing storage (remove dCache?)
- CSCS to inquire internally how to deploy FPGAs
- Meet at the end of October/Nov and discuss what we found out from others
- LHCb does not use CSCS as cache, but as real Storage (with 1 replica somewhere else)

Action items

Follow up in the monthly meetings on the actions listed in the F2F action items (e.g. on the request for details on the decommissioned HW)

Roland - LHCb
- check the y-axis (jobs submitted). It looks inconsistent
Vinzenz - CMS
- the monitoring moved to grafana. Re-check the plots shown; compare grafana with CMS-dashboard.
General:
- modify/optimize slurm for proper job distributions (to avoid the skewing which happened)
  (remember CMS was low at 33%, ATLAS high at 41%; LHCbup at 26%)
- re-call: all V0-contacts and CSCS MUST check the dashboards daily for ALL three experiments (and report monthly );
  in case of serious problems (one VO missing, not submitting jobs etc.)-> report immediately to the corresponding VO --> tickets
- Monitor running vs. installed cores;
- monitor VO-shares
- Batch vs. Dashboard metrics unification
- Understand the differences in quoted numbers for "delivered resources", which are available on 1) GRAFANA, 2) accounting-next.egi.eu/wlcg/report/tier2/ and 3) the VO-specific dashboards;
- VO-status box: get one per VO ?

all VOs: discuss internally possible consequences if storage in general was reduced or even abandoned; strategies with respect to accelerators.

Attachments

LHCb_CHIPP_Zurich_Sep2019.pdf: LHCb_CHIPP_Zurich_Sep2019.pdf

ATLAS-report-20190913.pdf: ATLAS VO report (T2)

UNIBE-LHEP-20190913.pdf: UNIBe-LHEP T2 report

UNIGE-DPNC-20190913.pdf: UNIGE-DPNC Tier3 report

EGI-20190913.pdf: EGI-topics

LHCb_CHIPP_Zurich_Sep2019.pdf: LHCb_CHIPP_Zurich_Sep2019.pdf

CMS_VO_report: CMS - VO report Vinzenz Stampf

f2f_13Sep19_v1.pdf: CMS - Vinzenz

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	20190913_CHIPP-CSCS_resource_provisioning_overview.pdf	r1	manage	351.6 K	2019-09-16 - 06:26	PabloFernandez
pdf	20190913_F2F_CSCS.pdf	r1	manage	7281.3 K	2019-09-13 - 08:37	StefanoGorini
pdf	ATLAS-report-20190913.pdf	r1	manage	3405.8 K	2019-09-13 - 08:35	GianfrancoSciacca	ATLAS VO report (T2)
pdf	EGI-20190913.pdf	r1	manage	477.8 K	2019-09-13 - 09:03	GianfrancoSciacca	EGI-topics
pdf	LHCb_CHIPP_Zurich_Sep2019.pdf	r1	manage	951.1 K	2019-09-13 - 09:25	RolandBernet
pdf	UNIBE-LHEP-20190913.pdf	r1	manage	2667.2 K	2019-09-13 - 08:51	GianfrancoSciacca	UNIBe-LHEP T2 report
pdf	UNIGE-DPNC-20190913.pdf	r1	manage	114.5 K	2019-09-13 - 08:53	GianfrancoSciacca	UNIGE-DPNC Tier3 report
pdf	f2f-19-zurich-cms-t3-status.pdf	r1	manage	800.4 K	2019-09-13 - 08:43	NinaLoktionova	CMS T3 status report
pdf	f2f_13Sep19_v1.pdf	r2 r1	manage	9547.7 K	2019-10-10 - 08:41	VinzenzStampf	CMS - Vinzenz